ufdcimages.uflib.ufl.edu€¦ · acknowledgments completing this dissertation would not have been...
Post on 08-Nov-2020
3 Views
Preview:
TRANSCRIPT
OBJECTIVE BAYESIAN METHODS FOR OCCUPANCY MODEL ESTIMATION ANDSELECTION
By
DANIEL TAYLOR-RODRIGUEZ
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2014
c⃝ 2014 Daniel Taylor-Rodrıguez
2
In memory of George Casella
It is a capital mistake to theorize before one has data Insensibly onebegins to twist facts to suit theories instead of theories to suit facts
ndashSherlock HolmesA Scandal in Bohemia
3
ACKNOWLEDGMENTS
Completing this dissertation would not have been possible without the support from
the people that have helped me remain focused motivated and inspired throughout the
years I am undeservingly fortunate to be surrounded by such amazing people
First of all I would like to express my gratitude to Professor George Casella It
was an unsurpassable honor to work with him His wisdom generosity optimism and
unyielding resolve will forever inspire me I will always treasure his teachings and the
fond memories I have of him I thank him and Anne for treating me and my wife as
family
I would like to acknowledge all of my committee members My heartfelt thanks to
my advisor Professor Linda J Young I will carry her thoughtful and patient recommendations
throughout my life I have no words to express how thankful I am to her for guiding me
through the difficult times that followed Dr Casellarsquos passing Also she has my gratitude
for sharing her knowledge and wealth of experience and for providing me with so many
amazing opportunities I am forever grateful to my local advisor Professor Nikolay
Bliznyuk for unsparingly sharing his insightful reflections and knowledge His generosity
and drive to help students develop are a model to follow His kind and extensive efforts
our many conversations his suggestions and advise in all aspects of academic and
non-academic life have made me a better statistician and have had a profound influence
on my way of thinking My appreciation to Professor Madan Oli for his enlightening
advise and for helping me advance my understanding of ecology
I would like to express my absolute gratitude to Dr Andrew Womack my friend and
young mentor His love for good science and hard work although impossible to keep up
with made my doctoral training one of the most exciting times in my life I have sincerely
enjoyed working and learning from him the last couple of years I offer my gratitude
to Dr Salvador Gezan for his friendship and the patience with which he taught me so
much more about statistics (boring our wives to death in the process) I am grateful to
4
Professor Mary Christman for her mentorship and enormous support I would like to
thank Dr Mihai Giurcanu for spending countless hours helping me think more deeply
about statistics his insight has been instrumental to shaping my own ideas Thanks to
Dr Claudio Fuentes for taking an interest in my work and for his advise support and
kind words which helped me retain the confidence to continue
I would like to acknowledge my friends at UF Juan Jose Acosta Mauricio
Mosquera Diana Falla Salvador and Emma Weeks and Anna Denicol thanks for
becoming my family away from home Andreas Tavis Emily Alex Sasha Mike
Yeonhee and Laura thanks for being there for me I truly enjoyed sharing these
years with you Vitor Paula Rafa Leandro Fabio Eduardo Marcelo and all the other
Brazilians in the Animal Science Department thanks for your friendship and for the
many unforgettable (though blurry) weekends
Also I would like to thank Pablo Arboleda for believing in me Because of him I
was able to take the first step towards fulfilling my educational goals My gratitude to
Grupo Bancolombia Fulbright Colombia Colfuturo and the IGERT QSE3 program
for supporting me throughout my studies Also thanks to Marc Kery and Christian
Monnerat for providing data to validate our methods Thanks to the staff in the Statistics
Department specially to Ryan Chance to the staff at the HPC and also to Karen Bray
at SNRE
Above all else I would like to thank my wife and family Nata you have always been
there for me pushing me forward believing in me helping me make better decisions
and regardless of how hard things get you have always managed to give me true and
lasting happiness Thank you for your love strength and patience Mom Dad Alejandro
Alberto Laura Sammy Vale and Tommy without your love trust and support getting
this far would not have been possible Thank you for giving me so much Gustavo
Lilia Angelica and Juan Pablo thanks for taking me into your family your words of
encouragement have led the way
5
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS 4
LIST OF TABLES 8
LIST OF FIGURES 10
ABSTRACT 12
CHAPTER
1 GENERAL INTRODUCTION 14
11 Occupancy Modeling 1512 A Primer on Objective Bayesian Testing 1713 Overview of the Chapters 21
2 MODEL ESTIMATION METHODS 23
21 Introduction 23211 The Occupancy Model 24212 Data Augmentation Algorithms for Binary Models 26
22 Single Season Occupancy 29221 Probit Link Model 30222 Logit Link Model 32
23 Temporal Dynamics and Spatial Structure 34231 Dynamic Mixture Occupancy State-Space Model 37232 Incorporating Spatial Dependence 43
24 Summary 46
3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS 49
31 Introduction 4932 Objective Bayesian Inference 52
321 The Intrinsic Methodology 53322 Mixtures of g-Priors 54
3221 Intrinsic priors 553222 Other mixtures of g-priors 56
33 Objective Bayes Occupancy Model Selection 57331 Preliminaries 58332 Intrinsic Priors for the Occupancy Problem 60333 Model Posterior Probabilities 62334 Model Selection Algorithm 63
34 Alternative Formulation 6635 Simulation Experiments 68
351 Marginal Posterior Inclusion Probabilities for Model Predictors 70
6
352 Summary Statistics for the Highest Posterior Probability Model 7636 Case Study Blue Hawker Data Analysis 77
361 Results Variable Selection Procedure 79362 Validation for the Selection Procedure 81
37 Discussion 82
4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS 84
41 Introduction 8442 Setup for Well-Formulated Models 88
421 Well-Formulated Model Spaces 9043 Priors on the Model Space 91
431 Model Prior Definition 92432 Choice of Prior Structure and Hyper-Parameters 96433 Posterior Sensitivity to the Choice of Prior 99
44 Random Walks on the Model Space 104441 Simple Pruning and Growing 105442 Degree Based Pruning and Growing 106
45 Simulation Study 107451 SNR and Sample Size Effect 109452 Coefficient Magnitude 110453 Special Points on the Scale 111
46 Case Study Ozone Data Analysis 11147 Discussion 113
5 CONCLUSIONS 115
APPENDIX
A FULL CONDITIONAL DENSITIES DYMOSS 118
B RANDOM WALK ALGORITHMS 121
C WFM SIMULATION DETAILS 124
D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS 131
REFERENCES 133
BIOGRAPHICAL SKETCH 140
7
LIST OF TABLES
Table page
1-1 Interpretation of BFji when contrasting Mj and Mi 20
3-1 Simulation control parameters occupancy model selector 69
3-2 Comparison of average minOddsMPIP under scenarios having different numberof sites (N=50 N=100) and under scenarios having different number of surveysper site (J=3 J=5) for the presence and detection components using uniformand multiplicity correction priors 75
3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors 75
3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highest probabilitymodels for the presence and the detection components using uniform andmultiplicity correcting priors on the model space 76
3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space 77
3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 77
3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 78
3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data 80
3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data 80
3-10 MPIP presence component 81
3-11 MPIP detection component 81
3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform and multiplicitycorrection model priors 82
8
4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulations 100
4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchical independenceprior (HIP) the hierarchical order prior (HOP) and the hierarchical uniformprior (HUP) 102
4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and the hierarchicaluniform prior (HUP) 103
4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP) 105
4-5 Variables used in the analyses of the ozone contamination dataset 112
4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso 113
C-1 Experimental conditions WFM simulations 124
D-1 Variables used in the analyses of the ozone contamination dataset 131
D-2 Marginal inclusion probabilities intrinsic prior 132
D-3 Marginal inclusion probabilities Zellner-Siow prior 132
D-4 Marginal inclusion probabilities Hyper-g11 132
D-5 Marginal inclusion probabilities Hyper-g21 132
9
LIST OF FIGURES
Figure page
2-1 Graphical representation occupancy model 25
2-2 Graphical representation occupancy model after data-augmentation 31
2-3 Graphical representation multiseason model for a single site 39
2-4 Graphical representation data-augmented multiseason model 39
3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors 71
3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors 72
3-3 Predictor MPIP averaged over scenarios with the interaction between the numberof sites and the surveys per site using uniform (U) and multiplicity correction(MC) priors 72
3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors 73
3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors 73
4-1 Graphs of well-formulated polynomial models for p = 2 90
4-2 E(M) and C(M) in M defined by a quadratic surface in two main effects formodel M = 1 x1 x21 91
4-3 Graphical representation of assumptions on M defined by the quadratic surfacein two main effects 93
4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the intercept onlymodel and (ab) isin (1 1) (1 ch) 97
4-5 Prior probabilities for the space of well-formulated models associated to threemain effects and one interaction term where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch) 98
4-6 MT DAG of the largest true model used in simulations 109
4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch) 110
C-1 SNR vs n Average model size average true positives and average false positivesfor all simulated scenarios by model ranking according to model posterior probabilities126
10
C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 128
C-3 SNR vs different true models MT Average model size average true positivesand average false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 129
11
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
OBJECTIVE BAYESIAN METHODS FOR OCCUPANCY MODEL ESTIMATION ANDSELECTION
By
Daniel Taylor-Rodrıguez
August 2014
Chair Linda J YoungCochair Nikolay BliznyukMajor Interdisciplinary Ecology
The ecological literature contains numerous methods for conducting inference about
the dynamics that govern biological populations Among these methods occupancy
models have played a leading role during the past decade in the analysis of large
biological population surveys The flexibility of the occupancy framework has brought
about useful extensions for determining key population parameters which provide
insights about the distribution structure and dynamics of a population However the
methods used to fit the models and to conduct inference have gradually grown in
complexity leaving practitioners unable to fully understand their implicit assumptions
increasing the potential for misuse This motivated our first contribution We develop
a flexible and straightforward estimation method for occupancy models that provides
the means to directly incorporate temporal and spatial heterogeneity using covariate
information that characterizes habitat quality and the detectability of a species
Adding to the issue mentioned above studies of complex ecological systems now
collect large amounts of information To identify the drivers of these systems robust
techniques that account for test multiplicity and for the structure in the predictors are
necessary but unavailable for ecological models We develop tools to address this
methodological gap First working in an ldquoobjectiverdquo Bayesian framework we develop
the first fully automatic and objective method for occupancy model selection based
12
on intrinsic parameter priors Moreover for the general variable selection problem we
propose three sets of prior structures on the model space that correct for multiple testing
and a stochastic search algorithm that relies on the priors on the models space to
account for the polynomial structure in the predictors
13
CHAPTER 1GENERAL INTRODUCTION
As with any other branch of science ecology strives to grasp truths about the
world that surrounds us and in particular about nature The objective truth sought
by ecology may well be beyond our grasp however it is reasonable to think that at
least partially ldquoNature is capable of being understoodrdquo (Dewey 1958) We can observe
and interpret nature to formulate hypotheses which can then be tested against reality
Hypotheses that encounter no or little opposition when confronted with reality may
become contextual versions of the truth and may be generalized by scaling them
spatially andor temporally accordingly to delimit the bounds within which they are valid
To formulate hypotheses accurately and in a fashion amenable to scientific inquiry
not only the point of view and assumptions considered must be made explicit but
also the object of interest the properties worthy of consideration of that object and
the methods used in studying such properties (Reiners amp Lockwood 2009 Rigler amp
Peters 1995) Ecology as defined by Krebs (1972) is ldquothe study of interactions that
determine the distribution and abundance of organismsrdquo This characterizes organisms
and their interactions as the objects of interest to ecology and prescribes distribution
and abundance as a relevant property of these organisms
With regards to the methods used to acquire ecological scientific knowledge
traditionally theoretical mathematical models (such as deterministic PDEs) have been
used However naturally varying systems are imprecisely observed and as such are
subject to multiple sources of uncertainty that must be explicitly accounted for Because
of this the ecological scientific community is developing a growing interest in flexible
and powerful statistical methods and among these Bayesian hierarchical models
predominate These methods rely on empirical observations and can accommodate
fairly complex relationships between empirical observations and theoretical process
models while accounting for diverse sources of uncertainty (Hooten 2006)
14
Bayesian approaches are now used extensively in ecological modeling however
there are two issues of concern one from the standpoint of ecological practitioners
and another from the perspective of scientific ecological endeavors First Bayesian
modeling tools require a considerable understanding of probability and statistical theory
leading practitioners to view them as black box approaches (Kery 2010) Second
although Bayesian applications proliferate in the literature in general there is a lack of
awareness of the distinction between approaches specifically devised for testing and
those for estimation (Ellison 2004) Furthermore there is a dangerous unfamiliarity with
the proven risks of using tools designed for estimation in testing procedures (Berger amp
Pericchi 1996 Berger et al 2001 Kass amp Raftery 1995 Moreno et al 1998 Robert
et al 2009 Robert 1993) (eg use of flat priors in hypothesis testing)
Occupancy models have played a leading role during the past decade in large
biological population surveys The flexibility of the occupancy framework has allowed
the development of useful extensions to determine several key population parameters
which provide robust notions of the distribution structure and dynamics of a population
In order to address some of the concerns stated in previous paragraph we concentrate
in the occupancy framework to develop estimation and testing tools that will allow
ecologists first to gain insight about the estimation procedure and second to conduct
statistically sound model selection for site-occupancy data
11 Occupancy Modeling
Since MacKenzie et al (2002) and Tyre et al (2003) introduced the site-occupancy
framework countless applications and extensions of the method have been developed
in the ecological literature as evidenced by the 438000 hits on Google Scholar for
a search of rdquooccupancy modelrdquo This class of models acknowledges that techniques
used to conduct biological population surveys are prone to detection errors ndashif an
individual is detected it must be present while if it is not detected it might or might
not be Occupancy models improve upon traditional binary regression by accounting
15
for observed detection and partially observed presence as two separate but related
components In the site occupancy setting the chosen locations are surveyed
repeatedly in order to reduce the ambiguity caused by the observed zeros This
approach therefore allows probabilities of both presence (occurrence) and detection
to be estimated
The uses of site-occupancy models are many For example metapopulation
and island biogeography models are often parameterized in terms of site (or patch)
occupancy (Hansky 19921994 1997 as cited in MacKenzie et al (2003)) and
occupancy may be used as a surrogate for abundance to answer questions regarding
geographic distribution range size and metapopulation dynamics (MacKenzie et al
2004 Royle amp Kery 2007)
The basic occupancy framework which assumes a single closed population with
fixed probabilities through time has proven to be quite useful however it might be of
limited utility when addressing some problems In particular assumptions for the basic
model may become too restrictive or unrealistic whenever the study period extends
throughout multiple years or seasons especially given the increasingly changing
environmental conditions that most ecosystems are currently experiencing
Among the extensions found in the literature one that we consider particularly
relevant incorporates heterogenous occupancy probabilities through time Models
that incorporate temporally varying probabilities stem from important meta-population
notions provided by Hanski (1994) such as occupancy probabilities depending on local
colonization and local extinction processes In spite of the conceptual usefulness of
Hanskirsquos model several strong and untenable assumptions (eg all patches being
homogenous in quality) are required for it to provide practically meaningful results
A more viable alternative which builds on Hanski (1994) is an extension of
the single season occupancy model of MacKenzie et al (2003) In this model the
heterogeneity of occupancy probabilities across seasons arises from local colonization
16
and extinction processes This model is flexible enough to let detection occurrence
extinction and colonization probabilities to each depend upon its own set of covariates
Model parameters are obtained through likelihood-based estimation
Using a maximum likelihood approach presents two drawbacks First the
uncertainty assessment for maximum likelihood parameter estimates relies on
asymptotic results which are obtained from implementation of the delta method
making it sensitive to sample size Second to obtain parameter estimates the latent
process (occupancy) is marginalized out of the likelihood leading to the usual zero
inflated Bernoulli model Although this is a convenient strategy for solving the estimation
problem after integrating the latent state variables (occupancy indicators) they are
no longer available Therefore finite sample estimates cannot be calculated directly
Instead a supplementary parametric bootstrapping step is necessary Further
additional structure such as temporal or spatial variation cannot be introduced by
means of random effects (Royle amp Kery 2007)
12 A Primer on Objective Bayesian Testing
With the advent of high dimensional data such as that found in modern problems
in ecology genetics physics etc coupled with evolving computing capability objective
Bayesian inferential methods have gained increasing popularity This however is by no
means a new approach in the way Bayesian inference is conducted In fact starting with
Bayes and Laplace and continuing for almost 200 years Bayesian analysis was primarily
based on ldquononinformativerdquo priors (Berger amp Bernardo 1992)
Now subjective elicitation of prior probabilities in Bayesian analysis is widely
recognized as the ideal (Berger et al 2001) however it is often the case that the
available information is insufficient to specify appropriate prior probabilistic statements
Commonly as in model selection problems where large model spaces have to be
explored the number of model parameters is prohibitively large preventing one from
eliciting prior information for the entire parameter space As a consequence in practice
17
the determination of priors through the definition of structural rules has become the
alternative to subjective elicitation for a variety of problems in Bayesian testing Priors
arising from these rules are known in the literature as noninformative objective default
or reference Many of these connotations generate controversy and are accused
perhaps rightly of providing a false pretension of objectivity Nevertheless we will avoid
that discussion and refer to them herein exchangeably as noninformative or objective
priors to convey the sense that no attempt to introduce an informed opinion is made in
defining prior probabilities
A plethora of ldquononinformativerdquo methods has been developed in the past few
decades (see Berger amp Bernardo (1992) Berger amp Pericchi (1996) Berger et al (2001)
Clyde amp George (2004) Kass amp Wasserman (1995 1996) Liang et al (2008) Moreno
et al (1998) Spiegelhalter amp Smith (1982) Wasserman (2000) and the references
therein) We find particularly interesting those derived from the model structure in which
no tuning parameters are required especially since these can be regarded as automatic
methods Among them methods based on the Bayes factor for Intrinsic Priors have
proven their worth in a variety of inferential problems given their excellent performance
flexibility and ease of use This class of priors is discussed in detail in chapter 3 For
now some basic notation and notions of Bayesian inferential procedures are introduced
Hypothesis testing and the Bayes factor
Bayesian model selection techniques that aim to find the true model as opposed
to searching for the model that best predicts the data are fundamentally extensions to
Bayesian hypothesis testing strategies In general this Bayesian approach to hypothesis
testing and model selection relies on determining the amount of evidence found in favor
of one hypothesis (or model) over the other given an observed set of data Approached
from a Bayesian standpoint this type of problem can be formulated in great generality
using a natural well defined probabilistic framework that incorporates both model and
parameter uncertainty
18
Jeffreys (1935) first developed the Bayesian strategy to hypothesis testing and
consequently to the model selection problem Bayesian model selection within
a model space M = (M1M2 MJ) where each model is associated with a
parameter θj which may be a vector of parameters itself incorporates three types
of probability distributions (1) a prior probability distribution for each model π(Mj)
(2) a prior probability distribution for the parameters in each model π(θj |Mj) and (3)
the distribution of the data conditional on both the model and the modelrsquos parameters
f (x|θj Mj) These three probability densities induce the joint distribution p(x θj Mj) =
f (x|θj Mj) middot π(θj |Mj) middot π(Mj) which is instrumental in producing model posterior
probabilities The model posterior probability is the probability that a model is true given
the data It is obtained by marginalizing over the parameter space and using Bayes rule
p(Mj |x) =m(x|Mj)π(Mj)sumJ
i=1m(x|Mi)π(Mi) (1ndash1)
where m(x|Mj) =intf (x|θj Mj)π(θj |Mj)dθj is the marginal likelihood of Mj
Given that interest lies in comparing different models evidence in favor of one or
another model is assessed with pairwise comparisons using posterior odds
p(Mj |x)p(Mk |x)
=m(x|Mj)
m(x|Mk)middot π(Mj)
π(Mk) (1ndash2)
The first term on the right hand side of (1ndash2) m(x|Mj )
m(x|Mk) is known as the Bayes factor
comparing model Mj to model Mk and it is denoted by BFjk(x) The Bayes factor
provides a measure of the evidence in favor of either model given the data and updates
the model prior odds given by π(Mj )
π(Mk) to produce the posterior odds
Note that the model posterior probability in (1ndash1) can be expressed as a function of
Bayes factors To illustrate let model Mlowast isin M be a reference model All other models
compare in M are compared to the reference model Then dividing both the numerator
19
and denominator in (1ndash1) by m(x|Mlowast)π(Mlowast) yields
p(Mj |x) =BFjlowast(x)
π(Mj )
π(Mlowast)
1 +sum
MiisinMMi =Mlowast
BFilowast(x)π(Mi )π(Mlowast)
(1ndash3)
Therefore as the Bayes factor increases the posterior probability of model Mj given the
data increases If all models have equal prior probabilities a straightforward criterion
to select the best among all candidate models is to choose the model with the largest
Bayes factor As such the Bayes factor is not only useful for identifying models favored
by the data but it also provides a means to rank models in terms of their posterior
probabilities
Assuming equal model prior probabilities in (1ndash3) the prior odds are set equal to
one and the model posterior odds in (1ndash2) become p(Mj |x)p(Mk |x) = BFjk(x) Based
on the Bayes factors the evidence in favor of one or another model can be interpreted
using Table 1-1 adapted from Kass amp Raftery (1995)
Table 1-1 Interpretation of BFji when contrasting Mj and Mi
lnBFjk BFjk Evidence in favor of Mj P(Mj |x)0 to 2 1 to 3 Weak evidence 05-0752 to 6 3 to 20 Positive evidence 075-095
6 to 10 20 to 150 Strong evidence 095-099gt10 gt150 Very strong evidence gt 099
Bayesian hypothesis testing and model selection procedures through Bayes factors
and posterior probabilities have several desirable features First these methods have a
straight forward interpretation since the Bayes factor is an increasing function of model
(or hypothesis) posterior probabilities Second these methods can yield frequentist
matching confidence bounds when implemented with good testing priors (Kass amp
Wasserman 1996) such as the reference priors of Berger amp Bernardo (1992) Third
since the Bayes factor contains the ratio of marginal densities it automatically penalizes
complexity according to the number of parameters in each model this property is
known as Ockhamrsquos razor (Kass amp Raftery 1995) Four the use of Bayes factors does
20
not require having nested hypotheses (ie having the null hypothesis nested in the
alternative) standard distributions or regular asymptotics (eg convergence to normal
or chi squared distributions) (Berger et al 2001) In contrast this is not always the case
with frequentist and likelihood ratio tests which depend on known distributions (at least
asymptotically) for the test statistic to perform the test Finally Bayesian hypothesis
testing procedures using the Bayes factor can naturally incorporate model uncertainty by
using the Bayesian machinery for model averaged predictions and confidence bounds
(Kass amp Raftery 1995) It is not clear how to account for this uncertainty rigorously in a
fully frequentist approach
13 Overview of the Chapters
In the chapters that follow we develop a flexible and straightforward hierarchical
Bayesian framework for occupancy models allowing us to obtain estimates and conduct
robust testing from an ldquoobjectiverdquo Bayesian perspective Latent mixtures of random
variables supply a foundation for our methodology This approach provides a means to
directly incorporate spatial dependency and temporal heterogeneity through predictors
that characterize either habitat quality of a given site or detectability features of a
particular survey conducted in a specific site On the other hand the Bayesian testing
methods we propose are (1) a fully automatic and objective method for occupancy
model selection and (2) an objective Bayesian testing tool that accounts for multiple
testing and for polynomial hierarchical structure in the space of predictors
Chapter 2 introduces the methods proposed for estimation of occupancy model
parameters A simple estimation procedure for the single season occupancy model
with covariates is formulated using both probit and logit links Based on the simple
version an extension is provided to cope with metapopulation dynamics by introducing
persistence and colonization processes Finally given the fundamental role that spatial
dependence plays in defining temporal dynamics a strategy to seamlessly account for
this feature in our framework is introduced
21
Chapter 3 develops a new fully automatic and objective method for occupancy
model selection that is asymptotically consistent for variable selection and averts the
use of tuning parameters In this Chapter first some issues surrounding multimodel
inference are described and insight about objective Bayesian inferential procedures is
provided Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate
priors on the parameter space the intrinsic priors for the parameters of the occupancy
model are obtained These are used in the construction of a variable selection algorithm
for ldquoobjectiverdquo variable selection tailored to the occupancy model framework
Chapter 4 touches on two important and interconnected issues when conducting
model testing that have yet to receive the attention they deserve (1) controlling for false
discovery in hypothesis testing given the size of the model space ie given the number
of tests performed and (2) non-invariance to location transformations of the variable
selection procedures in the face of polynomial predictor structure These elements both
depend on the definition of prior probabilities on the model space In this chapter a set
of priors on the model space and a stochastic search algorithm are proposed Together
these control for model multiplicity and account for the polynomial structure among the
predictors
22
CHAPTER 2MODEL ESTIMATION METHODS
ldquoData Data Datardquo he cried impatiently ldquoI canrsquot make bricks without clayrdquo
ndashSherlock HolmesThe Adventure of the Copper Beeches
21 Introduction
Prior to the introduction of site-occupancy models (MacKenzie et al 2002 Tyre
et al 2003) presence-absence data from ecological monitoring programs were used
without any adjustment to assess the impact of management actions to observe trends
in species distribution through space and time or to model the habitat of a species (Tyre
et al 2003) These efforts however were suspect due to false-negative errors not
being accounted for False-negative errors occur whenever a species is present at a site
but goes undetected during the survey
Site-occupancy models developed independently by MacKenzie et al (2002)
and Tyre et al (2003) extend simple binary-regression models to account for the
aforementioned errors in detection of individuals common in surveys of animal or plant
populations Since their introduction the site-occupancy framework has been used in
countless applications and numerous extensions for it have been proposed Occupancy
models improve upon traditional binary regression by analyzing observed detection
and partially observed presence as two separate but related components In the site
occupancy setting the chosen locations are surveyed repeatedly in order to reduce the
ambiguity caused by the observed zeros This approach therefore allows simultaneous
estimation of the probabilities of presence (occurrence) and detection
Several extensions to the basic single-season closed population model are
now available The occupancy approach has been used to determine species range
dynamics (MacKenzie et al 2003 Royle amp Kery 2007) and to understand agestage
23
structure within populations (Nichols et al 2007) model species co-occurrence
(MacKenzie et al 2004 Ovaskainen et al 2010 Waddle et al 2010) It has even been
suggested as a surrogate for abundance (MacKenzie amp Nichols 2004) MacKenzie et al
suggested using occupancy models to conduct large-scale monitoring programs since
this approach avoids the high costs associated with surveys designed for abundance
estimation Also to investigate metapopulation dynamics occupancy models improve
upon incidence function models (Hanski 1994) which are often parameterized in terms
of site (or patch) occupancy and assume homogenous patches and a metapopulation
that is at a colonization-extinction equilibrium
Nevertheless the implementation of Bayesian occupancy models commonly resorts
to sampling strategies dependent on hyper-parameters subjective prior elicitation
and relatively elaborate algorithms From the standpoint of practitioners these are
often treated as black-box methods (Kery 2010) As such the potential of using the
methodology incorrectly is high Commonly these procedures are fitted with packages
such as BUGS or JAGS Although the packagersquos ease of use has led to a wide-spread
adoption of the methods the user may be oblivious as to the assumptions underpinning
the analysis
We believe providing straightforward and robust alternatives to implement these
methods will help practitioners gain insight about how occupancy modeling and more
generally Bayesian modeling is performed In this Chapter using a simple Gibbs
sampling approach first we develop a versatile method to estimate the single season
closed population site-occupancy model then extend it to analyze metapopulation
dynamics through time and finally provide a further adaptation to incorporate spatial
dependence among neighboring sites211 The Occupancy Model
In this section of the document we first introduce our results published in Dorazio
amp Taylor-Rodrıguez (2012) and build upon them to propose relevant extensions For
24
the standard sampling protocol for collecting site-occupancy data J gt 1 independent
surveys are conducted at each of N representative sample locations (sites) noting
whether a species is detected or not detected during each survey Let yij denote a binary
random variable that indicates detection (y = 1) or non-detection (y = 0) during the
j th survey of site i Without loss of generality J may be assumed constant among all N
sites to simplify description of the model In practice however site-specific variation in
J poses no real difficulties and is easily implemented This sampling protocol therefore
yields a N times J matrix Y of detectionnon-detection data
Note that the observed process yij is an imperfect representation of the underlying
occupancy or presence process Hence letting zi denote the presence indicator at site i
this model specification can therefore be represented through the hierarchy
yij |zi λ sim Bernoulli (zipij)
zi |α sim Bernoulli (ψi) (2ndash1)
where pij is the probability of correctly classifying as occupied the i th site during the j th
survey ψi is the presence probability at the i th site The graphical representation of this
process is
ψi
zi
yi
pi
Figure 2-1 Graphical representation occupancy model
Probabilities of detection and occupancy can both be made functions of covariates
and their corresponding parameter estimates can be obtained using either a maximum
25
likelihood or a Bayesian approach Existing methodologies from the likelihood
perspective marginalize over the latent occupancy process (zi ) making the estimation
procedure depend only on the detections Most Bayesian strategies rely on MCMC
algorithms that require parameter prior specification and tuning However Albert amp Chib
(1993) proposed a longstanding strategy in the Bayesian statistical literature that models
binary outcomes using a simple Gibbs sampler This procedure which is described in
the following section can be extrapolated to the occupancy setting eliminating the need
for tuning parameters and subjective prior elicitation212 Data Augmentation Algorithms for Binary Models
Probit model Data-augmentation with latent normal variables
At the root of Albert amp Chibrsquos algorithm lies the idea that if the observed outcome is
0 the latent variable can be simulated from a truncated normal distribution with support
(minusinfin 0] And if the outcome is 1 the latent variable can be simulated from a truncated
normal distribution in (0infin) To understand the reasoning behind this strategy let
Y sim Bern((xTβ)
) and V = xTβ + ε with ε sim N (0 1) In such a case note that
Pr(y = 1 | xTβ) = (xTβ) = Pr(ε lt xTβ)
= Pr(ε gt minusxTβ)
= Pr(v gt 0 | xTβ)
Thus whenever y = 1 then v gt 0 and v le 0 otherwise In other words we
may think of y as a truncated version of v Thus we can sample iteratively alternating
between the latent variables conditioned on model parameters and vice versa to draw
from the desired posterior densities By augmenting the data with the latent variables
we are able to obtain full conditional posterior distributions for model parameters that are
easy to draw from (equation 2ndash3 below) Further we may sample the latent variables
we may also sample the parameters
Given some initial values for all model parameters values for the latent variables
can be simulated By conditioning on the latter it is then possible to draw samples
26
from the parameterrsquos posterior distributions These samples can be used to generate
new values for the latent variables etc The process is iterated using a Gibbs sampling
approach Generally after a large number iterations it yields draws from the joint
posterior distribution of the latent variables and the model parameters conditional on the
observed outcome values We formalize the procedure below
Assume that each outcome Y1Y2 Yn is such that Yi |xi β sim Bernoulli(qi)
where qi = (xTi β) is the standard normal CDF evaluated at xTi β where xi and β
are the p-dimensional vectors of observed covariates for the i -th observation and their
corresponding parameters respectively
Now let y = y1 y2 yn be the vector of observed outcomes and [ β ] represents
the prior distribution of the model parameters Therefore the posterior distribution of β is
given by
[ β|y ] prop [ β ]nprodi=1
(xTi β)yi(1minus(xTi β)
)1minusyi (2ndash2)
which is intractable Nevertheless introducing latent random variables V = (V1 Vn)
such that Vi sim N (xTi β 1) resolves this difficulty by specifying that whenever Yi = 1
then Vi gt 0 and if Yi = 0 then Vi le 0 This yields
[ β v|y ] prop [ β ]
nprodi=1
ϕ(vi | xTi β 1)Ivile0Iyi=0 + Ivigt0Iyi=1
(2ndash3)
where ϕ(x |micro τ 2) is the probability density function of normal random variable x
with mean micro and variance τ2 The data augmentation artifact works since [ β|y ] =int[ β v|y ]dv hence if we sample from joint posterior 2ndash3 and extract only the sampled
values for β they will correspond to samples from [ β|y ]
From the expression above it is possible to obtain the full conditional distributions
for V and β Thus a Gibbs sampler can be proposed For example if we use a flat prior
27
for β (ie [ β ] prop 1) the full conditionals are given by
β|V y sim MVNk
((XTX )minus1(XTV ) (XTX )minus1
)(2ndash4)
V|β y simnprodi=1
tr N (xTi β 1Qi) (2ndash5)
where MVNq(micro ) represents a multivariate normal distribution with mean vector micro
and variance-covariance matrix and tr N (ξσ2Q) stands for the truncated normal
distribution with mean ξ variance σ2 and truncation region Q For each i = 1 2 n
the support of the truncated variables is given by Q = (minusinfin 0 ] if yi = 0 and Q = (0infin)
otherwise Note that conjugate normal priors could be used alternatively
At iteration m + 1 the Gibbs sampler draws V(m+1) conditional on β(m) from (2ndash5)
and then samples β(m+1) conditional on V(m+1) from (2ndash4) This process is repeated for
s = 0 1 nsim where nsim is the number of iterations in the Gibbs sampler
Logit model Data-augmentation with latent Polya-gamma variables
Recently Polson et al (2013) developed a novel and efficient approach for Bayesian
inference for logistic models using Polya-gamma latent variables which is analogous
to the Albert amp Chib algorithm The result arises from what the authors refer to as the
Polya-gamma distribution To construct a random variable from this family consider the
infinite mixture of the iid sequence of Exp(1) random variables Ekinfink=1 given by
ω =2
π2
infinsumk=1
Ek
(2k minus 1)2
with probability density function
g(ω) =infinsumk=1
(minus1)k 2k + 1radic2πω3
eminus(2k+1)2
8ω Iωisin(0infin) (2ndash6)
and Laplace density transform E[eminustω] = coshminus1(radic
t2)
28
The Polya-gamma family of densities is obtained through an exponential tilting of
the density g from 2ndash6 These densities indexed by c ge 0 are characterized by
f (ω|c) = cosh c2 eminusc2ω2 g(ω)
The likelihood for the binomial logistic model can be expressed in terms of latent
Polya-gamma variables as follows Assume yi sim Bernoulli(δi) with predictors xprimei =
(xi1 xip) and success probability δi = exprimeiβ(1 + ex
primeiβ) Hence the posterior for the
model parameters can be represented as
[β|y] =[β]prodn
i δyii (1minus δi)
1minusyi
c(y)
where c(y) is the normalizing constant
To facilitate the sampling procedure a data augmentation step can be performed
by introducing a Polya-gamma random variable ω sim PG(xprimeβ 1) This yields the
data-augmented posterior
[βω|y] =
(prodn
i=1 Pr(yi = 1|β))f (ω|xprime
β) [β] dω
c(y) (2ndash7)
such that [β|y] =int
R+[βω|y] dω
Thus from the augmented model the full conditional density for β is given by
[β|ω y] prop
(nprodi=1
Pr(yi = 1|β)
)f (ω|xprime
β) [β] dω
=
nprodi=1
(exprimeiβ)yi
1 + exprimeiβ
nprodi=1
cosh
(∣∣xprime
iβ∣∣
2
)exp
[minus(x
prime
iβ)2ωi
2
]g(ωi)
(2ndash8)
This expression yields a normal posterior distribution if β is assigned flat or normal
priors Hence a two-step sampling strategy analogous to that of Albert amp Chib (1993)
can be used to estimate β in the occupancy framework22 Single Season Occupancy
Let pij = F (qTij λ) be the probability of correctly classifying as occupied the i th
site during the j th survey conditional on the site being occupied and let ψi = F (xTi α)
29
correspond to the presence probability at the i th site Further let Fminus1(middot) denote a link
function (ie probit or logit) connecting the response to the predictors and denote by λ
and α respectively the r -variate and p-variate coefficient vectors for the detection and
for the presence probabilities Then the following is the joint posterior probability for the
presence indicators and the model parameters
πlowast(z vαwλ) prop πα(α)πλ(λ)Nprodi=1
F (xprimeiα)zi (1minus F (xprimeiα))
(1minuszi ) times
Jprodj=1
(ziF (qprimeijλ))
yij (1minus ziF (qprimeijλ))
1minusyij (2ndash9)
As in the simple probit regression problem this posterior is intractable Consequently
sampling from it directly is not possible But the procedures of Albert amp Chib for the
probit model and of Polson et al for the logit model can be extended to generate an
MCMC sampling strategy for the occupancy problem In what follows we make use of
this framework to develop samplers with which occupancy parameter estimates can be
obtained for both probit and logit link functions These algorithms have the added benefit
that they do not require tuning parameters nor eliciting parameter priors subjectively221 Probit Link Model
To extend Albert amp Chibrsquos algorithm to the occupancy framework with a probit link
first we introduce two sets of latent variables denoted by wij and vi corresponding to
the normal latent variables used to augment the data The corresponding hierarchy is
yij |zi sij sim Bernoulli(ziIwijgt0
)wij |λ sim N
(qprimeijλ 1
)λ sim [λ]
zi |vi sim Ivigt0
vi |α sim N (xprimeiα 1)
α sim [α] (2ndash10)
30
represented by the directed graph found in Figure 2-2
α
vi
zi
yi
wi
λ
Figure 2-2 Graphical representation occupancy model after data-augmentation
Under this hierarchical model the joint density is given by
πlowast(z vαwλ) prop Cyπα(α)πλ(λ)Nprodi=1
ϕ(vi xprimeiα 1)I
zivigt0I
(1minuszi )vile0 times
Jprodj=1
(ziIwijgt0)yij (1minus ziIwijgt0)
1minusyijϕ(wij qprimeijλ 1) (2ndash11)
The full conditional densities derived from the posterior in equation 2ndash11 are
detailed below
1 These are obtained from the full conditional of z after integrating out v and w
f (z|αλ) =
Nprodi=1
f (zi |αλ) =Nprodi=1
ψlowastizi (1minus ψlowast
i )1minuszi
where ψlowasti =
ψiprodJ
j=1 pyijij (1minus pij)
1minusyij
ψiprodJ
j=1 pyijij (1minus pij)1minusyij + (1minus ψi)
prodJ
j=1 Iyij=0(2ndash12)
2
f (v|zα) =
Nprodi=1
f (vi |zi α) =Nprodi=1
tr N (x primeiα 1Ai)
where Ai =
(minusinfin 0] zi = 0(0infin) zi = 1
(2ndash13)
31
and tr N(microσ2A) denotes the pdf of a truncated normal random variable withmean micro variance σ2 and truncation region A
3
f (α|v) = ϕp (α αXprimev α) (2ndash14)
where α = (X primeX )minus1and ϕk(x micro ) represents the k-variate normal density withmean vector micro and variance matrix
4
f (w|y zλ) =
Nprodi=1
Jprodj=1
f (wij |yij zi λ) =Nprodi=1
Jprodj=1
tr N (qprimeijλ 1Bij)
where Bij =
(minusinfininfin) zi = 0(minusinfin 0] zi = 1 and yij = 0(0infin) zi = 1 and yij = 1
(2ndash15)
5
f (λ|w) = ϕr (λ λQprimew λ) (2ndash16)
where λ = (Q primeQ)minus1
The Gibbs sampling algorithm for the model can then be summarized as
1 Initialize z α v λ and w
2 Sample zi sim Bern(ψilowast)
3 Sample vi from a truncated normal with micro = x primeiα σ = 1 and truncation regiondepending on zi
4 Sample α sim N (αXprimev α) with α = (X primeX )minus1
5 Sample wij from a truncated normal with micro = qprimeijλ σ = 1 and truncation region
depending on yij and zi
6 Sample λ sim N (λQprimew λ) with λ = (Q primeQ)minus1
222 Logit Link Model
Now turning to the logit link version of the occupancy model again let yij be the
indicator variable used to mark detection of the target species on the j th survey at the
i th site and let zi be the indicator variable that denotes presence (zi = 1) or absence
32
(zi = 0) of the target species at the i th site The model is now defined by
yij |zi λ sim Bernoulli (zipij) where pij =eq
primeijλ
1 + eqprimeijλ
λ sim [λ]
zi |α sim Bernoulli (ψi) where ψi =ex
primeiα
1 + exprimeiα
α sim [α]
In this hierarchy the contribution of a single site to the likelihood is
Li(αλ) =(ex
primeiα)zi
1 + exprimeiα
Jprodj=1
(zi
eqprimeijλ
1 + eqprimeijλ
)yij(1minus zi
eqprimeijλ
1 + eqprimeijλ
)1minusyij
(2ndash17)
As in the probit case we data-augment the likelihood with two separate sets
of covariates however in this case each of them having Polya-gamma distribution
Augmenting the model and using the posterior in (2ndash7) the joint is
[ zαλ|y ] prop [α] [λ]
Nprodi=1
(ex
primeiα)zi
1 + exprimeiαcosh
(∣∣xprime
iα∣∣
2
)exp
[minus(x
prime
iα)2vi
2
]g(vi)times
Jprodj=1
(zi
eqprimeijλ
1 + eqprimeijλ
)yij(1minus zi
eqprimeijλ
1 + eqprimeijλ
)1minusyij
times
cosh
(∣∣ziqprimeijλ∣∣2
)exp
[minus(ziq
primeijλ)2wij
2
]g(wij)
(2ndash18)
The full conditionals for z α v λ and w obtained from (2ndash18) are provided below
1 The full conditional for z is obtained after marginalizing the latent variables andyields
f (z|αλ) =
Nprodi=1
f (zi |αλ) =Nprodi=1
ψlowastizi (1minus ψlowast
i )1minuszi
where ψlowasti =
ψiprodJ
j=1 pyijij (1minus pij)
1minusyij
ψiprodJ
j=1 pyijij (1minus pij)1minusyij + (1minus ψi)
prodJ
j=1 Iyij=0(2ndash19)
33
2 Using the result derived in Polson et al (2013) we have that
f (v|zα) =
Nprodi=1
f (vi |zi α) =Nprodi=1
PG(1 xprimeiα) (2ndash20)
3
f (α|v) prop [α ]
Nprodi=1
exp[zix
prime
iαminus xprime
iα
2minus (x
prime
iα)2vi
2
] (2ndash21)
4 By the same result as that used for v the full conditional for w is
f (w|y zλ) =
Nprodi=1
Jprodj=1
f (wij |yij zi λ)
=
(prodiisinS1
Jprodj=1
PG(1 |qprimeijλ| )
)(prodi isinS1
Jprodj=1
PG(1 0)
) (2ndash22)
with S1 = i isin 1 2 N zi = 1
5
f (λ|z yw) prop [λ ]prodiisinS1
exp
[yijq
prime
ijλminusq
prime
ijλ
2minus
(qprime
ijλ)2wij
2
] (2ndash23)
with S1 as defined above
The Gibbs sampling algorithm is analogous to the one with a probit link but with the
obvious modifications to incorporate Polya-gamma instead of normal latent variables23 Temporal Dynamics and Spatial Structure
The uses of the single-season model are limited to very specific problems In
particular assumptions for the basic model may become too restrictive or unrealistic
whenever the study period extends throughout multiple years or seasons especially
given the increasingly changing environmental conditions that most ecosystems are
currently experiencing
Among the many extensions found in the literature one that we consider particularly
relevant incorporates heterogenous occupancy probabilities through time Extensions of
34
site-occupancy models that incorporate temporally varying probabilities can be traced
back to Hanski (1994) The heterogeneity of occupancy probabilities through time arises
from local colonization and extinction processes MacKenzie et al (2003) proposed an
alternative to Hanskirsquos approach in order to incorporate imperfect detection The method
is flexible enough to let detection occurrence survival and colonization probabilities
each depend upon its own set of covariates using likelihood-based estimation for the
model parameters
However the approach of MacKenzie et al presents two drawbacks First
the uncertainty assessment for maximum likelihood parameter estimates relies on
asymptotic results (obtained from implementation of the delta method) making it
sensitive to sample size And second to obtain parameter estimates the latent process
(occupancy) is marginalized out of the likelihood leading to the usual zero-inflated
Bernoulli model Although this is a convenient strategy to solve the estimation problem
the latent state variables (occupancy indicators) are no longer available and as such
finite sample estimates cannot be calculated unless an additional (and computationally
expensive) parametric bootstrap step is performed (Royle amp Kery 2007) Additionally as
the occupancy process is integrated out the likelihood approach precludes incorporation
of additional structural dependence using random effects Thus the model cannot
account for spatial dependence which plays a fundamental role in this setting
To work around some of the shortcomings encountered when fitting dynamic
occupancy models via likelihood based methods Royle amp Kery developed what they
refer to as a dynamic occupancy state space model (DOSS) alluding to the conceptual
similarity found between this model and the class of state space models found in the
time series literature In particular this model allows one to retain the latent process
(occupancy indicators) in order to obtain small sample estimates and to eventually
generate extensions that incorporate structure in time andor space through random
effects
35
The data used in the DOSS model comes from standard repeated presenceabsence
surveys with N sampling locations (patches or sites) indexed by i = 1 2 N Within
a given season (eg year month week depending on the biology of the species) each
sampling location is visited (surveyed) j = 1 2 J times This process is repeated for
t = 1 2 T seasons Here an important assumption is that the site occupancy status
is closed within but not across seasons
As is usual in the occupancy modeling framework two different processes are
considered The first one is the detection process per site-visit-season combination
denoted by yijt The yijt are indicator functions that take the value 1 if the species is
present at site i survey j and season t and 0 otherwise These detection indicators
are assumed to be independent within each site and season The second response
considered is the partially observed presence (occupancy) indicators zit These are
indicator variables which are equal to 1 whenever yijt = 1 for one or more of the visits
made to site i during season t otherwise the values of the zit rsquos are unknown Royle amp
Kery refer to these two processes as the observation (yijt) and the state (zit) models
In this setting the parameters of greatest interest are the occurrence or site
occupancy probabilities denoted by ψit as well as those representing the population
dynamics which are accounted for by introducing changes in occupancy status over
time through local colonization and survival That is if a site was not occupied at season
t minus 1 at season t it can either be colonized or remain unoccupied On the other hand
if the site was in fact occupied at season t minus 1 it can remain that way (survival) or
become abandoned (local extinction) at season t The probabilities of survival and
colonization from season t minus 1 to season t at the i th site are denoted by θi(tminus1) and
γi(tminus1) respectivelyDuring the initial period (or season) the model for the state process is expressed in
terms of the occupancy probability (equation 2ndash24) For subsequent periods the stateprocess is specified in terms of survival and colonization probabilities (equation 2ndash25) inparticular
zi1 sim Bernoulli (ψi1) (2ndash24)
36
zit |zi(tminus1) sim Bernoulli(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)(2ndash25)
The observation model conditional on the latent process zit is defined by
yijt |zit sim Bernoulli(zitpijt
)(2ndash26)
Royle amp Kery induce the heterogeneity by site site-season and site-survey-seasonrespectively in the occupancy survival and colonization and in the detection probabilitiesthroughthe following specification
logit(ψi1) = x1 + ri ri sim N(0σ2ψ) logitminus1(x1) sim Unif(0 1)
logit(θit) = at + ui ui sim N(0σ2θ ) logitminus1(ai) sim Unif(0 1)logit(γit) = bt + vi vi sim N(0σ2γ) logitminus1(bi) sim Unif(0 1)
logit(pijt) = ct + wij wi sim N(0σ2p) logitminus1(ci) sim Unif(0 1) (2ndash27)
where x1 at bt ct are the season fixed effects for the corresponding probabilities
and where (ri ui vi) and wij are the site and site-survey random effects respectively
Additionally all variance components assume the usual inverse gamma priors
As the authors state this formulation can be regarded as ldquobeing suitably vaguerdquo
however it is also restrictive in the sense that it is not clear what strategy to follow to
incorporate additional covariates while preserving the straightforward sampling strategy231 Dynamic Mixture Occupancy State-Space Model
We assume that the probabilities for occupancy survival colonization and detection
are all functions of linear combinations of covariates However our setup varies
slightly from that considered by Royle amp Kery (2007) In essence we modify the way in
which the estimates for survival and colonization probabilities are attained Our model
incorporates the notion that occupancy at a site occupied during the previous season
takes place through persistence where we define persistence as a function of both
survival and colonization That is a site occupied at time t may again be occupied
at time t + 1 if the current settlers survive if they perish and new settlers colonize
simultaneously or if both current settlers survive and new ones colonize
Our functional forms of choice are again the probit and logit link functions This
means that each probability of interest which we will refer to for illustration as δ is
37
linked to a linear combination of covariates xprime ξ through the relationship defined by
δ = F (xT ξ) where F (middot) represents the inverse link function This particular assumption
facilitates relating the data augmentation algorithms of Albert amp Chib and Polson et al to
Royle amp Keryrsquos DOSS model We refer to this extension of Royle amp Keryrsquos model as the
Dynamic Mixture Occupancy State Space model (DYMOSS)
As before let yijt be the indicator variable used to mark detection of the target
species on the j th survey at the i th site during the tth season and let zit be the indicator
variable that denotes presence (zit = 1) or absence (zit = 0) of the target species at the
i th site tth season with i isin 1 2 N j isin 1 2 J and t isin 1 2 T
Additionally assume that probabilities for occupancy at time t = 1 persistence
colonization and detection are all functions of covariates with corresponding parameter
vectors α (s) =δ(s)tminus1
Tt=2
B(c) =β(c)tminus1
Tt=2
and = λtTt=1 and covariate matrices
X(o) X = Xtminus1Tt=2 and Q(s) = QtTt=1 respectively Using the notation above our
proposed dynamic occupancy model is defined by the following hierarchyState model
zi1|α sim Bernoulli (ψi1) where ψi1 = F(xprime(o)iα
)zit |zi(tminus1) δ
(c)tminus1β
(s)tminus1 sim Bernoulli
(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)where θi(tminus1) = F
(δ(s)tminus1 + xprimei(tminus1)β
(c)tminus1
) and
γi(tminus1) = F(xprimei(tminus1)β
(c)tminus1
)(2ndash28)
Observed modelyijt |zit ηt sim Bernoulli (zitpijt)
where pijt = F (qTijtλt) (2ndash29)
In the hierarchical setup given by Equations 2ndash28 and 2ndash29 θi(tminus1) corresponds to
the probability of persistence from time t minus 1 to time t at site i and γi(tminus1) denotes the
colonization probability Note that θi(tminus1) minus γi(tminus1) yields the survival probability from t minus 1
to t The effect of survival is introduced by changing the intercept of the linear predictor
by a quantity δ(s)tminus1 Although in this version of the model this effect is accomplished by
just modifying the intercept it can be extended to have covariates determining δ(s)tminus1 as
well The graphical representation of the model for a single site is
38
α
zi1
yi1
λ1
zi2
yi2
λ1
δ(s)1
β(c)1
middot middot middot
zit
yit
λt
δ(s)tminus1
β(c)tminus1
middot middot middot
ziT
yiT
λT
δ(s)Tminus1
β(c)Tminus1
Figure 2-3 Graphical representation multiseason model for a single site
The joint posterior for the model defined by this hierarchical setting is
[ zηαβλ|y ] = Cy
Nprodi=1
ψi1 Jprodj=1
pyij1ij1 (1minus pij1)
(1minusyij1)
zi1(1minus ψi1)
Jprodj=1
Iyij1=0
1minuszi1 [η1][α]times
Tprodt=2
Nprodi=1
[(θziti(tminus1)(1minus θi(tminus1))
1minuszit)zi(tminus1)
+(γziti(tminus1)(1minus γi(tminus1))
1minuszit)1minuszi(tminus1)
] Jprod
j=1
pyijtijt (1minus pijt)
1minusyijt
zit
times
Jprodj=1
Iyijt=0
1minuszit [ηt ][βtminus1][λtminus1]
(2ndash30)
which as in the single season case is intractable Once again a Gibbs sampler cannot
be constructed directly to sample from this joint posterior The graphical representation
of the model for one site incorporating the latent variables is provided in Figure 2-4
α
ui1
zi1
yi1
wi1
λ1
zi2
yi2
wi2
λ1
vi1
δ(s)1
β(c)1
middot middot middot
middot middot middot
zit
vi tminus1
yit
wit
λt
δ(s)tminus1
β(c)tminus1
middot middot middot
middot middot middot
ziT
vi Tminus1
yiT
wiT
λT
δ(s)Tminus1
β(s)Tminus1
Figure 2-4 Graphical representation data-augmented multiseason model
Probit link normal-mixture DYMOSS model
39
We deal with the intractability of the joint posterior distribution as before that is
by introducing latent random variables Each of the latent variables incorporates the
relevant linear combinations of covariates for the probabilities considered in the model
This artifact enables us to sample from the joint posterior distributions of the model
parameters For the probit link the sets of latent random variables respectively for first
season occupancy persistence and colonization and detection are
bull ui sim N (bTi α 1)
bull vi(tminus1) sim zi(tminus1)N(δ(s)(tminus1) + xTi(tminus1)β
(c)(tminus1) 1
)+ (1minus zi(tminus1))N
(xTi(tminus1)β
(c)(tminus1) 1
) and
bull wijt sim N (qTijtηt 1)
Introducing these latent variables into the hierarchical formulation yieldsState model
ui1|α sim N(xprime(o)iα 1
)zi1|ui sim Bernoulli
(Iuigt0
)for t gt 1
vi(tminus1)|zi(tminus1)βtminus1 sim zi(tminus1)N(δ(s)(tminus1) + xprimei(tminus1)β
(c)(tminus1) 1
)+
(1minus zi(tminus1))N(xprimei(tminus1)β
(c)(tminus1) 1
)zit |vi(tminus1) sim Bernoulli
(Ivi(tminus1)gt0
)(2ndash31)
Observed modelwijt |ηt sim N
(qTijtλt 1
)yijt |zit wijt sim Bernoulli
(zitIrijtgt0
)(2ndash32)
Note that the result presented in Section 22 corresponds to the particular case for
T = 1 of the model specified by Equations 2ndash31 and 2ndash32
As mentioned previously model parameters are obtained using a Gibbs sampling
approach Let ϕ(x |microσ2) denote the pdf of a normally distributed random variable x
with mean micro and standard deviation σ Also let
1 Wt = (w1t w2t wNt) with wit = (wi1t wi2t wiJitt) (for i = 1 2 N andt = 1 2 T )
40
2 u = (u1 u2 uN)
3 V = (v1 vTminus1) with vt = (v1t v2t vNt)
For the probit link model the joint posterior distribution is
π(ZuV WtTt=1αB(c) δ(s)
)prop [α]
prodNi=1 ϕ
(ui∣∣ xprime(o)iα 1
)Izi1uigt0I
1minuszi1uile0
times
Tprodt=2
[β(c)tminus1 δ
(s)tminus1
] Nprodi=1
ϕ(vi(tminus1)
∣∣micro(v)i(tminus1) 1
)Izitvi(tminus1)gt0
I1minuszitvi(tminus1)le0
times
Tprodt=1
[λt ]
Nprodi=1
Jitprodj=1
ϕ(wijt
∣∣qprimeijtλt 1)(zitIwijtgt0)yij1(1minus zitIwijtgt0)
(1minusyijt)
where micro(v)i(tminus1) = zi(tminus1)δ
(s)tminus1 + xprimei(tminus1)β
(c)tminus1 (2ndash33)
Initialize the Gibbs sampler at α(0)B(0)(c) δ
(s)(0)2minus1 and (0) For m = 0 1 nsim
The sampler proceeds iteratively by block sampling sequentially for each primary
sampling period as follows first the presence process then the latent variables from
the data-augmentation step for the presence component followed by the parameters for
the presence process then the latent variables for the detection component and finally
the parameters for the detection component Letting [|] denote the full conditional
probability density function of the component conditional on all other unknown
parameters and the observed data for m = 1 nsim the sampling procedure can be
summarized as
[z(m)1 | middot
]rarr[u(m)| middot
]rarr[α(m)
∣∣∣ middot ]rarr [W
(m)1 | middot
]rarr[λ(m)1
∣∣∣ middot ]rarr[z(m)2 | middot
]rarr[V(m)2minus1| middot
]rarr[β(c)(m)2minus1 δ(s)(m)
2minus1
∣∣∣ middot ]rarr [W
(m)2 | middot
]rarr[λ(m)2
∣∣∣ middot ]rarr middot middot middot
middot middot middot rarr[z(m)T | middot
]rarr[V(m)Tminus1| middot
]rarr[β(c)(m)Tminus1 δ(s)(m)
Tminus1
∣∣∣ middot ]rarr [W
(m)T | middot
]rarr[λ(m)T
∣∣∣ middot ]The full conditional probability densities for this Gibbs sampling algorithm are
presented in detail within Appendix A
41
Logit link Polya-Gamma DYMOSS model
Using the same notation as before the logit link model resorts to the hierarchy given
byState model
ui1|α sim PG(xT(o)iα 1
)zi1|ui sim Bernoulli
(Iuigt0
)for t gt 1
vi(tminus1)| sim PG(1∣∣zi(tminus1)δ(s)(tminus1) + xprimei(tminus1)β
(c)(tminus1)
∣∣)sim Bernoulli
(Ivi(tminus1)gt0
)(2ndash34)
Observed modelwijt |λt sim PG
(qTijtλt 1
)yijt |zit wijt sim Bernoulli
(zitIwijtgt0
)(2ndash35)
The logit link version of the joint posterior is given by
π(ZuV WtTt=1αB(s)B(c)
)prop
Nprodi=1
(e
xprime(o)i
α)zi1
1 + exprime(o)i
αPG
(ui 1 |xprime(o)iα|
)[λ1][α]times
Ji1prodj=1
(zi1
eqprimeij1λ1
1 + eqprimeij1λ1
)yij1(1minus zi1
eqprimeij1λ1
1 + eqprimeij1λ1
)1minusyij1
PG(wij1 1 |zi1qprimeij1λ1|
)times
Tprodt=2
[δ(s)tminus1][β
(c)tminus1][λt ]
Nprodi=1
(exp
[micro(v)tminus1
])zit1 + exp
[micro(v)i(tminus1)
]PG (vit 1 ∣∣∣micro(v)i(tminus1)
∣∣∣)timesJitprodj=1
(zit
eqprimeijtλt
1 + eqprimeijtλt
)yijt(1minus zit
eqprimeijtλt
1 + eqlowastTij
λt
)1minusyijt
PG(wijt 1 |zitqprimeijtλt |
)
(2ndash36)
with micro(v)tminus1 = zi(tminus1)δ
(s)tminus1 + xprimei(tminus1)β
(c)tminus1
42
The sampling procedure is entirely analogous to that described for the probit
version The full conditional densities derived from expression 2ndash36 are described in
detail in Appendix A232 Incorporating Spatial Dependence
In this section we describe how the additional layer of complexity space can also
be accounted for by continuing to use the same data-augmentation framework The
method we employ to incorporate spatial dependence is a slightly modified version of
the traditional approach for spatial generalized linear mixed models (GLMMrsquos) and
extends the model proposed by Johnson et al (2013) for the single season closed
population occupancy model
The traditional approach consists of using spatial random effects to induce a
correlation structure among adjacent sites This formulation introduced by Besag et al
(1991) assumes that the spatial random effect corresponds to a Gaussian Markov
Random Field (GMRF) The model known as the Spatial GLMM (SGLMM) is used to
analyze areal data It has been applied extensively given the flexibility of its hierarchical
formulation and the availability of software for its implementation (Hughes amp Haran
2013)
Succinctly the spatial dependence is accounted for in the model by adding a
random vector η assumed to have a conditionally-autoregressive (CAR) prior (also
known as the Gaussian Markov random field prior) To define the prior let the pair
G = (V E) represent the undirected graph for the entire spatial region studied where
V = (1 2 N) denotes the vertices of the graph (sites) and E the set of edges
between sites E is constituted by elements of the form (i j) indicating that sites i
and j are spatially adjacent for some i j isin V The prior for the spatial effects is then
characterized by
[η|τ ] prop τ rank()2exp[minusτ2ηprimeη
] (2ndash37)
43
where = (diag(A1)minus A) is the precision matrix with A denoting the adjacency matrix
The entries of the adjacency matrix A are such that diag(A) = 0 and Aij = I(i j)isinE
The matrix is singular Hence the probability density defined in equation 2ndash37
is improper ie it doesnrsquot integrate to 1 Regardless of the impropriety of the prior this
model can be fitted using a Bayesian approach since even if the prior is improper the
posterior for the model parameters is proper If a constraint such assum
k ηk = 0 is
imposed or if the precision matrix is replaced by a positive definite matrix the model
can also be fitted using a maximum likelihood approach
Assuming that all but the detection process are subject to spatial correlations and
using the notation we have developed up to this point the spatially explicit version of the
DYMOSS model is characterized by the hierarchy represented by equations 2ndash38 and
2ndash39
Hence adding spatial structure into the DYMOSS framework described in the
previous section only involves adding the steps to sample η(o) and ηtT
t=2 conditional
on all other parameters Furthermore the corresponding parameters and spatial
random effects of a given component (ie occupancy survival and colonization)
can be effortlessly pooled together into a single parameter vector to perform block
sampling For each of the latent variables the only modification required is to sum the
corresponding spatial effect to the linear predictor so that these retain their conditional
independence given the linear combination of fixed effects and the spatial effects
State modelzi1|α sim Bernoulli (ψi1) where ψi1 = F
(xT(o)iα+ η
(o)i
)[η(o)|τ
]prop τ rank()2exp
[minusτ2η(o)primeη(o)
]zit |zi(tminus1)αβtminus1λtminus1 sim Bernoulli
(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)where θi(tminus1) = F
(δ(s)(tminus1) + xTi(tminus1)β
(c)tminus1 + ηit
) and
γi(tminus1) = F(xTi(tminus1)β
(c)tminus1 + ηit
)[ηt |τ ] prop τ rank()2exp
[minusτ2ηprimetηt
](2ndash38)
44
Observed modelyijt |zit ηt sim Bernoulli (zitpijt)
where pijt = F (qTijtλt) (2ndash39)
In spite of the popularity of this approach to incorporating spatial dependence three
shortcomings have been reported in the literature (Hughes amp Haran 2013 Reich et al
2006) (1) model parameters have no clear interpretation due to spatial confounding
of the predictors with the spatial effect (2) there is variance inflation due to spatial
confounding and (3) the high dimensionality of the latent spatial variables leads to
high computational costs To avoid such difficulties we follow the approach used by
Hughes amp Haran (2013) which builds upon the earlier work by Reich et al (2006) This
methodology is summarized in what follows
Let a vector of spatial effects η have the CAR model given by 2ndash37 above Now
consider a random vector ζ sim MVN(0 τKprimeK
) with defined as above and where
τKprimeK corresponds to the precision of the distribution and not the covariance matrix
with matrix K satisfying KprimeK = I
This last condition implies that the linear predictor Xβ + η = Xβ + Kζ With
respect to how the matrix K is chosen Hughes amp Haran (2013) recommend basing its
construction on the spectral decomposition of operator matrices based on Moranrsquos I
The Moran operator matrix is defined as PperpAPperp with Pperp = IminusX (XprimeX )minus1X
prime and where A
is the adjacency matrix previously described The choice of the Moran operator is based
on the fact that it accounts for the underlying graph while incorporating the spatial
structure residual to the design matrix X These elements are incorporated into its
spectral decomposition of the Moran operator That is its eigenvalues correspond to the
values of Moranrsquos I statistic (a measure of spatial autocorrelation) for a spatial process
orthogonal to X while its eigenvectors provide the patterns of spatial dependence
residual to X Thus the matrix K is chosen to be the matrix whose columns are the
eigenvectors of the Moran operator for a particular adjacency matrix
45
Using this strategy the new hierarchical formulation of our model is simply modified
by letting η(o) = K(o)ζ(o) and ηt = Ktζt with
1 ζ(o) sim MVN(0 τ (o)K(o)primeK(o)
) where K(o) is the eigenvector matrix for
P(o)perpAP(o)perp and
2 ζt sim MVN(0 τtK
primetKt
) where Kt is the Pperp
t APperpt for t = 2 3 T
The algorithms for the probit and logit link from section 231 can be readily
adapted to incorporate the spatial structure simply by obtaining the joint posteriors
for (α ζ(o)) and (β(c)tminus1 δ
(s)tminus1 ζt) making the obvious modification of the corresponding
linear predictors to incorporate the spatial components24 Summary
With a few exceptions (Dorazio amp Taylor-Rodrıguez 2012 Johnson et al 2013
Royle amp Kery 2007) recent Bayesian approaches to site-occupancy modeling with
covariates have relied on model configurations (eg as multivariate normal priors of
parameters in logit scale) that lead to unfamiliar conditional posterior distributions thus
precluding the use of a direct sampling approach Therefore the sampling strategies
available are based on algorithms (eg Metropolis Hastings) that require tuning and the
knowledge to do so correctly
In Dorazio amp Taylor-Rodrıguez (2012) we proposed a Bayesian specification for
which a Gibbs sampler of the basic occupancy model is available and allowed detection
and occupancy probabilities to depend on linear combinations of predictors This
method described in section 221 is based on the data augmentation algorithm of
Albert amp Chib (1993) There the full conditional posteriors of the parameters of the probit
regression model are cast as latent mixtures of normal random variables The probit and
the logit link yield similar results with large sample sizes however their results may be
different when small to moderate sample sizes are considered because the logit link
function places more mass in the tails of the distribution than the probit link does In
46
section 222 we adapt the method for the single season model to work with the logit link
function
The basic occupancy framework is useful but it assumes a single closed population
with fixed probabilities through time Hence its assumptions may not be appropriate to
address problems where the interest lies in the temporal dynamics of the population
Hence we developed a dynamic model that incorporates the notion that occupancy
at a site previously occupied takes place through persistence which depends both on
survival and habitat suitability By this we mean that a site occupied at time t may again
be occupied at time t + 1 if (1) the current settlers survive (2) the existing settlers
perish but new settlers simultaneously colonize or (3) current settlers survive and new
ones colonize during the same season In our current formulation of the DYMOSS both
colonization and persistence depend on habitat suitability characterized by xprimei(tminus1)β(c)tminus1
They only differ in that persistence is also influenced by whether the site being occupied
during season t minus 1 enhances the suitability of the site or harms it through density
dependence
Additionally the study of the dynamics that govern distribution and abundance of
biological populations requires an understanding of the physical and biotic processes
that act upon them and these vary in time and space Consequently as a final step in
this Chapter we described a straightforward strategy to add spatial dependence among
neighboring sites in the dynamic metapopulation model This extension is based on the
popular Bayesian spatial modeling technique of Besag et al (1991) updated using the
methods described in (Hughes amp Haran 2013)
Future steps along these lines are (1) develop the software necessary to
implement the tools described throughout the Chapter and (2) build a suite of additional
extensions using this framework for occupancy models will be explored The first of
them will be used to incorporate information from different sources such as tracks
scats surveys and direct observations into a single model This can be accomplished
47
by adding a layer to the hierarchy where the source and spatial scale of the data is
accounted for The second extension is a single season spatially explicit multiple
species co-occupancy model This model will allow studying complex interactions
and testing hypotheses about species interactions at a given point in time Lastly this
co-occupancy model will be adapted to incorporate temporal dynamics in the spirit of
the DYMOSS model
48
CHAPTER 3INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
Eliminate all other factors and the one which remains must be the truthndashSherlock Holmes
The Sign of Four
31 Introduction
Occupancy models are often used to understand the mechanisms that dictate
the distribution of a species Therefore variable selection plays a fundamental role in
achieving this goal To the best of our knowledge ldquoobjectiverdquo Bayesian alternatives for
variable selection have not been put forth for this problem and with a few exceptions
(Hooten amp Hobbs 2014 Link amp Barker 2009) AIC is the method used to choose from
competing site-occupancy models In addition the procedures currently implemented
and accessible to ecologists require enumerating and estimating all the candidate
models (Fiske amp Chandler 2011 Mazerolle amp Mazerolle 2013) In practice this
can be achieved if the model space considered is small enough which is possible
if the choice of the model space is guided by substantial prior knowledge about the
underlying ecological processes Nevertheless many site-occupancy surveys collect
large amounts of covariate information about the sampled sites Given that the total
number of candidate models grows exponentially fast with the number of predictors
considered choosing a reduced set of models guided by ecological intuition becomes
increasingly difficult This is even more so the case in the occupancy model context
where the model space is the cartesian product of models for presence and models for
detection Given the issues mentioned above we propose the first objective Bayesian
variable selection method for the single-season occupancy model framework This
approach explores in a principled manner the entire model space It is completely
49
automatic precluding the need for both tuning parameters in the sampling algorithm and
subjective elicitation of parameter prior distributions
As mentioned above in ecological modeling if model selection or less frequently
model averaging is considered the Akaike Information Criterion (AIC) (Akaike 1983)
or a version of it is the measure of choice for comparing candidate models (Fiske amp
Chandler 2011 Mazerolle amp Mazerolle 2013) The AIC is designed to find the model
that has on average the density closest in Kullback-Leibler distance to the density
of the true data generating mechanism The model with the smallest AIC is selected
However if nested models are considered one of them being the true one generally the
AIC will not select it (Wasserman 2000) Commonly the model selected by AIC will be
more complex than the true one The reason for this is that the AIC has a weak signal to
noise ratio and as such it tends to overfit (Rao amp Wu 2001) Other versions of the AIC
provide a bias correction that enhances the signal to noise ratio leading to a stronger
penalization for model complexity Some examples are the AICc (Hurvich amp Tsai 1989)
and AICu (McQuarrie et al 1997) however these are also not consistent for selection
albeit asymptotically efficient (Rao amp Wu 2001)
If we are interested in prediction as opposed to testing the AIC is certainly
appropriate However when conducting inference the use of Bayesian model averaging
and selection methods is more fitting If the true data generating mechanism is among
those considered asymptotically Bayesian methods choose the true model with
probability one Conversely if the true model is not among the alternatives and a
suitable parameter prior is used the posterior probability of the most parsimonious
model closest to the true one tends asymptotically to one
In spite of this in general for Bayesian testing direct elicitation of prior probabilistic
statements is often impeded because the problems studied may not be sufficiently
well understood to make an informed decision about the priors Conversely there may
be a prohibitively large number of parameters making specifying priors for each of
50
these parameters an arduous task In addition to this seemingly innocuous subjective
choices for the priors on the parameter space may drastically affect test outcomes
This has been a recurring argument in favor of objective Bayesian procedures
which appeal to the use of formal rules to build parameter priors that incorporate the
structural information inside the likelihood while utilizing some objective criterion (Kass amp
Wasserman 1996)
One popular choice of ldquoobjectiverdquo prior is the reference prior (Berger amp Bernardo
1992) which is the prior that maximizes the amount of signal extracted from the
data These priors have proven to be effective as they are fully automatic and can
be frequentist matching in the sense that the posterior credible interval agrees with the
frequentist confidence interval from repeated sampling with equal coverage-probability
(Kass amp Wasserman 1996) Reference priors however are improper and while
they yield reasonable posterior parameter probabilities the derived model posterior
probabilities may be ill defined To avoid this shortcoming Berger amp Pericchi (1996)
introduced the intrinsic Bayes factor (IBF) for model comparison Moreno et al (1998)
building on the IBF of Berger amp Pericchi (1996) developed a limiting procedure to
generate a system of priors that yield well-defined posteriors even though these
priors may sometimes be improper The IBF is built using a data-dependent prior to
automatically generate Bayes factors however the extension introduced by Moreno
et al (1998) generates the intrinsic prior by taking a theoretical average over the space
of training samples freeing the prior from data dependence
In our view in the face of a large number of predictors the best alternative is to run
a stochastic search algorithm using good ldquoobjectiverdquo testing parameter priors and to
incorporate suitable model priors This being said the discussion about model priors is
deferred until Chapter 4 this Chapter focuses on the priors on the parameter space
The Chapter is structured as follows First issues surrounding multimodel inference
are described and insight about objective Bayesian inferential procedures is provided
51
Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate priors
on the parameter space the intrinsic priors for the parameters of the occupancy model
are derived These are used in the construction of an algorithm for ldquoobjectiverdquo model
selection tailored to the occupancy model framework To assess the performance of our
methods we provide results from a simulation study in which distinct scenarios both
favorable and unfavorable are used to determine the robustness of these tools and
analyze the Blue Hawker data set which has been examined previously in the ecological
literature (Dorazio amp Taylor-Rodrıguez 2012 Kery et al 2010)32 Objective Bayesian Inference
As mentioned before in practice noninformative priors arising from structural
rules are an alternative to subjective elicitation of priors Some of the rules used in
defining noninformative priors include the principle of insufficient reason parametrization
invariance maximum entropy geometric arguments coverage matching and decision
theoretic approaches (see Kass amp Wasserman (1996) for a discussion)
These rules reflect one of two attitudes (1) noninformative priors either aim to
convey unique representations of ignorance or (2) they attempt to produce probability
statements that may be accepted by convention This latter attitude is in the same
spirit as how weights and distances are defined (Kass amp Wasserman 1996) and
characterizes the way in which Bayesian reference methods are interpreted today ie
noninformative priors are seen to be chosen by convention according to the situation
A word of caution must be given when using noninformative priors Difficulties arise
in their implementation that should not be taken lightly In particular these difficulties
may occur because noninformative priors are generally improper (meaning that they do
not integrate or sum to a finite number) and as such are said to depend on arbitrary
constants
Bayes factors strongly depend upon the prior distributions for the parameters
included in each of the models being compared This can be an important limitation
52
considering that when using noninformative priors their introduction will result in the
Bayes factors being a function of the ratio of arbitrary constants given that these priors
are typically improper (see Jeffreys 1961 Pericchi 2005 and references therein)
Many different approaches have been developed to deal with the arbitrary constants
when using improper priors since then These include the use of partial Bayes factors
(Berger amp Pericchi 1996 Good 1950 Lempers 1971) setting the ratio of arbitrary
constants to a predefined value (Spiegelhalter amp Smith 1982) and approximating to the
Bayes factor (see Haughton 1988 as cited in Berger amp Pericchi 1996 Kass amp Raftery
1995 Tierney amp Kadane 1986)321 The Intrinsic Methodology
Berger amp Pericchi (1996) cleverly dealt with the arbitrary constants that arise when
using improper priors by introducing the intrinsic Bayes factor (IBF) procedure This
solution based on partial Bayes factors provides the means to replace the improper
priors by proper ldquoposteriorrdquo priors The IBF is obtained from combining the model
structure with information contained in the observed data Furthermore they showed
that as the sample size tends to infinity the Intrinsic Bayes factor corresponds to the
proper Bayes factor arising from the intrinsic priors
Intrinsic priors however are not unique The asymptotic correspondence between
the IBF and the Bayes factor arising from the intrinsic prior yields two functional
equations that are solved by a whole class of intrinsic priors Because all the priors
in the class produce Bayes factors that are asymptotically equivalent to the IBF for
finite sample sizes the resulting Bayes factor is not unique To address this issue
Moreno et al (1998) formalized the methodology through the ldquolimiting procedurerdquo
This procedure allows one to obtain a unique Bayes factor consolidating the method
as a valid objective Bayesian model selection procedure which we will refer to as the
Bayes factor for intrinsic priors (BFIP) This result is particularly valid for nested models
although the methodology may be extended with some caution to nonnested models
53
As mentioned before the Bayesian hypothesis testing procedure is highly sensitive
to parameter-prior specification and not all priors that are useful for estimation are
recommended for hypothesis testing or model selection Evidence of this is provided
by the Jeffreys-Lindley paradox which states that a point null hypothesis will always
be accepted when the variance of a conjugate prior goes to infinity (Robert 1993)
Additionally when comparing nested models the null model should correspond to
a substantial reduction in complexity from that of larger alternative models Hence
priors for the larger alternative models that place probability mass away from the null
model are wasteful If the true model is ldquofarrdquo from the null it will be easily detected by
any statistical procedure Therefore the prior on the alternative models should ldquowork
harderrdquo at selecting competitive models that are ldquocloserdquo to the null This principle known
as the Savage continuity condition (Gunel amp Dickey 1974) is widely recognized by
statisticians
Interestingly the intrinsic prior in correspondence with the BFIP automatically
satisfies the Savage continuity condition That is when comparing nested models the
intrinsic prior for the more complex model is centered around the null model and in spite
of being a limiting procedure it is not subject to the Jeffreys-Lindley paradox
Moreover beyond the usual pairwise consistency of the Bayes factor for nested
models Casella et al (2009) show that the corresponding Bayesian procedure with
intrinsic priors for variable selection in normal regression is consistent in the entire
class of normal linear models adding an important feature to the list of virtues of the
procedure Consistency of the BFIP for the case where the dimension of the alternative
model grows with the sample size is discussed in Moreno et al (2010)322 Mixtures of g-Priors
As previously mentioned in the Bayesian paradigm a model M in M is defined
by a sampling density and a prior distribution The sampling density associated with
model M is denoted by f (y|βM σ2M M) where (βM σ
2M) is a vector of model-specific
54
unknown parameters The prior for model M and its corresponding set of parameters is
π(βM σ2M M|M) = π(βM σ
2M |MM) middot π(M|M)
Objective local priors for the model parameters (βM σ2M) are achieved through
modifications and extensions of Zellnerrsquos g-prior (Liang et al 2008 Womack et al
2014) In particular below we focus on the intrinsic prior and provide some details for
other scaled mixtures of g-priors We defer the discussion on priors over the model
space until Chapter 5 where we describe them in detail and develop a few alternatives
of our own3221 Intrinsic priors
An automatic choice of an objective prior is the intrinsic prior (Berger amp Pericchi
1996 Moreno et al 1998) Because MB sube M for all M isin M the intrinsic prior for
(βM σ2M) is defined as an expected posterior prior
πI (βM σ2M |M) =
intpR(βM σ
2M |~yM)mR(~y|MB)d~y (3ndash1)
where ~y is a minimal training sample for model M I denotes the intrinsic distributions
and R denotes distributions derived from the reference prior πR(βM σ2M |M) = cM
dβMdσ2M
σ2M
In (3ndash1) mR(~y|M) =intint
f (~y|βM σ2M M)πR(βM σ
2M |M)dβMdσ2M is the reference marginal
of ~y under model M and pR(βM σ2M |~yM) =
f (~y|βM σ2MM)πR(βM σ2
M|M)
mR(~y|M)is the reference
posterior density
In the regression framework the reference marginal mR is improper and produces
improper intrinsic priors However the intrinsic Bayes factor of model M to the base
model MB is well-defined and given by
BF IMMB
(y) = (1minus R2M)
minus nminus|MB |2 times
int 1
0
n + sin2(π2θ) middot (|M|+ 1)
n +sin2(π
2θ)middot(|M|+1)1minusR2
M
nminus|M|
2sin2(π
2θ) middot (|M|+ 1)
n +sin2(π
2θ)middot(|M|+1)1minusR2
M
|M|minus|MB |
2
dθ (3ndash2)
55
where R2M is the coefficient of determination of model M versus model MB The Bayes
factor between two models M and M prime is defined as BF IMMprime(y) = BF I
MMB(y)BF I
MprimeMB(y)
The ldquogoodnessrdquo of the model M based on the intrinsic priors is given by its posterior
probability
pI (M|yM) =BF I
MMB(y)π(M|M)sum
MprimeisinM BF IMprimeMB
(y)π(M prime|M) (3ndash3)
It has been shown that the system of intrinsic priors produces consistent model selection
(Casella et al 2009 Giron et al 2010) In the context of well-formulated models the
true model MT is the smallest well-formulated model M isin M such that α isin M if βα = 0
If MT is the true model then the posterior probability of model MT based on equation
(3ndash3) converges to 13222 Other mixtures of g-priors
Scaled mixtures of g-priors place a reference prior on (βMBσ2) and a multivariate
normal distribution on β in M MB that is normal with mean 0 and precision matrix
qMw
nσ2ZprimeM (IminusH0)ZM
where H0 is the hat matrix associated with ZMB The prior is completed by a prior on w
and choice of scaling qM that is set at |M| + 1 to account for the minimal sample size of
M Under these assumptions the Bayesrsquo factor for M to MB is given by
BFMMB(y) =
(1minus R2
M
) nminus|MB |2
int n + w(|M|+ 1)
n + w(|M|+1)1minusR2
M
nminus|M|
2w(|M|+ 1)
n + w(|M|+1)1minusR2
M
|M|minus|MB |
2
π(w)dw
We consider the following priors on w The intrinsic prior is π(w) = Beta(w 05 05)
which is only defined for w isin (0 1) A version of the Zellner-Siow prior is given by
w sim Gamma(05 05) which produces a multivariate Cauchy distribution on β A family
of hyper-g priors are defined by π(w) prop wminus12(β + w)(α+1)2 which have Cauchy-like
tails but produce more shrinkage than the Cauchy prior
56
33 Objective Bayes Occupancy Model Selection
As mentioned before Bayesian inferential approaches used for ecological models
are lacking In particular there exists a need for suitable objective and automatic
Bayesian testing procedures and software implementations that explore thoroughly the
model space considered With this goal in mind in this section we develop an objective
intrinsic and fully automatic Bayesian model selection methodology for single season
site-occupancy models We refer to this method as automatic and objective given that
in its implementation no hyperparameter tuning is required and that it is built using
noninformative priors with good testing properties (eg intrinsic priors)
An inferential method for the occupancy problem is possible using the intrinsic
approach given that we are able to link intrinsic-Bayesian tools for the normal linear
model through our probit formulation of the occupancy model In other words because
we can represent the single season probit occupancy model through the hierarchy
yij |zi wij sim Bernoulli(ziIwijgt0
)wij |λ sim N
(qprimeijλ 1
)zi |vi sim Bernoulli
(Ivigt0
)vi |α sim N (x primeiα 1)
it is possible to solve the selection problem on the latent scale variables wij and vi and
to use those results at the level of the occupancy and detection processes
In what follows first we provide some necessary notation Then a derivation of
the intrinsic priors for the parameters of the detection and occupancy components
is outlined Using these priors we obtain the general form of the model posterior
probabilities Finally the results are incorporated in a model selection algorithm for
site-occupancy data Although the priors on the model space are not discussed in this
Chapter the software and methods developed have different choices of model priors
built in
57
331 Preliminaries
The notation used in Chapter 2 will be considered in this section as well Namely
presence will be denoted by z detection by y their corresponding latent processes are
v and w and the model parameters are denoted by α and λ However some additional
notation is also necessary Let M0 =M0y M0z
denote the ldquobaserdquo model defined by
the smallest models considered for the detection and presence processes The base
models M0y and M0z include predictors that must be contained in every model that
belongs to the model space Some examples of base models are the intercept only
model a model with covariates related to the sampling design and a model including
some predictors important to the researcher that should be included in every model
Furthermore let the sets [Kz ] = 1 2 Kz and [Ky ] = 1 2 Ky index
the covariates considered for the variable selection procedure for the presence and
detection processes respectively That is these sets denote the covariates that can
be added from the base models in M0 or removed from the largest possible models
considered MF z and MF y which we will refer to as the ldquofullrdquo models The model space
can then be represented by the Cartesian product of subsets such that Ay sube [Ky ]
and Az sube [Kz ] The entire model space is populated by models of the form MA =MAy
MAz
isin M = My timesMz with MAy
isin My and MAzisin Mz
For the presence process z the design matrix for model MAzis given by the block
matrix XAz= (X0|Xr A) X0 corresponds to the design matrix of the base model ndash which
is such that M0z sube MAzisin Mz for all Az isin [Kz ] ndash and Xr A corresponds to the submatrix
that contains the covariates indexed by Az Analogously for the detection process y the
design matrix is given by QAy= (Q0|Qr A) Similarly the coefficients for models MAz
and
MAyare given by αA = (αprime
0αprimer A)
prime and λA = (λprime0λ
primer A)
prime
With these elements in place the model selection problem consists of finding
subsets of covariates indexed by A = Az Ay that have a high posterior probability
given the detection and occupancy processes This is equivalent to finding models with
58
high posterior odds when compared to a suitable base model These posterior odds are
given by
p(MA|y z)p(M0|y z)
=m(y z|MA)π(MA)
m(y z|M0)π(M0)= BFMAM0
(y z)π(MA)
π(M0)
Since we are able to represent the occupancy model as a truncation of latent
normal variables it is possible to work through the occupancy model selection problem
in the latent normal scale used for the presence and detection processes We formulate
two solutions to this problem one that depends on the observed and latent components
and another that solely depends on the latent level variables used to data-augment the
problem We will however focus on the latter approach as this yields a straightforward
MCMC sampling scheme For completeness the other alternative is described in
Section 34
At the root of our objective inferential procedure for occupancy models lies the
conditional argument introduced by Womack et al (work in progress) for the simple
probit regression In the occupancy setting the argument is
p(MA|y zw v) =m(y z vw|MA)π(MA)
m(y zw v)
=fyz(y z|w v)
(intfvw(vw|αλMA)παλ(αλ|MA)d(αλ)
)π(MA)
fyz(y z|w v)sum
MlowastisinM(int
fvw(vw|αλMlowast)παλ(αλ|Mlowast)d(αλ))π(Mlowast)
=m(v|MAz
)m(w|MAy)π(MA)
m(v)m(w)
prop m(v|MAz)m(w|MAy
)π(MA) (3ndash4)
where
1 fyz(y z|w v) =prodN
i=1 Izivigt0I
(1minuszi )vile0
prodJ
j=1(ziIwijgt0)yij (1minus ziIwijgt0)
1minusyij
2 fvw(vw|αλMA) =
(Nprodi=1
ϕ(vi xprimeiαMAz
1)
)︸ ︷︷ ︸
f (v|αr Aα0MAz )
(Nprodi=1
Jiprodj=1
ϕ(wij qprimeijλMAy
1)
)︸ ︷︷ ︸
f (w|λr Aλ0MAy )
and
59
3 παλ(αλ|MA) = πα(α|MAz)πλ(λ|MAy
)
This result implies that once the occupancy and detection indicators are
conditioned on the latent processes v and w respectively the model posterior
probabilities only depend on the latent variables Hence in this case the model
selection problem is driven by the posterior odds
p(MA|y zw v)p(M0|y zw v)
=m(w v|MA)
m(w v|M0)
π(MA)
π(M0) (3ndash5)
where m(w v|MA) = m(w|MAy) middotm(v|MAz
) with
m(v|MAz) =
int intf (v|αr Aα0MAz
)π(αr A|α0MAz)π(α0)dαr Adα0
(3ndash6)
m(w|MAy) =
int intf (w|λr Aλ0MAy
)π(λr A|λ0MAy)π(λ0)dλ0dλr A
(3ndash7)
332 Intrinsic Priors for the Occupancy Problem
In general the intrinsic priors as defined by Moreno et al (1998) use the functional
form of the response to inform their construction assuming some preliminary prior
distribution proper or improper on the model parameters For our purposes we assume
noninformative improper priors for the parameters denoted by πN(middot|middot) Specifically the
intrinsic priors πIP(θMlowast|Mlowast) for a vector of parameters θMlowast corresponding to model
Mlowast isin M0M sub M for a response vector s with probability density (or mass) function
f (s|θMlowast) are defined by
πIP(θM0|M0) = πN(θM0
|M0)
πIP(θM |M) = πN(θM |M)
intm(~s|M)
m(~s|M0)f (~s|θM M)d~s
where ~s is a theoretical training sample
In what follows whenever it is clear from the context in an attempt to simplify the
notation MA will be used to refer to MAzor MAy
and A will denote Az or Ay To derive
60
the parameter priors involved in equations 3ndash6 and 3ndash7 using the objective intrinsic prior
strategy we start by assuming flat priors πN(αA|MA) prop cA and πN(λA|MA) prop dA where
cA and dA are unknown constants
The intrinsic prior for the parameters associated with the occupancy process αA
conditional on model MA is
πIP(αA|MA) = πN(αA|MA)
intm(~v|MA)
m(~v|M0)f (~v|αAMA)d~v
where the marginals m(~v|Mj) with j isin A 0 are obtained by solving the analogous
equation 3ndash6 for the (theoretical) training sample ~v These marginals are given by
m(~v|Mj) = cj (2π)pjminusp0
2 |~X primej~Xj |
12 eminus
12~vprime(Iminus~Hj )~v
The training sample ~v has dimension pAz=∣∣MAz
∣∣ that is the total number of
parameters in model MAz Note that without ambiguity we use
∣∣ middot ∣∣ to denote both
the cardinality of a set and also the determinant of a matrix The design matrix ~XA
corresponds to the training sample ~v and is chosen such that ~X primeA~XA =
pAzNX primeAXA
(Leon-Novelo et al 2012) and ~Hj is the corresponding hat matrix
Replacing m(~v|MA) and m(~v|M0) in πIP(αA|MA) and solving the integral with
respect to the theoretical training sample ~v we have
πIP(αA|MA) = cA
int ((2π)minus
pAzminusp0z2
(c0
cA
)eminus
12~vprime((Iminus~HA)minus(Iminus~H0))~v |~X
primeA~XA|12
|~X prime0~X0|12
)times(
(2π)minuspAz2 eminus
12(~vminus~XAαA)
prime(~vminus~XAαA))d~v
= c0(2π)minus
pAzminusp0z2 |~X prime
Ar~XAr |
12 2minus
pAzminusp0z2 exp
[minus1
2αprimer A
(1
2~X primer A
~Xr A
)αr A
]= πN(α0)timesN
(αr A
∣∣ 0 2 middot ( ~X primer A
~Xr A)minus1)
(3ndash8)
61
Analogously the intrinsic prior for the parameters associated to the detection
process is
πIP(λA|MA) = d0(2π)minus
pAyminusp0y2 | ~Q prime
Ar~QAr |
12 2minus
pAyminusp0y2 exp
[minus1
2λprimer A
(1
2~Q primer A
~Qr A
)λr A
]= πN(λ0)timesN
(λr A
∣∣ 0 2 middot ( ~Q primeA~QA)
minus1)
(3ndash9)
In short the intrinsic priors for αA = (αprime0α
primer A)
prime and λprimeA = (λprime
0λprimer A)
prime are the product
of a reference prior on the parameters of the base model and a normal density on the
parameters indexed by Az and Ay respectively333 Model Posterior Probabilities
We now derive the expressions involved in the calculations of the model posterior
probabilities First recall that p(MA|y zw v) prop m(w v|MA)π(MA) Hence determining
this posterior probability only requires calculating m(w v|MA)
Note that since w and v are independent obtaining the model posteriors from
expression 3ndash4 reduces to finding closed form expressions for the marginals m(v |MAz)
and m(w |MAy) respectively from equations 3ndash6 and 3ndash7 Therefore
m(w v|MA) =
int intf (vw|αλMA)π
IP (α|MAz)πIP
(λ|MAy
)dαdλ
(3ndash10)
For the latent variable associated with the occupancy process plugging the
parameter intrinsic prior given by 3ndash8 into equation 3ndash6 (recalling that ~X primeA~XA =
pAzNX primeAXA)
and integrating out αA yields
m(v|MA) =
int intc0N (v|X0α0 + Xr Aαr A I)N
(αr A|0 2( ~X prime
r A~Xr A)
minus1)dαr Adα0
= c0(2π)minusn2
int (pAz
2N + pAz
) (pAzminusp0z
)
2
times
exp[minus1
2(v minus X0α0)
prime(I minus
(2N
2N + pAz
)Hr Az
)(v minus X0α0)
]dα0
62
= c0 (2π)minus(nminusp0z )2
(pAz
2N + pAz
) (pAzminusp0z
)
2
|X prime0X0|minus
12 times
exp[minus1
2vprime(I minus H0z minus
(2N
2N + pAz
)Hr Az
)v
] (3ndash11)
with Hr Az= HAz
minus H0z where HAzis the hat matrix for the entire model MAz
and H0z is
the hat matrix for the base model
Similarly the marginal distribution for w is
m(w|MA) = d0 (2π)minus(Jminusp0y )2
(pAy
2J + pAy
) (pAyminusp0y
)
2
|Q prime0Q0|minus
12 times
exp[minus1
2wprime(I minus H0y minus
(2J
2J + pAy
)Hr Ay
)w
] (3ndash12)
where J =sumN
i=1 Ji or in other words J denotes the total number of surveys conducted
Now the posteriors for the base model M0 =M0y M0z
are
m(v|M0) =
intc0N (v|X0α0 I) dα0
= c0(2π)minus(nminusp0z )2 |X prime
0X0|minus12 exp
[minus1
2(v (I minus H0z ) v)
](3ndash13)
and
m(w|M0) = d0(2π)minus(Jminusp0y )2 |Q prime
0Q0|minus12 exp
[minus1
2
(w(I minus H0y
)w)]
(3ndash14)
334 Model Selection Algorithm
Having the parameter intrinsic priors in place and knowing the form of the model
posterior probabilities it is finally possible to develop a strategy to conduct model
selection for the occupancy framework
For each of the two components of the model ndashoccupancy and detectionndash the
algorithm first draws the set of active predictors (ie Az and Ay ) together with their
corresponding parameters This is a reversible jump step which uses a Metropolis
63
Hastings correction with proposal distributions given by
q(Alowastz |zo z(t)u v(t)MAz
) =1
2
(p(MAlowast
z|zo z(t)u v(t)Mz MAlowast
zisin L(MAz
)) +1
|L(MAz)|
)q(Alowast
y |y zo z(t)u w(t)MAy) =
1
2
(p(MAlowast
w|y zo z(t)u w(t)My MAlowast
yisin L(MAy
)) +1
|L(MAy)|
)(3ndash15)
where L(MAz) and L(MAy
) denote the sets of models obtained from adding or removing
one predictor at a time from MAzand MAy
respectively
To promote mixing this step is followed by an additional draw from the full
conditionals of α and λ The densities p(α0|) p(αr A|) p(λ0|) and p(λr A|) can
be sampled from directly with Gibbs steps Using the notation a|middot to denote the random
variable a conditioned on all other parameters and on the data these densities are given
by
bull α0|middot sim N((X
prime0X0)
minus1Xprime0v (X
prime0X0)
minus1)bull αr A|middot sim N
(microαr A
αr A
) where the mean vector and the covariance matrix are
given by αr A= 2N
2N+pAz(X
prime
r AXr A)minus1 and microαr A
=(αr A
Xprime
r Av)
bull λ0|middot sim N((Q
prime0Q0)
minus1Qprime0w (Q
prime0Q0)
minus1) and
bull λr A|middot sim N(microλr A
λr A
) analogously with mean and covariance matrix given by
λr A= 2J
2J+pAy(Q
prime
r AQr A)minus1 and microλr A
=(λr A
Qprime
r Aw)
Finally Gibbs sampling steps are also available for the unobserved occupancy
indicators zu and for the corresponding latent variables v and w The full conditional
posterior densities for z(t+1)u v(t+1) and w(t+1) are those introduced in Chapter 2 for the
single season probit model
The following steps summarize the stochastic search algorithm
1 Initialize A(0)y A
(0)z z
(0)u v(0)w(0)α(0)
0 λ(0)0
2 Sample the model indices and corresponding parameters
(a) Draw simultaneously
64
bull Alowastz sim q(Az |zo z(t)u v(t)MAz
)
bull αlowast0 sim p(α0|MAlowast
z zo z
(t)u v(t)) and
bull αlowastr Alowast sim p(αr A|MAlowast
z zo z
(t)u v(t))
(b) Accept (M(t+1)Az
α(t+1)10 α(t+1)1
r A ) = (MAlowastzαlowast
0αlowastr Alowast) with probability
δz = min
(1
p(MAlowastz|zo z(t)u v(t))
p(MA(t)z|zo z(t)u v(t))
q(A(t)z |zo z(t)u v(t)MAlowast
z)
q(Alowastz |zo z
(t)u v(t)MAz
)
)
otherwise let (M(t+1)Az
α(t+1)10 α(t+1)1
r A ) = (A(t)z α(t)2
0 α(t)2r A )
(c) Sample simultaneously
bull Alowasty sim q(Ay |y zo z(t)u w(t)MAy
)
bull λlowast0 sim p(λ0|MAlowast
y y zo z
(t)u w(t)) and
bull λlowastr Alowast sim p(λr A|MAlowast
y y zo z
(t)u w(t))
(d) Accept (M(t+1)Ay
λ(t+1)10 λ(t+1)1
r A ) = (MAlowastyλlowast
0λlowastr Alowast) with probability
δy = min
(1
p(MAlowastz|y zo z(t)u w(t))
p(MA(t)z|y zo z(t)u w(t))
q(A(t)z |y zo z(t)u w(t)MAlowast
y)
q(Alowastz |y zo z
(t)u w(t)MAy
)
)
otherwise let (M(t+1)Ay
λ(t+1)10 λ(t+1)1
r A ) = (A(t)y λ(t)2
0 λ(t)2r A )
3 Sample base model parameters
(a) Draw α(t+1)20 sim p(α0|MA
(t+1)z
zo z(t)u v(t))
(b) Draw λ(t+1)20 sim p(λ0|MA(t+1)y
y zo z(t)u v(t))
4 To improve mixing resample model coefficients not present the base model butare in MA
(a) Draw α(t+1)2r A sim p(αr A|MA
(t+1)z
zo z(t)u v(t))
(b) Draw λ(t+1)2r A sim p(λr A|MA
(t+1)y
yzo z(t)u v(t))
5 Sample latent and missing (unobserved) variables
(a) Sample z(t+1)u sim p(zu|MA(t+1)z
yα(t+1)2r A α(t+1)2
0 λ(t+1)2r A λ(t+1)2
0 )
(b) Sample v(t+1) sim p(v|MA(t+1)z
zo z(t+1)u α(t+1)2
r A α(t+1)20 )
65
(c) Sample w(t+1) sim p(w|MA(t+1)y
zo z(t+1)u λ(t+1)2
r A λ(t+1)20 )
34 Alternative Formulation
Because the occupancy process is partially observed it is reasonable to consider
the posterior odds in terms of the observed responses that is the detections y and
the presences at sites where at least one detection takes place Partitioning the vector
of presences into observed and unobserved z = (zprimeo zprimeu)
prime and integrating out the
unobserved component the model posterior for MA can be obtained as
p(MA|y zo) prop Ezu [m(y z|MA)] π(MA) (3ndash16)
Data-augmenting the model in terms of latent normal variables a la Albert and Chib
the marginals for any model My Mz = M isin M of z and y inside of the expectation in
equation 3ndash16 can be expressed in terms of the latent variables
m(y z|M) =
intT (z)
intT (yz)
m(w v|M)dwdv
=
(intT (z)
m(v| Mz)dv
)(intT (yz)
m(w|My)dw
) (3ndash17)
where T (z) and T (y z) denote the corresponding truncation regions for v and w which
depend on the values taken by z and y and
m(v|Mz) =
intf (v|αMz)π(α|Mz)dα (3ndash18)
m(w|My) =
intf (w|λMy)π(λ|My)dλ (3ndash19)
The last equality in equation 3ndash17 is a consequence of the independence of the
latent processes v and w Using expressions 3ndash18 and 3ndash19 allows one to embed this
model selection problem in the classical linear normal regression setting where many
ldquoobjectiverdquo Bayesian inferential tools are available In particular these expressions
facilitate deriving the parameter intrinsic priors (Berger amp Pericchi 1996 Moreno
et al 1998) for this problem This approach is an extension of the one implemented in
Leon-Novelo et al (2012) for the simple probit regression problem
66
Using this alternative approach all that is left is to integrate m(v|MA) and m(w|MA)
over their corresponding truncation regions T (z) and T (y z) which yields m(y z|MA)
and then to obtain the expectation with respect to the unobserved zrsquos Note however
two issues arise First such integrals are not available in closed form Second
calculating the expectation over the limit of integration further complicates things To
address these difficulties it is possible to express E [m(y z|MA)] as
Ezu [m(y z|MA)] = Ezu
[(intT (z)
m(v| MAz)dv
)(intT (yz)
m(w|MAy)dw
)](3ndash20)
= Ezu
[(intT (z)
intm(v| MAz
α0)πIP(α0|MAz
)dα0dv
)times(int
T (yz)
intm(w| MAy
λ0)πIP(λ0|MAy
)dλ0dw
)]
= Ezu
int (int
T (z)
m(v| MAzα0)dv
)︸ ︷︷ ︸
g1(T (z)|MAz α0)
πIP(α0|MAz)dα0 times
int (intT (yz)
m(w|MAyλ0)dw
)︸ ︷︷ ︸
g2(T (yz)|MAy λ0)
πIP(λ0|MAy)dλ0
= Ezu
[intg1(T (z)|MAz
α0)πIP(α0|MAz
)dα0 timesintg2(T (y z)|MAy
λ0)πIP(λ0|MAy
)dλ0
]= c0 d0
int intEzu
[g1(T (z)|MAz
α0)g2(T (y z)|MAyλ0)
]dα0 dλ0
where the last equality follows from Fubinirsquos theorem since m(v|MAzα0) and
m(w|MAyλ0) are proper densities From 3ndash21 the posterior odds are
p(MA|y zo)p(M0|y zo)
=
int intEzu
[g1(T (z)|MAz
α0)g2(T (y z)|MAyλ0)
]dα0 dλ0int int
Ezu
[g1(T (z)|M0z α0)g2(T (y z)|M0y λ0)
]dα0 dλ0
π(MA)
π(M0)
(3ndash21)
67
35 Simulation Experiments
The proposed methodology was tested under 36 different scenarios where we
evaluate the behavior of the algorithm by varying the number of sites the number of
surveys the amount of signal in the predictors for the presence component and finally
the amount of signal in the predictors for the detection component
For each model component the base model is taken to be the intercept only model
and the full models considered for the presence and the detection have respectively 30
and 20 predictors Therefore the model space contains 230times220 asymp 112times1015 candidate
models
To control the amount of signal in the presence and detection components values
for the model parameter were purposefully chosen so that quantiles 10 50 and 90 of the
occupancy and detection probabilities match some pre-specified probabilities Because
presence and detection are binary variables the amount of signal in each model
component associates to the spread and center of the distribution for the occupancy and
detection probabilities respectively Low signal levels relate to occupancy or detection
probabilities close to 50 High signal levels associate with probabilities close to 0 or 1
Large spreads of the distributions for the occupancy and detection probabilities reflect
greater heterogeneity among the observations collected improving the discrimination
capability of the model and viceversa
Therefore for the presence component the parameter values of the true model
were chosen to set the median for the occupancy probabilities equal 05 The chosen
parameter values also fix quantiles 10 and 90 symmetrically about 05 at small (Qz10 =
03Qz90 = 07) intermediate (Qz
10 = 02Qz90 = 08) and large (Qz
10 = 01Qz90 = 09)
distances For the detection component the model parameters are obtained to reflect
detection probabilities concentrated about low values (Qy50 = 02) intermediate values
(Qy50 = 05) and high values (Qy
50 = 08) while keeping quantiles 10 and 90 fixed at 01
and 09 respectively
68
Table 3-1 Simulation control parameters occupancy model selectorParameter Values considered
N 50 100
J 3 5
(Qz10Q
z50Q
z90)
(03 05 07) (02 05 08) (01 05 09)
(Qy
10Qy50Q
y90)
(01 02 09) (01 05 09) (01 08 09)
There are in total 36 scenarios these result from crossing all the levels of the
simulation control parameters (Table 3-1) Under each of these scenarios 20 data sets
were generated at random True presence and detection indicators were generated
with the probit model formulation from Chapter 2 This with the assumed true models
MTz = 1 x2 x15 x16 x22 x28 for the presence and MTy = 1 q7 q10 q12 q17 for
the detection with the predictors included in the randomly generated datasets In this
context 1 represents the intercept term Throughout the Section we refer to predictors
included in the true models as true predictors and to those absent as false predictors
The selection procedure was conducted using each one of these data sets with
two different priors on the model space the uniform or equal probability prior and a
multiplicity correcting prior
The results are summarized through the marginal posterior inclusion probabilities
(MPIPs) for each predictor and also the five highest posterior probability models (HPM)
The MPIP for a given predictor under a specific scenario and for a particular data set is
defined as
p(predictor is included|y zw v) =sumMisinM
I(predictorisinM)p(M|y zw vM) (3ndash22)
In addition we compare the MPIP odds between predictors present in the true model
and predictors absent from it Specifically we consider the minimum odds of marginal
posterior inclusion probabilities for the predictors Let ~ξ and ξ denote respectively a
69
predictor in the true model MT and a predictor absent from MT We define the minimum
MPIP odds between the probabilities of true and false predictor as
minOddsMPIP =min~ξisinMT
p(I~ξ = 1|~ξ isin MT )
maxξ isinMTp(Iξ = 1|ξ isin MT )
(3ndash23)
If the variable selection procedure adequately discriminates true and false predictors
minOddsMPIP will take values larger than one The ability of the method to discriminate
between the least probable true predictor and the most probable false predictor worsens
as the indicator approaches 0351 Marginal Posterior Inclusion Probabilities for Model Predictors
For clarity in Figures 3-1 through 3-5 only predictors in the true models are labeled
and are emphasized with a dotted line passing through them The left hand side plots
in these figures contain the results for the presence component and the ones on the
right correspond to predictors in the detection component The results obtained with
the uniform model priors correspond to the black lines and those for the multiplicity
correcting prior are in red In these Figures the MPIPrsquos have been averaged over all
datasets corresponding scenarios matching the condition observed
In Figure 3-1 we contrast the mean MPIPrsquos of the predictors over all datasets from
scenarios with 50 sites to the mean MPIPrsquos obtained for the scenarios with 100 sites
Similarly Figure 3-2 compares the mean MPIPrsquos of scenarios where 3 surveys are
performed to those of scenarios having 5 surveys per site Figures 3-4 and 3-5 show the
effect of the different levels of signal considered in the occupancy probabilities and in the
detection probabilities
From these figures mainly three results can be drawn (1) the effect of the model
prior is substantial (2) the proposed methods yield MPIPrsquos that clearly separate
true predictors from false predictors and (3) the separation between MPIPrsquos of true
predictors and false predictors is noticeably larger in the detection component
70
Regardless of the simulation scenario and model component observed under the
uniform prior false predictors obtain a relatively high MPIP Conversely the multiplicity
correction prior strongly shrinks towards 0 the MPIP for false predictors In the presence
component the MPIP for the true predictors is shrunk substantially under the multiplicity
prior however there remains a clear separation between true and false predictors In
contrast in the detection component the MPIP for true predictors remains relatively high
(Figures 3-1 through 3-5)
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif N=50MC N=50
Unif N=100MC N=100
Figure 3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors
71
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif J=3MC J=3
Unif J=5MC J=5
Figure 3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif N=50 J=3Unif N=50 J=5
Unif N=100 J=3Unif N=100 J=5
MC N=50 J=3MC N=50 J=5
MC N=100 J=3MC N=100 J=5
Figure 3-3 Predictor MPIP averaged over scenarios with the interaction between thenumber of sites and the surveys per site using uniform (U) and multiplicitycorrection (MC) priors
72
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
U(03 05 07)MC(03 05 07)
U(02 05 08)MC(02 05 08)
U(01 05 09)MC(01 05 09)
Figure 3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
U(01 02 09)MC(01 02 09)
U(01 05 09)MC(01 05 09)
U(01 08 09)MC(01 08 09)
Figure 3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors
73
In scenarios where more sites were surveyed the separation between the MPIP of
true and false predictors grew in both model components (Figure 3-1) Increasing the
number of sites has an effect over both components given that every time a new site is
included covariate information is added to the design matrix of both the presence and
the detection components
On the hand increasing the number of surveys affects the MPIP of predictors in the
detection component (Figures 3-2 and 3-3) but has only a marginal effect on predictors
of the presence component This may appear to be counterintuitive however increasing
the number of surveys only increases the number of observation in the design matrix
for the detection while leaving unaltered the design matrix for the presence The small
changes observed in the MPIP for the presence predictors J increases are exclusively
a result of having additional detection indicators equal to 1 in sites where with less
surveys would only have 0 valued detections
From Figure 3-3 it is clear that for the presence component the effect of the number
of sites dominates the behavior of the MPIP especially when using the multiplicity
correction priors In the detection component the MPIP is influenced by both the number
of sites and number of surveys The influence of increasing the number of surveys is
larger when considering a smaller number of sites and viceversa
Regarding the effect of the distribution for the occupancy probabilities we observe
that mostly the detection component is affected There is stronger discrimination
between true and false predictors as the distribution has a higher variability (Figure
3-4) This is consistent with intuition since having the presence probabilities more
concentrated about 05 implies that the predictors do not vary much from one site to
the next whereas having the occupancy probabilities more spread out would have the
opposite effect
Finally concentrating the detection probabilities about high or low values For
predictors in the detection component the separation between MPIP of true and false
74
predictors is larger both in scenarios where the distribution of the detection probability
is centered about 02 or 08 when compared to those scenarios where this distribution
is centered about 05 (where the signal of the predictors is weakest) For predictors in
the presence component having the detection probabilities centered at higher values
slightly increases the inclusion probabilities of the true predictors (Figure 3-5) and
reduces that of false predictors
Table 3-2 Comparison of average minOddsMPIP under scenarios having differentnumber of sites (N=50 N=100) and under scenarios having different numberof surveys per site (J=3 J=5) for the presence and detection componentsusing uniform and multiplicity correction priors
Sites SurveysComp π(M) N=50 N=100 J=3 J=5
Presence Unif 112 131 119 124MC 320 846 420 674
Detection Unif 203 264 211 257MC 2115 3246 2139 3252
Table 3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors
(Qz10Q
z50Q
z90) (Qy
10Qy50Q
y90)
Comp π(M) (030507) (020508) (010509) (010209) (010509) (010809)
Presence Unif 105 120 134 110 123 124MC 202 455 805 238 619 640
Detection Unif 234 234 230 257 200 238MC 2537 2077 2528 2933 1852 2849
The separation between the MPIP of true and false predictors is even more
notorious in Tables 3-2 and 3-3 where the minimum MPIP odds between true and
false predictors are shown Under every scenario the value for the minOddsMPIP (as
defined in 3ndash23) was greater than 1 implying that on average even the lowest MPIP
for a true predictor is higher than the maximum MPIP for a false predictor In both
components of the model the minOddsMPIP are markedly larger under the multiplicity
correction prior and increase with the number of sites and with the number of surveys
75
For the presence component increasing the signal in the occupancy probabilities
or having the detection probabilities concentrate about higher values has a positive and
considerable effect on the magnitude of the odds For the detection component these
odds are particularly high specially under the multiplicity correction prior Also having
the distribution for the detection probabilities center about low or high values increases
the minOddsMPIP 352 Summary Statistics for the Highest Posterior Probability Model
Tables 3-4 through 3-7 show the number of true predictors that are included in
the HPM (True +) and the number of false predictors excluded from it (True minus)
The mean percentages observed in these Tables provide one clear message The
highest probability models chosen with either model prior commonly differ from the
corresponding true models The multiplicity correction priorrsquos strong shrinkage only
allows a few true predictors to be selected but at the same time it prevents from
including in the HPM any false predictors On the other hand the uniform prior includes
in the HPM a larger proportion of true predictors but at the expense of also introducing
a large number of false predictors This situation is exacerbated in the presence
component but also occurs to a minor extent in the detection component
Table 3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highestprobability models for the presence and the detection components usinguniform and multiplicity correcting priors on the model space
True + True minusComp π(M) N=50 N=100 N=50 N=100
Presence Unif 057 063 051 055MC 006 013 100 100
Detection Unif 077 085 087 093MC 049 070 100 100
Having more sites or surveys improves the inclusion of true predictors and exclusion
of false ones in the HPM for both the presence and detection components (Tables 3-4
and 3-5) On the other hand if the distribution for the occupancy probabilities is more
76
Table 3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space
True + True minusComp π(M) J=3 J=5 J=3 J=5
Presence Unif 059 061 052 054MC 008 010 100 100
Detection Unif 078 085 087 092MC 050 068 100 100
spread out the HPM includes more true predictors and less false ones in the presence
component In contrast the effect of the spread of the occupancy probabilities in the
detection HPM is negligible (Table 3-6) Finally there is a positive relationship between
the location of the median for the detection probabilities and the number of correctly
classified true and false predictors for the presence The HPM in the detection part of
the model responds positively to low and high values of the median detection probability
(increased signal levels) in terms of correctly classified true and false predictors (Table
3-7)
Table 3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space
True + True minusComp π(M) (030507) (020508) (010509) (030507) (020508) (010509)
Presence Unif 055 061 064 050 054 055MC 002 008 018 100 100 100
Detection Unif 081 082 081 090 089 089MC 057 061 059 100 100 100
36 Case Study Blue Hawker Data Analysis
During 1999 and 2000 an intensive volunteer surveying effort coordinated by the
Centre Suisse de Cartographie de la Faune (CSCF) was conducted in order to analyze
the distribution of the blue hawker Ashna cyanea (Odonata Aeshnidae) a common
dragonfly in Switzerland Given that Switzerland is a small and mountainous country
77
Table 3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space
True + True minusComp π(M) (010209) (010509) (010809) (010209) (010509) (010809)
Presence Unif 059 059 062 051 054 054MC 006 010 011 100 100 100
Detection Unif 089 077 078 091 087 091MC 070 048 059 100 100 100
there is large variation in its topography and physio-geography as such elevation is a
good candidate covariate to predict species occurrence at a large spatial scale It can
be used as a proxy for habitat type intensity of land use temperature as well as some
biotic factors (Kery et al 2010)
Repeated visits to 1-ha pixels took place to obtain the corresponding detection
history In addition to the survey outcome the x and y-coordinates thermal-level the
date of the survey and the elevation were recorded Surveys were restricted to the
known flight period of the blue hawker which takes place between May 1 and October
10 In total 2572 sites were surveyed at least once during the surveying period The
number of surveys per site ranges from 1 to 22 times within each survey year
Kery et al (2010) summarize the results of this effort using AIC-based model
comparisons first by following a backwards elimination approach for the detection
process while keeping the occupancy component fixed at the most complex model and
then for the presence component choosing among a group of three models while using
the detection model chosen In our analysis of this dataset for the detection and the
presence we consider as the full models those used in Kery et al (2010) namely
minus1(ψ) = α0 + α1year+ α2elev+ α3elev2 + α4elev
3
minus1(p) = λ0 + λ1year+ λ2elev+ λ3elev2 + λ4elev
3 + λ5date+ λ6date2
78
where year = Iyear=2000
The model spaces for this data contain 26 = 64 and 24 = 16 models respectively
for the detection and occupancy components That is in total the model space contains
24+6 = 1 024 models Although this model space can be enumerated entirely for
illustration we implemented the algorithm from section 334 generating 10000 draws
from the Gibbs sampler Each one of the models sampled were chosen from the set of
models that could be reached by changing the state of a single term in the current model
(to inclusion or exclusion accordingly) This allows a more thorough exploration of the
model space because for each of the 10000 models drawn the posterior probabilities
for many more models can be observed Below the labels for the predictors are followed
by either ldquozrdquo or ldquoyrdquo accordingly to represent the component they pertain to Finally
using the results from the model selection procedure we conducted a validation step to
determine the predictive accuracy of the HPMrsquos and of the median probability models
(MPMrsquos) The performance of these models is then contrasted with that of the model
ultimately selected by Kery et al (2010)361 Results Variable Selection Procedure
The model finally chosen for the presence component in Kery et al (2010) was not
found among the highest five probability models under either model prior 3-8 Moreover
the year indicator was never chosen under the multiplicity correcting prior hinting that
this term might correspond to a falsely identified predictor under the uniform prior
Results in Table 3-10 support this claim the marginal inclusion posterior probability for
the year predictor is 7 under the multiplicity correction prior The multiplicity correction
prior concentrates more densely the model posterior probability mass in the highest
ranked models (90 of the mass is in the top five models) than the uniform prior (which
account for 40 of the mass)
For the detection component the HPM under both priors is the intercept only model
which we represent in Table 3-9 with a blank label In both cases this model obtains very
79
Table 3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data
Uniform model priorRank Mz selected p(Mz |y)
1 yrz+elevz 0102 yrz+elevz+elevz3 0083 elevz2+elevz3 0084 yrz+elevz2 0075 yrz+elevz3 007
Multiplicity correcting model priorRank Mz selected p(Mz |y)
1 elevz+elevz3 0532 0153 elevz+elevz2 0094 elevz2 0065 elevz+elevz2+elevz3 005
high posterior probabilities The terms contained in cubic polynomial for the elevation
appear to contain some relevant information however this conflicts with the MPIPs
observed in Table 3-11 which under both model priors are relatively low (lt 20 with the
uniform and le 4 with the multiplicity correcting prior)
Table 3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data
Uniform model priorRank Mz selected p(Mz |y)
1 0452 elevy3 0063 elevy2 0054 elevy 0055 yry 004
Multiplicity correcting model priorRank Mz selected p(Mz |y)
1 0862 elevy3 0023 datey2 0024 elevy2 0025 yry 002
Finally it is possible to use the MPIPs to obtain the median probability model which
contains the terms that have a MPIP higher than 50 For the occupancy process
(Table 3-10) under the uniform prior the model with the year the elevation and the
elevation cubed are included The MPM with multiplicity correction prior coincides with
the HPM from this prior The MPM chosen for the detection component (Table 3-11)
under both priors is the intercept only model coinciding again with the HPM
Given the outcomes of the simulation studies from Section 35 especially those
pertaining to the detection component the results in Table 3-11 appear to indicate that
none of the predictors considered belong to the true model especially when considering
80
Table 3-10 MPIP presence component
Predictor p(predictor isin MTz |y z w v)
Unif MultCorryrz 053 007elevz 051 073elevz2 045 023elevz3 050 067
Table 3-11 MPIP detection component
Predictor p(predictor isin MTy |y z w v)
Unif MultCorryry 019 003elevy 018 003elevy2 018 003elevy 3 019 004datey 016 003datey2 015 004
those derived with the multiplicity correction prior On the other hand for the presence
component (Table 3-10) there is an indication that terms related to the cubic polynomial
in elevz can explain the occupancy patterns362 Validation for the Selection Procedure
Approximately half of the sites were selected at random for training (ie for model
selection and parameter estimation) and the remaining half were used as test data In
the previous section we observed that using the marginal posterior inclusion probability
of the predictors the our method effectively separates predictors in the true model from
those that are not in it However in Tables 3-10 and 3-11 this separation is only clear for
the presence component using the multiplicity correction prior
Therefore in the validation procedure we observe the misclassification rates for the
detections using the following models (1) the model ultimately recommended in Kery
et al (2010) (yrz+elevz+elevz2+elevz3 + elevy+ elevy2+ datey+ datey2) (2) the
highest probability model (HPM) with a uniform prior (yrz+elevz) (3) the HPM with a
multiplicity correcting prior (elevz + elevz3 ) (4) the median probability model (MPM)
ndashthe model including only predictors with a MPIP larger than 50ndash with the uniform
prior (yrz+elevz+elevz3) and finally (5) the MPM with a multiplicity correction prior
(elevz+elevz3 same as the HPM with multiplicity correction)
We must emphasize that the models resulting from the implement ion of our model
selection procedure used exclusively the training dataset On the other hand the model
in Kery et al (2010) was chosen to minimize the prediction error of the complete data
81
Because this model was obtained from the full dataset results derived from it can only
be considered as a lower bound for the prediction errors
The benchmark misclassification error rate for true 1rsquos is high (close to 70)
However the misclassification rate for true 0rsquos which accounts for most of the
responses is less pronounced (15) Overall the performance of the selected models
is comparable They yield considerably worse results than the benchmark for the true
1rsquos but achieve rates close to the benchmark for the true zeros Pooling together
the results for true ones and true zeros the selected models with either prior have
misclassification rates close to 30 The benchmark model performs comparably with a
joint misclassification error of 23 (Table 3-12)
Table 3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform andmultiplicity correction model priors
Model True 1 True 0 Jointbenchmark (Kery et al 2010) yrz+elevz+elevz2+elevz3 + 066 015 023
elevy+ elevy2+ datey+ datey2
HPM Unif yrz+elevz 083 017 028HPMHPM MC elevz + elevz3 082 018 028MPM Unif yrz+elevz+elevz3 082 018 029
37 Discussion
In this Chapter we proposed an objective and fully automatic Bayes methodology for
the single season site-occupancy model The methodology is said to be fully automatic
because no hyper-parameter specification is necessary in defining the parameter priors
and objective because it relies on the intrinsic priors derived from noninformative priors
The intrinsic priors have been shown to have desirable properties as testing priors We
also propose a fast stochastic search algorithm to explore large model spaces using our
model selection procedure
Our simulation experiments demonstrated the ability of the method to single out the
predictors present in the true model when considering the marginal posterior inclusion
probabilities for the predictors For predictors in the true model these probabilities
were comparatively larger than those for predictors absent from it Also the simulations
82
indicated that the method has a greater discrimination capability for predictors in the
detection component of the model especially when using multiplicity correction priors
Multiplicity correction priors were not described in this Chapter however their
influence on the selection outcome is significant This behavior was observed in the
simulation experiment and in the analysis of the Blue Hawker data Model priors play an
essential role As the number of predictors grows these are instrumental in controlling
for selection of false positive predictors Additionally model priors can be used to
account for predictor structure in the selection process which helps both to reduce the
size of the model space and to make the selection more robust These issues are the
topic of the next Chapter
Accounting for the polynomial hierarchy in the predictors within the occupancy
context is a straightforward extension of the procedures we describe in Chapter 4
Hence our next step is to develop efficient software for it An additional direction we
plan to pursue is developing methods for occupancy variable selection in a multivariate
setting This can be used to conduct hypothesis testing in scenarios with varying
conditions through time or in the case where multiple species are co-observed A
final variation we will investigate for this problem is that of occupancy model selection
incorporating random effects
83
CHAPTER 4PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
It has long been an axiom of mine that the little things are infinitely themost important
ndashSherlock HolmesA Case of Identity
41 Introduction
In regression problems if a large number of potential predictors is available the
complete model space is too large to enumerate and automatic selection algorithms are
necessary to find informative parsimonious models This multiple testing problem
is difficult and even more so when interactions or powers of the predictors are
considered In the ecological literature models with interactions andor higher order
polynomial terms are ubiquitous (Johnson et al 2013 Kery et al 2010 Zeller et al
2011) given the complexity and non-linearities found in ecological processes Several
model selection procedures even in the classical normal linear setting fail to address
two fundamental issues (1) the model selection outcome is not invariant to affine
transformations when interactions or polynomial structures are found among the
predictors and (2) additional penalization is required to control for false positives as the
model space grows (ie as more covariates are considered)
These two issues motivate the developments developed throughout this Chapter
Building on the results of Chipman (1996) we propose investigate and provide
recommendations for three different prior distributions on the model space These
priors help control for test multiplicity while accounting for polynomial structure in the
predictors They improve upon those proposed by Chipman first by avoiding the need
for specific values for the prior inclusion probabilities of the predictors and second
by formulating principled alternatives to introduce additional structure in the model
84
priors Finally we design a stochastic search algorithm that allows fast and thorough
exploration of model spaces with polynomial structure
Having structure in the predictors can determine the selection outcome As an
illustration consider the model E [y ] = β00 + β01x2 + β20x21 where the order one
term x1 is not present (this choice of subscripts for the coefficients is defined in the
following section) Transforming x1 7rarr xlowast1 = x1 + c for some c = 0 the model
becomes E [y ] = β00 + β01x2 + βlowast20x
lowast21 Note that in terms of the original predictors
xlowast21 = x21 + 2c middot x1 + c2 implying that this seemingly innocuous transformation of x1
modifies the column space of the design matrix by including x1 which was not in the
original model That is when lower order terms in the hierarchy are omitted from the
model the column space of the design matrix is not invariant to afine transformations
As the hat matrix depends on the column space the modelrsquos predictive capability is also
affected by how the covariates in the model are coded an undesirable feature for any
model selection procedure To make model selection invariant to afine transformations
the selection must be constrained to the subset of models that respect the hierarchy
(Griepentrog et al 1982 Khuri 2002 McCullagh amp Nelder 1989 Nelder 2000
Peixoto 1987 1990) These models are known as well-formulated models (WFMs)
Succinctly a model is well-formulated if for any predictor in the model every lower order
predictor associated with it is also in the model The model above is not well-formulated
as it contains x21 but not x1
WFMs exhibit strong heredity in that all lower order terms dividing higher order
terms in the model must also be included An alternative is to only require weak heredity
(Chipman 1996) which only forces some of the lower terms in the corresponding
polynomial hierarchy to be in the model However Nelder (1998) demonstrated that the
conditions under which weak heredity allows the design matrix to be invariant to afine
transformations of the predictors are too restrictive to be useful in practice
85
Although this topic appeared in the literature more than three decades ago (Nelder
1977) only recently have modern variable selection techniques been adapted to
account for the constraints imposed by heredity As described in Bien et al (2013)
the current literature on variable selection for polynomial response surface models
can be classified into three broad groups mult-istep procedures (Brusco et al 2009
Peixoto 1987) regularized regression methods (Bien et al 2013 Yuan et al 2009)
and Bayesian approaches (Chipman 1996) The methods introduced in this Chapter
take a Bayesian approach towards variable selection for well-formulated models with
particular emphasis on model priors
As mentioned in previous chapters the Bayesian variable selection problem
consists of finding models with high posterior probabilities within a pre-specified model
space M The model posterior probability for M isin M is given by
p(M|yM) prop m(y|M)π(M|M) (4ndash1)
Model posterior probabilities depend on the prior distribution on the model space
as well as on the prior distributions for the model specific parameters implicitly through
the marginals m(y|M) Priors on the model specific parameters have been extensively
discussed in the literature (Berger amp Pericchi 1996 Berger et al 2001 George 2000
Jeffreys 1961 Kass amp Wasserman 1996 Liang et al 2008 Zellner amp Siow 1980) In
contrast the effect of the prior on the model space has until recently been neglected
A few authors (eg Casella et al (2014) Scott amp Berger (2010) Wilson et al (2010))
have highlighted the relevance of the priors on the model space in the context of multiple
testing Adequately formulating priors on the model space can both account for structure
in the predictors and provide additional control on the detection of false positive terms
In addition using the popular uniform prior over the model space may lead to the
undesirable and ldquoinformativerdquo implication of favoring models of size p2 (where p is the
86
total number of covariates) since this is the most abundant model size contained in the
model space
Variable selection within the model space of well-formulated polynomial models
poses two challenges for automatic objective model selection procedures First the
notion of model complexity takes on a new dimension Complexity is not exclusively
a function of the number of predictors but also depends upon the depth and
connectedness of the associations defined by the polynomial hierarchy Second
because the model space is shaped by such relationships stochastic search algorithms
used to explore the models must also conform to these restrictions
Models without polynomial hierarchy constitute a special case of WFMs where
all predictors are of order one Hence all the methods developed throughout this
Chapter also apply to models with no predictor structure Additionally although our
proposed methods are presented for the normal linear case to simplify the exposition
these methods are general enough to be embedded in many Bayesian selection
and averaging procedures including of course the occupancy framework previously
discussed
In this Chapter first we provide the necessary definitions to characterize the
well-formulated model selection problem Then we proceed to introduce three new prior
structures on the well-formulated model space and characterize their behavior with
simple examples and simulations With the model priors in place we build a stochastic
search algorithm to explore spaces of well-formulated models that relies on intrinsic
priors for the model specific parameters mdash though this assumption can be relaxed
to use other mixtures of g-priors Finally we implement our procedures using both
simulated and real data
87
42 Setup for Well-Formulated Models
Suppose that the observations yi are modeled using the polynomial regression of
the covariates xi 1 xi p given by
yi =sum
β(α1αp)
pprodj=1
xαji j + ϵi (4ndash2)
where α = (α1 αp) belongs to Np0 the p-dimensional space of natural numbers
including 0 with ϵiiidsim N(0σ2) and only finitely many βα are allowed to be non-zero
As an illustration consider a model space that includes polynomial terms incorporating
covariates xi 1 and xi 2 only The terms x2i 2 and x2i 1xi 2 can be represented by α = (0 2)
and α = (2 1) respectively
The notation y = Z(X)β + ϵ is used to denote that observed response y =
(y1 yn) is modeled via a polynomial function Z of the original covariates contained
in X = (x1 xp) (where xj = (x1j xnj)prime) and the coefficients of the polynomial
terms are given by β A specific polynomial model M is defined by the set of coefficients
βα that are allowed to be non-zero This definition is equivalent to characterizing M
through a collection of multi-indices α isin Np0 In particular model M is specified by
M = αM1 αM|M| for αMk isin Np0 where βα = 0 for α isin M
Any particular model M uses a subset XM of the original covariates X to form the
polynomial terms in the design matrix ZM(X) Without ambiguity a polynomial model
ZM(X) on X can be identified with a polynomial model ZM(XM) on the covariates XM
The number of terms used by M to model the response y denoted by |M| corresponds
to the number of columns of ZM(XM) The coefficient vector and error variance of
the model M are denoted by βM and σ2M respectively Thus M models the data as
y = ZM(XM)βM + ϵM where ϵM sim N(0 Iσ2M
) Model M is said to be nested in model M prime
if M sub M prime M models the response of the covariates in two distinct ways choosing the
set of meaningful covariates XM as well as choosing the polynomial structure of these
covariates ZM(XM)
88
The set Np0 constitutes a partially ordered set or more succinctly a poset A poset
is a set partially ordered through a binary relation ldquo≼rdquo In this context the binary relation
on the poset Np0 is defined between pairs (ααprime) by αprime ≼ α whenever αj ge αprime
j for all
j = 1 prime with αprime ≺ α if additionally αj gt αprimej for some j The order of a term α isin Np
0
is given by the sum of its elements order(α) =sumαj When order(α) = order(αprime) + 1
and αprime ≺ α then αprime is said to immediately precede α which is denoted by αprime rarr α
The parent set of α is defined by P(α) = αprime isin Np0 αprime rarr α and is given by the
set of nodes that immediately precede the given node A polynomial model M is said to
be well-formulated if α isin M implies that P(α) sub M For example any well-formulated
model using x2i 1xi 2 to model yi must also include the parent terms xi 1xi 2 and x2i 1 their
corresponding parent terms xi 1 and xi 2 and the intercept term 1
The poset Np0 can be represented by a Directed Acyclic Graph (DAG) denoted
by (Np0) Without ambiguity we can identify nodes in the graph α isin Np
0 with terms in
the set of covariates The graph has directed edges to a node from its parents Any
well-formulated model M is represented by a subgraph (M) of (Np0) with the property
that if node α isin (M) then the nodes corresponding to P(α) are also in (M) Figure
4-1 shows examples of well-formulated polynomial models where α isin Np0 is identified
withprodp
j=1 xαjj
The motivation for considering only well-formulated polynomial models is
compelling Let ZM be the design matrix associated with a polynomial model The
subspace of y modeled by ZM given by the hat matrix HM = ZM(ZprimeMZM)
minus1ZprimeM is
invariant to affine transformations of the matrix XM if and only if M corresponds to a
well-formulated polynomial model (Peixoto 1990)
89
A B
Figure 4-1 Graphs of well-formulated polynomial models for p = 2
For example if p = 2 and yi = β(00) + β(10)xi 1 + β(01)xi 2 + β(11)xi 1xi 2 + ϵi then
the hat matrix is invariant to any covariate transformation of the form A(xi 1xi 2
)+ b for any
real-valued positive definite 2 times 2 matrix A and any real-valued vector of dimension two
b In contrast if yi = β(00) + β(20)x2i 1 + ϵi then the hat matrix formed after applying the
transformation xi 1 7rarr xi 1 + c for real c = 0 is not the same as the hat matrix formed by
the original xi 1421 Well-Formulated Model Spaces
The spaces of WFMs M considered in this paper can be characterized in terms
of two WFMs MB the base model and MF the full model The base model contains at
least the intercept term and is nested in the full model The model space M is populated
by all well formulated models M that nest MB and are nested in MF
M = M MB sube M sube MF and M is well-formulated
For M to be well-formulated the entire ancestry of each node in M must also be
included in M Because of this M isin M can be uniquely identified by two different sets
of nodes in MF the set of extreme nodes and the set of children nodes For M isin M
90
the sets of extreme and children nodes respectively denoted by E(M) and C(M) are
defined by
E(M) = α isin M MB α isin P(αprime) forall αprime isin M
C(M) = α isin MF M α cupM is well-formulated
The extreme nodes are those nodes that when removed from M give rise to a WFM in
M The children nodes are those nodes that when added to M give rise to a WFM in
M Because MB sube M for all M isin M the set of nodes E(M)cupMB determine M by
beginning with this set and iteratively adding parent nodes Similarly the nodes in C(M)
determine the set αprime isin P(α) α isin C(M)cupαprime isin E(MF ) α ≼ αprime for all α isin C(M)
which contains E(M)cupMB and thus uniquely identifies M
1
x1
x2
x21
x1x2
x22
A Extreme node set
1
x1
x2
x21
x1x2
x22
B Children node set
Figure 4-2
In Figure 4-2 the extreme and children sets for model M = 1 x1 x21 are shown for
the model space characterized by MF = 1 x1 x2 x21 x1x2 x22 In Figure 4-2A the solid
nodes represent nodes α isin M E(M) the dashed node corresponds to α isin E(M) and
the dotted nodes are not in M Solid nodes in Figure 4-2B correspond to those in M
The dashed node is the single node in C(M) and the dotted nodes are not in M cup C(M)43 Priors on the Model Space
As discussed in Scott amp Berger (2010) the Ockhamrsquos-razor effect found
automatically in Bayesian variable selection through the Bayes factor does not correct
91
for multiple testing This penalization acts against more complex models but does not
account for the collection of models in the model space which describes the multiplicity
of the testing problem This is where the role of the prior on the model space becomes
important As Scott amp Berger explain the multiplicity penalty is ldquohidden awayrdquo in the
model prior probabilities π(M|M)
In what follows we propose three different prior structures on the model space
for WFMs discuss their advantages and disadvantages and describe reasonable
choices for their hyper-parameters In addition we investigate how the choice of
prior structure and hyper-parameter combinations affect the posterior probabilities for
predictor inclusion providing some recommendations for different situations431 Model Prior Definition
The graphical structure for the model spaces suggests a method for prior
construction on M guided by the notion of inheritance A node α is said to inherit from
a node αprime if there is a directed path from αprime to α in the graph (MF ) The inheritance
is said to be immediate if order(α) = order(αprime) + 1 (equivalently if αprime isin P(α) or if αprime
immediately precedes α)
For convenience define (M) = M MB to be the set of nodes in M that are not
in the base model MB For α isin (MF ) let γα(M) be the indicator function describing
whether α is included in M ie γα(M) = I(αisinM) Denote by γν(M) the set of indicators
of inclusion in M for all order ν nodes in (MF ) Finally let γltν(M) =cupνminus1
j=0 γ j(M)
the set of indicators for inclusion in M for all nodes in (MF ) of order less than ν With
these definitions the prior probability of any model M isin M can be factored as
π(M|M) =
JmaxMprod
j=JminM
π(γ j(M)|γltj(M)M) (4ndash3)
where JminM and Jmax
M are respectively the minimum and maximum order of nodes in
(MF ) and π(γJminM (M)|γltJmin
M (M)M) = π(γJminM (M)|M)
92
Prior distributions on M can be simplified by making two assumptions First if
order(α) = order(αprime) = j then γα and γαprime are assumed to be conditionally independent
when conditioned on γltj denoted by γα perpperp γαprime|γltj Second immediate inheritance is
invoked and it is assumed that if order(α) = j then γα(M)|γltj(M) = γα(M)|γP(α)(M)
where γP(α)(M) is the inclusion indicator for the set of parent nodes of α This indicator
is one if the complete parent set of α is contained in M and zero otherwise
In Figure 4-3 these two assumptions are depicted with MF being an order two
surface in two main effects The conditional independence assumption (Figure 4-3A)
implies that the inclusion indicators for x21 x22 and x1x2 is independent when conditioned
on all the lower order terms In this same space immediate inheritance implies that
the inclusion of x21 conditioned on the inclusion of all lower order nodes is equivalent to
conditioning it on its parent set (x1 in this case)
x21 perpperp x1x2 perpperp x22
∣∣∣∣∣
1
x1
x2
A Conditional independence
x21∣∣∣∣∣
1
x1
x2
=
x21
∣∣∣∣∣ x1
B Immediate inheritance
Figure 4-3
Denote the conditional inclusion probability of node α in model M by πα =
π(γα(M) = 1|γP(α)(M)M) Under the assumptions of conditional independence
93
and immediate inheritance the prior probability of M is
π(M|πMM) =prod
αisin(MF )
πγα(M)α (1minus πα)
1minusγα(M) (4ndash4)
with πM = πα α isin (MF ) Because M must be well-formulated πα = γα =
0 if γP(α)(M) = 0 Thus the product in 4ndash4 can be restricted to the set of nodes
α isin (M)cup
C(M) Additional structure can be built into the prior on M by making
assumptions about the inclusion probabilities πα such as equality assumptions or
assumptions of a hyper-prior for these parameters Three such prior classes are
developed next first by assigning hyperpriors on πM assuming some structure among
its elements and then marginalizing out the πM
Hierarchical Uniform Prior (HUP) The HUP assumes that the non-zero πα
are all equal Specifically for a model M isin M it is assumed that πα = π for all
α isin (M)cupC(M) A complete Bayesian specification of the HUP is completed by
assuming a prior distribution for π The choice of π sim Beta(a b) produces
πHUP(M|M a b) =B(|(M)|+ a |C(M)|+ b)
B(a b) (4ndash5)
where B is the beta function Setting a = b = 1 gives the particular value of
πHUP(M|M a = 1 b = 1) =1
|(M)|+ |C(M)|+ 1
(|(M)|+ |C(M)|
|(M)|
)minus1
(4ndash6)
The HUP assigns equal probabilities to all models for which the sets of nodes (M)
and C(M) have the same cardinality This prior provides a combinatorial penalization
but essentially fails to account for the hierarchical structure of the model space An
additional penalization for model complexity can be incorporated into the HUP by
changing the values of a and b Because πα = π for all α this penalization can only
depend on some aspect of the entire graph of MF such as the total number of nodes
not in the null model |(MF )|
94
Hierarchical Independence Prior (HIP) The HIP assumes that there are no
equality constraints among the non-zero πα Each non-zero πα is given its own prior
which is assumed to be a Beta distribution with parameters aα and bα Thus the prior
probability of M under the HIP is
πHIP(M|M ab) =
prodαisin(M)
aα
aα + bα
prodαisinC(M)
bα
aα + bα
(4ndash7)
where the product over empty is taken to be 1 Because the πα are totally independent any
choice of aα and bα is equivalent to choosing a probability of success πα for a given α
Setting aα = bα = 1 for all α isin (M)cup
C(M) gives the particular value of
πHIP(M|M a = 1b = 1) =
(1
2
)|(M)|+|C(M)|
(4ndash8)
Although the prior with this choice of hyper-parameters accounts for the hierarchical
structure of the model space it essentially provides no penalization for combinatorial
complexity at different levels of the hierarchy This can be observed by considering a
model space with main effects only the exponent in 4ndash8 is the same for every model in
the space because each node is either in the model or in the children set
Additional penalizations for model complexity can be incorporated into the HIP
Because each γ j is conditioned on γltj in the prior construction the aα and bα for α of
order j can be conditioned on γltj One such additional penalization utilizes the number
of nodes of order j that could be added to produce a WFM conditioned on the inclusion
vector γltj which is denoted as chj(γltj) Choosing aα = 1 and bα(M) = chj(γ
ltj) is
equivalent to choosing a probability of success πα = 1chj(γltj) This penalization can
drive down the false positive rate when chj(γltj) is large but may produce more false
negatives
Hierarchical Order Prior (HOP) A compromise between complete equality and
complete independence of the πα is to assume equality between the πα of a given
order and independence across the different orders Define j(M) = α isin (M)
95
order(α) = j and Cj(M) = α isin C(M) order(α) = j The HOP assumes that πα = πj
for all α isin j(M)cupCj(M) Assuming that πj sim Beta(aj bj) provides a prior probability of
πHOP(M|M ab) =
JmaxMprod
j=JminM
B(|j(M)|+ aj |Cj(M)|+ bj)
B(aj bj)(4ndash9)
The specific choice of aj = bj = 1 for all j gives a value of
πHOP(M|M a = 1b = 1) =prodj
[1
|j(M)|+ |Cj(M)|+ 1
(|j(M)|+ |Cj(M)|
|j(M)|
)minus1]
(4ndash10)
and produces a hierarchical version of the Scott and Berger multiplicity correction
The HOP arises from a conditional exchangeability assumption on the indicator
variables Conditioned on γltj(M) the indicators γα α isin j(M)cup
Cj(M) are
assumed to be exchangeable Bernoulli random variables By de Finettirsquos theorem these
arise from independent Bernoulli random variables with common probability of success
πj with a prior distribution Our construction of the HOP assumes that this prior is a
beta distribution Additional complexity penalizations can be incorporated into the HOP
in a similar fashion to the HIP The number of possible nodes that could be added of
order j while maintaining a WFM is given by chj(M) = chj(γltj(M)) = |j(M)
cupCj(M)|
Using aj = 1 and bj(M) = chj(M) produces a prior with two desirable properties
First if M prime sub M then π(M) le π(M prime) Second for each order j the conditional
probability of including k nodes is greater than or equal to that of including k + 1 nodes
for k = 0 1 chj(M)minus 1432 Choice of Prior Structure and Hyper-Parameters
Each of the priors introduced in Section 31 defines a whole family of model priors
characterized by the probability distribution assumed for the inclusion probabilities πM
For the sake of simplicity this paper focuses on those arising from Beta distributions
and concentrates on particular choices of hyper-parameters which can be specified
automatically First we describe some general features about how each of the three
prior structures (HUP HIP HOP) allocates mass to the models in the model space
96
Second as there is an infinite number of ways in which the hyper-parameters can be
specified focused is placed on the default choice a = b = 1 as well as the complexity
penalizations described in Section 31 The second alternative is referred to as a =
1b = ch where b = ch has a slightly different interpretation depending on the prior
structure Accordingly b = ch is given by bj(M) = bα(M) = chj(M) = |j(M)cup
Cj(M)|
for the HOP and HIP where j = order(α) while b = ch denotes that b = |(MF )| for
the HUP The prior behavior for two model spaces In both cases the base model MB is
taken to be the intercept only model and MF is the DAG shown (Figures 4-4 and 4-5)
The priors considered treat model complexity differently and some general properties
can be seen in these examples
ModelHIP HOP HUP
(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)
1 1 14 49 13 12 13 572 1 x1 18 19 112 112 112 5563 1 x2 18 19 112 112 112 5564 1 x1 x
21 18 19 112 112 112 5168
5 1 x2 x22 18 19 112 112 112 5168
6 1 x1 x2 132 364 112 112 160 1727 1 x1 x2 x
21 132 164 136 160 160 1168
8 1 x1 x2 x1x2 132 164 136 160 160 11689 1 x1 x2 x
22 132 164 136 160 160 1168
10 1 x1 x2 x21 x1x2 132 1192 136 1120 130 1252
11 1 x1 x2 x21 x
22 132 1192 136 1120 130 1252
12 1 x1 x2 x1x2 x22 132 1192 136 1120 130 1252
13 1 x1 x2 x21 x1x2 x
22 132 1576 112 1120 16 1252
Figure 4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch)
First contrast the choice of HIP HUP and HOP for the choice of (ab) = (1 1) The
HIP induces a complexity penalization that only accounts for the order of the terms in
the model This is best exhibited by the model space in Figure 4-4 Models including x1
and x2 models 6 through 13 are given the same prior probability and no penalization is
incurred for the inclusion of any or all of the quadratic terms In contrast to the HIP the
97
ModelHIP HOP HUP
(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)
1 1 18 2764 14 12 14 472 1 x1 18 964 112 110 112 2213 1 x2 18 964 112 110 112 2214 1 x3 18 964 112 110 112 2215 1 x1 x3 18 364 112 120 112 41056 1 x2 x3 18 364 112 120 112 41057 1 x1 x2 116 3128 124 140 130 1428 1 x1 x2 x1x2 116 3128 124 140 120 1709 1 x1 x2 x3 116 1128 18 140 120 17010 1 x1 x2 x3 x1x2 116 1128 18 140 15 170
Figure 4-5 Prior probabilities for the space of well-formulated models associated tothree main effects and one interaction term where MB is taken to be theintercept only model and (ab) isin (1 1) (1 ch)
HUP induces a penalization for model complexity but it does not adequately penalize
models for including additional terms Using the HIP models including all of the terms
are given at least as much probability as any model containing any non-empty set of
terms (Figures 4-4 and 4-5) This lack of penalization of the full model is originates from
its combinatorial simplicity (ie this is the only model that contains every term) and
as an unfortunate consequence this model space distribution favors the base and full
models Similar behavior is observed with the HOP with (ab) = (1 1) As models
become more complex they are appropriately penalized for their size However after a
sufficient number of nodes are added the number of possible models of that particular
size is considerably reduced Thus combinatorial complexity is negligible on the largest
models This is best exhibited in Figure 4-5 where the HOP places more mass on
the full model than on any model containing a single order one node highlighting an
undesirable behavior of the priors with this choice of hyper-parameters
In contrast if (ab) = (1 ch) all three priors produce strong penalization as
models become more complex both in terms of the number and order of the nodes
contained in the model For all of the priors adding a node α to a model M to form M prime
produces p(M) ge p(M prime) However differences between the priors are apparent The
98
HIP penalizes the full model the most with the HOP penalizing it the least and the HUP
lying between them At face value the HOP creates the most compelling penalization
of model complexity In Figure 4-5 the penalization of the HOP is the least dramatic
producing prior odds of 20 for MB versus MF as opposed to the HUP and HIP which
produce prior odds of 40 and 54 respectively Similarly the prior odds in Figure 4-4 are
60 180 and 256 for the HOP HUP and HIP respectively433 Posterior Sensitivity to the Choice of Prior
To determine how the proposed priors are adjusting the posterior probabilities to
account for multiplicity a simple simulation was performed The goal of this exercise
was to understand how the priors respond to increasing complexity First the priors are
compared as the number of main effects p grows Second they are compared as the
depth of the hierarchy increases or in other words as the orderJMmax increases
The quality of a node is characterized by its marginal posterior inclusion
probabilities defined as pα =sum
MisinM I(αisinM)p(M|yM) for α isin MF These posteriors
were obtained for the proposed priors as well as the Equal Probability Prior (EPP)
on M For all prior structures both the default hyper-parameters a = b = 1 and
the penalizing choice of a = 1 and b = ch are considered The results for the
different combinations of MF and MT incorporated in the analysis were obtained
from 100 random replications (ie generating at random 100 matrices of main effects
and responses) The simulation proceeds as follows
1 Randomly generate main effects matrices X = (x1 x18) for xiiidsim Nn(0 In) and
error vectors ϵ sim Nn(0 In) for n = 60
2 Setting all coefficient values equal to one calculate y = ZMTβ + ϵ for the true
models given byMT 1 = x1 x2 x3 x
21 x1x2 x
22 x2x3 with |MT 1| = 7
MT 2 = x1 x2 x16 with |MT 2| = 16MT 3 = x1 x2 x3 x4 with |MT 3| = 4MT 4 = x1 x2 x8 x
21 x3x4 with |MT 4| = 10
MT 5 = x1 x2 x3 x4 x21 x3x4 with |MT 5| = 6
99
Table 4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulationsgrowing p fixed JM
max fixed p growing JMmax
MF
∣∣MF
∣∣ ∣∣M∣∣ MT used MF
∣∣MF
∣∣ ∣∣M∣∣ MT used(x1 + x2 + x3)
2 9 95 MT 1 (x1 + x2 + x3)2 9 95 MT 1
(x1 + + x4)2 14 1337 MT 1 (x1 + x2 + x3)
3 19 2497 MT 1
(x1 + + x5)2 20 38619 MT 1 (x1 + x2 + x3)
4 34 161421 MT 1
Other model spacesMF
∣∣MF
∣∣ ∣∣M∣∣ MT usedx1 + x2 + middot middot middot+ x18 18 262144 MT 2MT 3
(x1 + x2 + x4)2 + x5+ 20 85568 MT 4MT 5x6 + + x10
3 In all simulations the base model MB is the intercept only model The notation(x1 + + xp)
d is used to represent the full order-d polynomial response surface inp main effects The model spaces characterized by their corresponding full modelMF are presented in Table 4-1 as well as the true models used in each case
4 Enumerate the model spaces and calculate p(M|yM) for all M isin Musing the EPP HUP HIP and HOP the latter two each with the two sets ofhyper-parameters
5 Count the number of true positives and false positives in each M for the differentpriors
The true positives (TP) are defined as those nodes α isin MT such that pα gt 05
With the false positives (FP) three different cutoffs are considered for pα elucidating
the adjustment for multiplicity induced by the model priors These cutoffs are
010 020 and 050 for α isin MT The results from this exercise provide insight
about the influence of the prior on the marginal posterior inclusion probabilities In Table
4-1 the model spaces considered are described in terms of the number of models they
contain and in terms of the number of nodes of MF the full model that defines the DAG
for M
Growing number of main effects fixed polynomial degree This simulation
investigates the posterior behavior as the number of covariates grows for a polynomial
100
surface of degree two The true model is assumed to be MT 1 and has 7 polynomial
terms The false positive and true positive rates are displayed in Table 4-2
First focus on the posterior when (ab) = (1 1) As p increases and the cutoff
is low the number of false positives increases for the EPP as well as the hierarchical
priors although less dramatically for the latter All of the priors identify all of the true
positives The false positive rate for the 50 cutoff is less than one for all four prior
structures with the HIP exhibiting the smallest false positive rate
With the second choice of hyper-parameters (1 ch) the improvement of the
hierarchical priors over the EPP is dramatic and the difference in performance is more
pronounced as p increases These also considerably outperform the priors using the
default hyper-parameters a = b = 1 in terms of the false positives Regarding the
number of true positives all priors discovered the 7 true predictors in MT 1 for most of
the 100 random samples of data with only minor differences observed between any of
the priors considered That being said the means for the priors with a = 1b = ch are
slightly lower for the true positives With a 50 cutoff the hierarchical priors keep a tight
control on the number of false positives but in doing so discard true positives with slightly
higher frequency
Growing polynomial degree fixed main effects For these examples the true
model is once again MT 1 When the complexity is increased by making the order of MF
larger (Table 4-3) the inability of the EPP to adjust the inclusion posteriors for multiplicity
becomes more pronounced the EPP becomes less and less efficient at removing false
positives when the FP cutoff is low Among the priors with a = b = 1 as the order
increases the HIP is the best at filtering out the false positives Using the 05 false
positive cutoff some false positives are included both for the EEP and for all the priors
with a = b = 1 indicating that the default hyper-parameters might not be the best option
to control FP The 7 covariates in the true model all obtain a high inclusion posterior
probability both with the EEP and the a = b = 1 priors
101
Table 4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
7 (x1 + x2 + x3)2
178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3 + x4)2
362 194 233 245 010 063 107FP(gt020) 160 047 217 215 001 017 024FP(gt050) 025 006 035 036 000 002 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3 + x4 + x5)2
600 216 260 255 012 043 115FP(gt020) 291 055 213 218 002 019 027FP(gt050) 066 011 025 037 000 003 001TP(gt050) (MT 1) 700 700 700 700 697 699 699
In contrast any of the a = 1 and b = ch priors dramatically improve upon their
a = b = 1 counterparts consistently assigning low inclusion probabilities for the majority
of the false positive terms even for low cutoffs As the order of the polynomial surface
increases the difference in performance between these priors and either the EEP or
their default versions becomes even more clear At the 50 cutoff the hierarchical priors
with complexity penalization exhibit very low false positive rates The true positive rate
decreases slightly for the priors but not to an alarming degree
Other model spaces This part of the analysis considers model spaces that do not
correspond to full polynomial degree response surfaces (Table 4-4) The first example
is a model space with main effects only The second example includes a full quadratic
surface of order 2 but in addition includes six terms for which only main effects are to be
modeled Two true models are used in combination with each model space to observe
how the posterior probabilities vary under the influence of the different priors for ldquolargerdquo
and ldquosmallrdquo true models
102
Table 4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
7 (x1 + x2 + x3)2
178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3)3
737 521 606 291 055 105 139FP(gt020) 291 155 361 208 017 034 031FP(gt050) 040 021 050 026 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 698 700FP(gt010)
7 (x1 + x2 + x3)4
822 400 469 261 052 055 132FP(gt020) 421 113 176 203 012 015 031FP(gt050) 056 017 022 027 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 697 699
By construction in model spaces with main effects only HIP(11) and EPP are
equivalent as are HOP(ab) and HUP(ab) This accounts for the similarities observed
among the results for the first two cases presented in Table 4-4 where the model space
corresponds to a full model with 18 main effects and the true models are a model with
16 and 4 main effects respectively When the number of true coefficients is large the
HUP(11) and HOP(11) do poorly at controlling false positives even at the 50 cutoff
In contrast the HIP (and thus the EPP) with the 50 cutoff identifies the true positives
and no false positives This result however does not imply that the EPP controls false
positives well The true model contains 16 out of the 18 nodes in MF so there is little
potential for false positives The a = 1 and b = ch priors show dramatically different
behavior The HIP controls false positive well but fails to identify the true coefficients at
the 50 cutoff In contrast the HOP identifies all of the true positives and has a small
false positive rate for the 50 cutoff
103
If the number of true positives is small most terms in the full model are truly zero
The EPP includes at least one false positive in approximately 50 of the randomly
sampled datasets On the other hand the HUP(11) provides some control for
multiplicity obtaining on average a lower number of false positives than the EPP
Furthermore the proposed hierarchical priors with a = 1b = ch are substantially better
than the EPP (and the choice of a = b = 1) at controlling false positives and capturing
all true positives using the marginal posterior inclusion probabilities The two examples
suggest that the HOP(1 ch) is the best default choice for model selection when the
number of terms available at a given degree is large
The third and fourth examples in Table 4-4 consider the same irregular model
space with data generated from MT 4 with ten terms and MT 5 with six terms HIP(11)
and EPP again behave quite similarly incorporating a large number of false positives
for the 01 cutoff At the 05 cutoff some false positives are still included The HUP(11)
and HOP(11) behave similarly with a slightly higher false positive rate at the 50 cutoff
In terms of the true positives the EPP and a = b = 1 priors always include all of the
predictors in MT 4 and MT 5 On the other hand the ability of the a = 1b = ch priors
to control for false positives is markedly better than that of the EPP and the hierarchical
priors with choice of a = 1 = b = 1 At the 50 cutoff these priors identify all of the true
positives and true negatives Once again these examples point to the hierarchical priors
with additional penalization for complexity as being good default priors on the model
space44 Random Walks on the Model Space
When the model space M is too large to enumerate a stochastic procedure can
be used to find models with high posterior probability In particular an MCMC algorithm
can be utilized to generate a dependent sample of models from the model posterior The
structure of the model space M both presents difficulties and provides clues on how to
build algorithms to explore it Different MCMC strategies can be adopted two of which
104
Table 4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
16 x1 + x2 + + x18
193 193 200 200 003 180 180FP(gt020) 052 052 200 200 001 046 046FP(gt050) 007 007 200 200 001 004 004TP(gt050) (MT 2) 1599 1599 1600 1600 699 1599 1599FP(gt010)
4 x1 + x2 + + x18
1395 1395 915 915 026 131 131FP(gt020) 545 545 303 303 005 045 045FP(gt050) 084 084 045 045 002 006 006TP(gt050) (MT 3) 400 400 400 400 400 400 400FP(gt010)
10
973 971 1000 560 034 233 220FP(gt020) (x1 + + x4)
2+ 265 265 873 305 012 074 069FP(gt050) +x5 + + x10 035 035 136 168 002 011 012TP(gt050) (MT 4) 1000 1000 1000 999 994 998 999FP(gt010)
6
1352 1352 1106 994 044 163 196FP(gt020) (x1 + + x4)
2+ 422 421 360 501 015 048 068FP(gt050) +x5 + + x10 053 053 057 075 001 008 011TP(gt050) (MT 5) 600 600 600 600 599 599 599
are outlined in this section Combining the different strategies allows the model selection
algorithm to explore the model space thoroughly and relatively fast441 Simple Pruning and Growing
This first strategy relies on small localized jumps around the model space turning
on or off a single node at each step The idea behind this algorithm is to grow the model
by activating one node in the children set or to prune the model by removing one node
in the extreme set At a given step in the algorithm assume that the current state of the
chain is model M Let pG be the probability that algorithm chooses the growth step The
proposed model M prime can either be M+ = M cup α for some α isin C(M) or Mminus = M α
or some α isin E(M)
An example transition kernel is defined by the mixture
g(M prime|M) = pG middot qGrow(M prime|M) + (1minus pG) middot qPrune(M prime|M)
105
=IM =MF
1 + IM =MBmiddotIαisinC(M)
|C(M)|+
IM =MB
1 + IM =MF middotIαisinE(M)
|E(M)|(4ndash11)
where pG has explicitly been defined as 05 when both C(M) and E(M) are non-empty
and as 0 (or 1) when C(M) = empty (or E(M) = empty) After choosing pruning or growing a
single node is proposed for addition to or deletion from M uniformly at random
For this simple algorithm pruning has the reverse kernel of growing and vice-versa
From this construction more elaborate algorithms can be specified First instead of
choosing the node uniformly at random from the corresponding set nodes can be
selected using the relative posterior probability of adding or removing the node Second
more than one node can be selected at any step for instance by also sampling at
random the number of nodes to add or remove given the size of the set Third the
strategy could combine pruning and growing in a single step by sampling one node
α isin C(M) cup E(M) and adding or removing it accordingly Fourth the sets of nodes from
C(M) cup E(M) that yield well-formulated models can be added or removed This simple
algorithm produces small moves around the model space by focusing node addition or
removal only on the set C(M) cup E(M)442 Degree Based Pruning and Growing
In exploring the model space it is possible to take advantage of the hierarchical
structure defined between nodes of different order One can update the vector of
inclusion indicators by blocks denoted j(M) Two flavors of this algorithm are
proposed one that separates the pruning and growing steps and one where both
are done simultaneously
Assume that at a given step say t the algorithm is at M If growing the strategy
proceeds successively by order class going from j = Jmin up to j = Jmax with Jmin
and Jmax being the lowest and highest orders of nodes in MF MB respectively Define
Mt(Jminminus1) = M and set j = Jmin The growth kernel comprises the following steps
proceeding from j = Jmin to j = Jmax
106
1) Propose a model M prime by selecting a set of nodes from Cj(Mt(jminus1)) through thekernel qGrow j(middot|Mt(jminus1))
2) Compute the Metropolis-Hastings correction for M prime versus Mt(jminus1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(jminus1)
3) If j lt Jmax then set j = j + 1 and return to 1) otherwise proceed to 4)
4) Set Mt = Mt(Jmax )
The pruning step is defined In a similar fashion however it starts at order j = Jmax
and proceeds down to j = Jmin Let Ej(M prime) = E(M prime) cap j(MF ) be the set of nodes of
order j that can removed from the model M to produce a WFM Define Mt(Jmax+1) = M
and set j = Jmax The pruning kernel comprises the following steps
1) Propose a model M prime by selecting a set of nodes from Ej(Mt(j+1)) through thekernel qPrunej(middot|Mt(j+1))
2) Compute the Metropolis-Hastings correction for M prime versus Mt(j+1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(j+1)
3) If j gt Jmin then set j = j minus 1 and return to Step 1) otherwise proceed to Step 4)
4) Set Mt = Mt(Jmin )
It is clear that the growing and pruning steps are reverse kernels of each other
Pruning and growing can be combined for each j The forward kernel proceeds from
j = Jmin to j = Jmax and proposes adding sets of nodes from Cj(M) cup Ej(M) The reverse
kernel simply reverses the direction of j proceeding from j = Jmax to j = Jmin 45 Simulation Study
To study the operating characteristics of the proposed priors a simulation
experiment was designed with three goals First the priors are characterized by how
the posterior distributions are affected by the sample size and the signal-to-noise ratio
(SNR) Second given the SNR level the influence of the allocation of the signal across
the terms in the model is investigated Third performance is assessed when the true
model has special points in the scale (McCullagh amp Nelder 1989) ie when the true
107
model has coefficients equal to zero for some lower-order terms in the polynomial
hierarchy
With these goals in mind sets of predictors and responses are generated under
various experimental conditions The model space is defined with MB being the
intercept-only model and MF being the complete order-four polynomial surface in five
main effects that has 126 nodes The entries of the matrix of main effects are generated
as independent standard normal The response vectors are drawn from the n-variate
normal distribution as y sim Nn
(ZMT
(X)βγ In) where MT is the true model and In is the
n times n identity matrix
The sample sizes considered are n isin 130 260 1040 which ensures that
ZMF(X) is of full rank The cardinality of this model space is |M| gt 12 times 1022 which
makes enumeration of all models unfeasible Because the value of the 2k-th moment
of the standard normal distribution increases with k = 1 2 higher-order terms by
construction have a larger variance than their ancestors As such assuming equal
values for all coefficients higher-order terms necessarily contain more ldquosignalrdquo than
the lower order terms from which they inherit (eg x21 has more signal than x1) Once a
higher-order term is selected its entire ancestry is also included Therefore to prevent
the simulation results from being overly optimistic (because of the larger signals from the
higher-order terms) sphering is used to calculate meaningful values of the coefficients
ensuring that the signal is of the magnitude intended in any given direction Given
the results of the simulations from Section 433 only the HOP with a = 1b = ch is
considered with the EPP included for comparison
The total number of combinations of SNR sample size regression coefficient
values and nodes in MT amounts to 108 different scenarios Each scenario was run
with 100 independently generated datasets and the mean behavior of the samples was
observed The results presented in this section correspond to the median probability
model (MPM) from each of the 108 simulation scenarios considered Figure 4-7 shows
108
the comparison between the two priors for the mean number of true positive (TP) and
false positive (FP) terms Although some of the scenarios consider true models that are
not well-formulated the smallest well-formulated model that stems from MT is always
the one shown in Figure 4-6
Figure 4-6 MT DAG of the largest true model used in simulations
The results are summarized in Figure 4-7 Each point on the horizontal axis
corresponds to the average for a given set of simulation conditions Only labels for the
SNR and sample size are included for clarity but the results are also shown for the
different values of the regression coefficients and the different true models considered
Additional details about the procedure and other results are included in the appendices451 SNR and Sample Size Effect
As expected small sample sizes conditioned upon a small SNR impair the ability
of the algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this
effect being greater when using the latter prior However considering the mean number
of TPs jointly with the number of FPs it is clear that although the number of TPs is
specially low with HOP(1 ch) most of the few predictors that are discovered in fact
belong to the true model In comparison to the results with EPP in terms of FPs the
HOP(1 ch) does better and even more so when both the sample size and the SNR are
109
Figure 4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch)
smallest Finally when either the SNR or the sample size is large the performance in
terms of TPs is similar between both priors but the number of FPs are somewhat lower
with the HOP452 Coefficient Magnitude
Three ways to allocate the amount of signal across predictors are considered For
the first choice all coefficients contain the same amount of signal regardless of their
order In the second each order-one coefficient contains twice as much signal as any
order-two coefficient and four times as much as any order-three coefficient Finally
each order-one coefficient contains a half as much signal as any order-two coefficient
and a quarter of what any order-three coefficient has These choices are denoted by
β(1) = c(1o1 1o2 1o3) β(2) = c(1o1 05o2 025o3) and β(3) = c(025o1 05o2 1o3)
respectively In Figure 4-7 the first 4 scenarios correspond to simulations with β(1) the
next four use β(2) the next four correspond to β(3) and then the values are cycled in
110
the same way The results show that scenarios using either β(1) or β(3) behave similarly
contrasting with the negative impact of having the highest signal in the order-one terms
through β(2) In Figure 4-7 the effect of using β(2) is evident as it corresponds to the
lowest values for the TPs regardless of the sample size the SNR or the prior used This
is an intuitive result since giving more signal to higher-order terms makes it easier to
detect higher-order terms and consequently by strong heredity the algorithm will also
select the corresponding lower-order terms included in the true model453 Special Points on the Scale
Four true models were considered (1) the model from Figure 4-6 (MT 1) (2)
the model without the order-one terms (MT 2) (3) the model without order-two terms
(MT 3) and (4) the model without x21 and x2x5 (MT 4) The last three are clearly not
well-formulated In Figure 4-7 the leftmost point on the horizontal axis corresponds to
scenarios with MT 1 the next point is for scenarios with MT 2 followed by those with MT 3
then with MT 4 then MT 1 etc In comparison to the EPP the HOP(1 ch) tightly controls
the inclusion of FPs by choosing smaller models at the expense of also reducing the TP
count especially when there is more uncertainty about the true model (ie SNR=025)
For both prior structures the results indicate that at low SNR levels the presence of
special points has no apparent impact as the selection behavior is similar between the
four models in terms of both the TP and FP An interesting observation is that the effect
of having special points on the scale is vastly magnified whenever the coefficients that
assign more weight to order-one terms (β(2)) are used46 Case Study Ozone Data Analysis
This section uses the ozone data from Breiman amp Friedman (1985) and followsthe analysis performed by Liang et al (2008) who investigated hyper g-priors Afterremoving observations with missing values 330 observations remain includingdaily measurements of maximum ozone concentration near Los Angeles and eightmeteorological variables Table D From the 330 observations 165 were sampled atrandom without replacement and used to run the variable selection procedure theremaining 165 were used for validation The eight meteorological variables interactionsand their squared terms are used as predictors resulting in a full model with 44predictors The model space assumes that the base model MB is the intercept onlymodel and that MF is the quadratic surface in the eight meteorological variables The
111
model space contains approximately 71 billion models and computation of all modelposterior probabilities is not feasible
Table 4-5 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA
vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX
The HOP HUP and HIP with a = 1 and b = ch as well as the EPP are considered forcomparison purposes To obtain the Bayes factors in equation 3ndash3 four different mixtures ofg-priors are utilized intrinsic priors (IP) (which yields the expression in equation 3ndash2) hyper-g(HG) priors (Liang et al 2008) with hyper-parameters α = 2β = 1 and α = β = 1 and Zellner-Siow (ZS) priors (Zellner amp Siow 1980) The results were extracted for the median posteriorprobability (MPM) models Additionally the model is estimated using the R package hierNet(Bien et al 2013) to compare model selection results to those obtained using the hierarchicallasso (Bien et al 2013) restricted to well formulated models by imposing the strong heredityconstraint The procedures were assessed on the basis of their predictive accuracy on thevalidation dataset
Among all models the one that yields the smallest RMSE is the median probability modelobtained using the HOP and EPP with the ZS prior and also using the HOP with both HGpriors (Table 4-6) The HOP model with the intrinsic prior has all the terms contained in thelowest RMSE model with the exception of dpg2 which has a relatively high marginal inclusionprobability of 46 This disparity between the IP and other mixtures of g-priors is explainedby the fact that the IP induces less posterior shrinkage than the ZS and HG priors The MPMobtained through the HUP and HIP are nested in the best model suggesting that these modelspace priors penalize complexity too much and result in false negatives Consideration ofthese MPMs suggest that the HOP is best at producing true positives while controlling for falsepositives
Finally the model obtained from the hierarchical lasso (HierNet) is the largest model andproduces the second to largest RMSE All of the terms contained in any of the other modelsexcept for vh are nested within the hierarchical lasso model and most of the terms that areexclusive to this model receive extremely low marginal inclusion probabilities under any of themodel priors and parameter priors considered under Bayesian model selection
112
Table 4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso
BF Prior Model R2 RMSEIP EPP hum dpg ibt hum2 hum lowast dpg 08054 42739
hum lowast ibt dpg2 ibt2IP HIP hum ibt hum2 hum lowast ibt ibt2 07740 43396IP HOP hum dpg ibt hum2 hum lowast ibt ibt2 07848 43175IP HUP hum dpg ibt hum lowast ibt ibt2 07767 43508ZS EPP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HIP hum ibt hum lowast ibt ibt2 07525 43505ZS HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG11 EPP vh hum dpg ibt hum2 hum lowast ibt dpg2 07701 43049HG11 HIP hum ibt hum lowast ibt ibt2 07525 43505HG11 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG11 HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 EPP hum dpg ibt hum2 hum lowast ibt dpg2 07701 43037HG21 HIP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG21 HUP hum dpg ibt hum lowast ibt 07526 44036
HierNet hum temp ibh dpg ibt vis hum2 hum lowast ibt 07651 43680temp2 temp lowast ibt dpg2
47 DiscussionScott amp Berger (2010) noted that Ockhamrsquos-razor effect found automatically in Bayesian
variable selection through the Bayes factor does not correct for multiple testing The Bayesfactor penalizes complexity of the alternative model according to the number of parametersin excess of those of the null model Therefore the Bayes factor only controls complexity in apairwise fashion If the model selection procedure uses equal prior probabilities for all M isin Mthen these comparisons ignore the effect of the multiplicity of the testing problem This is wherethe role of the prior on the model space becomes important The multiplicity penalty is ldquohiddenawayrdquo in the model prior probabilities π(M|M)
In addition to the multiplicity of the testing problem disregarding the hierarchical polynomialstructure in the predictors in model selection procedures has the potential to lead to differentresults according to how the predictors are setup (eg in what units these predictors areexpressed)
In this Chapter we investigated a solution to these two issues We define prior structuresfor well-formulated models and develop random walk algorithms to traverse this type of modelspace The key to understanding prior distributions on the space of WFMs is the hierarchicalnature of the model space itself The prior distributions described take advantage of thathierarchy in two ways First conditional independence and immediate inheritance are used todevelop the HOP HIP and HUP structures discussed in Section 43 Second the conditionalnature of the priors allows for the direct incorporation of complexity penalizations Of the priorsproposed the HOP using the hyperparameter choice (1 ch) provides the best control of falsepositives while maintaining a reasonable true positive rate Thus this prior is recommended asthe default prior on the space of WFMs
113
In the near future the software developed to carry out a Metropolis-Hastings random walkon the space of WFMs will be integrated to the R package varSelectIP These new functionsimplement various local priors for the regression coefficients including the intrinsic prior Zellner-Siow prior and hyper g-priors In addition the software supports the computation of crediblesets for each regression coefficient conditioned on the selected model as well as under modelaveraging
114
CHAPTER 5CONCLUSIONS
Ecologists are now embracing the use of Bayesian methods to investigate the
interactions that dictate the distribution and abundance of organisms These tools are
both powerful and flexible They allow integrating under a single methodology empirical
observations and theoretical process models and can seamlessly account for several
sources of uncertainty and dependence The estimation and testing methods proposed
throughout the document will contribute to the understanding of Bayesian methods used
in ecology and hopefully these will shed light about the differences between estimation
and testing Bayesian tools
All of our contributions exploit the potential of the latent variable formulation This
approach greatly simplifies the analysis of complex models it redirects the bulk of
the inferential burden away from the original response variables and places it on the
easy-to-work-with latent scale for which several time-tested approaches are available
Our methods are distinctly classified into estimation and testing tools
For estimation we proposed a Bayesian specification of the single-season
occupancy model for which a Gibbs sampler is available using both logit and probit
link functions This setup allows detection and occupancy probabilities to depend
on linear combinations of predictors Then we developed a dynamic version of this
approach incorporating the notion that occupancy at a previously occupied site depends
both on survival of current settlers and habitat suitability Additionally because these
dynamics also vary in space we suggest a strategy to add spatial dependence among
neighboring sites
Ecological inquiry usually requires of competing explanations and uncertainty
surrounds the decision of choosing any one of them Hence a model or a set of
probable models should be selected from all the viable alternatives To address this
testing problem we proposed an objective and fully automatic Bayesian methodology
115
for the single season site-occupancy model Our approach relies on the intrinsic prior
which prevents from introducing (commonly unavailable) subjectively information
into the model In simulation experiments we observed that the methods single out
accurately the predictors present in the true model using the marginal posterior inclusion
probabilities of the predictors For predictors in the true model these probabilities were
comparatively larger than those for predictors not present in the true model Also the
simulations indicated that the method provides better discrimination for predictors in the
detection component of the model
In our simulations and in the analysis of the Blue Hawker data we observed that the
effect from using the multiplicity correction prior was substantial This occurs because
the Bayes factor only penalizes complexity of the alternative model according to its
number of parameters in excess to those of the null model As the number of predictors
grows the number of models in the models space also grows increasing the chances
of making false positive decisions on the inclusion of predictors This is where the role
of the prior on the model space becomes important The multiplicity penalty is ldquohidden
awayrdquo in the model prior probabilities π(M|M) In addition to the multiplicity of the
testing problem disregarding the hierarchical polynomial structure in the predictors in
model selection procedures has the potential to lead to different results according to
how the predictors are coded (eg in what units these predictors are expressed)
To confront this situation we propose three prior structures for well-formulated
models take advantage of the hierarchical structure of the predictors Of the priors
proposed we recommend the HOP using the hyperparameter choice (1 ch) which
provides the best control of false positives while maintaining a reasonable true positive
rate
Overall considering the flexibility of the latent approach several other extensions of
these methods follow Currently we envision three future developments (1) occupancy
models incorporate various sources of information (2) multi-species models that make
116
use of spatial and interspecific dependence and (3) investigate methods to conduct
model selection for the dynamic and spatially explicit version of the model
117
APPENDIX AFULL CONDITIONAL DENSITIES DYMOSS
In this section we introduce the full conditional probability density functions for all
the parameters involved in the DYMOSS model using probit as well as logic links
Sampler Z
The full conditionals corresponding to the presence indicators have the same form
regardless of the link used These are derived separately for the cases t = 1 1 lt t lt T
and t = T since their corresponding probabilities take on slightly different forms
Let ϕ(ν|microσ2) represent the density for a normal random variable ν with mean micro and
variance σ2 and recall that ψi1 = F (xprime(o)iα) and pijt = F (qprimeijtλt) where F () is the
inverse link function The full conditional for zit is given by
1 For t = 1
π(zi1|vi1αλ1βc1 δ
s1) = ψlowast
i1zi1 (1minus ψlowast
i1)1minuszi1
= Bernoulli(ψlowasti1) (Andash1)
where
ψlowasti1 =
ψi1ϕ(vi1|xprimei1βc1 + δs1 1)
prodJi1j=1(1minus pij1)
ψi1ϕ(vi1|xprimei1βc1 + δs1 1)
prodJi1j=1(1minus pij1) + (1minus ψi1)ϕ(vi1|xprimei1β
c1 1)
prodJj=1 Iyij1=0
2 For 1 lt t lt T
π(zit |zi(tminus1) zi(t+1)λt βctminus1 δ
stminus1) = ψlowast
itzit (1minus ψlowast
it)1minuszit
= Bernoulli(ψlowastit) (Andash2)
where
ψlowastit =
κitprodJit
j=1(1minus pijt)
κitprodJit
j=1(1minus pijt) +nablait
prodJj=1 Iyijt=0
with
(a) κit = F (xprimei(tminus1)β
ctminus1 + zi(tminus1)δ
stminus1)ϕ(vit |xprimeitβ
ct + δst 1) and
(b) nablait =(1minus F (xprime
i(tminus1)βctminus1 + zi(tminus1)δ
stminus1)
)ϕ(vit |xprimeitβ
ct 1)
3 For t = T
π(ziT |zi(Tminus1)λT βcTminus1 δ
sTminus1) = ψ⋆iT
ziT (1minus ψ⋆iT )1minusziT
118
=
Nprodi=1
Bernoulli(ψ⋆iT ) (Andash3)
where
ψ⋆iT =κ⋆iT
prodJiTj=1(1minus pijT )
κ⋆iTprodJiT
j=1(1minus pijT ) +nabla⋆iT
prodJj=1 IyijT=0
with
(a) κ⋆iT = F (xprimei(Tminus1)β
cTminus1 + zi(Tminus1)δ
sTminus1) and
(b) nabla⋆iT =
(1minus F (xprime
i(Tminus1)βcTminus1 + zi(Tminus1)δ
sTminus1)
)Sampler ui
1
π(ui |zi1α) = tr N(xprime(o)iα 1 trunc(zi1))
where trunc(zi1) =
(minusinfin 0] zi1 = 0
(0infin) zi1 = 1(Andash4)
and tr N(microσ2A) denotes the pdf of a truncated normal random variable with mean microvariance σ2 and truncation region A
Sampler α
1
π(α|u) prop [α]
Nprodi=1
ϕ(ui xprime(o)iα 1) (Andash5)
If [α] prop 1 then
α|u sim N(m(α)α)
with m(α) = αXprime(o)u and α = (X prime
(o)X(o))minus1
Sampler vit
1 (For t gt 1)
π(vi (tminus1)|zi (tminus1) zit βctminus1 δ
stminus1) = tr N
(micro(v)i(tminus1) 1 trunc(zit)
)(Andash6)
where micro(v)i(tminus1) = xprime
i(tminus1)βctminus1 + zi(tminus1)δ
ci(tminus1) and trunc(zit) defines the corresponding
truncation region given by zit
119
Sampler(β(c)tminus1 δ
(c)tminus1
)
1 (For t gt 1)
π(β(s)tminus1 δ
(c)tminus1|vtminus1 ztminus1) prop [β
(s)tminus1 δ
(c)tminus1]
Nprodi=1
ϕ(vit xprimei(tminus1)β
(c)tminus1 + zi(tminus1)δ
(s)tminus1 1) (Andash7)
If[β(c)tminus1 δ
(s)tminus1
]prop 1 then
β(c)tminus1 δ
(s)tminus1|vtminus1 ztminus1 sim N(m(β
(c)tminus1 δ
(s)tminus1)tminus1)
with m(β(c)tminus1 δ
(s)tminus1) = tminus1 ~X
primetminus1vtminus1 and tminus1 = (~X prime
tminus1 ~Xtminus1)minus1 where ~Xtminus1 =(
Xtminus1 ztminus1)
Sampler wijt
1 (For t gt 1 and zit = 1)
π(wijt | i zit = 1 yijt λ) = tr N(qprimeijtλt 1 tr(yijt)
)(Andash8)
Sampler λt
1 (For t = 1 2 T )
π(λt |zt wt) prop [λt ]prod
i zit=1
Jitprodj=1
ϕ(wijt qprimeijtλt 1) (Andash9)
If [λt ] prop 1 then
λt |wt zt sim N(m(λt)λt)
with m(λt) = λtQ primetwt and λt
= (Q primetQt)
minus1 where Qt and wt respectively are the designmatrix and the vector of latent variables for surveys of sites such that zit = 1
120
APPENDIX BRANDOM WALK ALGORITHMS
Global Jump From the current state M the global jump is performed by drawing
a model M prime at random from the model space This is achieved by beginning at the base
model and increasing the order from JminM to the Jmax
M the minimum and maximum orders
of nodes in (MF ) = MF MB at each order a set of nodes is selected at random from
the prior conditioned on the nodes already in the model The MH correction is
α =
1m(y|M primeM)
m(y|MM)
Local Jump From the current state M the local jump is performed by drawing a
model from the set of models L(M) = Mα α isin E(M) cup C(M) where Mα is M α
for α isin E(M) and M cup α for α isin C(M) The proposal probabilities for the model are
computed as a mixture of p(M prime|yMM prime isin L(M)) and the discrete uniform distribution
The proposal kernel is
q(M prime|yMM prime isin L(M)) =1
2
(p(M prime|yMM prime isin L(M)) +
1
|L(M)|
)This choice promotes moving to better models while maintaining a non-negligible
probability of moving to any of the possible models The MH correction is
α =
1m(y|M primeM)
m(y|MM)
q(M|yMM isin L(M prime))
q(M prime|yMM prime isin L(M))
Intermediate Jump The intermediate jump is performed by increasing or
decreasing the order of the nodes under consideration performing local proposals based
on order For a model M prime define Lj(Mprime) = M prime cup M prime
α α isin (E(M prime) cup C(M prime)) capj(MF )
From a state M the kernel chooses at random whether to increase or decrease the
order If M = MF then decreasing the order is chosen with probability 1 and if M = MB
then increasing the order is chosen with probability 1 in all other cases the probability of
increasing and decreasing order is 12 The proposal kernels are given by
121
Increasing order proposal kernel
1 Set j = JminM minus 1 and M prime
j = M
2 Draw M primej+1 from qincj+1(M
prime|yMM prime isin Lj+1(Mprimej )) where
qincj+1(Mprime|yMM prime isin Lj+1(M
primej )) =
12
(p(M prime|yMM prime isin Lj+1(M
primej )) +
1|Lj+1(M
primej)|
)
3 Set j = j + 1
4 If j lt JmaxM then return to 2 O therwise proceed to 5
5 Set M prime = M primeJmaxM
and compute the proposal probability
qinc(Mprime|yMM) =
JmaxM minus1prod
j=JminM minus1
qincj+1(Mprimej |yMM prime isin Lj+1(M
primej )) (Bndash1)
Decreasing order proposal kernel
1 Set j = JmaxM + 1 and M prime
j = M
2 Draw M primejminus1 from qdecjminus1(M
prime|yMM prime isin Ljminus1(Mprimej )) where
qdecjminus1(Mprime|yMM prime isin Ljminus1(M
primej )) =
12
(p(M prime|yMM prime isin Ljminus1(M
primej )) +
1|Ljminus1(M
primej)|
)
3 Set j = j minus 1
4 If j gt JminM then return to 2 Otherwise proceed to 5
5 Set M prime = M primeJminM
and compute the proposal probability
qdec(Mprime|yMM) =
JminM +1prod
j=JmaxM +1
qdecjminus1(Mprimej |yMM prime isin Ljminus1(M
primej )) (Bndash2)
If increasing order is chosen then the MH correction is given by
α = min
1
(1 + I (M prime = MF )
1 + I (M = MB)
)qdec(M|yMM prime)
qinc(M prime|yMM)
p(M prime|yM)
p(M|yM)
(Bndash3)
and similarly if decreasing order is chosen
Other Local and Intermediate Kernels The local and intermediate kernels
described here perform a kind of stochastic forwards-backwards selection Each kernel
122
q can be relaxed to allow more than one node to be turned on or off at each step which
could provide larger jumps for each of these kernels The tradeoff is that number of
proposed models for such jumps could be very large precluding the use of posterior
information in the construction of the proposal kernel
123
APPENDIX CWFM SIMULATION DETAILS
Briefly the idea is to let ZMT(X )βMT
= (QR)βMT= QηMT
(ie βMT= Rminus1ηMT
)
using the QR decomposition As such setting all values in ηMTproportional to one
corresponds to distributing the signal in the model uniformly across all predictors
regardless of their order
The (unconditional) variance of a single observation yi is var(yi) = var (E [yi |zi ]) +
E [var(yi |zi)] where zi is the i -th row of the design matrix ZMT Hence we take the
signal to noise ratio for each observation to be
SNR(η) = ηTMT
RminusTzRminus1ηMT
σ2
where z = var(zi) We determine how the signal is distributed across predictors up to a
proportionality constant to be able to control simultaneously the signal to noise ratio
Additionally to investigate the ability of the model to capture correctly the
hierarchical structure we specify four different 0-1 vectors that determine the predictors
in MT which generates the data in the different scenarios
Table C-1 Experimental conditions WFM simulationsParameter Values considered
SNR(ηMT) = k 025 1 4
ηMTprop (1 13 14 12) (1 13 1214
1412) (1 1413
1214 12)
γMT(1 13 14 12) (1 13 14 02) (1 13 04 12) (1 03 0 1 1 0 12)
n 130 260 1040
The results presented below are somewhat different from those found in the main
body of the article in Section 5 These are extracted averaging the number of FPrsquos
TPrsquos and model sizes respectively over the 100 independent runs and across the
corresponding scenarios for the 20 highest probability models
124
SNR and Sample Size Effect
In terms of the SNR and the sample size (Figure C-1) we observe that as
expected small sample sizes conditioned upon a small SNR impair the ability of the
algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this effect
more notorious when using the latter prior However considering the mean number
of true positives (TP) jointly with the mean model size it is clear that although the
sensitivity is low most of the few predictors that are discovered belong to the true
model The results observed with SNR of 025 and a relatively small sample size are
far from being impressive however real problems where the SNR is as low as 025
will yield many spurious associations under the EPP The fact that the HOP(1 ch) has
a strong protection against false positive is commendable in itself A SNR of 1 also
represents a feeble relationship between the predictors and the response nonetheless
the method captures approximately half of the true coefficients while including very few
false positives Following intuition as either the sample size or the SNR increase the
algorithms performance is greatly enhanced Either having a large sample size or a
large SNR yields models that contain mostly true predictors Additionally HOP(1 ch)
provides a strong control over the number of false positives therefore for high SNR
or larger sample sizes the number of predictors in the top 20 models is close to the
size of the true model In general the EPP allows the detection of more TPrsquos while
the HOP(1 ch) provides a stronger control on the amount of FPrsquos included when
considering small sample sizes combined with small SNRs As either sample size or
SNR grows the differences between the two priors become indistinct
125
Figure C-1 SNR vs n Average model size average true positives and average false
positives for all simulated scenarios by model ranking according to model
posterior probabilities
Coefficient Magnitude
This part of the experiment explores the effect of how the signal is distributed across
predictors As mentioned before sphering is used to assign the coefficients values
in a manner that controls the amount of signal that goes into each coefficient Three
possible ways to allocate the signal are considered First each order-one coefficient
contains twice as much signal as any order-two coefficient and four times as much
any as order-three coefficient second all coefficients contain the same amount of
signal regardless of their order and third each order-one coefficient contains a half
as much signal as any order-two coefficient and a quarter of what any order-three
126
coefficient has In Figure C-2 these values are denoted by β = c(1o1 05o2 025o3)
β = c(1o1 1o2 1o3) and β = c(025o1 05o2 1o3) respectively
Observe that the number of FPrsquos is invulnerable to how the SNR is distributed
across predictors using the HOP(1 ch) conversely when using the EPP the number
of FPrsquos decreases as the SNR grows always being slightly higher than those obtained
with the HOP With either prior structure the algorithm performs better whenever all
coefficients are equally weighted or when those for the order-three terms have higher
weights In these two cases (ie with β = c(1o1 05o2 025o3) or β = c(1o1 1o2 1o3))
the effect of the SNR appears to be similar In contrast when more weight is given to
order one terms the algorithm yields slightly worse models at any SNR level This is an
intuitive result since giving more signal to higher order terms makes it easier to detect
higher order terms and consequently by strong heredity the algorithm will also select
the corresponding lower order terms included in the true model
Special Points on the Scale
In Nelder (1998) the author argues that the conditions under which the
weak-heredity principle can be used for model selection are so restrictive that the
principle is commonly not valid in practice in this context In addition the author states
that considering well-formulated models only does not take into account the possible
presence of special points on the scales of the predictors that is situations where
omitting lower order terms is justified due to the nature of the data However it is our
contention that every model has an underlying well-formulated structure whether or not
some predictor has special points on its scale will be determined through the estimation
of the coefficients once a valid well-formulated structure has been chosen
To understand how the algorithm behaves whenever the true data generating
mechanism has zero-valued coefficients for some lower order terms in the hierarchy
four different true models are considered Three of them are not well-formulated while
the remaining one is the WFM shown in Figure 4-6 The three models that have special
127
Figure C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model rankingaccording to model posterior probabilities
points correspond to the same model MT from Figure 4-6 but have respectively
zero-valued coefficients for all the order-one terms all the order-two terms and for x21
and x2x5
As seen before in comparison to the EPP the HOP(1 ch) tightly controls the
inclusion FPs by choosing smaller models at the expense of also reducing the TP
count especially when there is more uncertainty about the true model (ie SNR=025)
For both prior structures the results in Figure C-3 indicate that at low SNR levels the
presence of special points has no apparent impact as the selection behavior is similar
between the four models in terms of both the TP and FP As the SNR increases the
TPs and the model size are affected for true models with zero-valued lower order
128
Figure C-3 SNR vs different true models MT Average model size average truepositives and average false positives for all simulated scenarios by modelranking according to model posterior probabilities
terms These differences however are not very large Relatively smaller models are
selected whenever some terms in the hierarchy are missing but with high SNR which
is where the differences are most pronounced the predictors included are mostly true
coefficients The impact is almost imperceptible for the true model that lacks order one
terms and the model with zero coefficients for x21 and x2x5 and is more visible for models
without order two terms This last result is expected due to strong-heredity whenever
the order-one coefficients are missing the inclusion of order-two and order-three
terms will force their selection which is also the case when only a few order two terms
have zero-valued coefficients Conversely when all order two predictors are removed
129
some order three predictors are not selected as their signal is attributed the order two
predictors missing from the true model This is especially the case for the order three
interaction term x1x2x5 which depends on the inclusion of three order two terms terms
(x1x2 x1x5 x2x5) in order for it to be included as well This makes the inclusion of this
term somewhat more challenging the three order two interactions capture most of
the variation of the polynomial terms that is present when the order three term is also
included However special points on the scale commonly occur on a single or at most
on a few covariates A true data generating mechanism that removes all terms of a given
order in the context of polynomial models is clearly not justified here this was only done
for comparison purposes
130
APPENDIX DSUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
The covariates considered for the ozone data analysis match those used in Liang
et al (2008) these are displayed in Table D below
Table D-1 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA
vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX
The marginal posterior inclusion probability corresponds to the probability of including a
given term in the full model MF after summing over all models in the model space For each
node α isin MF this probability is given by pα =sum
MisinM I(αisinM)p(M|yM) Given that in problems
with a large model space such as the one considered for the ozone concentration problem
enumeration of the entire space is not feasible Thus these probabilities are estimated summing
over every model drawn by the random walk over the model space M
Given that there are in total 44 potential predictors for convenience in Tables D-2 to D-5
below we only display the marginal posterior probabilities for the terms included under at least
one of the model priors considered (EPP HIP HUP and HOP) for each of the parameter priors
utilized (intrinsic priors Zellner-Siow priors Hyper-g(11) and Hyper-g(21))
131
Table D-2 Marginal inclusion probabilities
intrinsic prior
EPP HIP HUP HOP
hum 099 069 085 076
dpg 085 048 052 053
ibt 099 100 100 100
hum2 076 051 043 062
humdpg 055 002 003 017
humibt 098 069 084 075
dpg2 072 036 025 046
ibt2 059 078 057 081
Table D-3 Marginal inclusion probabilities
Zellner-Siow prior
EPP HIP HUP HOP
hum 076 067 080 069
dpg 089 050 055 058
ibt 099 100 100 100
hum2 057 049 040 057
humibt 072 066 078 068
dpg2 081 038 031 051
ibt2 054 076 055 077
Table D-4 Marginal inclusion probabilities
Hyper-g11
EPP HIP HUP HOP
vh 054 005 010 011
hum 081 067 080 069
dpg 090 050 055 058
ibt 099 100 099 099
hum2 061 049 040 057
humibt 078 066 078 068
dpg2 083 038 030 051
ibt2 049 076 054 077
Table D-5 Marginal inclusion probabilities
Hyper-g21
EPP HIP HUP HOP
hum 079 064 073 067
dpg 090 052 060 059
ibt 099 100 099 100
hum2 060 047 037 055
humibt 076 064 071 067
dpg2 082 041 036 052
ibt2 047 073 049 075
132
REFERENCES
Akaike H (1983) Information measures and model selection Bull Int Statist Inst 50277ndash290
Albert J H amp Chib S (1993) Bayesian-analysis of binary and polychotomousresponse data Journal of the American Statistical Association 88(422) 669ndash679
Berger J amp Bernardo J (1992) On the development of reference priors BayesianStatistics 4 (pp 35ndash60)
URL httpisbastatdukeedueventsvalencia1992Valencia4Refpdf
Berger J amp Pericchi L (1996) The intrinsic Bayes factor for model selection andprediction Journal of the American Statistical Association 91(433) 109ndash122
URL httpamstattandfonlinecomdoiabs10108001621459199610476668
Berger J Pericchi L amp Ghosh J (2001) Objective Bayesian methods for modelselection introduction and comparison In Model selection vol 38 of IMS LectureNotes Monogr Ser (pp 135ndash207) Inst Math Statist
URL httpwwwjstororgstable1023074356165
Besag J York J amp Mollie A (1991) Bayesian Image-Restoration with 2 Applicationsin Spatial Statistics Annals of the Institute of Statistical Mathematics 43 1ndash20
Bien J Taylor J amp Tibshirani R (2013) A lasso for hierarchical interactions TheAnnals of Statistics 41(3) 1111ndash1141
URL httpprojecteuclidorgeuclidaos1371150895
Breiman L amp Friedman J (1985) Estimating optimal transformations for multipleregression and correlation Journal of the American Statistical Association 80580ndash598
Brusco M J Steinley D amp Cradit J D (2009) An exact algorithm for hierarchicallywell-formulated subsets in second-order polynomial regression Technometrics 51(3)306ndash315
Casella G Giron F J Martınez M L amp Moreno E (2009) Consistency of Bayesianprocedures for variable selection The Annals of Statistics 37 (3) 1207ndash1228
URL httpprojecteuclidorgeuclidaos1239369020
Casella G Moreno E amp Giron F (2014) Cluster Analysis Model Selection and PriorDistributions on Models Bayesian Analysis TBA(TBA) 1ndash46
URL httpwwwstatufledu~casellaPapersClusterModel-July11-Apdf
133
Chipman H (1996) Bayesian variable selection with related predictors CanadianJournal of Statistics 24(1) 17ndash36
URL httponlinelibrarywileycomdoi1023073315687abstract
Clyde M amp George E I (2004) Model Uncertainty Statistical Science 19(1) 81ndash94
URL httpprojecteuclidorgDienstgetRecordid=euclidss1089808274
Dewey J (1958) Experience and nature New York Dover Publications
Dorazio R M amp Taylor-Rodrıguez D (2012) A Gibbs sampler for Bayesian analysis ofsite-occupancy data Methods in Ecology and Evolution 3 1093ndash1098
Ellison A M (2004) Bayesian inference in ecology Ecology Letters 7 509ndash520
Fiske I amp Chandler R (2011) unmarked An R package for fitting hierarchical modelsof wildlife occurrence and abundance Journal of Statistical Software 43(10)
URL httpcorekmiopenacukdownloadpdf5701760pdf
George E (2000) The variable selection problem Journal of the American StatisticalAssociation 95(452) 1304ndash1308
URL httpwwwtandfonlinecomdoiabs10108001621459200010474336
Giron F J Moreno E Casella G amp Martınez M L (2010) Consistency of objectiveBayes factors for nonnested linear models and increasing model dimension Revistade la Real Academia de Ciencias Exactas Fisicas y Naturales Serie A Matematicas104(1) 57ndash67
URL httpwwwspringerlinkcomindex105052RACSAM201006
Good I J (1950) Probability and the Weighing of Evidence New York Haffner
Griepentrog G L Ryan J M amp Smith L D (1982) Linear transformations ofpolynomial regression-models American Statistician 36(3) 171ndash174
Gunel E amp Dickey J (1974) Bayes factors for independence in contingency tablesBiometrika 61 545ndash557
Hanski I (1994) A Practical Model of Metapopulation Dynamics Journal of AnimalEcology 63 151ndash162
Hooten M (2006) Hierarchical spatio-temporal models for ecological processesDoctoral dissertation University of Missouri-Columbia
URL httpsmospacelibraryumsystemeduxmluihandle103554500
Hooten M B amp Hobbs N T (2014) A Guide to Bayesian Model Selection forEcologists Ecological Monographs (In Press)
134
Hughes J amp Haran M (2013) Dimension reduction and alleviation of confoundingfor spatial generalized linear mixed models Journal of the Royal Statistical SocietySeries B Statistical Methodology 75 139ndash159
Hurvich C M amp Tsai C-L (1989) Regression and time series model selection insmall samples Biometrika 76 297ndash307
URL httpbiometoxfordjournalsorgcontent762297abstract
Jeffreys H (1935) Some tests of significance treated by the theory of probabilityProcedings of the Cambridge Philosophy Society 31 203ndash222
Jeffreys H (1961) Theory of Probability London Oxford University Press 3rd ed
Johnson D Conn P Hooten M Ray J amp Pond B (2013) Spatial occupancymodels for large data sets Ecology 94(4) 801ndash808
URL httpwwwesajournalsorgdoiabs10189012-05641mi=3eywlhampaf=R
ampsearchText=human+population
Kass R amp Wasserman L (1995) A reference Bayesian test for nested hypothesesand its relationship to the Schwarz criterion Journal of the American StatisticalAssociation 90(431)
URL httpamstattandfonlinecomdoiabs10108001621459199510476592
Kass R E amp Raftery A E (1995) Bayes Factors Journal of the American StatisticalAssociation 90 773ndash795
URL httpwwwtandfonlinecomdoiabs10108001621459199510476572$
delimiter026E30F$nhttpwwwtandfonlinecomdoiabs10108001621459
199510476572UvBybrTIgcs
Kass R E amp Wasserman L (1996) The Selection of Prior Distributions by FormalRules Journal of the American Statistical Association 91(435) 1343
URL httpwwwjstororgstable2291752origin=crossref
Kery M (2010) Introduction to WinBUGS for Ecologists Bayesian Approach toRegression ANOVA Mixed Models and Related Analyses Academic Press 1st ed
Kery M Gardner B amp Monnerat C (2010) Predicting species distributions fromchecklist data using site-occupancy models Journal of Biogeography 37 (10)1851ndash1862 Kery Marc Gardner Beth Monnerat Christian
Khuri A (2002) Nonsingular linear transformations of the control variables in responsesurface models Technical Report
Krebs C J (1972) Ecology the experimental analysis of distribution and abundance
135
Lempers F B (1971) Posterior probabilities of alternative linear models University ofRotterdam Press Rotterdam
Leon-Novelo L Moreno E amp Casella G (2012) Objective Bayes model selection inprobit models Statistics in medicine 31(4) 353ndash65
URL httpwwwncbinlmnihgovpubmed22162041
Liang F Paulo R Molina G Clyde M a amp Berger J O (2008) Mixtures of g Priorsfor Bayesian Variable Selection Journal of the American Statistical Association103(481) 410ndash423
URL httpwwwtandfonlinecomdoiabs101198016214507000001337
Link W amp Barker R (2009) Bayesian inference with ecological applications Elsevier
URL httpbooksgooglecombookshl=enamplr=ampid=hecon2l2QPcCampoi=fnd
amppg=PP2ampdq=Bayesian+Inference+with+ecological+applicationsampots=S82_
0pxrNmampsig=L3xbsSQcKD8FV6rxCMp2pmP2JKk
MacKenzie D amp Nichols J (2004) Occupancy as a surrogate for abundanceestimation Animal biodiversity and conservation 1 461ndash467
URL httpcrsitbacidmediajurnalrefslandscapemackenzie2004zhpdf
MacKenzie D Nichols J amp Hines J (2003) Estimating site occupancy colonizationand local extinction when a species is detected imperfectly Ecology 84(8)2200ndash2207
URL httpwwwesajournalsorgdoiabs10189002-3090
MacKenzie D I Bailey L L amp Nichols J D (2004) Investigating speciesco-occurrence patterns when species Journal of Animal Ecology 73 546ndash555
MacKenzie D I Nichols J D Lachman G B Droege S Royle J A amp LangtimmC A (2002) Estimating site occupancy rates when detection probabilities are lessthan one Ecology 83(8) 2248ndash2255
Mazerolle M amp Mazerolle M (2013) Package rsquoAICcmodavgrsquo (c)
URL ftpheanetarchivegnewsenseorgdisk1CRANwebpackages
AICcmodavgAICcmodavgpdf
McCullagh P amp Nelder J A (1989) Generalized linear models (2nd ed) LondonEngland Chapman amp Hall
McQuarrie A Shumway R amp Tsai C-L (1997) The model selection criterion AICu
136
Moreno E Bertolino F amp Racugno W (1998) An intrinsic limiting procedure for modelselection and hypotheses testing Journal of the American Statistical Association93(444) 1451ndash1460
Moreno E Giron F J amp Casella G (2010) Consistency of objective Bayes factors asthe model dimension grows The Annals of Statistics 38(4) 1937ndash1952
URL httpprojecteuclidorgeuclidaos1278861238
Nelder J A (1977) Reformulation of linear-models Journal of the Royal StatisticalSociety Series A - Statistics in Society 140 48ndash77
Nelder J A (1998) The selection of terms in response-surface models - how strong isthe weak-heredity principle American Statistician 52(4) 315ndash318
Nelder J A (2000) Functional marginality and response-surface fitting Journal ofApplied Statistics 27 (1) 109ndash112
Nichols J Hines J amp Mackenzie D (2007) Occupancy estimation and modeling withmultiple states and state uncertainty Ecology 88(6) 1395ndash1400
URL httpwwwesajournalsorgdoipdf10189006-1474
Ovaskainen O Hottola J amp Siitonen J (2010) Modeling species co-occurrenceby multivariate logistic regression generates new hypotheses on fungal interactionsEcology 91(9) 2514ndash21
URL httpwwwncbinlmnihgovpubmed20957941
Peixoto J L (1987) Hierarchical variable selection in polynomial regression-modelsAmerican Statistician 41(4) 311ndash313
Peixoto J L (1990) A property of well-formulated polynomial regression-modelsAmerican Statistician 44(1) 26ndash30
Pericchi L R (2005) Model selection and hypothesis testing based on objectiveprobabilities and bayes factors In Handbook of Statistics Elsevier
Polson N G Scott J G amp Windle J (2013) Bayesian Inference for Logistic ModelsUsing Polya-Gamma Latent Variables Journal of the American Statistical Association108 1339ndash1349
URL httpdxdoiorg101080016214592013829001
Rao C R amp Wu Y (2001) On model selection vol Volume 38 of Lecture NotesndashMonograph Series (pp 1ndash57) Beachwood OH Institute of Mathematical Statistics
URL httpdxdoiorg101214lnms1215540960
137
Reich B J Hodges J S amp Zadnik V (2006) Effects of residual smoothing on theposterior of the fixed effects in disease-mapping models Biometrics 62 1197ndash1206
Reiners W amp Lockwood J (2009) Philosophical Foundations for the Practices ofEcology Cambridge University Press
URL httpbooksgooglecombooksid=dr9cPgAACAAJ
Rigler F amp Peters R (1995) Excellence in Ecology Science and Limnology EcologyInstitute Germany
URL httportoncatieaccrcgi-binwxisexeIsisScript=CIENLxis
ampmethod=postampformato=2ampcantidad=1ampexpresion=mfn=008268
Robert C Chopin N amp Rousseau J (2009) Harold Jeffreysrsquo Theory of Probabilityrevisited Statistical Science Volume 24(2) 141ndash179
URL httpswwwnewtonacukpreprintsNI08021pdf
Robert C P (1993) A note on jeffreys-lindley paradox Statistica Sinica 3 601ndash608
Royle J A amp Kery M (2007) A Bayesian state-space formulation of dynamicoccupancy models Ecology 88(7) 1813ndash23
URL httpwwwncbinlmnihgovpubmed17645027
Scott J amp Berger J (2010) Bayes and Empirical-Bayes Multiplicity Adjustment in thevariable selection problem The Annals of Statistics
URL httpprojecteuclidorgeuclidaos1278861454
Spiegelhalter D J amp Smith A F M (1982) Bayes factor for linear and log-linearmodels with vague prior information J R Statist Soc B 44 377ndash387
Tierney L amp Kadane J B (1986) Accurate approximations for posterior moments andmarginal densities Journal of the American Statistical Association 81 82ndash86
Tyre A J Tenhumberg B Field S a Niejalke D Parris K amp Possingham H P(2003) Improving Precision and Reducing Bias in Biological Surveys EstimatingFalse-Negative Error Rates Ecological Applications 13(6) 1790ndash1801
URL httpwwwesajournalsorgdoiabs10189002-5078
Waddle J H Dorazio R M Walls S C Rice K G Beauchamp J Schuman M Jamp Mazzotti F J (2010) A new parameterization for estimating co-occurrence ofinteracting species Ecological applications a publication of the Ecological Society ofAmerica 20 1467ndash1475
Wasserman L (2000) Bayesian Model Selection and Model Averaging Journal ofmathematical psychology 44(1) 92ndash107
138
URL httpwwwncbinlmnihgovpubmed10733859
Wilson M Iversen E Clyde M A Schmidler S C amp Schildkraut J M (2010)Bayesian model search and multilevel inference for SNP association studies TheAnnals of Applied Statistics 4(3) 1342ndash1364
URL httpwwwncbinlmnihgovpmcarticlesPMC3004292
Womack A J Leon-Novelo L amp Casella G (2014) Inference from Intrinsic BayesProcedures Under Model Selection and Uncertainty Journal of the AmericanStatistical Association (June) 140114063448000
URL httpwwwtandfonlinecomdoiabs101080016214592014880348
Yuan M Joseph V R amp Zou H (2009) Structured variable selection and estimationThe Annals of Applied Statistics 3(4) 1738ndash1757
URL httpprojecteuclidorgeuclidaoas1267453962
Zeller K A Nijhawan S Salom-Perez R Potosme S H amp Hines J E (2011)Integrating occupancy modeling and interview data for corridor identification A casestudy for jaguars in nicaragua Biological Conservation 144(2) 892ndash901
Zellner A amp Siow A (1980) Posterior odds ratios for selected regression hypothesesIn Trabajos de estadıstica y de investigacion operativa (pp 585ndash603)
URL httpwwwspringerlinkcomindex5300770UP12246M9pdf
139
BIOGRAPHICAL SKETCH
Daniel Taylor-Rodrıguez was born in Bogota Colombia He earned a BS
degree in economics from the Universidad de Los Andes (2004) and a Specialist
degree in statistics from the Universidad Nacional de Colombia In 2009 he traveled
to Gainesville Florida to pursue a masterrsquos in statistics under the supervision of
George Casella Upon completion he started a PhD in interdisciplinary ecology with
concentration in statistics again under George Casellarsquos supervision After Georgersquos
passing Linda Young and Nikolay Bliznyuk continued to oversee Danielrsquos mentorship
He has currently accepted a joint postdoctoral fellowship at the Statistical and Applied
Mathematical Sciences Institute and the Department of Statistical Science at Duke
University
140
- ACKNOWLEDGMENTS
- TABLE OF CONTENTS
- LIST OF TABLES
- LIST OF FIGURES
- ABSTRACT
- 1 GENERAL INTRODUCTION
-
- 11 Occupancy Modeling
- 12 A Primer on Objective Bayesian Testing
- 13 Overview of the Chapters
-
- 2 MODEL ESTIMATION METHODS
-
- 21 Introduction
-
- 211 The Occupancy Model
- 212 Data Augmentation Algorithms for Binary Models
-
- 22 Single Season Occupancy
-
- 221 Probit Link Model
- 222 Logit Link Model
-
- 23 Temporal Dynamics and Spatial Structure
-
- 231 Dynamic Mixture Occupancy State-Space Model
- 232 Incorporating Spatial Dependence
-
- 24 Summary
-
- 3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
-
- 31 Introduction
- 32 Objective Bayesian Inference
-
- 321 The Intrinsic Methodology
- 322 Mixtures of g-Priors
-
- 3221 Intrinsic priors
- 3222 Other mixtures of g-priors
-
- 33 Objective Bayes Occupancy Model Selection
-
- 331 Preliminaries
- 332 Intrinsic Priors for the Occupancy Problem
- 333 Model Posterior Probabilities
- 334 Model Selection Algorithm
-
- 34 Alternative Formulation
- 35 Simulation Experiments
-
- 351 Marginal Posterior Inclusion Probabilities for Model Predictors
- 352 Summary Statistics for the Highest Posterior Probability Model
-
- 36 Case Study Blue Hawker Data Analysis
-
- 361 Results Variable Selection Procedure
- 362 Validation for the Selection Procedure
-
- 37 Discussion
-
- 4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
-
- 41 Introduction
- 42 Setup for Well-Formulated Models
-
- 421 Well-Formulated Model Spaces
-
- 43 Priors on the Model Space
-
- 431 Model Prior Definition
- 432 Choice of Prior Structure and Hyper-Parameters
- 433 Posterior Sensitivity to the Choice of Prior
-
- 44 Random Walks on the Model Space
-
- 441 Simple Pruning and Growing
- 442 Degree Based Pruning and Growing
-
- 45 Simulation Study
-
- 451 SNR and Sample Size Effect
- 452 Coefficient Magnitude
- 453 Special Points on the Scale
-
- 46 Case Study Ozone Data Analysis
- 47 Discussion
-
- 5 CONCLUSIONS
- A FULL CONDITIONAL DENSITIES DYMOSS
- B RANDOM WALK ALGORITHMS
- C WFM SIMULATION DETAILS
- D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
- REFERENCES
- BIOGRAPHICAL SKETCH
-
c⃝ 2014 Daniel Taylor-Rodrıguez
2
In memory of George Casella
It is a capital mistake to theorize before one has data Insensibly onebegins to twist facts to suit theories instead of theories to suit facts
ndashSherlock HolmesA Scandal in Bohemia
3
ACKNOWLEDGMENTS
Completing this dissertation would not have been possible without the support from
the people that have helped me remain focused motivated and inspired throughout the
years I am undeservingly fortunate to be surrounded by such amazing people
First of all I would like to express my gratitude to Professor George Casella It
was an unsurpassable honor to work with him His wisdom generosity optimism and
unyielding resolve will forever inspire me I will always treasure his teachings and the
fond memories I have of him I thank him and Anne for treating me and my wife as
family
I would like to acknowledge all of my committee members My heartfelt thanks to
my advisor Professor Linda J Young I will carry her thoughtful and patient recommendations
throughout my life I have no words to express how thankful I am to her for guiding me
through the difficult times that followed Dr Casellarsquos passing Also she has my gratitude
for sharing her knowledge and wealth of experience and for providing me with so many
amazing opportunities I am forever grateful to my local advisor Professor Nikolay
Bliznyuk for unsparingly sharing his insightful reflections and knowledge His generosity
and drive to help students develop are a model to follow His kind and extensive efforts
our many conversations his suggestions and advise in all aspects of academic and
non-academic life have made me a better statistician and have had a profound influence
on my way of thinking My appreciation to Professor Madan Oli for his enlightening
advise and for helping me advance my understanding of ecology
I would like to express my absolute gratitude to Dr Andrew Womack my friend and
young mentor His love for good science and hard work although impossible to keep up
with made my doctoral training one of the most exciting times in my life I have sincerely
enjoyed working and learning from him the last couple of years I offer my gratitude
to Dr Salvador Gezan for his friendship and the patience with which he taught me so
much more about statistics (boring our wives to death in the process) I am grateful to
4
Professor Mary Christman for her mentorship and enormous support I would like to
thank Dr Mihai Giurcanu for spending countless hours helping me think more deeply
about statistics his insight has been instrumental to shaping my own ideas Thanks to
Dr Claudio Fuentes for taking an interest in my work and for his advise support and
kind words which helped me retain the confidence to continue
I would like to acknowledge my friends at UF Juan Jose Acosta Mauricio
Mosquera Diana Falla Salvador and Emma Weeks and Anna Denicol thanks for
becoming my family away from home Andreas Tavis Emily Alex Sasha Mike
Yeonhee and Laura thanks for being there for me I truly enjoyed sharing these
years with you Vitor Paula Rafa Leandro Fabio Eduardo Marcelo and all the other
Brazilians in the Animal Science Department thanks for your friendship and for the
many unforgettable (though blurry) weekends
Also I would like to thank Pablo Arboleda for believing in me Because of him I
was able to take the first step towards fulfilling my educational goals My gratitude to
Grupo Bancolombia Fulbright Colombia Colfuturo and the IGERT QSE3 program
for supporting me throughout my studies Also thanks to Marc Kery and Christian
Monnerat for providing data to validate our methods Thanks to the staff in the Statistics
Department specially to Ryan Chance to the staff at the HPC and also to Karen Bray
at SNRE
Above all else I would like to thank my wife and family Nata you have always been
there for me pushing me forward believing in me helping me make better decisions
and regardless of how hard things get you have always managed to give me true and
lasting happiness Thank you for your love strength and patience Mom Dad Alejandro
Alberto Laura Sammy Vale and Tommy without your love trust and support getting
this far would not have been possible Thank you for giving me so much Gustavo
Lilia Angelica and Juan Pablo thanks for taking me into your family your words of
encouragement have led the way
5
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS 4
LIST OF TABLES 8
LIST OF FIGURES 10
ABSTRACT 12
CHAPTER
1 GENERAL INTRODUCTION 14
11 Occupancy Modeling 1512 A Primer on Objective Bayesian Testing 1713 Overview of the Chapters 21
2 MODEL ESTIMATION METHODS 23
21 Introduction 23211 The Occupancy Model 24212 Data Augmentation Algorithms for Binary Models 26
22 Single Season Occupancy 29221 Probit Link Model 30222 Logit Link Model 32
23 Temporal Dynamics and Spatial Structure 34231 Dynamic Mixture Occupancy State-Space Model 37232 Incorporating Spatial Dependence 43
24 Summary 46
3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS 49
31 Introduction 4932 Objective Bayesian Inference 52
321 The Intrinsic Methodology 53322 Mixtures of g-Priors 54
3221 Intrinsic priors 553222 Other mixtures of g-priors 56
33 Objective Bayes Occupancy Model Selection 57331 Preliminaries 58332 Intrinsic Priors for the Occupancy Problem 60333 Model Posterior Probabilities 62334 Model Selection Algorithm 63
34 Alternative Formulation 6635 Simulation Experiments 68
351 Marginal Posterior Inclusion Probabilities for Model Predictors 70
6
352 Summary Statistics for the Highest Posterior Probability Model 7636 Case Study Blue Hawker Data Analysis 77
361 Results Variable Selection Procedure 79362 Validation for the Selection Procedure 81
37 Discussion 82
4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS 84
41 Introduction 8442 Setup for Well-Formulated Models 88
421 Well-Formulated Model Spaces 9043 Priors on the Model Space 91
431 Model Prior Definition 92432 Choice of Prior Structure and Hyper-Parameters 96433 Posterior Sensitivity to the Choice of Prior 99
44 Random Walks on the Model Space 104441 Simple Pruning and Growing 105442 Degree Based Pruning and Growing 106
45 Simulation Study 107451 SNR and Sample Size Effect 109452 Coefficient Magnitude 110453 Special Points on the Scale 111
46 Case Study Ozone Data Analysis 11147 Discussion 113
5 CONCLUSIONS 115
APPENDIX
A FULL CONDITIONAL DENSITIES DYMOSS 118
B RANDOM WALK ALGORITHMS 121
C WFM SIMULATION DETAILS 124
D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS 131
REFERENCES 133
BIOGRAPHICAL SKETCH 140
7
LIST OF TABLES
Table page
1-1 Interpretation of BFji when contrasting Mj and Mi 20
3-1 Simulation control parameters occupancy model selector 69
3-2 Comparison of average minOddsMPIP under scenarios having different numberof sites (N=50 N=100) and under scenarios having different number of surveysper site (J=3 J=5) for the presence and detection components using uniformand multiplicity correction priors 75
3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors 75
3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highest probabilitymodels for the presence and the detection components using uniform andmultiplicity correcting priors on the model space 76
3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space 77
3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 77
3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 78
3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data 80
3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data 80
3-10 MPIP presence component 81
3-11 MPIP detection component 81
3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform and multiplicitycorrection model priors 82
8
4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulations 100
4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchical independenceprior (HIP) the hierarchical order prior (HOP) and the hierarchical uniformprior (HUP) 102
4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and the hierarchicaluniform prior (HUP) 103
4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP) 105
4-5 Variables used in the analyses of the ozone contamination dataset 112
4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso 113
C-1 Experimental conditions WFM simulations 124
D-1 Variables used in the analyses of the ozone contamination dataset 131
D-2 Marginal inclusion probabilities intrinsic prior 132
D-3 Marginal inclusion probabilities Zellner-Siow prior 132
D-4 Marginal inclusion probabilities Hyper-g11 132
D-5 Marginal inclusion probabilities Hyper-g21 132
9
LIST OF FIGURES
Figure page
2-1 Graphical representation occupancy model 25
2-2 Graphical representation occupancy model after data-augmentation 31
2-3 Graphical representation multiseason model for a single site 39
2-4 Graphical representation data-augmented multiseason model 39
3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors 71
3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors 72
3-3 Predictor MPIP averaged over scenarios with the interaction between the numberof sites and the surveys per site using uniform (U) and multiplicity correction(MC) priors 72
3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors 73
3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors 73
4-1 Graphs of well-formulated polynomial models for p = 2 90
4-2 E(M) and C(M) in M defined by a quadratic surface in two main effects formodel M = 1 x1 x21 91
4-3 Graphical representation of assumptions on M defined by the quadratic surfacein two main effects 93
4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the intercept onlymodel and (ab) isin (1 1) (1 ch) 97
4-5 Prior probabilities for the space of well-formulated models associated to threemain effects and one interaction term where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch) 98
4-6 MT DAG of the largest true model used in simulations 109
4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch) 110
C-1 SNR vs n Average model size average true positives and average false positivesfor all simulated scenarios by model ranking according to model posterior probabilities126
10
C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 128
C-3 SNR vs different true models MT Average model size average true positivesand average false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 129
11
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
OBJECTIVE BAYESIAN METHODS FOR OCCUPANCY MODEL ESTIMATION ANDSELECTION
By
Daniel Taylor-Rodrıguez
August 2014
Chair Linda J YoungCochair Nikolay BliznyukMajor Interdisciplinary Ecology
The ecological literature contains numerous methods for conducting inference about
the dynamics that govern biological populations Among these methods occupancy
models have played a leading role during the past decade in the analysis of large
biological population surveys The flexibility of the occupancy framework has brought
about useful extensions for determining key population parameters which provide
insights about the distribution structure and dynamics of a population However the
methods used to fit the models and to conduct inference have gradually grown in
complexity leaving practitioners unable to fully understand their implicit assumptions
increasing the potential for misuse This motivated our first contribution We develop
a flexible and straightforward estimation method for occupancy models that provides
the means to directly incorporate temporal and spatial heterogeneity using covariate
information that characterizes habitat quality and the detectability of a species
Adding to the issue mentioned above studies of complex ecological systems now
collect large amounts of information To identify the drivers of these systems robust
techniques that account for test multiplicity and for the structure in the predictors are
necessary but unavailable for ecological models We develop tools to address this
methodological gap First working in an ldquoobjectiverdquo Bayesian framework we develop
the first fully automatic and objective method for occupancy model selection based
12
on intrinsic parameter priors Moreover for the general variable selection problem we
propose three sets of prior structures on the model space that correct for multiple testing
and a stochastic search algorithm that relies on the priors on the models space to
account for the polynomial structure in the predictors
13
CHAPTER 1GENERAL INTRODUCTION
As with any other branch of science ecology strives to grasp truths about the
world that surrounds us and in particular about nature The objective truth sought
by ecology may well be beyond our grasp however it is reasonable to think that at
least partially ldquoNature is capable of being understoodrdquo (Dewey 1958) We can observe
and interpret nature to formulate hypotheses which can then be tested against reality
Hypotheses that encounter no or little opposition when confronted with reality may
become contextual versions of the truth and may be generalized by scaling them
spatially andor temporally accordingly to delimit the bounds within which they are valid
To formulate hypotheses accurately and in a fashion amenable to scientific inquiry
not only the point of view and assumptions considered must be made explicit but
also the object of interest the properties worthy of consideration of that object and
the methods used in studying such properties (Reiners amp Lockwood 2009 Rigler amp
Peters 1995) Ecology as defined by Krebs (1972) is ldquothe study of interactions that
determine the distribution and abundance of organismsrdquo This characterizes organisms
and their interactions as the objects of interest to ecology and prescribes distribution
and abundance as a relevant property of these organisms
With regards to the methods used to acquire ecological scientific knowledge
traditionally theoretical mathematical models (such as deterministic PDEs) have been
used However naturally varying systems are imprecisely observed and as such are
subject to multiple sources of uncertainty that must be explicitly accounted for Because
of this the ecological scientific community is developing a growing interest in flexible
and powerful statistical methods and among these Bayesian hierarchical models
predominate These methods rely on empirical observations and can accommodate
fairly complex relationships between empirical observations and theoretical process
models while accounting for diverse sources of uncertainty (Hooten 2006)
14
Bayesian approaches are now used extensively in ecological modeling however
there are two issues of concern one from the standpoint of ecological practitioners
and another from the perspective of scientific ecological endeavors First Bayesian
modeling tools require a considerable understanding of probability and statistical theory
leading practitioners to view them as black box approaches (Kery 2010) Second
although Bayesian applications proliferate in the literature in general there is a lack of
awareness of the distinction between approaches specifically devised for testing and
those for estimation (Ellison 2004) Furthermore there is a dangerous unfamiliarity with
the proven risks of using tools designed for estimation in testing procedures (Berger amp
Pericchi 1996 Berger et al 2001 Kass amp Raftery 1995 Moreno et al 1998 Robert
et al 2009 Robert 1993) (eg use of flat priors in hypothesis testing)
Occupancy models have played a leading role during the past decade in large
biological population surveys The flexibility of the occupancy framework has allowed
the development of useful extensions to determine several key population parameters
which provide robust notions of the distribution structure and dynamics of a population
In order to address some of the concerns stated in previous paragraph we concentrate
in the occupancy framework to develop estimation and testing tools that will allow
ecologists first to gain insight about the estimation procedure and second to conduct
statistically sound model selection for site-occupancy data
11 Occupancy Modeling
Since MacKenzie et al (2002) and Tyre et al (2003) introduced the site-occupancy
framework countless applications and extensions of the method have been developed
in the ecological literature as evidenced by the 438000 hits on Google Scholar for
a search of rdquooccupancy modelrdquo This class of models acknowledges that techniques
used to conduct biological population surveys are prone to detection errors ndashif an
individual is detected it must be present while if it is not detected it might or might
not be Occupancy models improve upon traditional binary regression by accounting
15
for observed detection and partially observed presence as two separate but related
components In the site occupancy setting the chosen locations are surveyed
repeatedly in order to reduce the ambiguity caused by the observed zeros This
approach therefore allows probabilities of both presence (occurrence) and detection
to be estimated
The uses of site-occupancy models are many For example metapopulation
and island biogeography models are often parameterized in terms of site (or patch)
occupancy (Hansky 19921994 1997 as cited in MacKenzie et al (2003)) and
occupancy may be used as a surrogate for abundance to answer questions regarding
geographic distribution range size and metapopulation dynamics (MacKenzie et al
2004 Royle amp Kery 2007)
The basic occupancy framework which assumes a single closed population with
fixed probabilities through time has proven to be quite useful however it might be of
limited utility when addressing some problems In particular assumptions for the basic
model may become too restrictive or unrealistic whenever the study period extends
throughout multiple years or seasons especially given the increasingly changing
environmental conditions that most ecosystems are currently experiencing
Among the extensions found in the literature one that we consider particularly
relevant incorporates heterogenous occupancy probabilities through time Models
that incorporate temporally varying probabilities stem from important meta-population
notions provided by Hanski (1994) such as occupancy probabilities depending on local
colonization and local extinction processes In spite of the conceptual usefulness of
Hanskirsquos model several strong and untenable assumptions (eg all patches being
homogenous in quality) are required for it to provide practically meaningful results
A more viable alternative which builds on Hanski (1994) is an extension of
the single season occupancy model of MacKenzie et al (2003) In this model the
heterogeneity of occupancy probabilities across seasons arises from local colonization
16
and extinction processes This model is flexible enough to let detection occurrence
extinction and colonization probabilities to each depend upon its own set of covariates
Model parameters are obtained through likelihood-based estimation
Using a maximum likelihood approach presents two drawbacks First the
uncertainty assessment for maximum likelihood parameter estimates relies on
asymptotic results which are obtained from implementation of the delta method
making it sensitive to sample size Second to obtain parameter estimates the latent
process (occupancy) is marginalized out of the likelihood leading to the usual zero
inflated Bernoulli model Although this is a convenient strategy for solving the estimation
problem after integrating the latent state variables (occupancy indicators) they are
no longer available Therefore finite sample estimates cannot be calculated directly
Instead a supplementary parametric bootstrapping step is necessary Further
additional structure such as temporal or spatial variation cannot be introduced by
means of random effects (Royle amp Kery 2007)
12 A Primer on Objective Bayesian Testing
With the advent of high dimensional data such as that found in modern problems
in ecology genetics physics etc coupled with evolving computing capability objective
Bayesian inferential methods have gained increasing popularity This however is by no
means a new approach in the way Bayesian inference is conducted In fact starting with
Bayes and Laplace and continuing for almost 200 years Bayesian analysis was primarily
based on ldquononinformativerdquo priors (Berger amp Bernardo 1992)
Now subjective elicitation of prior probabilities in Bayesian analysis is widely
recognized as the ideal (Berger et al 2001) however it is often the case that the
available information is insufficient to specify appropriate prior probabilistic statements
Commonly as in model selection problems where large model spaces have to be
explored the number of model parameters is prohibitively large preventing one from
eliciting prior information for the entire parameter space As a consequence in practice
17
the determination of priors through the definition of structural rules has become the
alternative to subjective elicitation for a variety of problems in Bayesian testing Priors
arising from these rules are known in the literature as noninformative objective default
or reference Many of these connotations generate controversy and are accused
perhaps rightly of providing a false pretension of objectivity Nevertheless we will avoid
that discussion and refer to them herein exchangeably as noninformative or objective
priors to convey the sense that no attempt to introduce an informed opinion is made in
defining prior probabilities
A plethora of ldquononinformativerdquo methods has been developed in the past few
decades (see Berger amp Bernardo (1992) Berger amp Pericchi (1996) Berger et al (2001)
Clyde amp George (2004) Kass amp Wasserman (1995 1996) Liang et al (2008) Moreno
et al (1998) Spiegelhalter amp Smith (1982) Wasserman (2000) and the references
therein) We find particularly interesting those derived from the model structure in which
no tuning parameters are required especially since these can be regarded as automatic
methods Among them methods based on the Bayes factor for Intrinsic Priors have
proven their worth in a variety of inferential problems given their excellent performance
flexibility and ease of use This class of priors is discussed in detail in chapter 3 For
now some basic notation and notions of Bayesian inferential procedures are introduced
Hypothesis testing and the Bayes factor
Bayesian model selection techniques that aim to find the true model as opposed
to searching for the model that best predicts the data are fundamentally extensions to
Bayesian hypothesis testing strategies In general this Bayesian approach to hypothesis
testing and model selection relies on determining the amount of evidence found in favor
of one hypothesis (or model) over the other given an observed set of data Approached
from a Bayesian standpoint this type of problem can be formulated in great generality
using a natural well defined probabilistic framework that incorporates both model and
parameter uncertainty
18
Jeffreys (1935) first developed the Bayesian strategy to hypothesis testing and
consequently to the model selection problem Bayesian model selection within
a model space M = (M1M2 MJ) where each model is associated with a
parameter θj which may be a vector of parameters itself incorporates three types
of probability distributions (1) a prior probability distribution for each model π(Mj)
(2) a prior probability distribution for the parameters in each model π(θj |Mj) and (3)
the distribution of the data conditional on both the model and the modelrsquos parameters
f (x|θj Mj) These three probability densities induce the joint distribution p(x θj Mj) =
f (x|θj Mj) middot π(θj |Mj) middot π(Mj) which is instrumental in producing model posterior
probabilities The model posterior probability is the probability that a model is true given
the data It is obtained by marginalizing over the parameter space and using Bayes rule
p(Mj |x) =m(x|Mj)π(Mj)sumJ
i=1m(x|Mi)π(Mi) (1ndash1)
where m(x|Mj) =intf (x|θj Mj)π(θj |Mj)dθj is the marginal likelihood of Mj
Given that interest lies in comparing different models evidence in favor of one or
another model is assessed with pairwise comparisons using posterior odds
p(Mj |x)p(Mk |x)
=m(x|Mj)
m(x|Mk)middot π(Mj)
π(Mk) (1ndash2)
The first term on the right hand side of (1ndash2) m(x|Mj )
m(x|Mk) is known as the Bayes factor
comparing model Mj to model Mk and it is denoted by BFjk(x) The Bayes factor
provides a measure of the evidence in favor of either model given the data and updates
the model prior odds given by π(Mj )
π(Mk) to produce the posterior odds
Note that the model posterior probability in (1ndash1) can be expressed as a function of
Bayes factors To illustrate let model Mlowast isin M be a reference model All other models
compare in M are compared to the reference model Then dividing both the numerator
19
and denominator in (1ndash1) by m(x|Mlowast)π(Mlowast) yields
p(Mj |x) =BFjlowast(x)
π(Mj )
π(Mlowast)
1 +sum
MiisinMMi =Mlowast
BFilowast(x)π(Mi )π(Mlowast)
(1ndash3)
Therefore as the Bayes factor increases the posterior probability of model Mj given the
data increases If all models have equal prior probabilities a straightforward criterion
to select the best among all candidate models is to choose the model with the largest
Bayes factor As such the Bayes factor is not only useful for identifying models favored
by the data but it also provides a means to rank models in terms of their posterior
probabilities
Assuming equal model prior probabilities in (1ndash3) the prior odds are set equal to
one and the model posterior odds in (1ndash2) become p(Mj |x)p(Mk |x) = BFjk(x) Based
on the Bayes factors the evidence in favor of one or another model can be interpreted
using Table 1-1 adapted from Kass amp Raftery (1995)
Table 1-1 Interpretation of BFji when contrasting Mj and Mi
lnBFjk BFjk Evidence in favor of Mj P(Mj |x)0 to 2 1 to 3 Weak evidence 05-0752 to 6 3 to 20 Positive evidence 075-095
6 to 10 20 to 150 Strong evidence 095-099gt10 gt150 Very strong evidence gt 099
Bayesian hypothesis testing and model selection procedures through Bayes factors
and posterior probabilities have several desirable features First these methods have a
straight forward interpretation since the Bayes factor is an increasing function of model
(or hypothesis) posterior probabilities Second these methods can yield frequentist
matching confidence bounds when implemented with good testing priors (Kass amp
Wasserman 1996) such as the reference priors of Berger amp Bernardo (1992) Third
since the Bayes factor contains the ratio of marginal densities it automatically penalizes
complexity according to the number of parameters in each model this property is
known as Ockhamrsquos razor (Kass amp Raftery 1995) Four the use of Bayes factors does
20
not require having nested hypotheses (ie having the null hypothesis nested in the
alternative) standard distributions or regular asymptotics (eg convergence to normal
or chi squared distributions) (Berger et al 2001) In contrast this is not always the case
with frequentist and likelihood ratio tests which depend on known distributions (at least
asymptotically) for the test statistic to perform the test Finally Bayesian hypothesis
testing procedures using the Bayes factor can naturally incorporate model uncertainty by
using the Bayesian machinery for model averaged predictions and confidence bounds
(Kass amp Raftery 1995) It is not clear how to account for this uncertainty rigorously in a
fully frequentist approach
13 Overview of the Chapters
In the chapters that follow we develop a flexible and straightforward hierarchical
Bayesian framework for occupancy models allowing us to obtain estimates and conduct
robust testing from an ldquoobjectiverdquo Bayesian perspective Latent mixtures of random
variables supply a foundation for our methodology This approach provides a means to
directly incorporate spatial dependency and temporal heterogeneity through predictors
that characterize either habitat quality of a given site or detectability features of a
particular survey conducted in a specific site On the other hand the Bayesian testing
methods we propose are (1) a fully automatic and objective method for occupancy
model selection and (2) an objective Bayesian testing tool that accounts for multiple
testing and for polynomial hierarchical structure in the space of predictors
Chapter 2 introduces the methods proposed for estimation of occupancy model
parameters A simple estimation procedure for the single season occupancy model
with covariates is formulated using both probit and logit links Based on the simple
version an extension is provided to cope with metapopulation dynamics by introducing
persistence and colonization processes Finally given the fundamental role that spatial
dependence plays in defining temporal dynamics a strategy to seamlessly account for
this feature in our framework is introduced
21
Chapter 3 develops a new fully automatic and objective method for occupancy
model selection that is asymptotically consistent for variable selection and averts the
use of tuning parameters In this Chapter first some issues surrounding multimodel
inference are described and insight about objective Bayesian inferential procedures is
provided Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate
priors on the parameter space the intrinsic priors for the parameters of the occupancy
model are obtained These are used in the construction of a variable selection algorithm
for ldquoobjectiverdquo variable selection tailored to the occupancy model framework
Chapter 4 touches on two important and interconnected issues when conducting
model testing that have yet to receive the attention they deserve (1) controlling for false
discovery in hypothesis testing given the size of the model space ie given the number
of tests performed and (2) non-invariance to location transformations of the variable
selection procedures in the face of polynomial predictor structure These elements both
depend on the definition of prior probabilities on the model space In this chapter a set
of priors on the model space and a stochastic search algorithm are proposed Together
these control for model multiplicity and account for the polynomial structure among the
predictors
22
CHAPTER 2MODEL ESTIMATION METHODS
ldquoData Data Datardquo he cried impatiently ldquoI canrsquot make bricks without clayrdquo
ndashSherlock HolmesThe Adventure of the Copper Beeches
21 Introduction
Prior to the introduction of site-occupancy models (MacKenzie et al 2002 Tyre
et al 2003) presence-absence data from ecological monitoring programs were used
without any adjustment to assess the impact of management actions to observe trends
in species distribution through space and time or to model the habitat of a species (Tyre
et al 2003) These efforts however were suspect due to false-negative errors not
being accounted for False-negative errors occur whenever a species is present at a site
but goes undetected during the survey
Site-occupancy models developed independently by MacKenzie et al (2002)
and Tyre et al (2003) extend simple binary-regression models to account for the
aforementioned errors in detection of individuals common in surveys of animal or plant
populations Since their introduction the site-occupancy framework has been used in
countless applications and numerous extensions for it have been proposed Occupancy
models improve upon traditional binary regression by analyzing observed detection
and partially observed presence as two separate but related components In the site
occupancy setting the chosen locations are surveyed repeatedly in order to reduce the
ambiguity caused by the observed zeros This approach therefore allows simultaneous
estimation of the probabilities of presence (occurrence) and detection
Several extensions to the basic single-season closed population model are
now available The occupancy approach has been used to determine species range
dynamics (MacKenzie et al 2003 Royle amp Kery 2007) and to understand agestage
23
structure within populations (Nichols et al 2007) model species co-occurrence
(MacKenzie et al 2004 Ovaskainen et al 2010 Waddle et al 2010) It has even been
suggested as a surrogate for abundance (MacKenzie amp Nichols 2004) MacKenzie et al
suggested using occupancy models to conduct large-scale monitoring programs since
this approach avoids the high costs associated with surveys designed for abundance
estimation Also to investigate metapopulation dynamics occupancy models improve
upon incidence function models (Hanski 1994) which are often parameterized in terms
of site (or patch) occupancy and assume homogenous patches and a metapopulation
that is at a colonization-extinction equilibrium
Nevertheless the implementation of Bayesian occupancy models commonly resorts
to sampling strategies dependent on hyper-parameters subjective prior elicitation
and relatively elaborate algorithms From the standpoint of practitioners these are
often treated as black-box methods (Kery 2010) As such the potential of using the
methodology incorrectly is high Commonly these procedures are fitted with packages
such as BUGS or JAGS Although the packagersquos ease of use has led to a wide-spread
adoption of the methods the user may be oblivious as to the assumptions underpinning
the analysis
We believe providing straightforward and robust alternatives to implement these
methods will help practitioners gain insight about how occupancy modeling and more
generally Bayesian modeling is performed In this Chapter using a simple Gibbs
sampling approach first we develop a versatile method to estimate the single season
closed population site-occupancy model then extend it to analyze metapopulation
dynamics through time and finally provide a further adaptation to incorporate spatial
dependence among neighboring sites211 The Occupancy Model
In this section of the document we first introduce our results published in Dorazio
amp Taylor-Rodrıguez (2012) and build upon them to propose relevant extensions For
24
the standard sampling protocol for collecting site-occupancy data J gt 1 independent
surveys are conducted at each of N representative sample locations (sites) noting
whether a species is detected or not detected during each survey Let yij denote a binary
random variable that indicates detection (y = 1) or non-detection (y = 0) during the
j th survey of site i Without loss of generality J may be assumed constant among all N
sites to simplify description of the model In practice however site-specific variation in
J poses no real difficulties and is easily implemented This sampling protocol therefore
yields a N times J matrix Y of detectionnon-detection data
Note that the observed process yij is an imperfect representation of the underlying
occupancy or presence process Hence letting zi denote the presence indicator at site i
this model specification can therefore be represented through the hierarchy
yij |zi λ sim Bernoulli (zipij)
zi |α sim Bernoulli (ψi) (2ndash1)
where pij is the probability of correctly classifying as occupied the i th site during the j th
survey ψi is the presence probability at the i th site The graphical representation of this
process is
ψi
zi
yi
pi
Figure 2-1 Graphical representation occupancy model
Probabilities of detection and occupancy can both be made functions of covariates
and their corresponding parameter estimates can be obtained using either a maximum
25
likelihood or a Bayesian approach Existing methodologies from the likelihood
perspective marginalize over the latent occupancy process (zi ) making the estimation
procedure depend only on the detections Most Bayesian strategies rely on MCMC
algorithms that require parameter prior specification and tuning However Albert amp Chib
(1993) proposed a longstanding strategy in the Bayesian statistical literature that models
binary outcomes using a simple Gibbs sampler This procedure which is described in
the following section can be extrapolated to the occupancy setting eliminating the need
for tuning parameters and subjective prior elicitation212 Data Augmentation Algorithms for Binary Models
Probit model Data-augmentation with latent normal variables
At the root of Albert amp Chibrsquos algorithm lies the idea that if the observed outcome is
0 the latent variable can be simulated from a truncated normal distribution with support
(minusinfin 0] And if the outcome is 1 the latent variable can be simulated from a truncated
normal distribution in (0infin) To understand the reasoning behind this strategy let
Y sim Bern((xTβ)
) and V = xTβ + ε with ε sim N (0 1) In such a case note that
Pr(y = 1 | xTβ) = (xTβ) = Pr(ε lt xTβ)
= Pr(ε gt minusxTβ)
= Pr(v gt 0 | xTβ)
Thus whenever y = 1 then v gt 0 and v le 0 otherwise In other words we
may think of y as a truncated version of v Thus we can sample iteratively alternating
between the latent variables conditioned on model parameters and vice versa to draw
from the desired posterior densities By augmenting the data with the latent variables
we are able to obtain full conditional posterior distributions for model parameters that are
easy to draw from (equation 2ndash3 below) Further we may sample the latent variables
we may also sample the parameters
Given some initial values for all model parameters values for the latent variables
can be simulated By conditioning on the latter it is then possible to draw samples
26
from the parameterrsquos posterior distributions These samples can be used to generate
new values for the latent variables etc The process is iterated using a Gibbs sampling
approach Generally after a large number iterations it yields draws from the joint
posterior distribution of the latent variables and the model parameters conditional on the
observed outcome values We formalize the procedure below
Assume that each outcome Y1Y2 Yn is such that Yi |xi β sim Bernoulli(qi)
where qi = (xTi β) is the standard normal CDF evaluated at xTi β where xi and β
are the p-dimensional vectors of observed covariates for the i -th observation and their
corresponding parameters respectively
Now let y = y1 y2 yn be the vector of observed outcomes and [ β ] represents
the prior distribution of the model parameters Therefore the posterior distribution of β is
given by
[ β|y ] prop [ β ]nprodi=1
(xTi β)yi(1minus(xTi β)
)1minusyi (2ndash2)
which is intractable Nevertheless introducing latent random variables V = (V1 Vn)
such that Vi sim N (xTi β 1) resolves this difficulty by specifying that whenever Yi = 1
then Vi gt 0 and if Yi = 0 then Vi le 0 This yields
[ β v|y ] prop [ β ]
nprodi=1
ϕ(vi | xTi β 1)Ivile0Iyi=0 + Ivigt0Iyi=1
(2ndash3)
where ϕ(x |micro τ 2) is the probability density function of normal random variable x
with mean micro and variance τ2 The data augmentation artifact works since [ β|y ] =int[ β v|y ]dv hence if we sample from joint posterior 2ndash3 and extract only the sampled
values for β they will correspond to samples from [ β|y ]
From the expression above it is possible to obtain the full conditional distributions
for V and β Thus a Gibbs sampler can be proposed For example if we use a flat prior
27
for β (ie [ β ] prop 1) the full conditionals are given by
β|V y sim MVNk
((XTX )minus1(XTV ) (XTX )minus1
)(2ndash4)
V|β y simnprodi=1
tr N (xTi β 1Qi) (2ndash5)
where MVNq(micro ) represents a multivariate normal distribution with mean vector micro
and variance-covariance matrix and tr N (ξσ2Q) stands for the truncated normal
distribution with mean ξ variance σ2 and truncation region Q For each i = 1 2 n
the support of the truncated variables is given by Q = (minusinfin 0 ] if yi = 0 and Q = (0infin)
otherwise Note that conjugate normal priors could be used alternatively
At iteration m + 1 the Gibbs sampler draws V(m+1) conditional on β(m) from (2ndash5)
and then samples β(m+1) conditional on V(m+1) from (2ndash4) This process is repeated for
s = 0 1 nsim where nsim is the number of iterations in the Gibbs sampler
Logit model Data-augmentation with latent Polya-gamma variables
Recently Polson et al (2013) developed a novel and efficient approach for Bayesian
inference for logistic models using Polya-gamma latent variables which is analogous
to the Albert amp Chib algorithm The result arises from what the authors refer to as the
Polya-gamma distribution To construct a random variable from this family consider the
infinite mixture of the iid sequence of Exp(1) random variables Ekinfink=1 given by
ω =2
π2
infinsumk=1
Ek
(2k minus 1)2
with probability density function
g(ω) =infinsumk=1
(minus1)k 2k + 1radic2πω3
eminus(2k+1)2
8ω Iωisin(0infin) (2ndash6)
and Laplace density transform E[eminustω] = coshminus1(radic
t2)
28
The Polya-gamma family of densities is obtained through an exponential tilting of
the density g from 2ndash6 These densities indexed by c ge 0 are characterized by
f (ω|c) = cosh c2 eminusc2ω2 g(ω)
The likelihood for the binomial logistic model can be expressed in terms of latent
Polya-gamma variables as follows Assume yi sim Bernoulli(δi) with predictors xprimei =
(xi1 xip) and success probability δi = exprimeiβ(1 + ex
primeiβ) Hence the posterior for the
model parameters can be represented as
[β|y] =[β]prodn
i δyii (1minus δi)
1minusyi
c(y)
where c(y) is the normalizing constant
To facilitate the sampling procedure a data augmentation step can be performed
by introducing a Polya-gamma random variable ω sim PG(xprimeβ 1) This yields the
data-augmented posterior
[βω|y] =
(prodn
i=1 Pr(yi = 1|β))f (ω|xprime
β) [β] dω
c(y) (2ndash7)
such that [β|y] =int
R+[βω|y] dω
Thus from the augmented model the full conditional density for β is given by
[β|ω y] prop
(nprodi=1
Pr(yi = 1|β)
)f (ω|xprime
β) [β] dω
=
nprodi=1
(exprimeiβ)yi
1 + exprimeiβ
nprodi=1
cosh
(∣∣xprime
iβ∣∣
2
)exp
[minus(x
prime
iβ)2ωi
2
]g(ωi)
(2ndash8)
This expression yields a normal posterior distribution if β is assigned flat or normal
priors Hence a two-step sampling strategy analogous to that of Albert amp Chib (1993)
can be used to estimate β in the occupancy framework22 Single Season Occupancy
Let pij = F (qTij λ) be the probability of correctly classifying as occupied the i th
site during the j th survey conditional on the site being occupied and let ψi = F (xTi α)
29
correspond to the presence probability at the i th site Further let Fminus1(middot) denote a link
function (ie probit or logit) connecting the response to the predictors and denote by λ
and α respectively the r -variate and p-variate coefficient vectors for the detection and
for the presence probabilities Then the following is the joint posterior probability for the
presence indicators and the model parameters
πlowast(z vαwλ) prop πα(α)πλ(λ)Nprodi=1
F (xprimeiα)zi (1minus F (xprimeiα))
(1minuszi ) times
Jprodj=1
(ziF (qprimeijλ))
yij (1minus ziF (qprimeijλ))
1minusyij (2ndash9)
As in the simple probit regression problem this posterior is intractable Consequently
sampling from it directly is not possible But the procedures of Albert amp Chib for the
probit model and of Polson et al for the logit model can be extended to generate an
MCMC sampling strategy for the occupancy problem In what follows we make use of
this framework to develop samplers with which occupancy parameter estimates can be
obtained for both probit and logit link functions These algorithms have the added benefit
that they do not require tuning parameters nor eliciting parameter priors subjectively221 Probit Link Model
To extend Albert amp Chibrsquos algorithm to the occupancy framework with a probit link
first we introduce two sets of latent variables denoted by wij and vi corresponding to
the normal latent variables used to augment the data The corresponding hierarchy is
yij |zi sij sim Bernoulli(ziIwijgt0
)wij |λ sim N
(qprimeijλ 1
)λ sim [λ]
zi |vi sim Ivigt0
vi |α sim N (xprimeiα 1)
α sim [α] (2ndash10)
30
represented by the directed graph found in Figure 2-2
α
vi
zi
yi
wi
λ
Figure 2-2 Graphical representation occupancy model after data-augmentation
Under this hierarchical model the joint density is given by
πlowast(z vαwλ) prop Cyπα(α)πλ(λ)Nprodi=1
ϕ(vi xprimeiα 1)I
zivigt0I
(1minuszi )vile0 times
Jprodj=1
(ziIwijgt0)yij (1minus ziIwijgt0)
1minusyijϕ(wij qprimeijλ 1) (2ndash11)
The full conditional densities derived from the posterior in equation 2ndash11 are
detailed below
1 These are obtained from the full conditional of z after integrating out v and w
f (z|αλ) =
Nprodi=1
f (zi |αλ) =Nprodi=1
ψlowastizi (1minus ψlowast
i )1minuszi
where ψlowasti =
ψiprodJ
j=1 pyijij (1minus pij)
1minusyij
ψiprodJ
j=1 pyijij (1minus pij)1minusyij + (1minus ψi)
prodJ
j=1 Iyij=0(2ndash12)
2
f (v|zα) =
Nprodi=1
f (vi |zi α) =Nprodi=1
tr N (x primeiα 1Ai)
where Ai =
(minusinfin 0] zi = 0(0infin) zi = 1
(2ndash13)
31
and tr N(microσ2A) denotes the pdf of a truncated normal random variable withmean micro variance σ2 and truncation region A
3
f (α|v) = ϕp (α αXprimev α) (2ndash14)
where α = (X primeX )minus1and ϕk(x micro ) represents the k-variate normal density withmean vector micro and variance matrix
4
f (w|y zλ) =
Nprodi=1
Jprodj=1
f (wij |yij zi λ) =Nprodi=1
Jprodj=1
tr N (qprimeijλ 1Bij)
where Bij =
(minusinfininfin) zi = 0(minusinfin 0] zi = 1 and yij = 0(0infin) zi = 1 and yij = 1
(2ndash15)
5
f (λ|w) = ϕr (λ λQprimew λ) (2ndash16)
where λ = (Q primeQ)minus1
The Gibbs sampling algorithm for the model can then be summarized as
1 Initialize z α v λ and w
2 Sample zi sim Bern(ψilowast)
3 Sample vi from a truncated normal with micro = x primeiα σ = 1 and truncation regiondepending on zi
4 Sample α sim N (αXprimev α) with α = (X primeX )minus1
5 Sample wij from a truncated normal with micro = qprimeijλ σ = 1 and truncation region
depending on yij and zi
6 Sample λ sim N (λQprimew λ) with λ = (Q primeQ)minus1
222 Logit Link Model
Now turning to the logit link version of the occupancy model again let yij be the
indicator variable used to mark detection of the target species on the j th survey at the
i th site and let zi be the indicator variable that denotes presence (zi = 1) or absence
32
(zi = 0) of the target species at the i th site The model is now defined by
yij |zi λ sim Bernoulli (zipij) where pij =eq
primeijλ
1 + eqprimeijλ
λ sim [λ]
zi |α sim Bernoulli (ψi) where ψi =ex
primeiα
1 + exprimeiα
α sim [α]
In this hierarchy the contribution of a single site to the likelihood is
Li(αλ) =(ex
primeiα)zi
1 + exprimeiα
Jprodj=1
(zi
eqprimeijλ
1 + eqprimeijλ
)yij(1minus zi
eqprimeijλ
1 + eqprimeijλ
)1minusyij
(2ndash17)
As in the probit case we data-augment the likelihood with two separate sets
of covariates however in this case each of them having Polya-gamma distribution
Augmenting the model and using the posterior in (2ndash7) the joint is
[ zαλ|y ] prop [α] [λ]
Nprodi=1
(ex
primeiα)zi
1 + exprimeiαcosh
(∣∣xprime
iα∣∣
2
)exp
[minus(x
prime
iα)2vi
2
]g(vi)times
Jprodj=1
(zi
eqprimeijλ
1 + eqprimeijλ
)yij(1minus zi
eqprimeijλ
1 + eqprimeijλ
)1minusyij
times
cosh
(∣∣ziqprimeijλ∣∣2
)exp
[minus(ziq
primeijλ)2wij
2
]g(wij)
(2ndash18)
The full conditionals for z α v λ and w obtained from (2ndash18) are provided below
1 The full conditional for z is obtained after marginalizing the latent variables andyields
f (z|αλ) =
Nprodi=1
f (zi |αλ) =Nprodi=1
ψlowastizi (1minus ψlowast
i )1minuszi
where ψlowasti =
ψiprodJ
j=1 pyijij (1minus pij)
1minusyij
ψiprodJ
j=1 pyijij (1minus pij)1minusyij + (1minus ψi)
prodJ
j=1 Iyij=0(2ndash19)
33
2 Using the result derived in Polson et al (2013) we have that
f (v|zα) =
Nprodi=1
f (vi |zi α) =Nprodi=1
PG(1 xprimeiα) (2ndash20)
3
f (α|v) prop [α ]
Nprodi=1
exp[zix
prime
iαminus xprime
iα
2minus (x
prime
iα)2vi
2
] (2ndash21)
4 By the same result as that used for v the full conditional for w is
f (w|y zλ) =
Nprodi=1
Jprodj=1
f (wij |yij zi λ)
=
(prodiisinS1
Jprodj=1
PG(1 |qprimeijλ| )
)(prodi isinS1
Jprodj=1
PG(1 0)
) (2ndash22)
with S1 = i isin 1 2 N zi = 1
5
f (λ|z yw) prop [λ ]prodiisinS1
exp
[yijq
prime
ijλminusq
prime
ijλ
2minus
(qprime
ijλ)2wij
2
] (2ndash23)
with S1 as defined above
The Gibbs sampling algorithm is analogous to the one with a probit link but with the
obvious modifications to incorporate Polya-gamma instead of normal latent variables23 Temporal Dynamics and Spatial Structure
The uses of the single-season model are limited to very specific problems In
particular assumptions for the basic model may become too restrictive or unrealistic
whenever the study period extends throughout multiple years or seasons especially
given the increasingly changing environmental conditions that most ecosystems are
currently experiencing
Among the many extensions found in the literature one that we consider particularly
relevant incorporates heterogenous occupancy probabilities through time Extensions of
34
site-occupancy models that incorporate temporally varying probabilities can be traced
back to Hanski (1994) The heterogeneity of occupancy probabilities through time arises
from local colonization and extinction processes MacKenzie et al (2003) proposed an
alternative to Hanskirsquos approach in order to incorporate imperfect detection The method
is flexible enough to let detection occurrence survival and colonization probabilities
each depend upon its own set of covariates using likelihood-based estimation for the
model parameters
However the approach of MacKenzie et al presents two drawbacks First
the uncertainty assessment for maximum likelihood parameter estimates relies on
asymptotic results (obtained from implementation of the delta method) making it
sensitive to sample size And second to obtain parameter estimates the latent process
(occupancy) is marginalized out of the likelihood leading to the usual zero-inflated
Bernoulli model Although this is a convenient strategy to solve the estimation problem
the latent state variables (occupancy indicators) are no longer available and as such
finite sample estimates cannot be calculated unless an additional (and computationally
expensive) parametric bootstrap step is performed (Royle amp Kery 2007) Additionally as
the occupancy process is integrated out the likelihood approach precludes incorporation
of additional structural dependence using random effects Thus the model cannot
account for spatial dependence which plays a fundamental role in this setting
To work around some of the shortcomings encountered when fitting dynamic
occupancy models via likelihood based methods Royle amp Kery developed what they
refer to as a dynamic occupancy state space model (DOSS) alluding to the conceptual
similarity found between this model and the class of state space models found in the
time series literature In particular this model allows one to retain the latent process
(occupancy indicators) in order to obtain small sample estimates and to eventually
generate extensions that incorporate structure in time andor space through random
effects
35
The data used in the DOSS model comes from standard repeated presenceabsence
surveys with N sampling locations (patches or sites) indexed by i = 1 2 N Within
a given season (eg year month week depending on the biology of the species) each
sampling location is visited (surveyed) j = 1 2 J times This process is repeated for
t = 1 2 T seasons Here an important assumption is that the site occupancy status
is closed within but not across seasons
As is usual in the occupancy modeling framework two different processes are
considered The first one is the detection process per site-visit-season combination
denoted by yijt The yijt are indicator functions that take the value 1 if the species is
present at site i survey j and season t and 0 otherwise These detection indicators
are assumed to be independent within each site and season The second response
considered is the partially observed presence (occupancy) indicators zit These are
indicator variables which are equal to 1 whenever yijt = 1 for one or more of the visits
made to site i during season t otherwise the values of the zit rsquos are unknown Royle amp
Kery refer to these two processes as the observation (yijt) and the state (zit) models
In this setting the parameters of greatest interest are the occurrence or site
occupancy probabilities denoted by ψit as well as those representing the population
dynamics which are accounted for by introducing changes in occupancy status over
time through local colonization and survival That is if a site was not occupied at season
t minus 1 at season t it can either be colonized or remain unoccupied On the other hand
if the site was in fact occupied at season t minus 1 it can remain that way (survival) or
become abandoned (local extinction) at season t The probabilities of survival and
colonization from season t minus 1 to season t at the i th site are denoted by θi(tminus1) and
γi(tminus1) respectivelyDuring the initial period (or season) the model for the state process is expressed in
terms of the occupancy probability (equation 2ndash24) For subsequent periods the stateprocess is specified in terms of survival and colonization probabilities (equation 2ndash25) inparticular
zi1 sim Bernoulli (ψi1) (2ndash24)
36
zit |zi(tminus1) sim Bernoulli(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)(2ndash25)
The observation model conditional on the latent process zit is defined by
yijt |zit sim Bernoulli(zitpijt
)(2ndash26)
Royle amp Kery induce the heterogeneity by site site-season and site-survey-seasonrespectively in the occupancy survival and colonization and in the detection probabilitiesthroughthe following specification
logit(ψi1) = x1 + ri ri sim N(0σ2ψ) logitminus1(x1) sim Unif(0 1)
logit(θit) = at + ui ui sim N(0σ2θ ) logitminus1(ai) sim Unif(0 1)logit(γit) = bt + vi vi sim N(0σ2γ) logitminus1(bi) sim Unif(0 1)
logit(pijt) = ct + wij wi sim N(0σ2p) logitminus1(ci) sim Unif(0 1) (2ndash27)
where x1 at bt ct are the season fixed effects for the corresponding probabilities
and where (ri ui vi) and wij are the site and site-survey random effects respectively
Additionally all variance components assume the usual inverse gamma priors
As the authors state this formulation can be regarded as ldquobeing suitably vaguerdquo
however it is also restrictive in the sense that it is not clear what strategy to follow to
incorporate additional covariates while preserving the straightforward sampling strategy231 Dynamic Mixture Occupancy State-Space Model
We assume that the probabilities for occupancy survival colonization and detection
are all functions of linear combinations of covariates However our setup varies
slightly from that considered by Royle amp Kery (2007) In essence we modify the way in
which the estimates for survival and colonization probabilities are attained Our model
incorporates the notion that occupancy at a site occupied during the previous season
takes place through persistence where we define persistence as a function of both
survival and colonization That is a site occupied at time t may again be occupied
at time t + 1 if the current settlers survive if they perish and new settlers colonize
simultaneously or if both current settlers survive and new ones colonize
Our functional forms of choice are again the probit and logit link functions This
means that each probability of interest which we will refer to for illustration as δ is
37
linked to a linear combination of covariates xprime ξ through the relationship defined by
δ = F (xT ξ) where F (middot) represents the inverse link function This particular assumption
facilitates relating the data augmentation algorithms of Albert amp Chib and Polson et al to
Royle amp Keryrsquos DOSS model We refer to this extension of Royle amp Keryrsquos model as the
Dynamic Mixture Occupancy State Space model (DYMOSS)
As before let yijt be the indicator variable used to mark detection of the target
species on the j th survey at the i th site during the tth season and let zit be the indicator
variable that denotes presence (zit = 1) or absence (zit = 0) of the target species at the
i th site tth season with i isin 1 2 N j isin 1 2 J and t isin 1 2 T
Additionally assume that probabilities for occupancy at time t = 1 persistence
colonization and detection are all functions of covariates with corresponding parameter
vectors α (s) =δ(s)tminus1
Tt=2
B(c) =β(c)tminus1
Tt=2
and = λtTt=1 and covariate matrices
X(o) X = Xtminus1Tt=2 and Q(s) = QtTt=1 respectively Using the notation above our
proposed dynamic occupancy model is defined by the following hierarchyState model
zi1|α sim Bernoulli (ψi1) where ψi1 = F(xprime(o)iα
)zit |zi(tminus1) δ
(c)tminus1β
(s)tminus1 sim Bernoulli
(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)where θi(tminus1) = F
(δ(s)tminus1 + xprimei(tminus1)β
(c)tminus1
) and
γi(tminus1) = F(xprimei(tminus1)β
(c)tminus1
)(2ndash28)
Observed modelyijt |zit ηt sim Bernoulli (zitpijt)
where pijt = F (qTijtλt) (2ndash29)
In the hierarchical setup given by Equations 2ndash28 and 2ndash29 θi(tminus1) corresponds to
the probability of persistence from time t minus 1 to time t at site i and γi(tminus1) denotes the
colonization probability Note that θi(tminus1) minus γi(tminus1) yields the survival probability from t minus 1
to t The effect of survival is introduced by changing the intercept of the linear predictor
by a quantity δ(s)tminus1 Although in this version of the model this effect is accomplished by
just modifying the intercept it can be extended to have covariates determining δ(s)tminus1 as
well The graphical representation of the model for a single site is
38
α
zi1
yi1
λ1
zi2
yi2
λ1
δ(s)1
β(c)1
middot middot middot
zit
yit
λt
δ(s)tminus1
β(c)tminus1
middot middot middot
ziT
yiT
λT
δ(s)Tminus1
β(c)Tminus1
Figure 2-3 Graphical representation multiseason model for a single site
The joint posterior for the model defined by this hierarchical setting is
[ zηαβλ|y ] = Cy
Nprodi=1
ψi1 Jprodj=1
pyij1ij1 (1minus pij1)
(1minusyij1)
zi1(1minus ψi1)
Jprodj=1
Iyij1=0
1minuszi1 [η1][α]times
Tprodt=2
Nprodi=1
[(θziti(tminus1)(1minus θi(tminus1))
1minuszit)zi(tminus1)
+(γziti(tminus1)(1minus γi(tminus1))
1minuszit)1minuszi(tminus1)
] Jprod
j=1
pyijtijt (1minus pijt)
1minusyijt
zit
times
Jprodj=1
Iyijt=0
1minuszit [ηt ][βtminus1][λtminus1]
(2ndash30)
which as in the single season case is intractable Once again a Gibbs sampler cannot
be constructed directly to sample from this joint posterior The graphical representation
of the model for one site incorporating the latent variables is provided in Figure 2-4
α
ui1
zi1
yi1
wi1
λ1
zi2
yi2
wi2
λ1
vi1
δ(s)1
β(c)1
middot middot middot
middot middot middot
zit
vi tminus1
yit
wit
λt
δ(s)tminus1
β(c)tminus1
middot middot middot
middot middot middot
ziT
vi Tminus1
yiT
wiT
λT
δ(s)Tminus1
β(s)Tminus1
Figure 2-4 Graphical representation data-augmented multiseason model
Probit link normal-mixture DYMOSS model
39
We deal with the intractability of the joint posterior distribution as before that is
by introducing latent random variables Each of the latent variables incorporates the
relevant linear combinations of covariates for the probabilities considered in the model
This artifact enables us to sample from the joint posterior distributions of the model
parameters For the probit link the sets of latent random variables respectively for first
season occupancy persistence and colonization and detection are
bull ui sim N (bTi α 1)
bull vi(tminus1) sim zi(tminus1)N(δ(s)(tminus1) + xTi(tminus1)β
(c)(tminus1) 1
)+ (1minus zi(tminus1))N
(xTi(tminus1)β
(c)(tminus1) 1
) and
bull wijt sim N (qTijtηt 1)
Introducing these latent variables into the hierarchical formulation yieldsState model
ui1|α sim N(xprime(o)iα 1
)zi1|ui sim Bernoulli
(Iuigt0
)for t gt 1
vi(tminus1)|zi(tminus1)βtminus1 sim zi(tminus1)N(δ(s)(tminus1) + xprimei(tminus1)β
(c)(tminus1) 1
)+
(1minus zi(tminus1))N(xprimei(tminus1)β
(c)(tminus1) 1
)zit |vi(tminus1) sim Bernoulli
(Ivi(tminus1)gt0
)(2ndash31)
Observed modelwijt |ηt sim N
(qTijtλt 1
)yijt |zit wijt sim Bernoulli
(zitIrijtgt0
)(2ndash32)
Note that the result presented in Section 22 corresponds to the particular case for
T = 1 of the model specified by Equations 2ndash31 and 2ndash32
As mentioned previously model parameters are obtained using a Gibbs sampling
approach Let ϕ(x |microσ2) denote the pdf of a normally distributed random variable x
with mean micro and standard deviation σ Also let
1 Wt = (w1t w2t wNt) with wit = (wi1t wi2t wiJitt) (for i = 1 2 N andt = 1 2 T )
40
2 u = (u1 u2 uN)
3 V = (v1 vTminus1) with vt = (v1t v2t vNt)
For the probit link model the joint posterior distribution is
π(ZuV WtTt=1αB(c) δ(s)
)prop [α]
prodNi=1 ϕ
(ui∣∣ xprime(o)iα 1
)Izi1uigt0I
1minuszi1uile0
times
Tprodt=2
[β(c)tminus1 δ
(s)tminus1
] Nprodi=1
ϕ(vi(tminus1)
∣∣micro(v)i(tminus1) 1
)Izitvi(tminus1)gt0
I1minuszitvi(tminus1)le0
times
Tprodt=1
[λt ]
Nprodi=1
Jitprodj=1
ϕ(wijt
∣∣qprimeijtλt 1)(zitIwijtgt0)yij1(1minus zitIwijtgt0)
(1minusyijt)
where micro(v)i(tminus1) = zi(tminus1)δ
(s)tminus1 + xprimei(tminus1)β
(c)tminus1 (2ndash33)
Initialize the Gibbs sampler at α(0)B(0)(c) δ
(s)(0)2minus1 and (0) For m = 0 1 nsim
The sampler proceeds iteratively by block sampling sequentially for each primary
sampling period as follows first the presence process then the latent variables from
the data-augmentation step for the presence component followed by the parameters for
the presence process then the latent variables for the detection component and finally
the parameters for the detection component Letting [|] denote the full conditional
probability density function of the component conditional on all other unknown
parameters and the observed data for m = 1 nsim the sampling procedure can be
summarized as
[z(m)1 | middot
]rarr[u(m)| middot
]rarr[α(m)
∣∣∣ middot ]rarr [W
(m)1 | middot
]rarr[λ(m)1
∣∣∣ middot ]rarr[z(m)2 | middot
]rarr[V(m)2minus1| middot
]rarr[β(c)(m)2minus1 δ(s)(m)
2minus1
∣∣∣ middot ]rarr [W
(m)2 | middot
]rarr[λ(m)2
∣∣∣ middot ]rarr middot middot middot
middot middot middot rarr[z(m)T | middot
]rarr[V(m)Tminus1| middot
]rarr[β(c)(m)Tminus1 δ(s)(m)
Tminus1
∣∣∣ middot ]rarr [W
(m)T | middot
]rarr[λ(m)T
∣∣∣ middot ]The full conditional probability densities for this Gibbs sampling algorithm are
presented in detail within Appendix A
41
Logit link Polya-Gamma DYMOSS model
Using the same notation as before the logit link model resorts to the hierarchy given
byState model
ui1|α sim PG(xT(o)iα 1
)zi1|ui sim Bernoulli
(Iuigt0
)for t gt 1
vi(tminus1)| sim PG(1∣∣zi(tminus1)δ(s)(tminus1) + xprimei(tminus1)β
(c)(tminus1)
∣∣)sim Bernoulli
(Ivi(tminus1)gt0
)(2ndash34)
Observed modelwijt |λt sim PG
(qTijtλt 1
)yijt |zit wijt sim Bernoulli
(zitIwijtgt0
)(2ndash35)
The logit link version of the joint posterior is given by
π(ZuV WtTt=1αB(s)B(c)
)prop
Nprodi=1
(e
xprime(o)i
α)zi1
1 + exprime(o)i
αPG
(ui 1 |xprime(o)iα|
)[λ1][α]times
Ji1prodj=1
(zi1
eqprimeij1λ1
1 + eqprimeij1λ1
)yij1(1minus zi1
eqprimeij1λ1
1 + eqprimeij1λ1
)1minusyij1
PG(wij1 1 |zi1qprimeij1λ1|
)times
Tprodt=2
[δ(s)tminus1][β
(c)tminus1][λt ]
Nprodi=1
(exp
[micro(v)tminus1
])zit1 + exp
[micro(v)i(tminus1)
]PG (vit 1 ∣∣∣micro(v)i(tminus1)
∣∣∣)timesJitprodj=1
(zit
eqprimeijtλt
1 + eqprimeijtλt
)yijt(1minus zit
eqprimeijtλt
1 + eqlowastTij
λt
)1minusyijt
PG(wijt 1 |zitqprimeijtλt |
)
(2ndash36)
with micro(v)tminus1 = zi(tminus1)δ
(s)tminus1 + xprimei(tminus1)β
(c)tminus1
42
The sampling procedure is entirely analogous to that described for the probit
version The full conditional densities derived from expression 2ndash36 are described in
detail in Appendix A232 Incorporating Spatial Dependence
In this section we describe how the additional layer of complexity space can also
be accounted for by continuing to use the same data-augmentation framework The
method we employ to incorporate spatial dependence is a slightly modified version of
the traditional approach for spatial generalized linear mixed models (GLMMrsquos) and
extends the model proposed by Johnson et al (2013) for the single season closed
population occupancy model
The traditional approach consists of using spatial random effects to induce a
correlation structure among adjacent sites This formulation introduced by Besag et al
(1991) assumes that the spatial random effect corresponds to a Gaussian Markov
Random Field (GMRF) The model known as the Spatial GLMM (SGLMM) is used to
analyze areal data It has been applied extensively given the flexibility of its hierarchical
formulation and the availability of software for its implementation (Hughes amp Haran
2013)
Succinctly the spatial dependence is accounted for in the model by adding a
random vector η assumed to have a conditionally-autoregressive (CAR) prior (also
known as the Gaussian Markov random field prior) To define the prior let the pair
G = (V E) represent the undirected graph for the entire spatial region studied where
V = (1 2 N) denotes the vertices of the graph (sites) and E the set of edges
between sites E is constituted by elements of the form (i j) indicating that sites i
and j are spatially adjacent for some i j isin V The prior for the spatial effects is then
characterized by
[η|τ ] prop τ rank()2exp[minusτ2ηprimeη
] (2ndash37)
43
where = (diag(A1)minus A) is the precision matrix with A denoting the adjacency matrix
The entries of the adjacency matrix A are such that diag(A) = 0 and Aij = I(i j)isinE
The matrix is singular Hence the probability density defined in equation 2ndash37
is improper ie it doesnrsquot integrate to 1 Regardless of the impropriety of the prior this
model can be fitted using a Bayesian approach since even if the prior is improper the
posterior for the model parameters is proper If a constraint such assum
k ηk = 0 is
imposed or if the precision matrix is replaced by a positive definite matrix the model
can also be fitted using a maximum likelihood approach
Assuming that all but the detection process are subject to spatial correlations and
using the notation we have developed up to this point the spatially explicit version of the
DYMOSS model is characterized by the hierarchy represented by equations 2ndash38 and
2ndash39
Hence adding spatial structure into the DYMOSS framework described in the
previous section only involves adding the steps to sample η(o) and ηtT
t=2 conditional
on all other parameters Furthermore the corresponding parameters and spatial
random effects of a given component (ie occupancy survival and colonization)
can be effortlessly pooled together into a single parameter vector to perform block
sampling For each of the latent variables the only modification required is to sum the
corresponding spatial effect to the linear predictor so that these retain their conditional
independence given the linear combination of fixed effects and the spatial effects
State modelzi1|α sim Bernoulli (ψi1) where ψi1 = F
(xT(o)iα+ η
(o)i
)[η(o)|τ
]prop τ rank()2exp
[minusτ2η(o)primeη(o)
]zit |zi(tminus1)αβtminus1λtminus1 sim Bernoulli
(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)where θi(tminus1) = F
(δ(s)(tminus1) + xTi(tminus1)β
(c)tminus1 + ηit
) and
γi(tminus1) = F(xTi(tminus1)β
(c)tminus1 + ηit
)[ηt |τ ] prop τ rank()2exp
[minusτ2ηprimetηt
](2ndash38)
44
Observed modelyijt |zit ηt sim Bernoulli (zitpijt)
where pijt = F (qTijtλt) (2ndash39)
In spite of the popularity of this approach to incorporating spatial dependence three
shortcomings have been reported in the literature (Hughes amp Haran 2013 Reich et al
2006) (1) model parameters have no clear interpretation due to spatial confounding
of the predictors with the spatial effect (2) there is variance inflation due to spatial
confounding and (3) the high dimensionality of the latent spatial variables leads to
high computational costs To avoid such difficulties we follow the approach used by
Hughes amp Haran (2013) which builds upon the earlier work by Reich et al (2006) This
methodology is summarized in what follows
Let a vector of spatial effects η have the CAR model given by 2ndash37 above Now
consider a random vector ζ sim MVN(0 τKprimeK
) with defined as above and where
τKprimeK corresponds to the precision of the distribution and not the covariance matrix
with matrix K satisfying KprimeK = I
This last condition implies that the linear predictor Xβ + η = Xβ + Kζ With
respect to how the matrix K is chosen Hughes amp Haran (2013) recommend basing its
construction on the spectral decomposition of operator matrices based on Moranrsquos I
The Moran operator matrix is defined as PperpAPperp with Pperp = IminusX (XprimeX )minus1X
prime and where A
is the adjacency matrix previously described The choice of the Moran operator is based
on the fact that it accounts for the underlying graph while incorporating the spatial
structure residual to the design matrix X These elements are incorporated into its
spectral decomposition of the Moran operator That is its eigenvalues correspond to the
values of Moranrsquos I statistic (a measure of spatial autocorrelation) for a spatial process
orthogonal to X while its eigenvectors provide the patterns of spatial dependence
residual to X Thus the matrix K is chosen to be the matrix whose columns are the
eigenvectors of the Moran operator for a particular adjacency matrix
45
Using this strategy the new hierarchical formulation of our model is simply modified
by letting η(o) = K(o)ζ(o) and ηt = Ktζt with
1 ζ(o) sim MVN(0 τ (o)K(o)primeK(o)
) where K(o) is the eigenvector matrix for
P(o)perpAP(o)perp and
2 ζt sim MVN(0 τtK
primetKt
) where Kt is the Pperp
t APperpt for t = 2 3 T
The algorithms for the probit and logit link from section 231 can be readily
adapted to incorporate the spatial structure simply by obtaining the joint posteriors
for (α ζ(o)) and (β(c)tminus1 δ
(s)tminus1 ζt) making the obvious modification of the corresponding
linear predictors to incorporate the spatial components24 Summary
With a few exceptions (Dorazio amp Taylor-Rodrıguez 2012 Johnson et al 2013
Royle amp Kery 2007) recent Bayesian approaches to site-occupancy modeling with
covariates have relied on model configurations (eg as multivariate normal priors of
parameters in logit scale) that lead to unfamiliar conditional posterior distributions thus
precluding the use of a direct sampling approach Therefore the sampling strategies
available are based on algorithms (eg Metropolis Hastings) that require tuning and the
knowledge to do so correctly
In Dorazio amp Taylor-Rodrıguez (2012) we proposed a Bayesian specification for
which a Gibbs sampler of the basic occupancy model is available and allowed detection
and occupancy probabilities to depend on linear combinations of predictors This
method described in section 221 is based on the data augmentation algorithm of
Albert amp Chib (1993) There the full conditional posteriors of the parameters of the probit
regression model are cast as latent mixtures of normal random variables The probit and
the logit link yield similar results with large sample sizes however their results may be
different when small to moderate sample sizes are considered because the logit link
function places more mass in the tails of the distribution than the probit link does In
46
section 222 we adapt the method for the single season model to work with the logit link
function
The basic occupancy framework is useful but it assumes a single closed population
with fixed probabilities through time Hence its assumptions may not be appropriate to
address problems where the interest lies in the temporal dynamics of the population
Hence we developed a dynamic model that incorporates the notion that occupancy
at a site previously occupied takes place through persistence which depends both on
survival and habitat suitability By this we mean that a site occupied at time t may again
be occupied at time t + 1 if (1) the current settlers survive (2) the existing settlers
perish but new settlers simultaneously colonize or (3) current settlers survive and new
ones colonize during the same season In our current formulation of the DYMOSS both
colonization and persistence depend on habitat suitability characterized by xprimei(tminus1)β(c)tminus1
They only differ in that persistence is also influenced by whether the site being occupied
during season t minus 1 enhances the suitability of the site or harms it through density
dependence
Additionally the study of the dynamics that govern distribution and abundance of
biological populations requires an understanding of the physical and biotic processes
that act upon them and these vary in time and space Consequently as a final step in
this Chapter we described a straightforward strategy to add spatial dependence among
neighboring sites in the dynamic metapopulation model This extension is based on the
popular Bayesian spatial modeling technique of Besag et al (1991) updated using the
methods described in (Hughes amp Haran 2013)
Future steps along these lines are (1) develop the software necessary to
implement the tools described throughout the Chapter and (2) build a suite of additional
extensions using this framework for occupancy models will be explored The first of
them will be used to incorporate information from different sources such as tracks
scats surveys and direct observations into a single model This can be accomplished
47
by adding a layer to the hierarchy where the source and spatial scale of the data is
accounted for The second extension is a single season spatially explicit multiple
species co-occupancy model This model will allow studying complex interactions
and testing hypotheses about species interactions at a given point in time Lastly this
co-occupancy model will be adapted to incorporate temporal dynamics in the spirit of
the DYMOSS model
48
CHAPTER 3INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
Eliminate all other factors and the one which remains must be the truthndashSherlock Holmes
The Sign of Four
31 Introduction
Occupancy models are often used to understand the mechanisms that dictate
the distribution of a species Therefore variable selection plays a fundamental role in
achieving this goal To the best of our knowledge ldquoobjectiverdquo Bayesian alternatives for
variable selection have not been put forth for this problem and with a few exceptions
(Hooten amp Hobbs 2014 Link amp Barker 2009) AIC is the method used to choose from
competing site-occupancy models In addition the procedures currently implemented
and accessible to ecologists require enumerating and estimating all the candidate
models (Fiske amp Chandler 2011 Mazerolle amp Mazerolle 2013) In practice this
can be achieved if the model space considered is small enough which is possible
if the choice of the model space is guided by substantial prior knowledge about the
underlying ecological processes Nevertheless many site-occupancy surveys collect
large amounts of covariate information about the sampled sites Given that the total
number of candidate models grows exponentially fast with the number of predictors
considered choosing a reduced set of models guided by ecological intuition becomes
increasingly difficult This is even more so the case in the occupancy model context
where the model space is the cartesian product of models for presence and models for
detection Given the issues mentioned above we propose the first objective Bayesian
variable selection method for the single-season occupancy model framework This
approach explores in a principled manner the entire model space It is completely
49
automatic precluding the need for both tuning parameters in the sampling algorithm and
subjective elicitation of parameter prior distributions
As mentioned above in ecological modeling if model selection or less frequently
model averaging is considered the Akaike Information Criterion (AIC) (Akaike 1983)
or a version of it is the measure of choice for comparing candidate models (Fiske amp
Chandler 2011 Mazerolle amp Mazerolle 2013) The AIC is designed to find the model
that has on average the density closest in Kullback-Leibler distance to the density
of the true data generating mechanism The model with the smallest AIC is selected
However if nested models are considered one of them being the true one generally the
AIC will not select it (Wasserman 2000) Commonly the model selected by AIC will be
more complex than the true one The reason for this is that the AIC has a weak signal to
noise ratio and as such it tends to overfit (Rao amp Wu 2001) Other versions of the AIC
provide a bias correction that enhances the signal to noise ratio leading to a stronger
penalization for model complexity Some examples are the AICc (Hurvich amp Tsai 1989)
and AICu (McQuarrie et al 1997) however these are also not consistent for selection
albeit asymptotically efficient (Rao amp Wu 2001)
If we are interested in prediction as opposed to testing the AIC is certainly
appropriate However when conducting inference the use of Bayesian model averaging
and selection methods is more fitting If the true data generating mechanism is among
those considered asymptotically Bayesian methods choose the true model with
probability one Conversely if the true model is not among the alternatives and a
suitable parameter prior is used the posterior probability of the most parsimonious
model closest to the true one tends asymptotically to one
In spite of this in general for Bayesian testing direct elicitation of prior probabilistic
statements is often impeded because the problems studied may not be sufficiently
well understood to make an informed decision about the priors Conversely there may
be a prohibitively large number of parameters making specifying priors for each of
50
these parameters an arduous task In addition to this seemingly innocuous subjective
choices for the priors on the parameter space may drastically affect test outcomes
This has been a recurring argument in favor of objective Bayesian procedures
which appeal to the use of formal rules to build parameter priors that incorporate the
structural information inside the likelihood while utilizing some objective criterion (Kass amp
Wasserman 1996)
One popular choice of ldquoobjectiverdquo prior is the reference prior (Berger amp Bernardo
1992) which is the prior that maximizes the amount of signal extracted from the
data These priors have proven to be effective as they are fully automatic and can
be frequentist matching in the sense that the posterior credible interval agrees with the
frequentist confidence interval from repeated sampling with equal coverage-probability
(Kass amp Wasserman 1996) Reference priors however are improper and while
they yield reasonable posterior parameter probabilities the derived model posterior
probabilities may be ill defined To avoid this shortcoming Berger amp Pericchi (1996)
introduced the intrinsic Bayes factor (IBF) for model comparison Moreno et al (1998)
building on the IBF of Berger amp Pericchi (1996) developed a limiting procedure to
generate a system of priors that yield well-defined posteriors even though these
priors may sometimes be improper The IBF is built using a data-dependent prior to
automatically generate Bayes factors however the extension introduced by Moreno
et al (1998) generates the intrinsic prior by taking a theoretical average over the space
of training samples freeing the prior from data dependence
In our view in the face of a large number of predictors the best alternative is to run
a stochastic search algorithm using good ldquoobjectiverdquo testing parameter priors and to
incorporate suitable model priors This being said the discussion about model priors is
deferred until Chapter 4 this Chapter focuses on the priors on the parameter space
The Chapter is structured as follows First issues surrounding multimodel inference
are described and insight about objective Bayesian inferential procedures is provided
51
Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate priors
on the parameter space the intrinsic priors for the parameters of the occupancy model
are derived These are used in the construction of an algorithm for ldquoobjectiverdquo model
selection tailored to the occupancy model framework To assess the performance of our
methods we provide results from a simulation study in which distinct scenarios both
favorable and unfavorable are used to determine the robustness of these tools and
analyze the Blue Hawker data set which has been examined previously in the ecological
literature (Dorazio amp Taylor-Rodrıguez 2012 Kery et al 2010)32 Objective Bayesian Inference
As mentioned before in practice noninformative priors arising from structural
rules are an alternative to subjective elicitation of priors Some of the rules used in
defining noninformative priors include the principle of insufficient reason parametrization
invariance maximum entropy geometric arguments coverage matching and decision
theoretic approaches (see Kass amp Wasserman (1996) for a discussion)
These rules reflect one of two attitudes (1) noninformative priors either aim to
convey unique representations of ignorance or (2) they attempt to produce probability
statements that may be accepted by convention This latter attitude is in the same
spirit as how weights and distances are defined (Kass amp Wasserman 1996) and
characterizes the way in which Bayesian reference methods are interpreted today ie
noninformative priors are seen to be chosen by convention according to the situation
A word of caution must be given when using noninformative priors Difficulties arise
in their implementation that should not be taken lightly In particular these difficulties
may occur because noninformative priors are generally improper (meaning that they do
not integrate or sum to a finite number) and as such are said to depend on arbitrary
constants
Bayes factors strongly depend upon the prior distributions for the parameters
included in each of the models being compared This can be an important limitation
52
considering that when using noninformative priors their introduction will result in the
Bayes factors being a function of the ratio of arbitrary constants given that these priors
are typically improper (see Jeffreys 1961 Pericchi 2005 and references therein)
Many different approaches have been developed to deal with the arbitrary constants
when using improper priors since then These include the use of partial Bayes factors
(Berger amp Pericchi 1996 Good 1950 Lempers 1971) setting the ratio of arbitrary
constants to a predefined value (Spiegelhalter amp Smith 1982) and approximating to the
Bayes factor (see Haughton 1988 as cited in Berger amp Pericchi 1996 Kass amp Raftery
1995 Tierney amp Kadane 1986)321 The Intrinsic Methodology
Berger amp Pericchi (1996) cleverly dealt with the arbitrary constants that arise when
using improper priors by introducing the intrinsic Bayes factor (IBF) procedure This
solution based on partial Bayes factors provides the means to replace the improper
priors by proper ldquoposteriorrdquo priors The IBF is obtained from combining the model
structure with information contained in the observed data Furthermore they showed
that as the sample size tends to infinity the Intrinsic Bayes factor corresponds to the
proper Bayes factor arising from the intrinsic priors
Intrinsic priors however are not unique The asymptotic correspondence between
the IBF and the Bayes factor arising from the intrinsic prior yields two functional
equations that are solved by a whole class of intrinsic priors Because all the priors
in the class produce Bayes factors that are asymptotically equivalent to the IBF for
finite sample sizes the resulting Bayes factor is not unique To address this issue
Moreno et al (1998) formalized the methodology through the ldquolimiting procedurerdquo
This procedure allows one to obtain a unique Bayes factor consolidating the method
as a valid objective Bayesian model selection procedure which we will refer to as the
Bayes factor for intrinsic priors (BFIP) This result is particularly valid for nested models
although the methodology may be extended with some caution to nonnested models
53
As mentioned before the Bayesian hypothesis testing procedure is highly sensitive
to parameter-prior specification and not all priors that are useful for estimation are
recommended for hypothesis testing or model selection Evidence of this is provided
by the Jeffreys-Lindley paradox which states that a point null hypothesis will always
be accepted when the variance of a conjugate prior goes to infinity (Robert 1993)
Additionally when comparing nested models the null model should correspond to
a substantial reduction in complexity from that of larger alternative models Hence
priors for the larger alternative models that place probability mass away from the null
model are wasteful If the true model is ldquofarrdquo from the null it will be easily detected by
any statistical procedure Therefore the prior on the alternative models should ldquowork
harderrdquo at selecting competitive models that are ldquocloserdquo to the null This principle known
as the Savage continuity condition (Gunel amp Dickey 1974) is widely recognized by
statisticians
Interestingly the intrinsic prior in correspondence with the BFIP automatically
satisfies the Savage continuity condition That is when comparing nested models the
intrinsic prior for the more complex model is centered around the null model and in spite
of being a limiting procedure it is not subject to the Jeffreys-Lindley paradox
Moreover beyond the usual pairwise consistency of the Bayes factor for nested
models Casella et al (2009) show that the corresponding Bayesian procedure with
intrinsic priors for variable selection in normal regression is consistent in the entire
class of normal linear models adding an important feature to the list of virtues of the
procedure Consistency of the BFIP for the case where the dimension of the alternative
model grows with the sample size is discussed in Moreno et al (2010)322 Mixtures of g-Priors
As previously mentioned in the Bayesian paradigm a model M in M is defined
by a sampling density and a prior distribution The sampling density associated with
model M is denoted by f (y|βM σ2M M) where (βM σ
2M) is a vector of model-specific
54
unknown parameters The prior for model M and its corresponding set of parameters is
π(βM σ2M M|M) = π(βM σ
2M |MM) middot π(M|M)
Objective local priors for the model parameters (βM σ2M) are achieved through
modifications and extensions of Zellnerrsquos g-prior (Liang et al 2008 Womack et al
2014) In particular below we focus on the intrinsic prior and provide some details for
other scaled mixtures of g-priors We defer the discussion on priors over the model
space until Chapter 5 where we describe them in detail and develop a few alternatives
of our own3221 Intrinsic priors
An automatic choice of an objective prior is the intrinsic prior (Berger amp Pericchi
1996 Moreno et al 1998) Because MB sube M for all M isin M the intrinsic prior for
(βM σ2M) is defined as an expected posterior prior
πI (βM σ2M |M) =
intpR(βM σ
2M |~yM)mR(~y|MB)d~y (3ndash1)
where ~y is a minimal training sample for model M I denotes the intrinsic distributions
and R denotes distributions derived from the reference prior πR(βM σ2M |M) = cM
dβMdσ2M
σ2M
In (3ndash1) mR(~y|M) =intint
f (~y|βM σ2M M)πR(βM σ
2M |M)dβMdσ2M is the reference marginal
of ~y under model M and pR(βM σ2M |~yM) =
f (~y|βM σ2MM)πR(βM σ2
M|M)
mR(~y|M)is the reference
posterior density
In the regression framework the reference marginal mR is improper and produces
improper intrinsic priors However the intrinsic Bayes factor of model M to the base
model MB is well-defined and given by
BF IMMB
(y) = (1minus R2M)
minus nminus|MB |2 times
int 1
0
n + sin2(π2θ) middot (|M|+ 1)
n +sin2(π
2θ)middot(|M|+1)1minusR2
M
nminus|M|
2sin2(π
2θ) middot (|M|+ 1)
n +sin2(π
2θ)middot(|M|+1)1minusR2
M
|M|minus|MB |
2
dθ (3ndash2)
55
where R2M is the coefficient of determination of model M versus model MB The Bayes
factor between two models M and M prime is defined as BF IMMprime(y) = BF I
MMB(y)BF I
MprimeMB(y)
The ldquogoodnessrdquo of the model M based on the intrinsic priors is given by its posterior
probability
pI (M|yM) =BF I
MMB(y)π(M|M)sum
MprimeisinM BF IMprimeMB
(y)π(M prime|M) (3ndash3)
It has been shown that the system of intrinsic priors produces consistent model selection
(Casella et al 2009 Giron et al 2010) In the context of well-formulated models the
true model MT is the smallest well-formulated model M isin M such that α isin M if βα = 0
If MT is the true model then the posterior probability of model MT based on equation
(3ndash3) converges to 13222 Other mixtures of g-priors
Scaled mixtures of g-priors place a reference prior on (βMBσ2) and a multivariate
normal distribution on β in M MB that is normal with mean 0 and precision matrix
qMw
nσ2ZprimeM (IminusH0)ZM
where H0 is the hat matrix associated with ZMB The prior is completed by a prior on w
and choice of scaling qM that is set at |M| + 1 to account for the minimal sample size of
M Under these assumptions the Bayesrsquo factor for M to MB is given by
BFMMB(y) =
(1minus R2
M
) nminus|MB |2
int n + w(|M|+ 1)
n + w(|M|+1)1minusR2
M
nminus|M|
2w(|M|+ 1)
n + w(|M|+1)1minusR2
M
|M|minus|MB |
2
π(w)dw
We consider the following priors on w The intrinsic prior is π(w) = Beta(w 05 05)
which is only defined for w isin (0 1) A version of the Zellner-Siow prior is given by
w sim Gamma(05 05) which produces a multivariate Cauchy distribution on β A family
of hyper-g priors are defined by π(w) prop wminus12(β + w)(α+1)2 which have Cauchy-like
tails but produce more shrinkage than the Cauchy prior
56
33 Objective Bayes Occupancy Model Selection
As mentioned before Bayesian inferential approaches used for ecological models
are lacking In particular there exists a need for suitable objective and automatic
Bayesian testing procedures and software implementations that explore thoroughly the
model space considered With this goal in mind in this section we develop an objective
intrinsic and fully automatic Bayesian model selection methodology for single season
site-occupancy models We refer to this method as automatic and objective given that
in its implementation no hyperparameter tuning is required and that it is built using
noninformative priors with good testing properties (eg intrinsic priors)
An inferential method for the occupancy problem is possible using the intrinsic
approach given that we are able to link intrinsic-Bayesian tools for the normal linear
model through our probit formulation of the occupancy model In other words because
we can represent the single season probit occupancy model through the hierarchy
yij |zi wij sim Bernoulli(ziIwijgt0
)wij |λ sim N
(qprimeijλ 1
)zi |vi sim Bernoulli
(Ivigt0
)vi |α sim N (x primeiα 1)
it is possible to solve the selection problem on the latent scale variables wij and vi and
to use those results at the level of the occupancy and detection processes
In what follows first we provide some necessary notation Then a derivation of
the intrinsic priors for the parameters of the detection and occupancy components
is outlined Using these priors we obtain the general form of the model posterior
probabilities Finally the results are incorporated in a model selection algorithm for
site-occupancy data Although the priors on the model space are not discussed in this
Chapter the software and methods developed have different choices of model priors
built in
57
331 Preliminaries
The notation used in Chapter 2 will be considered in this section as well Namely
presence will be denoted by z detection by y their corresponding latent processes are
v and w and the model parameters are denoted by α and λ However some additional
notation is also necessary Let M0 =M0y M0z
denote the ldquobaserdquo model defined by
the smallest models considered for the detection and presence processes The base
models M0y and M0z include predictors that must be contained in every model that
belongs to the model space Some examples of base models are the intercept only
model a model with covariates related to the sampling design and a model including
some predictors important to the researcher that should be included in every model
Furthermore let the sets [Kz ] = 1 2 Kz and [Ky ] = 1 2 Ky index
the covariates considered for the variable selection procedure for the presence and
detection processes respectively That is these sets denote the covariates that can
be added from the base models in M0 or removed from the largest possible models
considered MF z and MF y which we will refer to as the ldquofullrdquo models The model space
can then be represented by the Cartesian product of subsets such that Ay sube [Ky ]
and Az sube [Kz ] The entire model space is populated by models of the form MA =MAy
MAz
isin M = My timesMz with MAy
isin My and MAzisin Mz
For the presence process z the design matrix for model MAzis given by the block
matrix XAz= (X0|Xr A) X0 corresponds to the design matrix of the base model ndash which
is such that M0z sube MAzisin Mz for all Az isin [Kz ] ndash and Xr A corresponds to the submatrix
that contains the covariates indexed by Az Analogously for the detection process y the
design matrix is given by QAy= (Q0|Qr A) Similarly the coefficients for models MAz
and
MAyare given by αA = (αprime
0αprimer A)
prime and λA = (λprime0λ
primer A)
prime
With these elements in place the model selection problem consists of finding
subsets of covariates indexed by A = Az Ay that have a high posterior probability
given the detection and occupancy processes This is equivalent to finding models with
58
high posterior odds when compared to a suitable base model These posterior odds are
given by
p(MA|y z)p(M0|y z)
=m(y z|MA)π(MA)
m(y z|M0)π(M0)= BFMAM0
(y z)π(MA)
π(M0)
Since we are able to represent the occupancy model as a truncation of latent
normal variables it is possible to work through the occupancy model selection problem
in the latent normal scale used for the presence and detection processes We formulate
two solutions to this problem one that depends on the observed and latent components
and another that solely depends on the latent level variables used to data-augment the
problem We will however focus on the latter approach as this yields a straightforward
MCMC sampling scheme For completeness the other alternative is described in
Section 34
At the root of our objective inferential procedure for occupancy models lies the
conditional argument introduced by Womack et al (work in progress) for the simple
probit regression In the occupancy setting the argument is
p(MA|y zw v) =m(y z vw|MA)π(MA)
m(y zw v)
=fyz(y z|w v)
(intfvw(vw|αλMA)παλ(αλ|MA)d(αλ)
)π(MA)
fyz(y z|w v)sum
MlowastisinM(int
fvw(vw|αλMlowast)παλ(αλ|Mlowast)d(αλ))π(Mlowast)
=m(v|MAz
)m(w|MAy)π(MA)
m(v)m(w)
prop m(v|MAz)m(w|MAy
)π(MA) (3ndash4)
where
1 fyz(y z|w v) =prodN
i=1 Izivigt0I
(1minuszi )vile0
prodJ
j=1(ziIwijgt0)yij (1minus ziIwijgt0)
1minusyij
2 fvw(vw|αλMA) =
(Nprodi=1
ϕ(vi xprimeiαMAz
1)
)︸ ︷︷ ︸
f (v|αr Aα0MAz )
(Nprodi=1
Jiprodj=1
ϕ(wij qprimeijλMAy
1)
)︸ ︷︷ ︸
f (w|λr Aλ0MAy )
and
59
3 παλ(αλ|MA) = πα(α|MAz)πλ(λ|MAy
)
This result implies that once the occupancy and detection indicators are
conditioned on the latent processes v and w respectively the model posterior
probabilities only depend on the latent variables Hence in this case the model
selection problem is driven by the posterior odds
p(MA|y zw v)p(M0|y zw v)
=m(w v|MA)
m(w v|M0)
π(MA)
π(M0) (3ndash5)
where m(w v|MA) = m(w|MAy) middotm(v|MAz
) with
m(v|MAz) =
int intf (v|αr Aα0MAz
)π(αr A|α0MAz)π(α0)dαr Adα0
(3ndash6)
m(w|MAy) =
int intf (w|λr Aλ0MAy
)π(λr A|λ0MAy)π(λ0)dλ0dλr A
(3ndash7)
332 Intrinsic Priors for the Occupancy Problem
In general the intrinsic priors as defined by Moreno et al (1998) use the functional
form of the response to inform their construction assuming some preliminary prior
distribution proper or improper on the model parameters For our purposes we assume
noninformative improper priors for the parameters denoted by πN(middot|middot) Specifically the
intrinsic priors πIP(θMlowast|Mlowast) for a vector of parameters θMlowast corresponding to model
Mlowast isin M0M sub M for a response vector s with probability density (or mass) function
f (s|θMlowast) are defined by
πIP(θM0|M0) = πN(θM0
|M0)
πIP(θM |M) = πN(θM |M)
intm(~s|M)
m(~s|M0)f (~s|θM M)d~s
where ~s is a theoretical training sample
In what follows whenever it is clear from the context in an attempt to simplify the
notation MA will be used to refer to MAzor MAy
and A will denote Az or Ay To derive
60
the parameter priors involved in equations 3ndash6 and 3ndash7 using the objective intrinsic prior
strategy we start by assuming flat priors πN(αA|MA) prop cA and πN(λA|MA) prop dA where
cA and dA are unknown constants
The intrinsic prior for the parameters associated with the occupancy process αA
conditional on model MA is
πIP(αA|MA) = πN(αA|MA)
intm(~v|MA)
m(~v|M0)f (~v|αAMA)d~v
where the marginals m(~v|Mj) with j isin A 0 are obtained by solving the analogous
equation 3ndash6 for the (theoretical) training sample ~v These marginals are given by
m(~v|Mj) = cj (2π)pjminusp0
2 |~X primej~Xj |
12 eminus
12~vprime(Iminus~Hj )~v
The training sample ~v has dimension pAz=∣∣MAz
∣∣ that is the total number of
parameters in model MAz Note that without ambiguity we use
∣∣ middot ∣∣ to denote both
the cardinality of a set and also the determinant of a matrix The design matrix ~XA
corresponds to the training sample ~v and is chosen such that ~X primeA~XA =
pAzNX primeAXA
(Leon-Novelo et al 2012) and ~Hj is the corresponding hat matrix
Replacing m(~v|MA) and m(~v|M0) in πIP(αA|MA) and solving the integral with
respect to the theoretical training sample ~v we have
πIP(αA|MA) = cA
int ((2π)minus
pAzminusp0z2
(c0
cA
)eminus
12~vprime((Iminus~HA)minus(Iminus~H0))~v |~X
primeA~XA|12
|~X prime0~X0|12
)times(
(2π)minuspAz2 eminus
12(~vminus~XAαA)
prime(~vminus~XAαA))d~v
= c0(2π)minus
pAzminusp0z2 |~X prime
Ar~XAr |
12 2minus
pAzminusp0z2 exp
[minus1
2αprimer A
(1
2~X primer A
~Xr A
)αr A
]= πN(α0)timesN
(αr A
∣∣ 0 2 middot ( ~X primer A
~Xr A)minus1)
(3ndash8)
61
Analogously the intrinsic prior for the parameters associated to the detection
process is
πIP(λA|MA) = d0(2π)minus
pAyminusp0y2 | ~Q prime
Ar~QAr |
12 2minus
pAyminusp0y2 exp
[minus1
2λprimer A
(1
2~Q primer A
~Qr A
)λr A
]= πN(λ0)timesN
(λr A
∣∣ 0 2 middot ( ~Q primeA~QA)
minus1)
(3ndash9)
In short the intrinsic priors for αA = (αprime0α
primer A)
prime and λprimeA = (λprime
0λprimer A)
prime are the product
of a reference prior on the parameters of the base model and a normal density on the
parameters indexed by Az and Ay respectively333 Model Posterior Probabilities
We now derive the expressions involved in the calculations of the model posterior
probabilities First recall that p(MA|y zw v) prop m(w v|MA)π(MA) Hence determining
this posterior probability only requires calculating m(w v|MA)
Note that since w and v are independent obtaining the model posteriors from
expression 3ndash4 reduces to finding closed form expressions for the marginals m(v |MAz)
and m(w |MAy) respectively from equations 3ndash6 and 3ndash7 Therefore
m(w v|MA) =
int intf (vw|αλMA)π
IP (α|MAz)πIP
(λ|MAy
)dαdλ
(3ndash10)
For the latent variable associated with the occupancy process plugging the
parameter intrinsic prior given by 3ndash8 into equation 3ndash6 (recalling that ~X primeA~XA =
pAzNX primeAXA)
and integrating out αA yields
m(v|MA) =
int intc0N (v|X0α0 + Xr Aαr A I)N
(αr A|0 2( ~X prime
r A~Xr A)
minus1)dαr Adα0
= c0(2π)minusn2
int (pAz
2N + pAz
) (pAzminusp0z
)
2
times
exp[minus1
2(v minus X0α0)
prime(I minus
(2N
2N + pAz
)Hr Az
)(v minus X0α0)
]dα0
62
= c0 (2π)minus(nminusp0z )2
(pAz
2N + pAz
) (pAzminusp0z
)
2
|X prime0X0|minus
12 times
exp[minus1
2vprime(I minus H0z minus
(2N
2N + pAz
)Hr Az
)v
] (3ndash11)
with Hr Az= HAz
minus H0z where HAzis the hat matrix for the entire model MAz
and H0z is
the hat matrix for the base model
Similarly the marginal distribution for w is
m(w|MA) = d0 (2π)minus(Jminusp0y )2
(pAy
2J + pAy
) (pAyminusp0y
)
2
|Q prime0Q0|minus
12 times
exp[minus1
2wprime(I minus H0y minus
(2J
2J + pAy
)Hr Ay
)w
] (3ndash12)
where J =sumN
i=1 Ji or in other words J denotes the total number of surveys conducted
Now the posteriors for the base model M0 =M0y M0z
are
m(v|M0) =
intc0N (v|X0α0 I) dα0
= c0(2π)minus(nminusp0z )2 |X prime
0X0|minus12 exp
[minus1
2(v (I minus H0z ) v)
](3ndash13)
and
m(w|M0) = d0(2π)minus(Jminusp0y )2 |Q prime
0Q0|minus12 exp
[minus1
2
(w(I minus H0y
)w)]
(3ndash14)
334 Model Selection Algorithm
Having the parameter intrinsic priors in place and knowing the form of the model
posterior probabilities it is finally possible to develop a strategy to conduct model
selection for the occupancy framework
For each of the two components of the model ndashoccupancy and detectionndash the
algorithm first draws the set of active predictors (ie Az and Ay ) together with their
corresponding parameters This is a reversible jump step which uses a Metropolis
63
Hastings correction with proposal distributions given by
q(Alowastz |zo z(t)u v(t)MAz
) =1
2
(p(MAlowast
z|zo z(t)u v(t)Mz MAlowast
zisin L(MAz
)) +1
|L(MAz)|
)q(Alowast
y |y zo z(t)u w(t)MAy) =
1
2
(p(MAlowast
w|y zo z(t)u w(t)My MAlowast
yisin L(MAy
)) +1
|L(MAy)|
)(3ndash15)
where L(MAz) and L(MAy
) denote the sets of models obtained from adding or removing
one predictor at a time from MAzand MAy
respectively
To promote mixing this step is followed by an additional draw from the full
conditionals of α and λ The densities p(α0|) p(αr A|) p(λ0|) and p(λr A|) can
be sampled from directly with Gibbs steps Using the notation a|middot to denote the random
variable a conditioned on all other parameters and on the data these densities are given
by
bull α0|middot sim N((X
prime0X0)
minus1Xprime0v (X
prime0X0)
minus1)bull αr A|middot sim N
(microαr A
αr A
) where the mean vector and the covariance matrix are
given by αr A= 2N
2N+pAz(X
prime
r AXr A)minus1 and microαr A
=(αr A
Xprime
r Av)
bull λ0|middot sim N((Q
prime0Q0)
minus1Qprime0w (Q
prime0Q0)
minus1) and
bull λr A|middot sim N(microλr A
λr A
) analogously with mean and covariance matrix given by
λr A= 2J
2J+pAy(Q
prime
r AQr A)minus1 and microλr A
=(λr A
Qprime
r Aw)
Finally Gibbs sampling steps are also available for the unobserved occupancy
indicators zu and for the corresponding latent variables v and w The full conditional
posterior densities for z(t+1)u v(t+1) and w(t+1) are those introduced in Chapter 2 for the
single season probit model
The following steps summarize the stochastic search algorithm
1 Initialize A(0)y A
(0)z z
(0)u v(0)w(0)α(0)
0 λ(0)0
2 Sample the model indices and corresponding parameters
(a) Draw simultaneously
64
bull Alowastz sim q(Az |zo z(t)u v(t)MAz
)
bull αlowast0 sim p(α0|MAlowast
z zo z
(t)u v(t)) and
bull αlowastr Alowast sim p(αr A|MAlowast
z zo z
(t)u v(t))
(b) Accept (M(t+1)Az
α(t+1)10 α(t+1)1
r A ) = (MAlowastzαlowast
0αlowastr Alowast) with probability
δz = min
(1
p(MAlowastz|zo z(t)u v(t))
p(MA(t)z|zo z(t)u v(t))
q(A(t)z |zo z(t)u v(t)MAlowast
z)
q(Alowastz |zo z
(t)u v(t)MAz
)
)
otherwise let (M(t+1)Az
α(t+1)10 α(t+1)1
r A ) = (A(t)z α(t)2
0 α(t)2r A )
(c) Sample simultaneously
bull Alowasty sim q(Ay |y zo z(t)u w(t)MAy
)
bull λlowast0 sim p(λ0|MAlowast
y y zo z
(t)u w(t)) and
bull λlowastr Alowast sim p(λr A|MAlowast
y y zo z
(t)u w(t))
(d) Accept (M(t+1)Ay
λ(t+1)10 λ(t+1)1
r A ) = (MAlowastyλlowast
0λlowastr Alowast) with probability
δy = min
(1
p(MAlowastz|y zo z(t)u w(t))
p(MA(t)z|y zo z(t)u w(t))
q(A(t)z |y zo z(t)u w(t)MAlowast
y)
q(Alowastz |y zo z
(t)u w(t)MAy
)
)
otherwise let (M(t+1)Ay
λ(t+1)10 λ(t+1)1
r A ) = (A(t)y λ(t)2
0 λ(t)2r A )
3 Sample base model parameters
(a) Draw α(t+1)20 sim p(α0|MA
(t+1)z
zo z(t)u v(t))
(b) Draw λ(t+1)20 sim p(λ0|MA(t+1)y
y zo z(t)u v(t))
4 To improve mixing resample model coefficients not present the base model butare in MA
(a) Draw α(t+1)2r A sim p(αr A|MA
(t+1)z
zo z(t)u v(t))
(b) Draw λ(t+1)2r A sim p(λr A|MA
(t+1)y
yzo z(t)u v(t))
5 Sample latent and missing (unobserved) variables
(a) Sample z(t+1)u sim p(zu|MA(t+1)z
yα(t+1)2r A α(t+1)2
0 λ(t+1)2r A λ(t+1)2
0 )
(b) Sample v(t+1) sim p(v|MA(t+1)z
zo z(t+1)u α(t+1)2
r A α(t+1)20 )
65
(c) Sample w(t+1) sim p(w|MA(t+1)y
zo z(t+1)u λ(t+1)2
r A λ(t+1)20 )
34 Alternative Formulation
Because the occupancy process is partially observed it is reasonable to consider
the posterior odds in terms of the observed responses that is the detections y and
the presences at sites where at least one detection takes place Partitioning the vector
of presences into observed and unobserved z = (zprimeo zprimeu)
prime and integrating out the
unobserved component the model posterior for MA can be obtained as
p(MA|y zo) prop Ezu [m(y z|MA)] π(MA) (3ndash16)
Data-augmenting the model in terms of latent normal variables a la Albert and Chib
the marginals for any model My Mz = M isin M of z and y inside of the expectation in
equation 3ndash16 can be expressed in terms of the latent variables
m(y z|M) =
intT (z)
intT (yz)
m(w v|M)dwdv
=
(intT (z)
m(v| Mz)dv
)(intT (yz)
m(w|My)dw
) (3ndash17)
where T (z) and T (y z) denote the corresponding truncation regions for v and w which
depend on the values taken by z and y and
m(v|Mz) =
intf (v|αMz)π(α|Mz)dα (3ndash18)
m(w|My) =
intf (w|λMy)π(λ|My)dλ (3ndash19)
The last equality in equation 3ndash17 is a consequence of the independence of the
latent processes v and w Using expressions 3ndash18 and 3ndash19 allows one to embed this
model selection problem in the classical linear normal regression setting where many
ldquoobjectiverdquo Bayesian inferential tools are available In particular these expressions
facilitate deriving the parameter intrinsic priors (Berger amp Pericchi 1996 Moreno
et al 1998) for this problem This approach is an extension of the one implemented in
Leon-Novelo et al (2012) for the simple probit regression problem
66
Using this alternative approach all that is left is to integrate m(v|MA) and m(w|MA)
over their corresponding truncation regions T (z) and T (y z) which yields m(y z|MA)
and then to obtain the expectation with respect to the unobserved zrsquos Note however
two issues arise First such integrals are not available in closed form Second
calculating the expectation over the limit of integration further complicates things To
address these difficulties it is possible to express E [m(y z|MA)] as
Ezu [m(y z|MA)] = Ezu
[(intT (z)
m(v| MAz)dv
)(intT (yz)
m(w|MAy)dw
)](3ndash20)
= Ezu
[(intT (z)
intm(v| MAz
α0)πIP(α0|MAz
)dα0dv
)times(int
T (yz)
intm(w| MAy
λ0)πIP(λ0|MAy
)dλ0dw
)]
= Ezu
int (int
T (z)
m(v| MAzα0)dv
)︸ ︷︷ ︸
g1(T (z)|MAz α0)
πIP(α0|MAz)dα0 times
int (intT (yz)
m(w|MAyλ0)dw
)︸ ︷︷ ︸
g2(T (yz)|MAy λ0)
πIP(λ0|MAy)dλ0
= Ezu
[intg1(T (z)|MAz
α0)πIP(α0|MAz
)dα0 timesintg2(T (y z)|MAy
λ0)πIP(λ0|MAy
)dλ0
]= c0 d0
int intEzu
[g1(T (z)|MAz
α0)g2(T (y z)|MAyλ0)
]dα0 dλ0
where the last equality follows from Fubinirsquos theorem since m(v|MAzα0) and
m(w|MAyλ0) are proper densities From 3ndash21 the posterior odds are
p(MA|y zo)p(M0|y zo)
=
int intEzu
[g1(T (z)|MAz
α0)g2(T (y z)|MAyλ0)
]dα0 dλ0int int
Ezu
[g1(T (z)|M0z α0)g2(T (y z)|M0y λ0)
]dα0 dλ0
π(MA)
π(M0)
(3ndash21)
67
35 Simulation Experiments
The proposed methodology was tested under 36 different scenarios where we
evaluate the behavior of the algorithm by varying the number of sites the number of
surveys the amount of signal in the predictors for the presence component and finally
the amount of signal in the predictors for the detection component
For each model component the base model is taken to be the intercept only model
and the full models considered for the presence and the detection have respectively 30
and 20 predictors Therefore the model space contains 230times220 asymp 112times1015 candidate
models
To control the amount of signal in the presence and detection components values
for the model parameter were purposefully chosen so that quantiles 10 50 and 90 of the
occupancy and detection probabilities match some pre-specified probabilities Because
presence and detection are binary variables the amount of signal in each model
component associates to the spread and center of the distribution for the occupancy and
detection probabilities respectively Low signal levels relate to occupancy or detection
probabilities close to 50 High signal levels associate with probabilities close to 0 or 1
Large spreads of the distributions for the occupancy and detection probabilities reflect
greater heterogeneity among the observations collected improving the discrimination
capability of the model and viceversa
Therefore for the presence component the parameter values of the true model
were chosen to set the median for the occupancy probabilities equal 05 The chosen
parameter values also fix quantiles 10 and 90 symmetrically about 05 at small (Qz10 =
03Qz90 = 07) intermediate (Qz
10 = 02Qz90 = 08) and large (Qz
10 = 01Qz90 = 09)
distances For the detection component the model parameters are obtained to reflect
detection probabilities concentrated about low values (Qy50 = 02) intermediate values
(Qy50 = 05) and high values (Qy
50 = 08) while keeping quantiles 10 and 90 fixed at 01
and 09 respectively
68
Table 3-1 Simulation control parameters occupancy model selectorParameter Values considered
N 50 100
J 3 5
(Qz10Q
z50Q
z90)
(03 05 07) (02 05 08) (01 05 09)
(Qy
10Qy50Q
y90)
(01 02 09) (01 05 09) (01 08 09)
There are in total 36 scenarios these result from crossing all the levels of the
simulation control parameters (Table 3-1) Under each of these scenarios 20 data sets
were generated at random True presence and detection indicators were generated
with the probit model formulation from Chapter 2 This with the assumed true models
MTz = 1 x2 x15 x16 x22 x28 for the presence and MTy = 1 q7 q10 q12 q17 for
the detection with the predictors included in the randomly generated datasets In this
context 1 represents the intercept term Throughout the Section we refer to predictors
included in the true models as true predictors and to those absent as false predictors
The selection procedure was conducted using each one of these data sets with
two different priors on the model space the uniform or equal probability prior and a
multiplicity correcting prior
The results are summarized through the marginal posterior inclusion probabilities
(MPIPs) for each predictor and also the five highest posterior probability models (HPM)
The MPIP for a given predictor under a specific scenario and for a particular data set is
defined as
p(predictor is included|y zw v) =sumMisinM
I(predictorisinM)p(M|y zw vM) (3ndash22)
In addition we compare the MPIP odds between predictors present in the true model
and predictors absent from it Specifically we consider the minimum odds of marginal
posterior inclusion probabilities for the predictors Let ~ξ and ξ denote respectively a
69
predictor in the true model MT and a predictor absent from MT We define the minimum
MPIP odds between the probabilities of true and false predictor as
minOddsMPIP =min~ξisinMT
p(I~ξ = 1|~ξ isin MT )
maxξ isinMTp(Iξ = 1|ξ isin MT )
(3ndash23)
If the variable selection procedure adequately discriminates true and false predictors
minOddsMPIP will take values larger than one The ability of the method to discriminate
between the least probable true predictor and the most probable false predictor worsens
as the indicator approaches 0351 Marginal Posterior Inclusion Probabilities for Model Predictors
For clarity in Figures 3-1 through 3-5 only predictors in the true models are labeled
and are emphasized with a dotted line passing through them The left hand side plots
in these figures contain the results for the presence component and the ones on the
right correspond to predictors in the detection component The results obtained with
the uniform model priors correspond to the black lines and those for the multiplicity
correcting prior are in red In these Figures the MPIPrsquos have been averaged over all
datasets corresponding scenarios matching the condition observed
In Figure 3-1 we contrast the mean MPIPrsquos of the predictors over all datasets from
scenarios with 50 sites to the mean MPIPrsquos obtained for the scenarios with 100 sites
Similarly Figure 3-2 compares the mean MPIPrsquos of scenarios where 3 surveys are
performed to those of scenarios having 5 surveys per site Figures 3-4 and 3-5 show the
effect of the different levels of signal considered in the occupancy probabilities and in the
detection probabilities
From these figures mainly three results can be drawn (1) the effect of the model
prior is substantial (2) the proposed methods yield MPIPrsquos that clearly separate
true predictors from false predictors and (3) the separation between MPIPrsquos of true
predictors and false predictors is noticeably larger in the detection component
70
Regardless of the simulation scenario and model component observed under the
uniform prior false predictors obtain a relatively high MPIP Conversely the multiplicity
correction prior strongly shrinks towards 0 the MPIP for false predictors In the presence
component the MPIP for the true predictors is shrunk substantially under the multiplicity
prior however there remains a clear separation between true and false predictors In
contrast in the detection component the MPIP for true predictors remains relatively high
(Figures 3-1 through 3-5)
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif N=50MC N=50
Unif N=100MC N=100
Figure 3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors
71
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif J=3MC J=3
Unif J=5MC J=5
Figure 3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif N=50 J=3Unif N=50 J=5
Unif N=100 J=3Unif N=100 J=5
MC N=50 J=3MC N=50 J=5
MC N=100 J=3MC N=100 J=5
Figure 3-3 Predictor MPIP averaged over scenarios with the interaction between thenumber of sites and the surveys per site using uniform (U) and multiplicitycorrection (MC) priors
72
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
U(03 05 07)MC(03 05 07)
U(02 05 08)MC(02 05 08)
U(01 05 09)MC(01 05 09)
Figure 3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
U(01 02 09)MC(01 02 09)
U(01 05 09)MC(01 05 09)
U(01 08 09)MC(01 08 09)
Figure 3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors
73
In scenarios where more sites were surveyed the separation between the MPIP of
true and false predictors grew in both model components (Figure 3-1) Increasing the
number of sites has an effect over both components given that every time a new site is
included covariate information is added to the design matrix of both the presence and
the detection components
On the hand increasing the number of surveys affects the MPIP of predictors in the
detection component (Figures 3-2 and 3-3) but has only a marginal effect on predictors
of the presence component This may appear to be counterintuitive however increasing
the number of surveys only increases the number of observation in the design matrix
for the detection while leaving unaltered the design matrix for the presence The small
changes observed in the MPIP for the presence predictors J increases are exclusively
a result of having additional detection indicators equal to 1 in sites where with less
surveys would only have 0 valued detections
From Figure 3-3 it is clear that for the presence component the effect of the number
of sites dominates the behavior of the MPIP especially when using the multiplicity
correction priors In the detection component the MPIP is influenced by both the number
of sites and number of surveys The influence of increasing the number of surveys is
larger when considering a smaller number of sites and viceversa
Regarding the effect of the distribution for the occupancy probabilities we observe
that mostly the detection component is affected There is stronger discrimination
between true and false predictors as the distribution has a higher variability (Figure
3-4) This is consistent with intuition since having the presence probabilities more
concentrated about 05 implies that the predictors do not vary much from one site to
the next whereas having the occupancy probabilities more spread out would have the
opposite effect
Finally concentrating the detection probabilities about high or low values For
predictors in the detection component the separation between MPIP of true and false
74
predictors is larger both in scenarios where the distribution of the detection probability
is centered about 02 or 08 when compared to those scenarios where this distribution
is centered about 05 (where the signal of the predictors is weakest) For predictors in
the presence component having the detection probabilities centered at higher values
slightly increases the inclusion probabilities of the true predictors (Figure 3-5) and
reduces that of false predictors
Table 3-2 Comparison of average minOddsMPIP under scenarios having differentnumber of sites (N=50 N=100) and under scenarios having different numberof surveys per site (J=3 J=5) for the presence and detection componentsusing uniform and multiplicity correction priors
Sites SurveysComp π(M) N=50 N=100 J=3 J=5
Presence Unif 112 131 119 124MC 320 846 420 674
Detection Unif 203 264 211 257MC 2115 3246 2139 3252
Table 3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors
(Qz10Q
z50Q
z90) (Qy
10Qy50Q
y90)
Comp π(M) (030507) (020508) (010509) (010209) (010509) (010809)
Presence Unif 105 120 134 110 123 124MC 202 455 805 238 619 640
Detection Unif 234 234 230 257 200 238MC 2537 2077 2528 2933 1852 2849
The separation between the MPIP of true and false predictors is even more
notorious in Tables 3-2 and 3-3 where the minimum MPIP odds between true and
false predictors are shown Under every scenario the value for the minOddsMPIP (as
defined in 3ndash23) was greater than 1 implying that on average even the lowest MPIP
for a true predictor is higher than the maximum MPIP for a false predictor In both
components of the model the minOddsMPIP are markedly larger under the multiplicity
correction prior and increase with the number of sites and with the number of surveys
75
For the presence component increasing the signal in the occupancy probabilities
or having the detection probabilities concentrate about higher values has a positive and
considerable effect on the magnitude of the odds For the detection component these
odds are particularly high specially under the multiplicity correction prior Also having
the distribution for the detection probabilities center about low or high values increases
the minOddsMPIP 352 Summary Statistics for the Highest Posterior Probability Model
Tables 3-4 through 3-7 show the number of true predictors that are included in
the HPM (True +) and the number of false predictors excluded from it (True minus)
The mean percentages observed in these Tables provide one clear message The
highest probability models chosen with either model prior commonly differ from the
corresponding true models The multiplicity correction priorrsquos strong shrinkage only
allows a few true predictors to be selected but at the same time it prevents from
including in the HPM any false predictors On the other hand the uniform prior includes
in the HPM a larger proportion of true predictors but at the expense of also introducing
a large number of false predictors This situation is exacerbated in the presence
component but also occurs to a minor extent in the detection component
Table 3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highestprobability models for the presence and the detection components usinguniform and multiplicity correcting priors on the model space
True + True minusComp π(M) N=50 N=100 N=50 N=100
Presence Unif 057 063 051 055MC 006 013 100 100
Detection Unif 077 085 087 093MC 049 070 100 100
Having more sites or surveys improves the inclusion of true predictors and exclusion
of false ones in the HPM for both the presence and detection components (Tables 3-4
and 3-5) On the other hand if the distribution for the occupancy probabilities is more
76
Table 3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space
True + True minusComp π(M) J=3 J=5 J=3 J=5
Presence Unif 059 061 052 054MC 008 010 100 100
Detection Unif 078 085 087 092MC 050 068 100 100
spread out the HPM includes more true predictors and less false ones in the presence
component In contrast the effect of the spread of the occupancy probabilities in the
detection HPM is negligible (Table 3-6) Finally there is a positive relationship between
the location of the median for the detection probabilities and the number of correctly
classified true and false predictors for the presence The HPM in the detection part of
the model responds positively to low and high values of the median detection probability
(increased signal levels) in terms of correctly classified true and false predictors (Table
3-7)
Table 3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space
True + True minusComp π(M) (030507) (020508) (010509) (030507) (020508) (010509)
Presence Unif 055 061 064 050 054 055MC 002 008 018 100 100 100
Detection Unif 081 082 081 090 089 089MC 057 061 059 100 100 100
36 Case Study Blue Hawker Data Analysis
During 1999 and 2000 an intensive volunteer surveying effort coordinated by the
Centre Suisse de Cartographie de la Faune (CSCF) was conducted in order to analyze
the distribution of the blue hawker Ashna cyanea (Odonata Aeshnidae) a common
dragonfly in Switzerland Given that Switzerland is a small and mountainous country
77
Table 3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space
True + True minusComp π(M) (010209) (010509) (010809) (010209) (010509) (010809)
Presence Unif 059 059 062 051 054 054MC 006 010 011 100 100 100
Detection Unif 089 077 078 091 087 091MC 070 048 059 100 100 100
there is large variation in its topography and physio-geography as such elevation is a
good candidate covariate to predict species occurrence at a large spatial scale It can
be used as a proxy for habitat type intensity of land use temperature as well as some
biotic factors (Kery et al 2010)
Repeated visits to 1-ha pixels took place to obtain the corresponding detection
history In addition to the survey outcome the x and y-coordinates thermal-level the
date of the survey and the elevation were recorded Surveys were restricted to the
known flight period of the blue hawker which takes place between May 1 and October
10 In total 2572 sites were surveyed at least once during the surveying period The
number of surveys per site ranges from 1 to 22 times within each survey year
Kery et al (2010) summarize the results of this effort using AIC-based model
comparisons first by following a backwards elimination approach for the detection
process while keeping the occupancy component fixed at the most complex model and
then for the presence component choosing among a group of three models while using
the detection model chosen In our analysis of this dataset for the detection and the
presence we consider as the full models those used in Kery et al (2010) namely
minus1(ψ) = α0 + α1year+ α2elev+ α3elev2 + α4elev
3
minus1(p) = λ0 + λ1year+ λ2elev+ λ3elev2 + λ4elev
3 + λ5date+ λ6date2
78
where year = Iyear=2000
The model spaces for this data contain 26 = 64 and 24 = 16 models respectively
for the detection and occupancy components That is in total the model space contains
24+6 = 1 024 models Although this model space can be enumerated entirely for
illustration we implemented the algorithm from section 334 generating 10000 draws
from the Gibbs sampler Each one of the models sampled were chosen from the set of
models that could be reached by changing the state of a single term in the current model
(to inclusion or exclusion accordingly) This allows a more thorough exploration of the
model space because for each of the 10000 models drawn the posterior probabilities
for many more models can be observed Below the labels for the predictors are followed
by either ldquozrdquo or ldquoyrdquo accordingly to represent the component they pertain to Finally
using the results from the model selection procedure we conducted a validation step to
determine the predictive accuracy of the HPMrsquos and of the median probability models
(MPMrsquos) The performance of these models is then contrasted with that of the model
ultimately selected by Kery et al (2010)361 Results Variable Selection Procedure
The model finally chosen for the presence component in Kery et al (2010) was not
found among the highest five probability models under either model prior 3-8 Moreover
the year indicator was never chosen under the multiplicity correcting prior hinting that
this term might correspond to a falsely identified predictor under the uniform prior
Results in Table 3-10 support this claim the marginal inclusion posterior probability for
the year predictor is 7 under the multiplicity correction prior The multiplicity correction
prior concentrates more densely the model posterior probability mass in the highest
ranked models (90 of the mass is in the top five models) than the uniform prior (which
account for 40 of the mass)
For the detection component the HPM under both priors is the intercept only model
which we represent in Table 3-9 with a blank label In both cases this model obtains very
79
Table 3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data
Uniform model priorRank Mz selected p(Mz |y)
1 yrz+elevz 0102 yrz+elevz+elevz3 0083 elevz2+elevz3 0084 yrz+elevz2 0075 yrz+elevz3 007
Multiplicity correcting model priorRank Mz selected p(Mz |y)
1 elevz+elevz3 0532 0153 elevz+elevz2 0094 elevz2 0065 elevz+elevz2+elevz3 005
high posterior probabilities The terms contained in cubic polynomial for the elevation
appear to contain some relevant information however this conflicts with the MPIPs
observed in Table 3-11 which under both model priors are relatively low (lt 20 with the
uniform and le 4 with the multiplicity correcting prior)
Table 3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data
Uniform model priorRank Mz selected p(Mz |y)
1 0452 elevy3 0063 elevy2 0054 elevy 0055 yry 004
Multiplicity correcting model priorRank Mz selected p(Mz |y)
1 0862 elevy3 0023 datey2 0024 elevy2 0025 yry 002
Finally it is possible to use the MPIPs to obtain the median probability model which
contains the terms that have a MPIP higher than 50 For the occupancy process
(Table 3-10) under the uniform prior the model with the year the elevation and the
elevation cubed are included The MPM with multiplicity correction prior coincides with
the HPM from this prior The MPM chosen for the detection component (Table 3-11)
under both priors is the intercept only model coinciding again with the HPM
Given the outcomes of the simulation studies from Section 35 especially those
pertaining to the detection component the results in Table 3-11 appear to indicate that
none of the predictors considered belong to the true model especially when considering
80
Table 3-10 MPIP presence component
Predictor p(predictor isin MTz |y z w v)
Unif MultCorryrz 053 007elevz 051 073elevz2 045 023elevz3 050 067
Table 3-11 MPIP detection component
Predictor p(predictor isin MTy |y z w v)
Unif MultCorryry 019 003elevy 018 003elevy2 018 003elevy 3 019 004datey 016 003datey2 015 004
those derived with the multiplicity correction prior On the other hand for the presence
component (Table 3-10) there is an indication that terms related to the cubic polynomial
in elevz can explain the occupancy patterns362 Validation for the Selection Procedure
Approximately half of the sites were selected at random for training (ie for model
selection and parameter estimation) and the remaining half were used as test data In
the previous section we observed that using the marginal posterior inclusion probability
of the predictors the our method effectively separates predictors in the true model from
those that are not in it However in Tables 3-10 and 3-11 this separation is only clear for
the presence component using the multiplicity correction prior
Therefore in the validation procedure we observe the misclassification rates for the
detections using the following models (1) the model ultimately recommended in Kery
et al (2010) (yrz+elevz+elevz2+elevz3 + elevy+ elevy2+ datey+ datey2) (2) the
highest probability model (HPM) with a uniform prior (yrz+elevz) (3) the HPM with a
multiplicity correcting prior (elevz + elevz3 ) (4) the median probability model (MPM)
ndashthe model including only predictors with a MPIP larger than 50ndash with the uniform
prior (yrz+elevz+elevz3) and finally (5) the MPM with a multiplicity correction prior
(elevz+elevz3 same as the HPM with multiplicity correction)
We must emphasize that the models resulting from the implement ion of our model
selection procedure used exclusively the training dataset On the other hand the model
in Kery et al (2010) was chosen to minimize the prediction error of the complete data
81
Because this model was obtained from the full dataset results derived from it can only
be considered as a lower bound for the prediction errors
The benchmark misclassification error rate for true 1rsquos is high (close to 70)
However the misclassification rate for true 0rsquos which accounts for most of the
responses is less pronounced (15) Overall the performance of the selected models
is comparable They yield considerably worse results than the benchmark for the true
1rsquos but achieve rates close to the benchmark for the true zeros Pooling together
the results for true ones and true zeros the selected models with either prior have
misclassification rates close to 30 The benchmark model performs comparably with a
joint misclassification error of 23 (Table 3-12)
Table 3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform andmultiplicity correction model priors
Model True 1 True 0 Jointbenchmark (Kery et al 2010) yrz+elevz+elevz2+elevz3 + 066 015 023
elevy+ elevy2+ datey+ datey2
HPM Unif yrz+elevz 083 017 028HPMHPM MC elevz + elevz3 082 018 028MPM Unif yrz+elevz+elevz3 082 018 029
37 Discussion
In this Chapter we proposed an objective and fully automatic Bayes methodology for
the single season site-occupancy model The methodology is said to be fully automatic
because no hyper-parameter specification is necessary in defining the parameter priors
and objective because it relies on the intrinsic priors derived from noninformative priors
The intrinsic priors have been shown to have desirable properties as testing priors We
also propose a fast stochastic search algorithm to explore large model spaces using our
model selection procedure
Our simulation experiments demonstrated the ability of the method to single out the
predictors present in the true model when considering the marginal posterior inclusion
probabilities for the predictors For predictors in the true model these probabilities
were comparatively larger than those for predictors absent from it Also the simulations
82
indicated that the method has a greater discrimination capability for predictors in the
detection component of the model especially when using multiplicity correction priors
Multiplicity correction priors were not described in this Chapter however their
influence on the selection outcome is significant This behavior was observed in the
simulation experiment and in the analysis of the Blue Hawker data Model priors play an
essential role As the number of predictors grows these are instrumental in controlling
for selection of false positive predictors Additionally model priors can be used to
account for predictor structure in the selection process which helps both to reduce the
size of the model space and to make the selection more robust These issues are the
topic of the next Chapter
Accounting for the polynomial hierarchy in the predictors within the occupancy
context is a straightforward extension of the procedures we describe in Chapter 4
Hence our next step is to develop efficient software for it An additional direction we
plan to pursue is developing methods for occupancy variable selection in a multivariate
setting This can be used to conduct hypothesis testing in scenarios with varying
conditions through time or in the case where multiple species are co-observed A
final variation we will investigate for this problem is that of occupancy model selection
incorporating random effects
83
CHAPTER 4PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
It has long been an axiom of mine that the little things are infinitely themost important
ndashSherlock HolmesA Case of Identity
41 Introduction
In regression problems if a large number of potential predictors is available the
complete model space is too large to enumerate and automatic selection algorithms are
necessary to find informative parsimonious models This multiple testing problem
is difficult and even more so when interactions or powers of the predictors are
considered In the ecological literature models with interactions andor higher order
polynomial terms are ubiquitous (Johnson et al 2013 Kery et al 2010 Zeller et al
2011) given the complexity and non-linearities found in ecological processes Several
model selection procedures even in the classical normal linear setting fail to address
two fundamental issues (1) the model selection outcome is not invariant to affine
transformations when interactions or polynomial structures are found among the
predictors and (2) additional penalization is required to control for false positives as the
model space grows (ie as more covariates are considered)
These two issues motivate the developments developed throughout this Chapter
Building on the results of Chipman (1996) we propose investigate and provide
recommendations for three different prior distributions on the model space These
priors help control for test multiplicity while accounting for polynomial structure in the
predictors They improve upon those proposed by Chipman first by avoiding the need
for specific values for the prior inclusion probabilities of the predictors and second
by formulating principled alternatives to introduce additional structure in the model
84
priors Finally we design a stochastic search algorithm that allows fast and thorough
exploration of model spaces with polynomial structure
Having structure in the predictors can determine the selection outcome As an
illustration consider the model E [y ] = β00 + β01x2 + β20x21 where the order one
term x1 is not present (this choice of subscripts for the coefficients is defined in the
following section) Transforming x1 7rarr xlowast1 = x1 + c for some c = 0 the model
becomes E [y ] = β00 + β01x2 + βlowast20x
lowast21 Note that in terms of the original predictors
xlowast21 = x21 + 2c middot x1 + c2 implying that this seemingly innocuous transformation of x1
modifies the column space of the design matrix by including x1 which was not in the
original model That is when lower order terms in the hierarchy are omitted from the
model the column space of the design matrix is not invariant to afine transformations
As the hat matrix depends on the column space the modelrsquos predictive capability is also
affected by how the covariates in the model are coded an undesirable feature for any
model selection procedure To make model selection invariant to afine transformations
the selection must be constrained to the subset of models that respect the hierarchy
(Griepentrog et al 1982 Khuri 2002 McCullagh amp Nelder 1989 Nelder 2000
Peixoto 1987 1990) These models are known as well-formulated models (WFMs)
Succinctly a model is well-formulated if for any predictor in the model every lower order
predictor associated with it is also in the model The model above is not well-formulated
as it contains x21 but not x1
WFMs exhibit strong heredity in that all lower order terms dividing higher order
terms in the model must also be included An alternative is to only require weak heredity
(Chipman 1996) which only forces some of the lower terms in the corresponding
polynomial hierarchy to be in the model However Nelder (1998) demonstrated that the
conditions under which weak heredity allows the design matrix to be invariant to afine
transformations of the predictors are too restrictive to be useful in practice
85
Although this topic appeared in the literature more than three decades ago (Nelder
1977) only recently have modern variable selection techniques been adapted to
account for the constraints imposed by heredity As described in Bien et al (2013)
the current literature on variable selection for polynomial response surface models
can be classified into three broad groups mult-istep procedures (Brusco et al 2009
Peixoto 1987) regularized regression methods (Bien et al 2013 Yuan et al 2009)
and Bayesian approaches (Chipman 1996) The methods introduced in this Chapter
take a Bayesian approach towards variable selection for well-formulated models with
particular emphasis on model priors
As mentioned in previous chapters the Bayesian variable selection problem
consists of finding models with high posterior probabilities within a pre-specified model
space M The model posterior probability for M isin M is given by
p(M|yM) prop m(y|M)π(M|M) (4ndash1)
Model posterior probabilities depend on the prior distribution on the model space
as well as on the prior distributions for the model specific parameters implicitly through
the marginals m(y|M) Priors on the model specific parameters have been extensively
discussed in the literature (Berger amp Pericchi 1996 Berger et al 2001 George 2000
Jeffreys 1961 Kass amp Wasserman 1996 Liang et al 2008 Zellner amp Siow 1980) In
contrast the effect of the prior on the model space has until recently been neglected
A few authors (eg Casella et al (2014) Scott amp Berger (2010) Wilson et al (2010))
have highlighted the relevance of the priors on the model space in the context of multiple
testing Adequately formulating priors on the model space can both account for structure
in the predictors and provide additional control on the detection of false positive terms
In addition using the popular uniform prior over the model space may lead to the
undesirable and ldquoinformativerdquo implication of favoring models of size p2 (where p is the
86
total number of covariates) since this is the most abundant model size contained in the
model space
Variable selection within the model space of well-formulated polynomial models
poses two challenges for automatic objective model selection procedures First the
notion of model complexity takes on a new dimension Complexity is not exclusively
a function of the number of predictors but also depends upon the depth and
connectedness of the associations defined by the polynomial hierarchy Second
because the model space is shaped by such relationships stochastic search algorithms
used to explore the models must also conform to these restrictions
Models without polynomial hierarchy constitute a special case of WFMs where
all predictors are of order one Hence all the methods developed throughout this
Chapter also apply to models with no predictor structure Additionally although our
proposed methods are presented for the normal linear case to simplify the exposition
these methods are general enough to be embedded in many Bayesian selection
and averaging procedures including of course the occupancy framework previously
discussed
In this Chapter first we provide the necessary definitions to characterize the
well-formulated model selection problem Then we proceed to introduce three new prior
structures on the well-formulated model space and characterize their behavior with
simple examples and simulations With the model priors in place we build a stochastic
search algorithm to explore spaces of well-formulated models that relies on intrinsic
priors for the model specific parameters mdash though this assumption can be relaxed
to use other mixtures of g-priors Finally we implement our procedures using both
simulated and real data
87
42 Setup for Well-Formulated Models
Suppose that the observations yi are modeled using the polynomial regression of
the covariates xi 1 xi p given by
yi =sum
β(α1αp)
pprodj=1
xαji j + ϵi (4ndash2)
where α = (α1 αp) belongs to Np0 the p-dimensional space of natural numbers
including 0 with ϵiiidsim N(0σ2) and only finitely many βα are allowed to be non-zero
As an illustration consider a model space that includes polynomial terms incorporating
covariates xi 1 and xi 2 only The terms x2i 2 and x2i 1xi 2 can be represented by α = (0 2)
and α = (2 1) respectively
The notation y = Z(X)β + ϵ is used to denote that observed response y =
(y1 yn) is modeled via a polynomial function Z of the original covariates contained
in X = (x1 xp) (where xj = (x1j xnj)prime) and the coefficients of the polynomial
terms are given by β A specific polynomial model M is defined by the set of coefficients
βα that are allowed to be non-zero This definition is equivalent to characterizing M
through a collection of multi-indices α isin Np0 In particular model M is specified by
M = αM1 αM|M| for αMk isin Np0 where βα = 0 for α isin M
Any particular model M uses a subset XM of the original covariates X to form the
polynomial terms in the design matrix ZM(X) Without ambiguity a polynomial model
ZM(X) on X can be identified with a polynomial model ZM(XM) on the covariates XM
The number of terms used by M to model the response y denoted by |M| corresponds
to the number of columns of ZM(XM) The coefficient vector and error variance of
the model M are denoted by βM and σ2M respectively Thus M models the data as
y = ZM(XM)βM + ϵM where ϵM sim N(0 Iσ2M
) Model M is said to be nested in model M prime
if M sub M prime M models the response of the covariates in two distinct ways choosing the
set of meaningful covariates XM as well as choosing the polynomial structure of these
covariates ZM(XM)
88
The set Np0 constitutes a partially ordered set or more succinctly a poset A poset
is a set partially ordered through a binary relation ldquo≼rdquo In this context the binary relation
on the poset Np0 is defined between pairs (ααprime) by αprime ≼ α whenever αj ge αprime
j for all
j = 1 prime with αprime ≺ α if additionally αj gt αprimej for some j The order of a term α isin Np
0
is given by the sum of its elements order(α) =sumαj When order(α) = order(αprime) + 1
and αprime ≺ α then αprime is said to immediately precede α which is denoted by αprime rarr α
The parent set of α is defined by P(α) = αprime isin Np0 αprime rarr α and is given by the
set of nodes that immediately precede the given node A polynomial model M is said to
be well-formulated if α isin M implies that P(α) sub M For example any well-formulated
model using x2i 1xi 2 to model yi must also include the parent terms xi 1xi 2 and x2i 1 their
corresponding parent terms xi 1 and xi 2 and the intercept term 1
The poset Np0 can be represented by a Directed Acyclic Graph (DAG) denoted
by (Np0) Without ambiguity we can identify nodes in the graph α isin Np
0 with terms in
the set of covariates The graph has directed edges to a node from its parents Any
well-formulated model M is represented by a subgraph (M) of (Np0) with the property
that if node α isin (M) then the nodes corresponding to P(α) are also in (M) Figure
4-1 shows examples of well-formulated polynomial models where α isin Np0 is identified
withprodp
j=1 xαjj
The motivation for considering only well-formulated polynomial models is
compelling Let ZM be the design matrix associated with a polynomial model The
subspace of y modeled by ZM given by the hat matrix HM = ZM(ZprimeMZM)
minus1ZprimeM is
invariant to affine transformations of the matrix XM if and only if M corresponds to a
well-formulated polynomial model (Peixoto 1990)
89
A B
Figure 4-1 Graphs of well-formulated polynomial models for p = 2
For example if p = 2 and yi = β(00) + β(10)xi 1 + β(01)xi 2 + β(11)xi 1xi 2 + ϵi then
the hat matrix is invariant to any covariate transformation of the form A(xi 1xi 2
)+ b for any
real-valued positive definite 2 times 2 matrix A and any real-valued vector of dimension two
b In contrast if yi = β(00) + β(20)x2i 1 + ϵi then the hat matrix formed after applying the
transformation xi 1 7rarr xi 1 + c for real c = 0 is not the same as the hat matrix formed by
the original xi 1421 Well-Formulated Model Spaces
The spaces of WFMs M considered in this paper can be characterized in terms
of two WFMs MB the base model and MF the full model The base model contains at
least the intercept term and is nested in the full model The model space M is populated
by all well formulated models M that nest MB and are nested in MF
M = M MB sube M sube MF and M is well-formulated
For M to be well-formulated the entire ancestry of each node in M must also be
included in M Because of this M isin M can be uniquely identified by two different sets
of nodes in MF the set of extreme nodes and the set of children nodes For M isin M
90
the sets of extreme and children nodes respectively denoted by E(M) and C(M) are
defined by
E(M) = α isin M MB α isin P(αprime) forall αprime isin M
C(M) = α isin MF M α cupM is well-formulated
The extreme nodes are those nodes that when removed from M give rise to a WFM in
M The children nodes are those nodes that when added to M give rise to a WFM in
M Because MB sube M for all M isin M the set of nodes E(M)cupMB determine M by
beginning with this set and iteratively adding parent nodes Similarly the nodes in C(M)
determine the set αprime isin P(α) α isin C(M)cupαprime isin E(MF ) α ≼ αprime for all α isin C(M)
which contains E(M)cupMB and thus uniquely identifies M
1
x1
x2
x21
x1x2
x22
A Extreme node set
1
x1
x2
x21
x1x2
x22
B Children node set
Figure 4-2
In Figure 4-2 the extreme and children sets for model M = 1 x1 x21 are shown for
the model space characterized by MF = 1 x1 x2 x21 x1x2 x22 In Figure 4-2A the solid
nodes represent nodes α isin M E(M) the dashed node corresponds to α isin E(M) and
the dotted nodes are not in M Solid nodes in Figure 4-2B correspond to those in M
The dashed node is the single node in C(M) and the dotted nodes are not in M cup C(M)43 Priors on the Model Space
As discussed in Scott amp Berger (2010) the Ockhamrsquos-razor effect found
automatically in Bayesian variable selection through the Bayes factor does not correct
91
for multiple testing This penalization acts against more complex models but does not
account for the collection of models in the model space which describes the multiplicity
of the testing problem This is where the role of the prior on the model space becomes
important As Scott amp Berger explain the multiplicity penalty is ldquohidden awayrdquo in the
model prior probabilities π(M|M)
In what follows we propose three different prior structures on the model space
for WFMs discuss their advantages and disadvantages and describe reasonable
choices for their hyper-parameters In addition we investigate how the choice of
prior structure and hyper-parameter combinations affect the posterior probabilities for
predictor inclusion providing some recommendations for different situations431 Model Prior Definition
The graphical structure for the model spaces suggests a method for prior
construction on M guided by the notion of inheritance A node α is said to inherit from
a node αprime if there is a directed path from αprime to α in the graph (MF ) The inheritance
is said to be immediate if order(α) = order(αprime) + 1 (equivalently if αprime isin P(α) or if αprime
immediately precedes α)
For convenience define (M) = M MB to be the set of nodes in M that are not
in the base model MB For α isin (MF ) let γα(M) be the indicator function describing
whether α is included in M ie γα(M) = I(αisinM) Denote by γν(M) the set of indicators
of inclusion in M for all order ν nodes in (MF ) Finally let γltν(M) =cupνminus1
j=0 γ j(M)
the set of indicators for inclusion in M for all nodes in (MF ) of order less than ν With
these definitions the prior probability of any model M isin M can be factored as
π(M|M) =
JmaxMprod
j=JminM
π(γ j(M)|γltj(M)M) (4ndash3)
where JminM and Jmax
M are respectively the minimum and maximum order of nodes in
(MF ) and π(γJminM (M)|γltJmin
M (M)M) = π(γJminM (M)|M)
92
Prior distributions on M can be simplified by making two assumptions First if
order(α) = order(αprime) = j then γα and γαprime are assumed to be conditionally independent
when conditioned on γltj denoted by γα perpperp γαprime|γltj Second immediate inheritance is
invoked and it is assumed that if order(α) = j then γα(M)|γltj(M) = γα(M)|γP(α)(M)
where γP(α)(M) is the inclusion indicator for the set of parent nodes of α This indicator
is one if the complete parent set of α is contained in M and zero otherwise
In Figure 4-3 these two assumptions are depicted with MF being an order two
surface in two main effects The conditional independence assumption (Figure 4-3A)
implies that the inclusion indicators for x21 x22 and x1x2 is independent when conditioned
on all the lower order terms In this same space immediate inheritance implies that
the inclusion of x21 conditioned on the inclusion of all lower order nodes is equivalent to
conditioning it on its parent set (x1 in this case)
x21 perpperp x1x2 perpperp x22
∣∣∣∣∣
1
x1
x2
A Conditional independence
x21∣∣∣∣∣
1
x1
x2
=
x21
∣∣∣∣∣ x1
B Immediate inheritance
Figure 4-3
Denote the conditional inclusion probability of node α in model M by πα =
π(γα(M) = 1|γP(α)(M)M) Under the assumptions of conditional independence
93
and immediate inheritance the prior probability of M is
π(M|πMM) =prod
αisin(MF )
πγα(M)α (1minus πα)
1minusγα(M) (4ndash4)
with πM = πα α isin (MF ) Because M must be well-formulated πα = γα =
0 if γP(α)(M) = 0 Thus the product in 4ndash4 can be restricted to the set of nodes
α isin (M)cup
C(M) Additional structure can be built into the prior on M by making
assumptions about the inclusion probabilities πα such as equality assumptions or
assumptions of a hyper-prior for these parameters Three such prior classes are
developed next first by assigning hyperpriors on πM assuming some structure among
its elements and then marginalizing out the πM
Hierarchical Uniform Prior (HUP) The HUP assumes that the non-zero πα
are all equal Specifically for a model M isin M it is assumed that πα = π for all
α isin (M)cupC(M) A complete Bayesian specification of the HUP is completed by
assuming a prior distribution for π The choice of π sim Beta(a b) produces
πHUP(M|M a b) =B(|(M)|+ a |C(M)|+ b)
B(a b) (4ndash5)
where B is the beta function Setting a = b = 1 gives the particular value of
πHUP(M|M a = 1 b = 1) =1
|(M)|+ |C(M)|+ 1
(|(M)|+ |C(M)|
|(M)|
)minus1
(4ndash6)
The HUP assigns equal probabilities to all models for which the sets of nodes (M)
and C(M) have the same cardinality This prior provides a combinatorial penalization
but essentially fails to account for the hierarchical structure of the model space An
additional penalization for model complexity can be incorporated into the HUP by
changing the values of a and b Because πα = π for all α this penalization can only
depend on some aspect of the entire graph of MF such as the total number of nodes
not in the null model |(MF )|
94
Hierarchical Independence Prior (HIP) The HIP assumes that there are no
equality constraints among the non-zero πα Each non-zero πα is given its own prior
which is assumed to be a Beta distribution with parameters aα and bα Thus the prior
probability of M under the HIP is
πHIP(M|M ab) =
prodαisin(M)
aα
aα + bα
prodαisinC(M)
bα
aα + bα
(4ndash7)
where the product over empty is taken to be 1 Because the πα are totally independent any
choice of aα and bα is equivalent to choosing a probability of success πα for a given α
Setting aα = bα = 1 for all α isin (M)cup
C(M) gives the particular value of
πHIP(M|M a = 1b = 1) =
(1
2
)|(M)|+|C(M)|
(4ndash8)
Although the prior with this choice of hyper-parameters accounts for the hierarchical
structure of the model space it essentially provides no penalization for combinatorial
complexity at different levels of the hierarchy This can be observed by considering a
model space with main effects only the exponent in 4ndash8 is the same for every model in
the space because each node is either in the model or in the children set
Additional penalizations for model complexity can be incorporated into the HIP
Because each γ j is conditioned on γltj in the prior construction the aα and bα for α of
order j can be conditioned on γltj One such additional penalization utilizes the number
of nodes of order j that could be added to produce a WFM conditioned on the inclusion
vector γltj which is denoted as chj(γltj) Choosing aα = 1 and bα(M) = chj(γ
ltj) is
equivalent to choosing a probability of success πα = 1chj(γltj) This penalization can
drive down the false positive rate when chj(γltj) is large but may produce more false
negatives
Hierarchical Order Prior (HOP) A compromise between complete equality and
complete independence of the πα is to assume equality between the πα of a given
order and independence across the different orders Define j(M) = α isin (M)
95
order(α) = j and Cj(M) = α isin C(M) order(α) = j The HOP assumes that πα = πj
for all α isin j(M)cupCj(M) Assuming that πj sim Beta(aj bj) provides a prior probability of
πHOP(M|M ab) =
JmaxMprod
j=JminM
B(|j(M)|+ aj |Cj(M)|+ bj)
B(aj bj)(4ndash9)
The specific choice of aj = bj = 1 for all j gives a value of
πHOP(M|M a = 1b = 1) =prodj
[1
|j(M)|+ |Cj(M)|+ 1
(|j(M)|+ |Cj(M)|
|j(M)|
)minus1]
(4ndash10)
and produces a hierarchical version of the Scott and Berger multiplicity correction
The HOP arises from a conditional exchangeability assumption on the indicator
variables Conditioned on γltj(M) the indicators γα α isin j(M)cup
Cj(M) are
assumed to be exchangeable Bernoulli random variables By de Finettirsquos theorem these
arise from independent Bernoulli random variables with common probability of success
πj with a prior distribution Our construction of the HOP assumes that this prior is a
beta distribution Additional complexity penalizations can be incorporated into the HOP
in a similar fashion to the HIP The number of possible nodes that could be added of
order j while maintaining a WFM is given by chj(M) = chj(γltj(M)) = |j(M)
cupCj(M)|
Using aj = 1 and bj(M) = chj(M) produces a prior with two desirable properties
First if M prime sub M then π(M) le π(M prime) Second for each order j the conditional
probability of including k nodes is greater than or equal to that of including k + 1 nodes
for k = 0 1 chj(M)minus 1432 Choice of Prior Structure and Hyper-Parameters
Each of the priors introduced in Section 31 defines a whole family of model priors
characterized by the probability distribution assumed for the inclusion probabilities πM
For the sake of simplicity this paper focuses on those arising from Beta distributions
and concentrates on particular choices of hyper-parameters which can be specified
automatically First we describe some general features about how each of the three
prior structures (HUP HIP HOP) allocates mass to the models in the model space
96
Second as there is an infinite number of ways in which the hyper-parameters can be
specified focused is placed on the default choice a = b = 1 as well as the complexity
penalizations described in Section 31 The second alternative is referred to as a =
1b = ch where b = ch has a slightly different interpretation depending on the prior
structure Accordingly b = ch is given by bj(M) = bα(M) = chj(M) = |j(M)cup
Cj(M)|
for the HOP and HIP where j = order(α) while b = ch denotes that b = |(MF )| for
the HUP The prior behavior for two model spaces In both cases the base model MB is
taken to be the intercept only model and MF is the DAG shown (Figures 4-4 and 4-5)
The priors considered treat model complexity differently and some general properties
can be seen in these examples
ModelHIP HOP HUP
(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)
1 1 14 49 13 12 13 572 1 x1 18 19 112 112 112 5563 1 x2 18 19 112 112 112 5564 1 x1 x
21 18 19 112 112 112 5168
5 1 x2 x22 18 19 112 112 112 5168
6 1 x1 x2 132 364 112 112 160 1727 1 x1 x2 x
21 132 164 136 160 160 1168
8 1 x1 x2 x1x2 132 164 136 160 160 11689 1 x1 x2 x
22 132 164 136 160 160 1168
10 1 x1 x2 x21 x1x2 132 1192 136 1120 130 1252
11 1 x1 x2 x21 x
22 132 1192 136 1120 130 1252
12 1 x1 x2 x1x2 x22 132 1192 136 1120 130 1252
13 1 x1 x2 x21 x1x2 x
22 132 1576 112 1120 16 1252
Figure 4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch)
First contrast the choice of HIP HUP and HOP for the choice of (ab) = (1 1) The
HIP induces a complexity penalization that only accounts for the order of the terms in
the model This is best exhibited by the model space in Figure 4-4 Models including x1
and x2 models 6 through 13 are given the same prior probability and no penalization is
incurred for the inclusion of any or all of the quadratic terms In contrast to the HIP the
97
ModelHIP HOP HUP
(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)
1 1 18 2764 14 12 14 472 1 x1 18 964 112 110 112 2213 1 x2 18 964 112 110 112 2214 1 x3 18 964 112 110 112 2215 1 x1 x3 18 364 112 120 112 41056 1 x2 x3 18 364 112 120 112 41057 1 x1 x2 116 3128 124 140 130 1428 1 x1 x2 x1x2 116 3128 124 140 120 1709 1 x1 x2 x3 116 1128 18 140 120 17010 1 x1 x2 x3 x1x2 116 1128 18 140 15 170
Figure 4-5 Prior probabilities for the space of well-formulated models associated tothree main effects and one interaction term where MB is taken to be theintercept only model and (ab) isin (1 1) (1 ch)
HUP induces a penalization for model complexity but it does not adequately penalize
models for including additional terms Using the HIP models including all of the terms
are given at least as much probability as any model containing any non-empty set of
terms (Figures 4-4 and 4-5) This lack of penalization of the full model is originates from
its combinatorial simplicity (ie this is the only model that contains every term) and
as an unfortunate consequence this model space distribution favors the base and full
models Similar behavior is observed with the HOP with (ab) = (1 1) As models
become more complex they are appropriately penalized for their size However after a
sufficient number of nodes are added the number of possible models of that particular
size is considerably reduced Thus combinatorial complexity is negligible on the largest
models This is best exhibited in Figure 4-5 where the HOP places more mass on
the full model than on any model containing a single order one node highlighting an
undesirable behavior of the priors with this choice of hyper-parameters
In contrast if (ab) = (1 ch) all three priors produce strong penalization as
models become more complex both in terms of the number and order of the nodes
contained in the model For all of the priors adding a node α to a model M to form M prime
produces p(M) ge p(M prime) However differences between the priors are apparent The
98
HIP penalizes the full model the most with the HOP penalizing it the least and the HUP
lying between them At face value the HOP creates the most compelling penalization
of model complexity In Figure 4-5 the penalization of the HOP is the least dramatic
producing prior odds of 20 for MB versus MF as opposed to the HUP and HIP which
produce prior odds of 40 and 54 respectively Similarly the prior odds in Figure 4-4 are
60 180 and 256 for the HOP HUP and HIP respectively433 Posterior Sensitivity to the Choice of Prior
To determine how the proposed priors are adjusting the posterior probabilities to
account for multiplicity a simple simulation was performed The goal of this exercise
was to understand how the priors respond to increasing complexity First the priors are
compared as the number of main effects p grows Second they are compared as the
depth of the hierarchy increases or in other words as the orderJMmax increases
The quality of a node is characterized by its marginal posterior inclusion
probabilities defined as pα =sum
MisinM I(αisinM)p(M|yM) for α isin MF These posteriors
were obtained for the proposed priors as well as the Equal Probability Prior (EPP)
on M For all prior structures both the default hyper-parameters a = b = 1 and
the penalizing choice of a = 1 and b = ch are considered The results for the
different combinations of MF and MT incorporated in the analysis were obtained
from 100 random replications (ie generating at random 100 matrices of main effects
and responses) The simulation proceeds as follows
1 Randomly generate main effects matrices X = (x1 x18) for xiiidsim Nn(0 In) and
error vectors ϵ sim Nn(0 In) for n = 60
2 Setting all coefficient values equal to one calculate y = ZMTβ + ϵ for the true
models given byMT 1 = x1 x2 x3 x
21 x1x2 x
22 x2x3 with |MT 1| = 7
MT 2 = x1 x2 x16 with |MT 2| = 16MT 3 = x1 x2 x3 x4 with |MT 3| = 4MT 4 = x1 x2 x8 x
21 x3x4 with |MT 4| = 10
MT 5 = x1 x2 x3 x4 x21 x3x4 with |MT 5| = 6
99
Table 4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulationsgrowing p fixed JM
max fixed p growing JMmax
MF
∣∣MF
∣∣ ∣∣M∣∣ MT used MF
∣∣MF
∣∣ ∣∣M∣∣ MT used(x1 + x2 + x3)
2 9 95 MT 1 (x1 + x2 + x3)2 9 95 MT 1
(x1 + + x4)2 14 1337 MT 1 (x1 + x2 + x3)
3 19 2497 MT 1
(x1 + + x5)2 20 38619 MT 1 (x1 + x2 + x3)
4 34 161421 MT 1
Other model spacesMF
∣∣MF
∣∣ ∣∣M∣∣ MT usedx1 + x2 + middot middot middot+ x18 18 262144 MT 2MT 3
(x1 + x2 + x4)2 + x5+ 20 85568 MT 4MT 5x6 + + x10
3 In all simulations the base model MB is the intercept only model The notation(x1 + + xp)
d is used to represent the full order-d polynomial response surface inp main effects The model spaces characterized by their corresponding full modelMF are presented in Table 4-1 as well as the true models used in each case
4 Enumerate the model spaces and calculate p(M|yM) for all M isin Musing the EPP HUP HIP and HOP the latter two each with the two sets ofhyper-parameters
5 Count the number of true positives and false positives in each M for the differentpriors
The true positives (TP) are defined as those nodes α isin MT such that pα gt 05
With the false positives (FP) three different cutoffs are considered for pα elucidating
the adjustment for multiplicity induced by the model priors These cutoffs are
010 020 and 050 for α isin MT The results from this exercise provide insight
about the influence of the prior on the marginal posterior inclusion probabilities In Table
4-1 the model spaces considered are described in terms of the number of models they
contain and in terms of the number of nodes of MF the full model that defines the DAG
for M
Growing number of main effects fixed polynomial degree This simulation
investigates the posterior behavior as the number of covariates grows for a polynomial
100
surface of degree two The true model is assumed to be MT 1 and has 7 polynomial
terms The false positive and true positive rates are displayed in Table 4-2
First focus on the posterior when (ab) = (1 1) As p increases and the cutoff
is low the number of false positives increases for the EPP as well as the hierarchical
priors although less dramatically for the latter All of the priors identify all of the true
positives The false positive rate for the 50 cutoff is less than one for all four prior
structures with the HIP exhibiting the smallest false positive rate
With the second choice of hyper-parameters (1 ch) the improvement of the
hierarchical priors over the EPP is dramatic and the difference in performance is more
pronounced as p increases These also considerably outperform the priors using the
default hyper-parameters a = b = 1 in terms of the false positives Regarding the
number of true positives all priors discovered the 7 true predictors in MT 1 for most of
the 100 random samples of data with only minor differences observed between any of
the priors considered That being said the means for the priors with a = 1b = ch are
slightly lower for the true positives With a 50 cutoff the hierarchical priors keep a tight
control on the number of false positives but in doing so discard true positives with slightly
higher frequency
Growing polynomial degree fixed main effects For these examples the true
model is once again MT 1 When the complexity is increased by making the order of MF
larger (Table 4-3) the inability of the EPP to adjust the inclusion posteriors for multiplicity
becomes more pronounced the EPP becomes less and less efficient at removing false
positives when the FP cutoff is low Among the priors with a = b = 1 as the order
increases the HIP is the best at filtering out the false positives Using the 05 false
positive cutoff some false positives are included both for the EEP and for all the priors
with a = b = 1 indicating that the default hyper-parameters might not be the best option
to control FP The 7 covariates in the true model all obtain a high inclusion posterior
probability both with the EEP and the a = b = 1 priors
101
Table 4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
7 (x1 + x2 + x3)2
178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3 + x4)2
362 194 233 245 010 063 107FP(gt020) 160 047 217 215 001 017 024FP(gt050) 025 006 035 036 000 002 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3 + x4 + x5)2
600 216 260 255 012 043 115FP(gt020) 291 055 213 218 002 019 027FP(gt050) 066 011 025 037 000 003 001TP(gt050) (MT 1) 700 700 700 700 697 699 699
In contrast any of the a = 1 and b = ch priors dramatically improve upon their
a = b = 1 counterparts consistently assigning low inclusion probabilities for the majority
of the false positive terms even for low cutoffs As the order of the polynomial surface
increases the difference in performance between these priors and either the EEP or
their default versions becomes even more clear At the 50 cutoff the hierarchical priors
with complexity penalization exhibit very low false positive rates The true positive rate
decreases slightly for the priors but not to an alarming degree
Other model spaces This part of the analysis considers model spaces that do not
correspond to full polynomial degree response surfaces (Table 4-4) The first example
is a model space with main effects only The second example includes a full quadratic
surface of order 2 but in addition includes six terms for which only main effects are to be
modeled Two true models are used in combination with each model space to observe
how the posterior probabilities vary under the influence of the different priors for ldquolargerdquo
and ldquosmallrdquo true models
102
Table 4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
7 (x1 + x2 + x3)2
178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3)3
737 521 606 291 055 105 139FP(gt020) 291 155 361 208 017 034 031FP(gt050) 040 021 050 026 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 698 700FP(gt010)
7 (x1 + x2 + x3)4
822 400 469 261 052 055 132FP(gt020) 421 113 176 203 012 015 031FP(gt050) 056 017 022 027 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 697 699
By construction in model spaces with main effects only HIP(11) and EPP are
equivalent as are HOP(ab) and HUP(ab) This accounts for the similarities observed
among the results for the first two cases presented in Table 4-4 where the model space
corresponds to a full model with 18 main effects and the true models are a model with
16 and 4 main effects respectively When the number of true coefficients is large the
HUP(11) and HOP(11) do poorly at controlling false positives even at the 50 cutoff
In contrast the HIP (and thus the EPP) with the 50 cutoff identifies the true positives
and no false positives This result however does not imply that the EPP controls false
positives well The true model contains 16 out of the 18 nodes in MF so there is little
potential for false positives The a = 1 and b = ch priors show dramatically different
behavior The HIP controls false positive well but fails to identify the true coefficients at
the 50 cutoff In contrast the HOP identifies all of the true positives and has a small
false positive rate for the 50 cutoff
103
If the number of true positives is small most terms in the full model are truly zero
The EPP includes at least one false positive in approximately 50 of the randomly
sampled datasets On the other hand the HUP(11) provides some control for
multiplicity obtaining on average a lower number of false positives than the EPP
Furthermore the proposed hierarchical priors with a = 1b = ch are substantially better
than the EPP (and the choice of a = b = 1) at controlling false positives and capturing
all true positives using the marginal posterior inclusion probabilities The two examples
suggest that the HOP(1 ch) is the best default choice for model selection when the
number of terms available at a given degree is large
The third and fourth examples in Table 4-4 consider the same irregular model
space with data generated from MT 4 with ten terms and MT 5 with six terms HIP(11)
and EPP again behave quite similarly incorporating a large number of false positives
for the 01 cutoff At the 05 cutoff some false positives are still included The HUP(11)
and HOP(11) behave similarly with a slightly higher false positive rate at the 50 cutoff
In terms of the true positives the EPP and a = b = 1 priors always include all of the
predictors in MT 4 and MT 5 On the other hand the ability of the a = 1b = ch priors
to control for false positives is markedly better than that of the EPP and the hierarchical
priors with choice of a = 1 = b = 1 At the 50 cutoff these priors identify all of the true
positives and true negatives Once again these examples point to the hierarchical priors
with additional penalization for complexity as being good default priors on the model
space44 Random Walks on the Model Space
When the model space M is too large to enumerate a stochastic procedure can
be used to find models with high posterior probability In particular an MCMC algorithm
can be utilized to generate a dependent sample of models from the model posterior The
structure of the model space M both presents difficulties and provides clues on how to
build algorithms to explore it Different MCMC strategies can be adopted two of which
104
Table 4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
16 x1 + x2 + + x18
193 193 200 200 003 180 180FP(gt020) 052 052 200 200 001 046 046FP(gt050) 007 007 200 200 001 004 004TP(gt050) (MT 2) 1599 1599 1600 1600 699 1599 1599FP(gt010)
4 x1 + x2 + + x18
1395 1395 915 915 026 131 131FP(gt020) 545 545 303 303 005 045 045FP(gt050) 084 084 045 045 002 006 006TP(gt050) (MT 3) 400 400 400 400 400 400 400FP(gt010)
10
973 971 1000 560 034 233 220FP(gt020) (x1 + + x4)
2+ 265 265 873 305 012 074 069FP(gt050) +x5 + + x10 035 035 136 168 002 011 012TP(gt050) (MT 4) 1000 1000 1000 999 994 998 999FP(gt010)
6
1352 1352 1106 994 044 163 196FP(gt020) (x1 + + x4)
2+ 422 421 360 501 015 048 068FP(gt050) +x5 + + x10 053 053 057 075 001 008 011TP(gt050) (MT 5) 600 600 600 600 599 599 599
are outlined in this section Combining the different strategies allows the model selection
algorithm to explore the model space thoroughly and relatively fast441 Simple Pruning and Growing
This first strategy relies on small localized jumps around the model space turning
on or off a single node at each step The idea behind this algorithm is to grow the model
by activating one node in the children set or to prune the model by removing one node
in the extreme set At a given step in the algorithm assume that the current state of the
chain is model M Let pG be the probability that algorithm chooses the growth step The
proposed model M prime can either be M+ = M cup α for some α isin C(M) or Mminus = M α
or some α isin E(M)
An example transition kernel is defined by the mixture
g(M prime|M) = pG middot qGrow(M prime|M) + (1minus pG) middot qPrune(M prime|M)
105
=IM =MF
1 + IM =MBmiddotIαisinC(M)
|C(M)|+
IM =MB
1 + IM =MF middotIαisinE(M)
|E(M)|(4ndash11)
where pG has explicitly been defined as 05 when both C(M) and E(M) are non-empty
and as 0 (or 1) when C(M) = empty (or E(M) = empty) After choosing pruning or growing a
single node is proposed for addition to or deletion from M uniformly at random
For this simple algorithm pruning has the reverse kernel of growing and vice-versa
From this construction more elaborate algorithms can be specified First instead of
choosing the node uniformly at random from the corresponding set nodes can be
selected using the relative posterior probability of adding or removing the node Second
more than one node can be selected at any step for instance by also sampling at
random the number of nodes to add or remove given the size of the set Third the
strategy could combine pruning and growing in a single step by sampling one node
α isin C(M) cup E(M) and adding or removing it accordingly Fourth the sets of nodes from
C(M) cup E(M) that yield well-formulated models can be added or removed This simple
algorithm produces small moves around the model space by focusing node addition or
removal only on the set C(M) cup E(M)442 Degree Based Pruning and Growing
In exploring the model space it is possible to take advantage of the hierarchical
structure defined between nodes of different order One can update the vector of
inclusion indicators by blocks denoted j(M) Two flavors of this algorithm are
proposed one that separates the pruning and growing steps and one where both
are done simultaneously
Assume that at a given step say t the algorithm is at M If growing the strategy
proceeds successively by order class going from j = Jmin up to j = Jmax with Jmin
and Jmax being the lowest and highest orders of nodes in MF MB respectively Define
Mt(Jminminus1) = M and set j = Jmin The growth kernel comprises the following steps
proceeding from j = Jmin to j = Jmax
106
1) Propose a model M prime by selecting a set of nodes from Cj(Mt(jminus1)) through thekernel qGrow j(middot|Mt(jminus1))
2) Compute the Metropolis-Hastings correction for M prime versus Mt(jminus1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(jminus1)
3) If j lt Jmax then set j = j + 1 and return to 1) otherwise proceed to 4)
4) Set Mt = Mt(Jmax )
The pruning step is defined In a similar fashion however it starts at order j = Jmax
and proceeds down to j = Jmin Let Ej(M prime) = E(M prime) cap j(MF ) be the set of nodes of
order j that can removed from the model M to produce a WFM Define Mt(Jmax+1) = M
and set j = Jmax The pruning kernel comprises the following steps
1) Propose a model M prime by selecting a set of nodes from Ej(Mt(j+1)) through thekernel qPrunej(middot|Mt(j+1))
2) Compute the Metropolis-Hastings correction for M prime versus Mt(j+1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(j+1)
3) If j gt Jmin then set j = j minus 1 and return to Step 1) otherwise proceed to Step 4)
4) Set Mt = Mt(Jmin )
It is clear that the growing and pruning steps are reverse kernels of each other
Pruning and growing can be combined for each j The forward kernel proceeds from
j = Jmin to j = Jmax and proposes adding sets of nodes from Cj(M) cup Ej(M) The reverse
kernel simply reverses the direction of j proceeding from j = Jmax to j = Jmin 45 Simulation Study
To study the operating characteristics of the proposed priors a simulation
experiment was designed with three goals First the priors are characterized by how
the posterior distributions are affected by the sample size and the signal-to-noise ratio
(SNR) Second given the SNR level the influence of the allocation of the signal across
the terms in the model is investigated Third performance is assessed when the true
model has special points in the scale (McCullagh amp Nelder 1989) ie when the true
107
model has coefficients equal to zero for some lower-order terms in the polynomial
hierarchy
With these goals in mind sets of predictors and responses are generated under
various experimental conditions The model space is defined with MB being the
intercept-only model and MF being the complete order-four polynomial surface in five
main effects that has 126 nodes The entries of the matrix of main effects are generated
as independent standard normal The response vectors are drawn from the n-variate
normal distribution as y sim Nn
(ZMT
(X)βγ In) where MT is the true model and In is the
n times n identity matrix
The sample sizes considered are n isin 130 260 1040 which ensures that
ZMF(X) is of full rank The cardinality of this model space is |M| gt 12 times 1022 which
makes enumeration of all models unfeasible Because the value of the 2k-th moment
of the standard normal distribution increases with k = 1 2 higher-order terms by
construction have a larger variance than their ancestors As such assuming equal
values for all coefficients higher-order terms necessarily contain more ldquosignalrdquo than
the lower order terms from which they inherit (eg x21 has more signal than x1) Once a
higher-order term is selected its entire ancestry is also included Therefore to prevent
the simulation results from being overly optimistic (because of the larger signals from the
higher-order terms) sphering is used to calculate meaningful values of the coefficients
ensuring that the signal is of the magnitude intended in any given direction Given
the results of the simulations from Section 433 only the HOP with a = 1b = ch is
considered with the EPP included for comparison
The total number of combinations of SNR sample size regression coefficient
values and nodes in MT amounts to 108 different scenarios Each scenario was run
with 100 independently generated datasets and the mean behavior of the samples was
observed The results presented in this section correspond to the median probability
model (MPM) from each of the 108 simulation scenarios considered Figure 4-7 shows
108
the comparison between the two priors for the mean number of true positive (TP) and
false positive (FP) terms Although some of the scenarios consider true models that are
not well-formulated the smallest well-formulated model that stems from MT is always
the one shown in Figure 4-6
Figure 4-6 MT DAG of the largest true model used in simulations
The results are summarized in Figure 4-7 Each point on the horizontal axis
corresponds to the average for a given set of simulation conditions Only labels for the
SNR and sample size are included for clarity but the results are also shown for the
different values of the regression coefficients and the different true models considered
Additional details about the procedure and other results are included in the appendices451 SNR and Sample Size Effect
As expected small sample sizes conditioned upon a small SNR impair the ability
of the algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this
effect being greater when using the latter prior However considering the mean number
of TPs jointly with the number of FPs it is clear that although the number of TPs is
specially low with HOP(1 ch) most of the few predictors that are discovered in fact
belong to the true model In comparison to the results with EPP in terms of FPs the
HOP(1 ch) does better and even more so when both the sample size and the SNR are
109
Figure 4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch)
smallest Finally when either the SNR or the sample size is large the performance in
terms of TPs is similar between both priors but the number of FPs are somewhat lower
with the HOP452 Coefficient Magnitude
Three ways to allocate the amount of signal across predictors are considered For
the first choice all coefficients contain the same amount of signal regardless of their
order In the second each order-one coefficient contains twice as much signal as any
order-two coefficient and four times as much as any order-three coefficient Finally
each order-one coefficient contains a half as much signal as any order-two coefficient
and a quarter of what any order-three coefficient has These choices are denoted by
β(1) = c(1o1 1o2 1o3) β(2) = c(1o1 05o2 025o3) and β(3) = c(025o1 05o2 1o3)
respectively In Figure 4-7 the first 4 scenarios correspond to simulations with β(1) the
next four use β(2) the next four correspond to β(3) and then the values are cycled in
110
the same way The results show that scenarios using either β(1) or β(3) behave similarly
contrasting with the negative impact of having the highest signal in the order-one terms
through β(2) In Figure 4-7 the effect of using β(2) is evident as it corresponds to the
lowest values for the TPs regardless of the sample size the SNR or the prior used This
is an intuitive result since giving more signal to higher-order terms makes it easier to
detect higher-order terms and consequently by strong heredity the algorithm will also
select the corresponding lower-order terms included in the true model453 Special Points on the Scale
Four true models were considered (1) the model from Figure 4-6 (MT 1) (2)
the model without the order-one terms (MT 2) (3) the model without order-two terms
(MT 3) and (4) the model without x21 and x2x5 (MT 4) The last three are clearly not
well-formulated In Figure 4-7 the leftmost point on the horizontal axis corresponds to
scenarios with MT 1 the next point is for scenarios with MT 2 followed by those with MT 3
then with MT 4 then MT 1 etc In comparison to the EPP the HOP(1 ch) tightly controls
the inclusion of FPs by choosing smaller models at the expense of also reducing the TP
count especially when there is more uncertainty about the true model (ie SNR=025)
For both prior structures the results indicate that at low SNR levels the presence of
special points has no apparent impact as the selection behavior is similar between the
four models in terms of both the TP and FP An interesting observation is that the effect
of having special points on the scale is vastly magnified whenever the coefficients that
assign more weight to order-one terms (β(2)) are used46 Case Study Ozone Data Analysis
This section uses the ozone data from Breiman amp Friedman (1985) and followsthe analysis performed by Liang et al (2008) who investigated hyper g-priors Afterremoving observations with missing values 330 observations remain includingdaily measurements of maximum ozone concentration near Los Angeles and eightmeteorological variables Table D From the 330 observations 165 were sampled atrandom without replacement and used to run the variable selection procedure theremaining 165 were used for validation The eight meteorological variables interactionsand their squared terms are used as predictors resulting in a full model with 44predictors The model space assumes that the base model MB is the intercept onlymodel and that MF is the quadratic surface in the eight meteorological variables The
111
model space contains approximately 71 billion models and computation of all modelposterior probabilities is not feasible
Table 4-5 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA
vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX
The HOP HUP and HIP with a = 1 and b = ch as well as the EPP are considered forcomparison purposes To obtain the Bayes factors in equation 3ndash3 four different mixtures ofg-priors are utilized intrinsic priors (IP) (which yields the expression in equation 3ndash2) hyper-g(HG) priors (Liang et al 2008) with hyper-parameters α = 2β = 1 and α = β = 1 and Zellner-Siow (ZS) priors (Zellner amp Siow 1980) The results were extracted for the median posteriorprobability (MPM) models Additionally the model is estimated using the R package hierNet(Bien et al 2013) to compare model selection results to those obtained using the hierarchicallasso (Bien et al 2013) restricted to well formulated models by imposing the strong heredityconstraint The procedures were assessed on the basis of their predictive accuracy on thevalidation dataset
Among all models the one that yields the smallest RMSE is the median probability modelobtained using the HOP and EPP with the ZS prior and also using the HOP with both HGpriors (Table 4-6) The HOP model with the intrinsic prior has all the terms contained in thelowest RMSE model with the exception of dpg2 which has a relatively high marginal inclusionprobability of 46 This disparity between the IP and other mixtures of g-priors is explainedby the fact that the IP induces less posterior shrinkage than the ZS and HG priors The MPMobtained through the HUP and HIP are nested in the best model suggesting that these modelspace priors penalize complexity too much and result in false negatives Consideration ofthese MPMs suggest that the HOP is best at producing true positives while controlling for falsepositives
Finally the model obtained from the hierarchical lasso (HierNet) is the largest model andproduces the second to largest RMSE All of the terms contained in any of the other modelsexcept for vh are nested within the hierarchical lasso model and most of the terms that areexclusive to this model receive extremely low marginal inclusion probabilities under any of themodel priors and parameter priors considered under Bayesian model selection
112
Table 4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso
BF Prior Model R2 RMSEIP EPP hum dpg ibt hum2 hum lowast dpg 08054 42739
hum lowast ibt dpg2 ibt2IP HIP hum ibt hum2 hum lowast ibt ibt2 07740 43396IP HOP hum dpg ibt hum2 hum lowast ibt ibt2 07848 43175IP HUP hum dpg ibt hum lowast ibt ibt2 07767 43508ZS EPP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HIP hum ibt hum lowast ibt ibt2 07525 43505ZS HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG11 EPP vh hum dpg ibt hum2 hum lowast ibt dpg2 07701 43049HG11 HIP hum ibt hum lowast ibt ibt2 07525 43505HG11 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG11 HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 EPP hum dpg ibt hum2 hum lowast ibt dpg2 07701 43037HG21 HIP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG21 HUP hum dpg ibt hum lowast ibt 07526 44036
HierNet hum temp ibh dpg ibt vis hum2 hum lowast ibt 07651 43680temp2 temp lowast ibt dpg2
47 DiscussionScott amp Berger (2010) noted that Ockhamrsquos-razor effect found automatically in Bayesian
variable selection through the Bayes factor does not correct for multiple testing The Bayesfactor penalizes complexity of the alternative model according to the number of parametersin excess of those of the null model Therefore the Bayes factor only controls complexity in apairwise fashion If the model selection procedure uses equal prior probabilities for all M isin Mthen these comparisons ignore the effect of the multiplicity of the testing problem This is wherethe role of the prior on the model space becomes important The multiplicity penalty is ldquohiddenawayrdquo in the model prior probabilities π(M|M)
In addition to the multiplicity of the testing problem disregarding the hierarchical polynomialstructure in the predictors in model selection procedures has the potential to lead to differentresults according to how the predictors are setup (eg in what units these predictors areexpressed)
In this Chapter we investigated a solution to these two issues We define prior structuresfor well-formulated models and develop random walk algorithms to traverse this type of modelspace The key to understanding prior distributions on the space of WFMs is the hierarchicalnature of the model space itself The prior distributions described take advantage of thathierarchy in two ways First conditional independence and immediate inheritance are used todevelop the HOP HIP and HUP structures discussed in Section 43 Second the conditionalnature of the priors allows for the direct incorporation of complexity penalizations Of the priorsproposed the HOP using the hyperparameter choice (1 ch) provides the best control of falsepositives while maintaining a reasonable true positive rate Thus this prior is recommended asthe default prior on the space of WFMs
113
In the near future the software developed to carry out a Metropolis-Hastings random walkon the space of WFMs will be integrated to the R package varSelectIP These new functionsimplement various local priors for the regression coefficients including the intrinsic prior Zellner-Siow prior and hyper g-priors In addition the software supports the computation of crediblesets for each regression coefficient conditioned on the selected model as well as under modelaveraging
114
CHAPTER 5CONCLUSIONS
Ecologists are now embracing the use of Bayesian methods to investigate the
interactions that dictate the distribution and abundance of organisms These tools are
both powerful and flexible They allow integrating under a single methodology empirical
observations and theoretical process models and can seamlessly account for several
sources of uncertainty and dependence The estimation and testing methods proposed
throughout the document will contribute to the understanding of Bayesian methods used
in ecology and hopefully these will shed light about the differences between estimation
and testing Bayesian tools
All of our contributions exploit the potential of the latent variable formulation This
approach greatly simplifies the analysis of complex models it redirects the bulk of
the inferential burden away from the original response variables and places it on the
easy-to-work-with latent scale for which several time-tested approaches are available
Our methods are distinctly classified into estimation and testing tools
For estimation we proposed a Bayesian specification of the single-season
occupancy model for which a Gibbs sampler is available using both logit and probit
link functions This setup allows detection and occupancy probabilities to depend
on linear combinations of predictors Then we developed a dynamic version of this
approach incorporating the notion that occupancy at a previously occupied site depends
both on survival of current settlers and habitat suitability Additionally because these
dynamics also vary in space we suggest a strategy to add spatial dependence among
neighboring sites
Ecological inquiry usually requires of competing explanations and uncertainty
surrounds the decision of choosing any one of them Hence a model or a set of
probable models should be selected from all the viable alternatives To address this
testing problem we proposed an objective and fully automatic Bayesian methodology
115
for the single season site-occupancy model Our approach relies on the intrinsic prior
which prevents from introducing (commonly unavailable) subjectively information
into the model In simulation experiments we observed that the methods single out
accurately the predictors present in the true model using the marginal posterior inclusion
probabilities of the predictors For predictors in the true model these probabilities were
comparatively larger than those for predictors not present in the true model Also the
simulations indicated that the method provides better discrimination for predictors in the
detection component of the model
In our simulations and in the analysis of the Blue Hawker data we observed that the
effect from using the multiplicity correction prior was substantial This occurs because
the Bayes factor only penalizes complexity of the alternative model according to its
number of parameters in excess to those of the null model As the number of predictors
grows the number of models in the models space also grows increasing the chances
of making false positive decisions on the inclusion of predictors This is where the role
of the prior on the model space becomes important The multiplicity penalty is ldquohidden
awayrdquo in the model prior probabilities π(M|M) In addition to the multiplicity of the
testing problem disregarding the hierarchical polynomial structure in the predictors in
model selection procedures has the potential to lead to different results according to
how the predictors are coded (eg in what units these predictors are expressed)
To confront this situation we propose three prior structures for well-formulated
models take advantage of the hierarchical structure of the predictors Of the priors
proposed we recommend the HOP using the hyperparameter choice (1 ch) which
provides the best control of false positives while maintaining a reasonable true positive
rate
Overall considering the flexibility of the latent approach several other extensions of
these methods follow Currently we envision three future developments (1) occupancy
models incorporate various sources of information (2) multi-species models that make
116
use of spatial and interspecific dependence and (3) investigate methods to conduct
model selection for the dynamic and spatially explicit version of the model
117
APPENDIX AFULL CONDITIONAL DENSITIES DYMOSS
In this section we introduce the full conditional probability density functions for all
the parameters involved in the DYMOSS model using probit as well as logic links
Sampler Z
The full conditionals corresponding to the presence indicators have the same form
regardless of the link used These are derived separately for the cases t = 1 1 lt t lt T
and t = T since their corresponding probabilities take on slightly different forms
Let ϕ(ν|microσ2) represent the density for a normal random variable ν with mean micro and
variance σ2 and recall that ψi1 = F (xprime(o)iα) and pijt = F (qprimeijtλt) where F () is the
inverse link function The full conditional for zit is given by
1 For t = 1
π(zi1|vi1αλ1βc1 δ
s1) = ψlowast
i1zi1 (1minus ψlowast
i1)1minuszi1
= Bernoulli(ψlowasti1) (Andash1)
where
ψlowasti1 =
ψi1ϕ(vi1|xprimei1βc1 + δs1 1)
prodJi1j=1(1minus pij1)
ψi1ϕ(vi1|xprimei1βc1 + δs1 1)
prodJi1j=1(1minus pij1) + (1minus ψi1)ϕ(vi1|xprimei1β
c1 1)
prodJj=1 Iyij1=0
2 For 1 lt t lt T
π(zit |zi(tminus1) zi(t+1)λt βctminus1 δ
stminus1) = ψlowast
itzit (1minus ψlowast
it)1minuszit
= Bernoulli(ψlowastit) (Andash2)
where
ψlowastit =
κitprodJit
j=1(1minus pijt)
κitprodJit
j=1(1minus pijt) +nablait
prodJj=1 Iyijt=0
with
(a) κit = F (xprimei(tminus1)β
ctminus1 + zi(tminus1)δ
stminus1)ϕ(vit |xprimeitβ
ct + δst 1) and
(b) nablait =(1minus F (xprime
i(tminus1)βctminus1 + zi(tminus1)δ
stminus1)
)ϕ(vit |xprimeitβ
ct 1)
3 For t = T
π(ziT |zi(Tminus1)λT βcTminus1 δ
sTminus1) = ψ⋆iT
ziT (1minus ψ⋆iT )1minusziT
118
=
Nprodi=1
Bernoulli(ψ⋆iT ) (Andash3)
where
ψ⋆iT =κ⋆iT
prodJiTj=1(1minus pijT )
κ⋆iTprodJiT
j=1(1minus pijT ) +nabla⋆iT
prodJj=1 IyijT=0
with
(a) κ⋆iT = F (xprimei(Tminus1)β
cTminus1 + zi(Tminus1)δ
sTminus1) and
(b) nabla⋆iT =
(1minus F (xprime
i(Tminus1)βcTminus1 + zi(Tminus1)δ
sTminus1)
)Sampler ui
1
π(ui |zi1α) = tr N(xprime(o)iα 1 trunc(zi1))
where trunc(zi1) =
(minusinfin 0] zi1 = 0
(0infin) zi1 = 1(Andash4)
and tr N(microσ2A) denotes the pdf of a truncated normal random variable with mean microvariance σ2 and truncation region A
Sampler α
1
π(α|u) prop [α]
Nprodi=1
ϕ(ui xprime(o)iα 1) (Andash5)
If [α] prop 1 then
α|u sim N(m(α)α)
with m(α) = αXprime(o)u and α = (X prime
(o)X(o))minus1
Sampler vit
1 (For t gt 1)
π(vi (tminus1)|zi (tminus1) zit βctminus1 δ
stminus1) = tr N
(micro(v)i(tminus1) 1 trunc(zit)
)(Andash6)
where micro(v)i(tminus1) = xprime
i(tminus1)βctminus1 + zi(tminus1)δ
ci(tminus1) and trunc(zit) defines the corresponding
truncation region given by zit
119
Sampler(β(c)tminus1 δ
(c)tminus1
)
1 (For t gt 1)
π(β(s)tminus1 δ
(c)tminus1|vtminus1 ztminus1) prop [β
(s)tminus1 δ
(c)tminus1]
Nprodi=1
ϕ(vit xprimei(tminus1)β
(c)tminus1 + zi(tminus1)δ
(s)tminus1 1) (Andash7)
If[β(c)tminus1 δ
(s)tminus1
]prop 1 then
β(c)tminus1 δ
(s)tminus1|vtminus1 ztminus1 sim N(m(β
(c)tminus1 δ
(s)tminus1)tminus1)
with m(β(c)tminus1 δ
(s)tminus1) = tminus1 ~X
primetminus1vtminus1 and tminus1 = (~X prime
tminus1 ~Xtminus1)minus1 where ~Xtminus1 =(
Xtminus1 ztminus1)
Sampler wijt
1 (For t gt 1 and zit = 1)
π(wijt | i zit = 1 yijt λ) = tr N(qprimeijtλt 1 tr(yijt)
)(Andash8)
Sampler λt
1 (For t = 1 2 T )
π(λt |zt wt) prop [λt ]prod
i zit=1
Jitprodj=1
ϕ(wijt qprimeijtλt 1) (Andash9)
If [λt ] prop 1 then
λt |wt zt sim N(m(λt)λt)
with m(λt) = λtQ primetwt and λt
= (Q primetQt)
minus1 where Qt and wt respectively are the designmatrix and the vector of latent variables for surveys of sites such that zit = 1
120
APPENDIX BRANDOM WALK ALGORITHMS
Global Jump From the current state M the global jump is performed by drawing
a model M prime at random from the model space This is achieved by beginning at the base
model and increasing the order from JminM to the Jmax
M the minimum and maximum orders
of nodes in (MF ) = MF MB at each order a set of nodes is selected at random from
the prior conditioned on the nodes already in the model The MH correction is
α =
1m(y|M primeM)
m(y|MM)
Local Jump From the current state M the local jump is performed by drawing a
model from the set of models L(M) = Mα α isin E(M) cup C(M) where Mα is M α
for α isin E(M) and M cup α for α isin C(M) The proposal probabilities for the model are
computed as a mixture of p(M prime|yMM prime isin L(M)) and the discrete uniform distribution
The proposal kernel is
q(M prime|yMM prime isin L(M)) =1
2
(p(M prime|yMM prime isin L(M)) +
1
|L(M)|
)This choice promotes moving to better models while maintaining a non-negligible
probability of moving to any of the possible models The MH correction is
α =
1m(y|M primeM)
m(y|MM)
q(M|yMM isin L(M prime))
q(M prime|yMM prime isin L(M))
Intermediate Jump The intermediate jump is performed by increasing or
decreasing the order of the nodes under consideration performing local proposals based
on order For a model M prime define Lj(Mprime) = M prime cup M prime
α α isin (E(M prime) cup C(M prime)) capj(MF )
From a state M the kernel chooses at random whether to increase or decrease the
order If M = MF then decreasing the order is chosen with probability 1 and if M = MB
then increasing the order is chosen with probability 1 in all other cases the probability of
increasing and decreasing order is 12 The proposal kernels are given by
121
Increasing order proposal kernel
1 Set j = JminM minus 1 and M prime
j = M
2 Draw M primej+1 from qincj+1(M
prime|yMM prime isin Lj+1(Mprimej )) where
qincj+1(Mprime|yMM prime isin Lj+1(M
primej )) =
12
(p(M prime|yMM prime isin Lj+1(M
primej )) +
1|Lj+1(M
primej)|
)
3 Set j = j + 1
4 If j lt JmaxM then return to 2 O therwise proceed to 5
5 Set M prime = M primeJmaxM
and compute the proposal probability
qinc(Mprime|yMM) =
JmaxM minus1prod
j=JminM minus1
qincj+1(Mprimej |yMM prime isin Lj+1(M
primej )) (Bndash1)
Decreasing order proposal kernel
1 Set j = JmaxM + 1 and M prime
j = M
2 Draw M primejminus1 from qdecjminus1(M
prime|yMM prime isin Ljminus1(Mprimej )) where
qdecjminus1(Mprime|yMM prime isin Ljminus1(M
primej )) =
12
(p(M prime|yMM prime isin Ljminus1(M
primej )) +
1|Ljminus1(M
primej)|
)
3 Set j = j minus 1
4 If j gt JminM then return to 2 Otherwise proceed to 5
5 Set M prime = M primeJminM
and compute the proposal probability
qdec(Mprime|yMM) =
JminM +1prod
j=JmaxM +1
qdecjminus1(Mprimej |yMM prime isin Ljminus1(M
primej )) (Bndash2)
If increasing order is chosen then the MH correction is given by
α = min
1
(1 + I (M prime = MF )
1 + I (M = MB)
)qdec(M|yMM prime)
qinc(M prime|yMM)
p(M prime|yM)
p(M|yM)
(Bndash3)
and similarly if decreasing order is chosen
Other Local and Intermediate Kernels The local and intermediate kernels
described here perform a kind of stochastic forwards-backwards selection Each kernel
122
q can be relaxed to allow more than one node to be turned on or off at each step which
could provide larger jumps for each of these kernels The tradeoff is that number of
proposed models for such jumps could be very large precluding the use of posterior
information in the construction of the proposal kernel
123
APPENDIX CWFM SIMULATION DETAILS
Briefly the idea is to let ZMT(X )βMT
= (QR)βMT= QηMT
(ie βMT= Rminus1ηMT
)
using the QR decomposition As such setting all values in ηMTproportional to one
corresponds to distributing the signal in the model uniformly across all predictors
regardless of their order
The (unconditional) variance of a single observation yi is var(yi) = var (E [yi |zi ]) +
E [var(yi |zi)] where zi is the i -th row of the design matrix ZMT Hence we take the
signal to noise ratio for each observation to be
SNR(η) = ηTMT
RminusTzRminus1ηMT
σ2
where z = var(zi) We determine how the signal is distributed across predictors up to a
proportionality constant to be able to control simultaneously the signal to noise ratio
Additionally to investigate the ability of the model to capture correctly the
hierarchical structure we specify four different 0-1 vectors that determine the predictors
in MT which generates the data in the different scenarios
Table C-1 Experimental conditions WFM simulationsParameter Values considered
SNR(ηMT) = k 025 1 4
ηMTprop (1 13 14 12) (1 13 1214
1412) (1 1413
1214 12)
γMT(1 13 14 12) (1 13 14 02) (1 13 04 12) (1 03 0 1 1 0 12)
n 130 260 1040
The results presented below are somewhat different from those found in the main
body of the article in Section 5 These are extracted averaging the number of FPrsquos
TPrsquos and model sizes respectively over the 100 independent runs and across the
corresponding scenarios for the 20 highest probability models
124
SNR and Sample Size Effect
In terms of the SNR and the sample size (Figure C-1) we observe that as
expected small sample sizes conditioned upon a small SNR impair the ability of the
algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this effect
more notorious when using the latter prior However considering the mean number
of true positives (TP) jointly with the mean model size it is clear that although the
sensitivity is low most of the few predictors that are discovered belong to the true
model The results observed with SNR of 025 and a relatively small sample size are
far from being impressive however real problems where the SNR is as low as 025
will yield many spurious associations under the EPP The fact that the HOP(1 ch) has
a strong protection against false positive is commendable in itself A SNR of 1 also
represents a feeble relationship between the predictors and the response nonetheless
the method captures approximately half of the true coefficients while including very few
false positives Following intuition as either the sample size or the SNR increase the
algorithms performance is greatly enhanced Either having a large sample size or a
large SNR yields models that contain mostly true predictors Additionally HOP(1 ch)
provides a strong control over the number of false positives therefore for high SNR
or larger sample sizes the number of predictors in the top 20 models is close to the
size of the true model In general the EPP allows the detection of more TPrsquos while
the HOP(1 ch) provides a stronger control on the amount of FPrsquos included when
considering small sample sizes combined with small SNRs As either sample size or
SNR grows the differences between the two priors become indistinct
125
Figure C-1 SNR vs n Average model size average true positives and average false
positives for all simulated scenarios by model ranking according to model
posterior probabilities
Coefficient Magnitude
This part of the experiment explores the effect of how the signal is distributed across
predictors As mentioned before sphering is used to assign the coefficients values
in a manner that controls the amount of signal that goes into each coefficient Three
possible ways to allocate the signal are considered First each order-one coefficient
contains twice as much signal as any order-two coefficient and four times as much
any as order-three coefficient second all coefficients contain the same amount of
signal regardless of their order and third each order-one coefficient contains a half
as much signal as any order-two coefficient and a quarter of what any order-three
126
coefficient has In Figure C-2 these values are denoted by β = c(1o1 05o2 025o3)
β = c(1o1 1o2 1o3) and β = c(025o1 05o2 1o3) respectively
Observe that the number of FPrsquos is invulnerable to how the SNR is distributed
across predictors using the HOP(1 ch) conversely when using the EPP the number
of FPrsquos decreases as the SNR grows always being slightly higher than those obtained
with the HOP With either prior structure the algorithm performs better whenever all
coefficients are equally weighted or when those for the order-three terms have higher
weights In these two cases (ie with β = c(1o1 05o2 025o3) or β = c(1o1 1o2 1o3))
the effect of the SNR appears to be similar In contrast when more weight is given to
order one terms the algorithm yields slightly worse models at any SNR level This is an
intuitive result since giving more signal to higher order terms makes it easier to detect
higher order terms and consequently by strong heredity the algorithm will also select
the corresponding lower order terms included in the true model
Special Points on the Scale
In Nelder (1998) the author argues that the conditions under which the
weak-heredity principle can be used for model selection are so restrictive that the
principle is commonly not valid in practice in this context In addition the author states
that considering well-formulated models only does not take into account the possible
presence of special points on the scales of the predictors that is situations where
omitting lower order terms is justified due to the nature of the data However it is our
contention that every model has an underlying well-formulated structure whether or not
some predictor has special points on its scale will be determined through the estimation
of the coefficients once a valid well-formulated structure has been chosen
To understand how the algorithm behaves whenever the true data generating
mechanism has zero-valued coefficients for some lower order terms in the hierarchy
four different true models are considered Three of them are not well-formulated while
the remaining one is the WFM shown in Figure 4-6 The three models that have special
127
Figure C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model rankingaccording to model posterior probabilities
points correspond to the same model MT from Figure 4-6 but have respectively
zero-valued coefficients for all the order-one terms all the order-two terms and for x21
and x2x5
As seen before in comparison to the EPP the HOP(1 ch) tightly controls the
inclusion FPs by choosing smaller models at the expense of also reducing the TP
count especially when there is more uncertainty about the true model (ie SNR=025)
For both prior structures the results in Figure C-3 indicate that at low SNR levels the
presence of special points has no apparent impact as the selection behavior is similar
between the four models in terms of both the TP and FP As the SNR increases the
TPs and the model size are affected for true models with zero-valued lower order
128
Figure C-3 SNR vs different true models MT Average model size average truepositives and average false positives for all simulated scenarios by modelranking according to model posterior probabilities
terms These differences however are not very large Relatively smaller models are
selected whenever some terms in the hierarchy are missing but with high SNR which
is where the differences are most pronounced the predictors included are mostly true
coefficients The impact is almost imperceptible for the true model that lacks order one
terms and the model with zero coefficients for x21 and x2x5 and is more visible for models
without order two terms This last result is expected due to strong-heredity whenever
the order-one coefficients are missing the inclusion of order-two and order-three
terms will force their selection which is also the case when only a few order two terms
have zero-valued coefficients Conversely when all order two predictors are removed
129
some order three predictors are not selected as their signal is attributed the order two
predictors missing from the true model This is especially the case for the order three
interaction term x1x2x5 which depends on the inclusion of three order two terms terms
(x1x2 x1x5 x2x5) in order for it to be included as well This makes the inclusion of this
term somewhat more challenging the three order two interactions capture most of
the variation of the polynomial terms that is present when the order three term is also
included However special points on the scale commonly occur on a single or at most
on a few covariates A true data generating mechanism that removes all terms of a given
order in the context of polynomial models is clearly not justified here this was only done
for comparison purposes
130
APPENDIX DSUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
The covariates considered for the ozone data analysis match those used in Liang
et al (2008) these are displayed in Table D below
Table D-1 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA
vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX
The marginal posterior inclusion probability corresponds to the probability of including a
given term in the full model MF after summing over all models in the model space For each
node α isin MF this probability is given by pα =sum
MisinM I(αisinM)p(M|yM) Given that in problems
with a large model space such as the one considered for the ozone concentration problem
enumeration of the entire space is not feasible Thus these probabilities are estimated summing
over every model drawn by the random walk over the model space M
Given that there are in total 44 potential predictors for convenience in Tables D-2 to D-5
below we only display the marginal posterior probabilities for the terms included under at least
one of the model priors considered (EPP HIP HUP and HOP) for each of the parameter priors
utilized (intrinsic priors Zellner-Siow priors Hyper-g(11) and Hyper-g(21))
131
Table D-2 Marginal inclusion probabilities
intrinsic prior
EPP HIP HUP HOP
hum 099 069 085 076
dpg 085 048 052 053
ibt 099 100 100 100
hum2 076 051 043 062
humdpg 055 002 003 017
humibt 098 069 084 075
dpg2 072 036 025 046
ibt2 059 078 057 081
Table D-3 Marginal inclusion probabilities
Zellner-Siow prior
EPP HIP HUP HOP
hum 076 067 080 069
dpg 089 050 055 058
ibt 099 100 100 100
hum2 057 049 040 057
humibt 072 066 078 068
dpg2 081 038 031 051
ibt2 054 076 055 077
Table D-4 Marginal inclusion probabilities
Hyper-g11
EPP HIP HUP HOP
vh 054 005 010 011
hum 081 067 080 069
dpg 090 050 055 058
ibt 099 100 099 099
hum2 061 049 040 057
humibt 078 066 078 068
dpg2 083 038 030 051
ibt2 049 076 054 077
Table D-5 Marginal inclusion probabilities
Hyper-g21
EPP HIP HUP HOP
hum 079 064 073 067
dpg 090 052 060 059
ibt 099 100 099 100
hum2 060 047 037 055
humibt 076 064 071 067
dpg2 082 041 036 052
ibt2 047 073 049 075
132
REFERENCES
Akaike H (1983) Information measures and model selection Bull Int Statist Inst 50277ndash290
Albert J H amp Chib S (1993) Bayesian-analysis of binary and polychotomousresponse data Journal of the American Statistical Association 88(422) 669ndash679
Berger J amp Bernardo J (1992) On the development of reference priors BayesianStatistics 4 (pp 35ndash60)
URL httpisbastatdukeedueventsvalencia1992Valencia4Refpdf
Berger J amp Pericchi L (1996) The intrinsic Bayes factor for model selection andprediction Journal of the American Statistical Association 91(433) 109ndash122
URL httpamstattandfonlinecomdoiabs10108001621459199610476668
Berger J Pericchi L amp Ghosh J (2001) Objective Bayesian methods for modelselection introduction and comparison In Model selection vol 38 of IMS LectureNotes Monogr Ser (pp 135ndash207) Inst Math Statist
URL httpwwwjstororgstable1023074356165
Besag J York J amp Mollie A (1991) Bayesian Image-Restoration with 2 Applicationsin Spatial Statistics Annals of the Institute of Statistical Mathematics 43 1ndash20
Bien J Taylor J amp Tibshirani R (2013) A lasso for hierarchical interactions TheAnnals of Statistics 41(3) 1111ndash1141
URL httpprojecteuclidorgeuclidaos1371150895
Breiman L amp Friedman J (1985) Estimating optimal transformations for multipleregression and correlation Journal of the American Statistical Association 80580ndash598
Brusco M J Steinley D amp Cradit J D (2009) An exact algorithm for hierarchicallywell-formulated subsets in second-order polynomial regression Technometrics 51(3)306ndash315
Casella G Giron F J Martınez M L amp Moreno E (2009) Consistency of Bayesianprocedures for variable selection The Annals of Statistics 37 (3) 1207ndash1228
URL httpprojecteuclidorgeuclidaos1239369020
Casella G Moreno E amp Giron F (2014) Cluster Analysis Model Selection and PriorDistributions on Models Bayesian Analysis TBA(TBA) 1ndash46
URL httpwwwstatufledu~casellaPapersClusterModel-July11-Apdf
133
Chipman H (1996) Bayesian variable selection with related predictors CanadianJournal of Statistics 24(1) 17ndash36
URL httponlinelibrarywileycomdoi1023073315687abstract
Clyde M amp George E I (2004) Model Uncertainty Statistical Science 19(1) 81ndash94
URL httpprojecteuclidorgDienstgetRecordid=euclidss1089808274
Dewey J (1958) Experience and nature New York Dover Publications
Dorazio R M amp Taylor-Rodrıguez D (2012) A Gibbs sampler for Bayesian analysis ofsite-occupancy data Methods in Ecology and Evolution 3 1093ndash1098
Ellison A M (2004) Bayesian inference in ecology Ecology Letters 7 509ndash520
Fiske I amp Chandler R (2011) unmarked An R package for fitting hierarchical modelsof wildlife occurrence and abundance Journal of Statistical Software 43(10)
URL httpcorekmiopenacukdownloadpdf5701760pdf
George E (2000) The variable selection problem Journal of the American StatisticalAssociation 95(452) 1304ndash1308
URL httpwwwtandfonlinecomdoiabs10108001621459200010474336
Giron F J Moreno E Casella G amp Martınez M L (2010) Consistency of objectiveBayes factors for nonnested linear models and increasing model dimension Revistade la Real Academia de Ciencias Exactas Fisicas y Naturales Serie A Matematicas104(1) 57ndash67
URL httpwwwspringerlinkcomindex105052RACSAM201006
Good I J (1950) Probability and the Weighing of Evidence New York Haffner
Griepentrog G L Ryan J M amp Smith L D (1982) Linear transformations ofpolynomial regression-models American Statistician 36(3) 171ndash174
Gunel E amp Dickey J (1974) Bayes factors for independence in contingency tablesBiometrika 61 545ndash557
Hanski I (1994) A Practical Model of Metapopulation Dynamics Journal of AnimalEcology 63 151ndash162
Hooten M (2006) Hierarchical spatio-temporal models for ecological processesDoctoral dissertation University of Missouri-Columbia
URL httpsmospacelibraryumsystemeduxmluihandle103554500
Hooten M B amp Hobbs N T (2014) A Guide to Bayesian Model Selection forEcologists Ecological Monographs (In Press)
134
Hughes J amp Haran M (2013) Dimension reduction and alleviation of confoundingfor spatial generalized linear mixed models Journal of the Royal Statistical SocietySeries B Statistical Methodology 75 139ndash159
Hurvich C M amp Tsai C-L (1989) Regression and time series model selection insmall samples Biometrika 76 297ndash307
URL httpbiometoxfordjournalsorgcontent762297abstract
Jeffreys H (1935) Some tests of significance treated by the theory of probabilityProcedings of the Cambridge Philosophy Society 31 203ndash222
Jeffreys H (1961) Theory of Probability London Oxford University Press 3rd ed
Johnson D Conn P Hooten M Ray J amp Pond B (2013) Spatial occupancymodels for large data sets Ecology 94(4) 801ndash808
URL httpwwwesajournalsorgdoiabs10189012-05641mi=3eywlhampaf=R
ampsearchText=human+population
Kass R amp Wasserman L (1995) A reference Bayesian test for nested hypothesesand its relationship to the Schwarz criterion Journal of the American StatisticalAssociation 90(431)
URL httpamstattandfonlinecomdoiabs10108001621459199510476592
Kass R E amp Raftery A E (1995) Bayes Factors Journal of the American StatisticalAssociation 90 773ndash795
URL httpwwwtandfonlinecomdoiabs10108001621459199510476572$
delimiter026E30F$nhttpwwwtandfonlinecomdoiabs10108001621459
199510476572UvBybrTIgcs
Kass R E amp Wasserman L (1996) The Selection of Prior Distributions by FormalRules Journal of the American Statistical Association 91(435) 1343
URL httpwwwjstororgstable2291752origin=crossref
Kery M (2010) Introduction to WinBUGS for Ecologists Bayesian Approach toRegression ANOVA Mixed Models and Related Analyses Academic Press 1st ed
Kery M Gardner B amp Monnerat C (2010) Predicting species distributions fromchecklist data using site-occupancy models Journal of Biogeography 37 (10)1851ndash1862 Kery Marc Gardner Beth Monnerat Christian
Khuri A (2002) Nonsingular linear transformations of the control variables in responsesurface models Technical Report
Krebs C J (1972) Ecology the experimental analysis of distribution and abundance
135
Lempers F B (1971) Posterior probabilities of alternative linear models University ofRotterdam Press Rotterdam
Leon-Novelo L Moreno E amp Casella G (2012) Objective Bayes model selection inprobit models Statistics in medicine 31(4) 353ndash65
URL httpwwwncbinlmnihgovpubmed22162041
Liang F Paulo R Molina G Clyde M a amp Berger J O (2008) Mixtures of g Priorsfor Bayesian Variable Selection Journal of the American Statistical Association103(481) 410ndash423
URL httpwwwtandfonlinecomdoiabs101198016214507000001337
Link W amp Barker R (2009) Bayesian inference with ecological applications Elsevier
URL httpbooksgooglecombookshl=enamplr=ampid=hecon2l2QPcCampoi=fnd
amppg=PP2ampdq=Bayesian+Inference+with+ecological+applicationsampots=S82_
0pxrNmampsig=L3xbsSQcKD8FV6rxCMp2pmP2JKk
MacKenzie D amp Nichols J (2004) Occupancy as a surrogate for abundanceestimation Animal biodiversity and conservation 1 461ndash467
URL httpcrsitbacidmediajurnalrefslandscapemackenzie2004zhpdf
MacKenzie D Nichols J amp Hines J (2003) Estimating site occupancy colonizationand local extinction when a species is detected imperfectly Ecology 84(8)2200ndash2207
URL httpwwwesajournalsorgdoiabs10189002-3090
MacKenzie D I Bailey L L amp Nichols J D (2004) Investigating speciesco-occurrence patterns when species Journal of Animal Ecology 73 546ndash555
MacKenzie D I Nichols J D Lachman G B Droege S Royle J A amp LangtimmC A (2002) Estimating site occupancy rates when detection probabilities are lessthan one Ecology 83(8) 2248ndash2255
Mazerolle M amp Mazerolle M (2013) Package rsquoAICcmodavgrsquo (c)
URL ftpheanetarchivegnewsenseorgdisk1CRANwebpackages
AICcmodavgAICcmodavgpdf
McCullagh P amp Nelder J A (1989) Generalized linear models (2nd ed) LondonEngland Chapman amp Hall
McQuarrie A Shumway R amp Tsai C-L (1997) The model selection criterion AICu
136
Moreno E Bertolino F amp Racugno W (1998) An intrinsic limiting procedure for modelselection and hypotheses testing Journal of the American Statistical Association93(444) 1451ndash1460
Moreno E Giron F J amp Casella G (2010) Consistency of objective Bayes factors asthe model dimension grows The Annals of Statistics 38(4) 1937ndash1952
URL httpprojecteuclidorgeuclidaos1278861238
Nelder J A (1977) Reformulation of linear-models Journal of the Royal StatisticalSociety Series A - Statistics in Society 140 48ndash77
Nelder J A (1998) The selection of terms in response-surface models - how strong isthe weak-heredity principle American Statistician 52(4) 315ndash318
Nelder J A (2000) Functional marginality and response-surface fitting Journal ofApplied Statistics 27 (1) 109ndash112
Nichols J Hines J amp Mackenzie D (2007) Occupancy estimation and modeling withmultiple states and state uncertainty Ecology 88(6) 1395ndash1400
URL httpwwwesajournalsorgdoipdf10189006-1474
Ovaskainen O Hottola J amp Siitonen J (2010) Modeling species co-occurrenceby multivariate logistic regression generates new hypotheses on fungal interactionsEcology 91(9) 2514ndash21
URL httpwwwncbinlmnihgovpubmed20957941
Peixoto J L (1987) Hierarchical variable selection in polynomial regression-modelsAmerican Statistician 41(4) 311ndash313
Peixoto J L (1990) A property of well-formulated polynomial regression-modelsAmerican Statistician 44(1) 26ndash30
Pericchi L R (2005) Model selection and hypothesis testing based on objectiveprobabilities and bayes factors In Handbook of Statistics Elsevier
Polson N G Scott J G amp Windle J (2013) Bayesian Inference for Logistic ModelsUsing Polya-Gamma Latent Variables Journal of the American Statistical Association108 1339ndash1349
URL httpdxdoiorg101080016214592013829001
Rao C R amp Wu Y (2001) On model selection vol Volume 38 of Lecture NotesndashMonograph Series (pp 1ndash57) Beachwood OH Institute of Mathematical Statistics
URL httpdxdoiorg101214lnms1215540960
137
Reich B J Hodges J S amp Zadnik V (2006) Effects of residual smoothing on theposterior of the fixed effects in disease-mapping models Biometrics 62 1197ndash1206
Reiners W amp Lockwood J (2009) Philosophical Foundations for the Practices ofEcology Cambridge University Press
URL httpbooksgooglecombooksid=dr9cPgAACAAJ
Rigler F amp Peters R (1995) Excellence in Ecology Science and Limnology EcologyInstitute Germany
URL httportoncatieaccrcgi-binwxisexeIsisScript=CIENLxis
ampmethod=postampformato=2ampcantidad=1ampexpresion=mfn=008268
Robert C Chopin N amp Rousseau J (2009) Harold Jeffreysrsquo Theory of Probabilityrevisited Statistical Science Volume 24(2) 141ndash179
URL httpswwwnewtonacukpreprintsNI08021pdf
Robert C P (1993) A note on jeffreys-lindley paradox Statistica Sinica 3 601ndash608
Royle J A amp Kery M (2007) A Bayesian state-space formulation of dynamicoccupancy models Ecology 88(7) 1813ndash23
URL httpwwwncbinlmnihgovpubmed17645027
Scott J amp Berger J (2010) Bayes and Empirical-Bayes Multiplicity Adjustment in thevariable selection problem The Annals of Statistics
URL httpprojecteuclidorgeuclidaos1278861454
Spiegelhalter D J amp Smith A F M (1982) Bayes factor for linear and log-linearmodels with vague prior information J R Statist Soc B 44 377ndash387
Tierney L amp Kadane J B (1986) Accurate approximations for posterior moments andmarginal densities Journal of the American Statistical Association 81 82ndash86
Tyre A J Tenhumberg B Field S a Niejalke D Parris K amp Possingham H P(2003) Improving Precision and Reducing Bias in Biological Surveys EstimatingFalse-Negative Error Rates Ecological Applications 13(6) 1790ndash1801
URL httpwwwesajournalsorgdoiabs10189002-5078
Waddle J H Dorazio R M Walls S C Rice K G Beauchamp J Schuman M Jamp Mazzotti F J (2010) A new parameterization for estimating co-occurrence ofinteracting species Ecological applications a publication of the Ecological Society ofAmerica 20 1467ndash1475
Wasserman L (2000) Bayesian Model Selection and Model Averaging Journal ofmathematical psychology 44(1) 92ndash107
138
URL httpwwwncbinlmnihgovpubmed10733859
Wilson M Iversen E Clyde M A Schmidler S C amp Schildkraut J M (2010)Bayesian model search and multilevel inference for SNP association studies TheAnnals of Applied Statistics 4(3) 1342ndash1364
URL httpwwwncbinlmnihgovpmcarticlesPMC3004292
Womack A J Leon-Novelo L amp Casella G (2014) Inference from Intrinsic BayesProcedures Under Model Selection and Uncertainty Journal of the AmericanStatistical Association (June) 140114063448000
URL httpwwwtandfonlinecomdoiabs101080016214592014880348
Yuan M Joseph V R amp Zou H (2009) Structured variable selection and estimationThe Annals of Applied Statistics 3(4) 1738ndash1757
URL httpprojecteuclidorgeuclidaoas1267453962
Zeller K A Nijhawan S Salom-Perez R Potosme S H amp Hines J E (2011)Integrating occupancy modeling and interview data for corridor identification A casestudy for jaguars in nicaragua Biological Conservation 144(2) 892ndash901
Zellner A amp Siow A (1980) Posterior odds ratios for selected regression hypothesesIn Trabajos de estadıstica y de investigacion operativa (pp 585ndash603)
URL httpwwwspringerlinkcomindex5300770UP12246M9pdf
139
BIOGRAPHICAL SKETCH
Daniel Taylor-Rodrıguez was born in Bogota Colombia He earned a BS
degree in economics from the Universidad de Los Andes (2004) and a Specialist
degree in statistics from the Universidad Nacional de Colombia In 2009 he traveled
to Gainesville Florida to pursue a masterrsquos in statistics under the supervision of
George Casella Upon completion he started a PhD in interdisciplinary ecology with
concentration in statistics again under George Casellarsquos supervision After Georgersquos
passing Linda Young and Nikolay Bliznyuk continued to oversee Danielrsquos mentorship
He has currently accepted a joint postdoctoral fellowship at the Statistical and Applied
Mathematical Sciences Institute and the Department of Statistical Science at Duke
University
140
- ACKNOWLEDGMENTS
- TABLE OF CONTENTS
- LIST OF TABLES
- LIST OF FIGURES
- ABSTRACT
- 1 GENERAL INTRODUCTION
-
- 11 Occupancy Modeling
- 12 A Primer on Objective Bayesian Testing
- 13 Overview of the Chapters
-
- 2 MODEL ESTIMATION METHODS
-
- 21 Introduction
-
- 211 The Occupancy Model
- 212 Data Augmentation Algorithms for Binary Models
-
- 22 Single Season Occupancy
-
- 221 Probit Link Model
- 222 Logit Link Model
-
- 23 Temporal Dynamics and Spatial Structure
-
- 231 Dynamic Mixture Occupancy State-Space Model
- 232 Incorporating Spatial Dependence
-
- 24 Summary
-
- 3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
-
- 31 Introduction
- 32 Objective Bayesian Inference
-
- 321 The Intrinsic Methodology
- 322 Mixtures of g-Priors
-
- 3221 Intrinsic priors
- 3222 Other mixtures of g-priors
-
- 33 Objective Bayes Occupancy Model Selection
-
- 331 Preliminaries
- 332 Intrinsic Priors for the Occupancy Problem
- 333 Model Posterior Probabilities
- 334 Model Selection Algorithm
-
- 34 Alternative Formulation
- 35 Simulation Experiments
-
- 351 Marginal Posterior Inclusion Probabilities for Model Predictors
- 352 Summary Statistics for the Highest Posterior Probability Model
-
- 36 Case Study Blue Hawker Data Analysis
-
- 361 Results Variable Selection Procedure
- 362 Validation for the Selection Procedure
-
- 37 Discussion
-
- 4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
-
- 41 Introduction
- 42 Setup for Well-Formulated Models
-
- 421 Well-Formulated Model Spaces
-
- 43 Priors on the Model Space
-
- 431 Model Prior Definition
- 432 Choice of Prior Structure and Hyper-Parameters
- 433 Posterior Sensitivity to the Choice of Prior
-
- 44 Random Walks on the Model Space
-
- 441 Simple Pruning and Growing
- 442 Degree Based Pruning and Growing
-
- 45 Simulation Study
-
- 451 SNR and Sample Size Effect
- 452 Coefficient Magnitude
- 453 Special Points on the Scale
-
- 46 Case Study Ozone Data Analysis
- 47 Discussion
-
- 5 CONCLUSIONS
- A FULL CONDITIONAL DENSITIES DYMOSS
- B RANDOM WALK ALGORITHMS
- C WFM SIMULATION DETAILS
- D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
- REFERENCES
- BIOGRAPHICAL SKETCH
-
In memory of George Casella
It is a capital mistake to theorize before one has data Insensibly onebegins to twist facts to suit theories instead of theories to suit facts
ndashSherlock HolmesA Scandal in Bohemia
3
ACKNOWLEDGMENTS
Completing this dissertation would not have been possible without the support from
the people that have helped me remain focused motivated and inspired throughout the
years I am undeservingly fortunate to be surrounded by such amazing people
First of all I would like to express my gratitude to Professor George Casella It
was an unsurpassable honor to work with him His wisdom generosity optimism and
unyielding resolve will forever inspire me I will always treasure his teachings and the
fond memories I have of him I thank him and Anne for treating me and my wife as
family
I would like to acknowledge all of my committee members My heartfelt thanks to
my advisor Professor Linda J Young I will carry her thoughtful and patient recommendations
throughout my life I have no words to express how thankful I am to her for guiding me
through the difficult times that followed Dr Casellarsquos passing Also she has my gratitude
for sharing her knowledge and wealth of experience and for providing me with so many
amazing opportunities I am forever grateful to my local advisor Professor Nikolay
Bliznyuk for unsparingly sharing his insightful reflections and knowledge His generosity
and drive to help students develop are a model to follow His kind and extensive efforts
our many conversations his suggestions and advise in all aspects of academic and
non-academic life have made me a better statistician and have had a profound influence
on my way of thinking My appreciation to Professor Madan Oli for his enlightening
advise and for helping me advance my understanding of ecology
I would like to express my absolute gratitude to Dr Andrew Womack my friend and
young mentor His love for good science and hard work although impossible to keep up
with made my doctoral training one of the most exciting times in my life I have sincerely
enjoyed working and learning from him the last couple of years I offer my gratitude
to Dr Salvador Gezan for his friendship and the patience with which he taught me so
much more about statistics (boring our wives to death in the process) I am grateful to
4
Professor Mary Christman for her mentorship and enormous support I would like to
thank Dr Mihai Giurcanu for spending countless hours helping me think more deeply
about statistics his insight has been instrumental to shaping my own ideas Thanks to
Dr Claudio Fuentes for taking an interest in my work and for his advise support and
kind words which helped me retain the confidence to continue
I would like to acknowledge my friends at UF Juan Jose Acosta Mauricio
Mosquera Diana Falla Salvador and Emma Weeks and Anna Denicol thanks for
becoming my family away from home Andreas Tavis Emily Alex Sasha Mike
Yeonhee and Laura thanks for being there for me I truly enjoyed sharing these
years with you Vitor Paula Rafa Leandro Fabio Eduardo Marcelo and all the other
Brazilians in the Animal Science Department thanks for your friendship and for the
many unforgettable (though blurry) weekends
Also I would like to thank Pablo Arboleda for believing in me Because of him I
was able to take the first step towards fulfilling my educational goals My gratitude to
Grupo Bancolombia Fulbright Colombia Colfuturo and the IGERT QSE3 program
for supporting me throughout my studies Also thanks to Marc Kery and Christian
Monnerat for providing data to validate our methods Thanks to the staff in the Statistics
Department specially to Ryan Chance to the staff at the HPC and also to Karen Bray
at SNRE
Above all else I would like to thank my wife and family Nata you have always been
there for me pushing me forward believing in me helping me make better decisions
and regardless of how hard things get you have always managed to give me true and
lasting happiness Thank you for your love strength and patience Mom Dad Alejandro
Alberto Laura Sammy Vale and Tommy without your love trust and support getting
this far would not have been possible Thank you for giving me so much Gustavo
Lilia Angelica and Juan Pablo thanks for taking me into your family your words of
encouragement have led the way
5
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS 4
LIST OF TABLES 8
LIST OF FIGURES 10
ABSTRACT 12
CHAPTER
1 GENERAL INTRODUCTION 14
11 Occupancy Modeling 1512 A Primer on Objective Bayesian Testing 1713 Overview of the Chapters 21
2 MODEL ESTIMATION METHODS 23
21 Introduction 23211 The Occupancy Model 24212 Data Augmentation Algorithms for Binary Models 26
22 Single Season Occupancy 29221 Probit Link Model 30222 Logit Link Model 32
23 Temporal Dynamics and Spatial Structure 34231 Dynamic Mixture Occupancy State-Space Model 37232 Incorporating Spatial Dependence 43
24 Summary 46
3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS 49
31 Introduction 4932 Objective Bayesian Inference 52
321 The Intrinsic Methodology 53322 Mixtures of g-Priors 54
3221 Intrinsic priors 553222 Other mixtures of g-priors 56
33 Objective Bayes Occupancy Model Selection 57331 Preliminaries 58332 Intrinsic Priors for the Occupancy Problem 60333 Model Posterior Probabilities 62334 Model Selection Algorithm 63
34 Alternative Formulation 6635 Simulation Experiments 68
351 Marginal Posterior Inclusion Probabilities for Model Predictors 70
6
352 Summary Statistics for the Highest Posterior Probability Model 7636 Case Study Blue Hawker Data Analysis 77
361 Results Variable Selection Procedure 79362 Validation for the Selection Procedure 81
37 Discussion 82
4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS 84
41 Introduction 8442 Setup for Well-Formulated Models 88
421 Well-Formulated Model Spaces 9043 Priors on the Model Space 91
431 Model Prior Definition 92432 Choice of Prior Structure and Hyper-Parameters 96433 Posterior Sensitivity to the Choice of Prior 99
44 Random Walks on the Model Space 104441 Simple Pruning and Growing 105442 Degree Based Pruning and Growing 106
45 Simulation Study 107451 SNR and Sample Size Effect 109452 Coefficient Magnitude 110453 Special Points on the Scale 111
46 Case Study Ozone Data Analysis 11147 Discussion 113
5 CONCLUSIONS 115
APPENDIX
A FULL CONDITIONAL DENSITIES DYMOSS 118
B RANDOM WALK ALGORITHMS 121
C WFM SIMULATION DETAILS 124
D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS 131
REFERENCES 133
BIOGRAPHICAL SKETCH 140
7
LIST OF TABLES
Table page
1-1 Interpretation of BFji when contrasting Mj and Mi 20
3-1 Simulation control parameters occupancy model selector 69
3-2 Comparison of average minOddsMPIP under scenarios having different numberof sites (N=50 N=100) and under scenarios having different number of surveysper site (J=3 J=5) for the presence and detection components using uniformand multiplicity correction priors 75
3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors 75
3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highest probabilitymodels for the presence and the detection components using uniform andmultiplicity correcting priors on the model space 76
3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space 77
3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 77
3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 78
3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data 80
3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data 80
3-10 MPIP presence component 81
3-11 MPIP detection component 81
3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform and multiplicitycorrection model priors 82
8
4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulations 100
4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchical independenceprior (HIP) the hierarchical order prior (HOP) and the hierarchical uniformprior (HUP) 102
4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and the hierarchicaluniform prior (HUP) 103
4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP) 105
4-5 Variables used in the analyses of the ozone contamination dataset 112
4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso 113
C-1 Experimental conditions WFM simulations 124
D-1 Variables used in the analyses of the ozone contamination dataset 131
D-2 Marginal inclusion probabilities intrinsic prior 132
D-3 Marginal inclusion probabilities Zellner-Siow prior 132
D-4 Marginal inclusion probabilities Hyper-g11 132
D-5 Marginal inclusion probabilities Hyper-g21 132
9
LIST OF FIGURES
Figure page
2-1 Graphical representation occupancy model 25
2-2 Graphical representation occupancy model after data-augmentation 31
2-3 Graphical representation multiseason model for a single site 39
2-4 Graphical representation data-augmented multiseason model 39
3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors 71
3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors 72
3-3 Predictor MPIP averaged over scenarios with the interaction between the numberof sites and the surveys per site using uniform (U) and multiplicity correction(MC) priors 72
3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors 73
3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors 73
4-1 Graphs of well-formulated polynomial models for p = 2 90
4-2 E(M) and C(M) in M defined by a quadratic surface in two main effects formodel M = 1 x1 x21 91
4-3 Graphical representation of assumptions on M defined by the quadratic surfacein two main effects 93
4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the intercept onlymodel and (ab) isin (1 1) (1 ch) 97
4-5 Prior probabilities for the space of well-formulated models associated to threemain effects and one interaction term where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch) 98
4-6 MT DAG of the largest true model used in simulations 109
4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch) 110
C-1 SNR vs n Average model size average true positives and average false positivesfor all simulated scenarios by model ranking according to model posterior probabilities126
10
C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 128
C-3 SNR vs different true models MT Average model size average true positivesand average false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 129
11
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
OBJECTIVE BAYESIAN METHODS FOR OCCUPANCY MODEL ESTIMATION ANDSELECTION
By
Daniel Taylor-Rodrıguez
August 2014
Chair Linda J YoungCochair Nikolay BliznyukMajor Interdisciplinary Ecology
The ecological literature contains numerous methods for conducting inference about
the dynamics that govern biological populations Among these methods occupancy
models have played a leading role during the past decade in the analysis of large
biological population surveys The flexibility of the occupancy framework has brought
about useful extensions for determining key population parameters which provide
insights about the distribution structure and dynamics of a population However the
methods used to fit the models and to conduct inference have gradually grown in
complexity leaving practitioners unable to fully understand their implicit assumptions
increasing the potential for misuse This motivated our first contribution We develop
a flexible and straightforward estimation method for occupancy models that provides
the means to directly incorporate temporal and spatial heterogeneity using covariate
information that characterizes habitat quality and the detectability of a species
Adding to the issue mentioned above studies of complex ecological systems now
collect large amounts of information To identify the drivers of these systems robust
techniques that account for test multiplicity and for the structure in the predictors are
necessary but unavailable for ecological models We develop tools to address this
methodological gap First working in an ldquoobjectiverdquo Bayesian framework we develop
the first fully automatic and objective method for occupancy model selection based
12
on intrinsic parameter priors Moreover for the general variable selection problem we
propose three sets of prior structures on the model space that correct for multiple testing
and a stochastic search algorithm that relies on the priors on the models space to
account for the polynomial structure in the predictors
13
CHAPTER 1GENERAL INTRODUCTION
As with any other branch of science ecology strives to grasp truths about the
world that surrounds us and in particular about nature The objective truth sought
by ecology may well be beyond our grasp however it is reasonable to think that at
least partially ldquoNature is capable of being understoodrdquo (Dewey 1958) We can observe
and interpret nature to formulate hypotheses which can then be tested against reality
Hypotheses that encounter no or little opposition when confronted with reality may
become contextual versions of the truth and may be generalized by scaling them
spatially andor temporally accordingly to delimit the bounds within which they are valid
To formulate hypotheses accurately and in a fashion amenable to scientific inquiry
not only the point of view and assumptions considered must be made explicit but
also the object of interest the properties worthy of consideration of that object and
the methods used in studying such properties (Reiners amp Lockwood 2009 Rigler amp
Peters 1995) Ecology as defined by Krebs (1972) is ldquothe study of interactions that
determine the distribution and abundance of organismsrdquo This characterizes organisms
and their interactions as the objects of interest to ecology and prescribes distribution
and abundance as a relevant property of these organisms
With regards to the methods used to acquire ecological scientific knowledge
traditionally theoretical mathematical models (such as deterministic PDEs) have been
used However naturally varying systems are imprecisely observed and as such are
subject to multiple sources of uncertainty that must be explicitly accounted for Because
of this the ecological scientific community is developing a growing interest in flexible
and powerful statistical methods and among these Bayesian hierarchical models
predominate These methods rely on empirical observations and can accommodate
fairly complex relationships between empirical observations and theoretical process
models while accounting for diverse sources of uncertainty (Hooten 2006)
14
Bayesian approaches are now used extensively in ecological modeling however
there are two issues of concern one from the standpoint of ecological practitioners
and another from the perspective of scientific ecological endeavors First Bayesian
modeling tools require a considerable understanding of probability and statistical theory
leading practitioners to view them as black box approaches (Kery 2010) Second
although Bayesian applications proliferate in the literature in general there is a lack of
awareness of the distinction between approaches specifically devised for testing and
those for estimation (Ellison 2004) Furthermore there is a dangerous unfamiliarity with
the proven risks of using tools designed for estimation in testing procedures (Berger amp
Pericchi 1996 Berger et al 2001 Kass amp Raftery 1995 Moreno et al 1998 Robert
et al 2009 Robert 1993) (eg use of flat priors in hypothesis testing)
Occupancy models have played a leading role during the past decade in large
biological population surveys The flexibility of the occupancy framework has allowed
the development of useful extensions to determine several key population parameters
which provide robust notions of the distribution structure and dynamics of a population
In order to address some of the concerns stated in previous paragraph we concentrate
in the occupancy framework to develop estimation and testing tools that will allow
ecologists first to gain insight about the estimation procedure and second to conduct
statistically sound model selection for site-occupancy data
11 Occupancy Modeling
Since MacKenzie et al (2002) and Tyre et al (2003) introduced the site-occupancy
framework countless applications and extensions of the method have been developed
in the ecological literature as evidenced by the 438000 hits on Google Scholar for
a search of rdquooccupancy modelrdquo This class of models acknowledges that techniques
used to conduct biological population surveys are prone to detection errors ndashif an
individual is detected it must be present while if it is not detected it might or might
not be Occupancy models improve upon traditional binary regression by accounting
15
for observed detection and partially observed presence as two separate but related
components In the site occupancy setting the chosen locations are surveyed
repeatedly in order to reduce the ambiguity caused by the observed zeros This
approach therefore allows probabilities of both presence (occurrence) and detection
to be estimated
The uses of site-occupancy models are many For example metapopulation
and island biogeography models are often parameterized in terms of site (or patch)
occupancy (Hansky 19921994 1997 as cited in MacKenzie et al (2003)) and
occupancy may be used as a surrogate for abundance to answer questions regarding
geographic distribution range size and metapopulation dynamics (MacKenzie et al
2004 Royle amp Kery 2007)
The basic occupancy framework which assumes a single closed population with
fixed probabilities through time has proven to be quite useful however it might be of
limited utility when addressing some problems In particular assumptions for the basic
model may become too restrictive or unrealistic whenever the study period extends
throughout multiple years or seasons especially given the increasingly changing
environmental conditions that most ecosystems are currently experiencing
Among the extensions found in the literature one that we consider particularly
relevant incorporates heterogenous occupancy probabilities through time Models
that incorporate temporally varying probabilities stem from important meta-population
notions provided by Hanski (1994) such as occupancy probabilities depending on local
colonization and local extinction processes In spite of the conceptual usefulness of
Hanskirsquos model several strong and untenable assumptions (eg all patches being
homogenous in quality) are required for it to provide practically meaningful results
A more viable alternative which builds on Hanski (1994) is an extension of
the single season occupancy model of MacKenzie et al (2003) In this model the
heterogeneity of occupancy probabilities across seasons arises from local colonization
16
and extinction processes This model is flexible enough to let detection occurrence
extinction and colonization probabilities to each depend upon its own set of covariates
Model parameters are obtained through likelihood-based estimation
Using a maximum likelihood approach presents two drawbacks First the
uncertainty assessment for maximum likelihood parameter estimates relies on
asymptotic results which are obtained from implementation of the delta method
making it sensitive to sample size Second to obtain parameter estimates the latent
process (occupancy) is marginalized out of the likelihood leading to the usual zero
inflated Bernoulli model Although this is a convenient strategy for solving the estimation
problem after integrating the latent state variables (occupancy indicators) they are
no longer available Therefore finite sample estimates cannot be calculated directly
Instead a supplementary parametric bootstrapping step is necessary Further
additional structure such as temporal or spatial variation cannot be introduced by
means of random effects (Royle amp Kery 2007)
12 A Primer on Objective Bayesian Testing
With the advent of high dimensional data such as that found in modern problems
in ecology genetics physics etc coupled with evolving computing capability objective
Bayesian inferential methods have gained increasing popularity This however is by no
means a new approach in the way Bayesian inference is conducted In fact starting with
Bayes and Laplace and continuing for almost 200 years Bayesian analysis was primarily
based on ldquononinformativerdquo priors (Berger amp Bernardo 1992)
Now subjective elicitation of prior probabilities in Bayesian analysis is widely
recognized as the ideal (Berger et al 2001) however it is often the case that the
available information is insufficient to specify appropriate prior probabilistic statements
Commonly as in model selection problems where large model spaces have to be
explored the number of model parameters is prohibitively large preventing one from
eliciting prior information for the entire parameter space As a consequence in practice
17
the determination of priors through the definition of structural rules has become the
alternative to subjective elicitation for a variety of problems in Bayesian testing Priors
arising from these rules are known in the literature as noninformative objective default
or reference Many of these connotations generate controversy and are accused
perhaps rightly of providing a false pretension of objectivity Nevertheless we will avoid
that discussion and refer to them herein exchangeably as noninformative or objective
priors to convey the sense that no attempt to introduce an informed opinion is made in
defining prior probabilities
A plethora of ldquononinformativerdquo methods has been developed in the past few
decades (see Berger amp Bernardo (1992) Berger amp Pericchi (1996) Berger et al (2001)
Clyde amp George (2004) Kass amp Wasserman (1995 1996) Liang et al (2008) Moreno
et al (1998) Spiegelhalter amp Smith (1982) Wasserman (2000) and the references
therein) We find particularly interesting those derived from the model structure in which
no tuning parameters are required especially since these can be regarded as automatic
methods Among them methods based on the Bayes factor for Intrinsic Priors have
proven their worth in a variety of inferential problems given their excellent performance
flexibility and ease of use This class of priors is discussed in detail in chapter 3 For
now some basic notation and notions of Bayesian inferential procedures are introduced
Hypothesis testing and the Bayes factor
Bayesian model selection techniques that aim to find the true model as opposed
to searching for the model that best predicts the data are fundamentally extensions to
Bayesian hypothesis testing strategies In general this Bayesian approach to hypothesis
testing and model selection relies on determining the amount of evidence found in favor
of one hypothesis (or model) over the other given an observed set of data Approached
from a Bayesian standpoint this type of problem can be formulated in great generality
using a natural well defined probabilistic framework that incorporates both model and
parameter uncertainty
18
Jeffreys (1935) first developed the Bayesian strategy to hypothesis testing and
consequently to the model selection problem Bayesian model selection within
a model space M = (M1M2 MJ) where each model is associated with a
parameter θj which may be a vector of parameters itself incorporates three types
of probability distributions (1) a prior probability distribution for each model π(Mj)
(2) a prior probability distribution for the parameters in each model π(θj |Mj) and (3)
the distribution of the data conditional on both the model and the modelrsquos parameters
f (x|θj Mj) These three probability densities induce the joint distribution p(x θj Mj) =
f (x|θj Mj) middot π(θj |Mj) middot π(Mj) which is instrumental in producing model posterior
probabilities The model posterior probability is the probability that a model is true given
the data It is obtained by marginalizing over the parameter space and using Bayes rule
p(Mj |x) =m(x|Mj)π(Mj)sumJ
i=1m(x|Mi)π(Mi) (1ndash1)
where m(x|Mj) =intf (x|θj Mj)π(θj |Mj)dθj is the marginal likelihood of Mj
Given that interest lies in comparing different models evidence in favor of one or
another model is assessed with pairwise comparisons using posterior odds
p(Mj |x)p(Mk |x)
=m(x|Mj)
m(x|Mk)middot π(Mj)
π(Mk) (1ndash2)
The first term on the right hand side of (1ndash2) m(x|Mj )
m(x|Mk) is known as the Bayes factor
comparing model Mj to model Mk and it is denoted by BFjk(x) The Bayes factor
provides a measure of the evidence in favor of either model given the data and updates
the model prior odds given by π(Mj )
π(Mk) to produce the posterior odds
Note that the model posterior probability in (1ndash1) can be expressed as a function of
Bayes factors To illustrate let model Mlowast isin M be a reference model All other models
compare in M are compared to the reference model Then dividing both the numerator
19
and denominator in (1ndash1) by m(x|Mlowast)π(Mlowast) yields
p(Mj |x) =BFjlowast(x)
π(Mj )
π(Mlowast)
1 +sum
MiisinMMi =Mlowast
BFilowast(x)π(Mi )π(Mlowast)
(1ndash3)
Therefore as the Bayes factor increases the posterior probability of model Mj given the
data increases If all models have equal prior probabilities a straightforward criterion
to select the best among all candidate models is to choose the model with the largest
Bayes factor As such the Bayes factor is not only useful for identifying models favored
by the data but it also provides a means to rank models in terms of their posterior
probabilities
Assuming equal model prior probabilities in (1ndash3) the prior odds are set equal to
one and the model posterior odds in (1ndash2) become p(Mj |x)p(Mk |x) = BFjk(x) Based
on the Bayes factors the evidence in favor of one or another model can be interpreted
using Table 1-1 adapted from Kass amp Raftery (1995)
Table 1-1 Interpretation of BFji when contrasting Mj and Mi
lnBFjk BFjk Evidence in favor of Mj P(Mj |x)0 to 2 1 to 3 Weak evidence 05-0752 to 6 3 to 20 Positive evidence 075-095
6 to 10 20 to 150 Strong evidence 095-099gt10 gt150 Very strong evidence gt 099
Bayesian hypothesis testing and model selection procedures through Bayes factors
and posterior probabilities have several desirable features First these methods have a
straight forward interpretation since the Bayes factor is an increasing function of model
(or hypothesis) posterior probabilities Second these methods can yield frequentist
matching confidence bounds when implemented with good testing priors (Kass amp
Wasserman 1996) such as the reference priors of Berger amp Bernardo (1992) Third
since the Bayes factor contains the ratio of marginal densities it automatically penalizes
complexity according to the number of parameters in each model this property is
known as Ockhamrsquos razor (Kass amp Raftery 1995) Four the use of Bayes factors does
20
not require having nested hypotheses (ie having the null hypothesis nested in the
alternative) standard distributions or regular asymptotics (eg convergence to normal
or chi squared distributions) (Berger et al 2001) In contrast this is not always the case
with frequentist and likelihood ratio tests which depend on known distributions (at least
asymptotically) for the test statistic to perform the test Finally Bayesian hypothesis
testing procedures using the Bayes factor can naturally incorporate model uncertainty by
using the Bayesian machinery for model averaged predictions and confidence bounds
(Kass amp Raftery 1995) It is not clear how to account for this uncertainty rigorously in a
fully frequentist approach
13 Overview of the Chapters
In the chapters that follow we develop a flexible and straightforward hierarchical
Bayesian framework for occupancy models allowing us to obtain estimates and conduct
robust testing from an ldquoobjectiverdquo Bayesian perspective Latent mixtures of random
variables supply a foundation for our methodology This approach provides a means to
directly incorporate spatial dependency and temporal heterogeneity through predictors
that characterize either habitat quality of a given site or detectability features of a
particular survey conducted in a specific site On the other hand the Bayesian testing
methods we propose are (1) a fully automatic and objective method for occupancy
model selection and (2) an objective Bayesian testing tool that accounts for multiple
testing and for polynomial hierarchical structure in the space of predictors
Chapter 2 introduces the methods proposed for estimation of occupancy model
parameters A simple estimation procedure for the single season occupancy model
with covariates is formulated using both probit and logit links Based on the simple
version an extension is provided to cope with metapopulation dynamics by introducing
persistence and colonization processes Finally given the fundamental role that spatial
dependence plays in defining temporal dynamics a strategy to seamlessly account for
this feature in our framework is introduced
21
Chapter 3 develops a new fully automatic and objective method for occupancy
model selection that is asymptotically consistent for variable selection and averts the
use of tuning parameters In this Chapter first some issues surrounding multimodel
inference are described and insight about objective Bayesian inferential procedures is
provided Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate
priors on the parameter space the intrinsic priors for the parameters of the occupancy
model are obtained These are used in the construction of a variable selection algorithm
for ldquoobjectiverdquo variable selection tailored to the occupancy model framework
Chapter 4 touches on two important and interconnected issues when conducting
model testing that have yet to receive the attention they deserve (1) controlling for false
discovery in hypothesis testing given the size of the model space ie given the number
of tests performed and (2) non-invariance to location transformations of the variable
selection procedures in the face of polynomial predictor structure These elements both
depend on the definition of prior probabilities on the model space In this chapter a set
of priors on the model space and a stochastic search algorithm are proposed Together
these control for model multiplicity and account for the polynomial structure among the
predictors
22
CHAPTER 2MODEL ESTIMATION METHODS
ldquoData Data Datardquo he cried impatiently ldquoI canrsquot make bricks without clayrdquo
ndashSherlock HolmesThe Adventure of the Copper Beeches
21 Introduction
Prior to the introduction of site-occupancy models (MacKenzie et al 2002 Tyre
et al 2003) presence-absence data from ecological monitoring programs were used
without any adjustment to assess the impact of management actions to observe trends
in species distribution through space and time or to model the habitat of a species (Tyre
et al 2003) These efforts however were suspect due to false-negative errors not
being accounted for False-negative errors occur whenever a species is present at a site
but goes undetected during the survey
Site-occupancy models developed independently by MacKenzie et al (2002)
and Tyre et al (2003) extend simple binary-regression models to account for the
aforementioned errors in detection of individuals common in surveys of animal or plant
populations Since their introduction the site-occupancy framework has been used in
countless applications and numerous extensions for it have been proposed Occupancy
models improve upon traditional binary regression by analyzing observed detection
and partially observed presence as two separate but related components In the site
occupancy setting the chosen locations are surveyed repeatedly in order to reduce the
ambiguity caused by the observed zeros This approach therefore allows simultaneous
estimation of the probabilities of presence (occurrence) and detection
Several extensions to the basic single-season closed population model are
now available The occupancy approach has been used to determine species range
dynamics (MacKenzie et al 2003 Royle amp Kery 2007) and to understand agestage
23
structure within populations (Nichols et al 2007) model species co-occurrence
(MacKenzie et al 2004 Ovaskainen et al 2010 Waddle et al 2010) It has even been
suggested as a surrogate for abundance (MacKenzie amp Nichols 2004) MacKenzie et al
suggested using occupancy models to conduct large-scale monitoring programs since
this approach avoids the high costs associated with surveys designed for abundance
estimation Also to investigate metapopulation dynamics occupancy models improve
upon incidence function models (Hanski 1994) which are often parameterized in terms
of site (or patch) occupancy and assume homogenous patches and a metapopulation
that is at a colonization-extinction equilibrium
Nevertheless the implementation of Bayesian occupancy models commonly resorts
to sampling strategies dependent on hyper-parameters subjective prior elicitation
and relatively elaborate algorithms From the standpoint of practitioners these are
often treated as black-box methods (Kery 2010) As such the potential of using the
methodology incorrectly is high Commonly these procedures are fitted with packages
such as BUGS or JAGS Although the packagersquos ease of use has led to a wide-spread
adoption of the methods the user may be oblivious as to the assumptions underpinning
the analysis
We believe providing straightforward and robust alternatives to implement these
methods will help practitioners gain insight about how occupancy modeling and more
generally Bayesian modeling is performed In this Chapter using a simple Gibbs
sampling approach first we develop a versatile method to estimate the single season
closed population site-occupancy model then extend it to analyze metapopulation
dynamics through time and finally provide a further adaptation to incorporate spatial
dependence among neighboring sites211 The Occupancy Model
In this section of the document we first introduce our results published in Dorazio
amp Taylor-Rodrıguez (2012) and build upon them to propose relevant extensions For
24
the standard sampling protocol for collecting site-occupancy data J gt 1 independent
surveys are conducted at each of N representative sample locations (sites) noting
whether a species is detected or not detected during each survey Let yij denote a binary
random variable that indicates detection (y = 1) or non-detection (y = 0) during the
j th survey of site i Without loss of generality J may be assumed constant among all N
sites to simplify description of the model In practice however site-specific variation in
J poses no real difficulties and is easily implemented This sampling protocol therefore
yields a N times J matrix Y of detectionnon-detection data
Note that the observed process yij is an imperfect representation of the underlying
occupancy or presence process Hence letting zi denote the presence indicator at site i
this model specification can therefore be represented through the hierarchy
yij |zi λ sim Bernoulli (zipij)
zi |α sim Bernoulli (ψi) (2ndash1)
where pij is the probability of correctly classifying as occupied the i th site during the j th
survey ψi is the presence probability at the i th site The graphical representation of this
process is
ψi
zi
yi
pi
Figure 2-1 Graphical representation occupancy model
Probabilities of detection and occupancy can both be made functions of covariates
and their corresponding parameter estimates can be obtained using either a maximum
25
likelihood or a Bayesian approach Existing methodologies from the likelihood
perspective marginalize over the latent occupancy process (zi ) making the estimation
procedure depend only on the detections Most Bayesian strategies rely on MCMC
algorithms that require parameter prior specification and tuning However Albert amp Chib
(1993) proposed a longstanding strategy in the Bayesian statistical literature that models
binary outcomes using a simple Gibbs sampler This procedure which is described in
the following section can be extrapolated to the occupancy setting eliminating the need
for tuning parameters and subjective prior elicitation212 Data Augmentation Algorithms for Binary Models
Probit model Data-augmentation with latent normal variables
At the root of Albert amp Chibrsquos algorithm lies the idea that if the observed outcome is
0 the latent variable can be simulated from a truncated normal distribution with support
(minusinfin 0] And if the outcome is 1 the latent variable can be simulated from a truncated
normal distribution in (0infin) To understand the reasoning behind this strategy let
Y sim Bern((xTβ)
) and V = xTβ + ε with ε sim N (0 1) In such a case note that
Pr(y = 1 | xTβ) = (xTβ) = Pr(ε lt xTβ)
= Pr(ε gt minusxTβ)
= Pr(v gt 0 | xTβ)
Thus whenever y = 1 then v gt 0 and v le 0 otherwise In other words we
may think of y as a truncated version of v Thus we can sample iteratively alternating
between the latent variables conditioned on model parameters and vice versa to draw
from the desired posterior densities By augmenting the data with the latent variables
we are able to obtain full conditional posterior distributions for model parameters that are
easy to draw from (equation 2ndash3 below) Further we may sample the latent variables
we may also sample the parameters
Given some initial values for all model parameters values for the latent variables
can be simulated By conditioning on the latter it is then possible to draw samples
26
from the parameterrsquos posterior distributions These samples can be used to generate
new values for the latent variables etc The process is iterated using a Gibbs sampling
approach Generally after a large number iterations it yields draws from the joint
posterior distribution of the latent variables and the model parameters conditional on the
observed outcome values We formalize the procedure below
Assume that each outcome Y1Y2 Yn is such that Yi |xi β sim Bernoulli(qi)
where qi = (xTi β) is the standard normal CDF evaluated at xTi β where xi and β
are the p-dimensional vectors of observed covariates for the i -th observation and their
corresponding parameters respectively
Now let y = y1 y2 yn be the vector of observed outcomes and [ β ] represents
the prior distribution of the model parameters Therefore the posterior distribution of β is
given by
[ β|y ] prop [ β ]nprodi=1
(xTi β)yi(1minus(xTi β)
)1minusyi (2ndash2)
which is intractable Nevertheless introducing latent random variables V = (V1 Vn)
such that Vi sim N (xTi β 1) resolves this difficulty by specifying that whenever Yi = 1
then Vi gt 0 and if Yi = 0 then Vi le 0 This yields
[ β v|y ] prop [ β ]
nprodi=1
ϕ(vi | xTi β 1)Ivile0Iyi=0 + Ivigt0Iyi=1
(2ndash3)
where ϕ(x |micro τ 2) is the probability density function of normal random variable x
with mean micro and variance τ2 The data augmentation artifact works since [ β|y ] =int[ β v|y ]dv hence if we sample from joint posterior 2ndash3 and extract only the sampled
values for β they will correspond to samples from [ β|y ]
From the expression above it is possible to obtain the full conditional distributions
for V and β Thus a Gibbs sampler can be proposed For example if we use a flat prior
27
for β (ie [ β ] prop 1) the full conditionals are given by
β|V y sim MVNk
((XTX )minus1(XTV ) (XTX )minus1
)(2ndash4)
V|β y simnprodi=1
tr N (xTi β 1Qi) (2ndash5)
where MVNq(micro ) represents a multivariate normal distribution with mean vector micro
and variance-covariance matrix and tr N (ξσ2Q) stands for the truncated normal
distribution with mean ξ variance σ2 and truncation region Q For each i = 1 2 n
the support of the truncated variables is given by Q = (minusinfin 0 ] if yi = 0 and Q = (0infin)
otherwise Note that conjugate normal priors could be used alternatively
At iteration m + 1 the Gibbs sampler draws V(m+1) conditional on β(m) from (2ndash5)
and then samples β(m+1) conditional on V(m+1) from (2ndash4) This process is repeated for
s = 0 1 nsim where nsim is the number of iterations in the Gibbs sampler
Logit model Data-augmentation with latent Polya-gamma variables
Recently Polson et al (2013) developed a novel and efficient approach for Bayesian
inference for logistic models using Polya-gamma latent variables which is analogous
to the Albert amp Chib algorithm The result arises from what the authors refer to as the
Polya-gamma distribution To construct a random variable from this family consider the
infinite mixture of the iid sequence of Exp(1) random variables Ekinfink=1 given by
ω =2
π2
infinsumk=1
Ek
(2k minus 1)2
with probability density function
g(ω) =infinsumk=1
(minus1)k 2k + 1radic2πω3
eminus(2k+1)2
8ω Iωisin(0infin) (2ndash6)
and Laplace density transform E[eminustω] = coshminus1(radic
t2)
28
The Polya-gamma family of densities is obtained through an exponential tilting of
the density g from 2ndash6 These densities indexed by c ge 0 are characterized by
f (ω|c) = cosh c2 eminusc2ω2 g(ω)
The likelihood for the binomial logistic model can be expressed in terms of latent
Polya-gamma variables as follows Assume yi sim Bernoulli(δi) with predictors xprimei =
(xi1 xip) and success probability δi = exprimeiβ(1 + ex
primeiβ) Hence the posterior for the
model parameters can be represented as
[β|y] =[β]prodn
i δyii (1minus δi)
1minusyi
c(y)
where c(y) is the normalizing constant
To facilitate the sampling procedure a data augmentation step can be performed
by introducing a Polya-gamma random variable ω sim PG(xprimeβ 1) This yields the
data-augmented posterior
[βω|y] =
(prodn
i=1 Pr(yi = 1|β))f (ω|xprime
β) [β] dω
c(y) (2ndash7)
such that [β|y] =int
R+[βω|y] dω
Thus from the augmented model the full conditional density for β is given by
[β|ω y] prop
(nprodi=1
Pr(yi = 1|β)
)f (ω|xprime
β) [β] dω
=
nprodi=1
(exprimeiβ)yi
1 + exprimeiβ
nprodi=1
cosh
(∣∣xprime
iβ∣∣
2
)exp
[minus(x
prime
iβ)2ωi
2
]g(ωi)
(2ndash8)
This expression yields a normal posterior distribution if β is assigned flat or normal
priors Hence a two-step sampling strategy analogous to that of Albert amp Chib (1993)
can be used to estimate β in the occupancy framework22 Single Season Occupancy
Let pij = F (qTij λ) be the probability of correctly classifying as occupied the i th
site during the j th survey conditional on the site being occupied and let ψi = F (xTi α)
29
correspond to the presence probability at the i th site Further let Fminus1(middot) denote a link
function (ie probit or logit) connecting the response to the predictors and denote by λ
and α respectively the r -variate and p-variate coefficient vectors for the detection and
for the presence probabilities Then the following is the joint posterior probability for the
presence indicators and the model parameters
πlowast(z vαwλ) prop πα(α)πλ(λ)Nprodi=1
F (xprimeiα)zi (1minus F (xprimeiα))
(1minuszi ) times
Jprodj=1
(ziF (qprimeijλ))
yij (1minus ziF (qprimeijλ))
1minusyij (2ndash9)
As in the simple probit regression problem this posterior is intractable Consequently
sampling from it directly is not possible But the procedures of Albert amp Chib for the
probit model and of Polson et al for the logit model can be extended to generate an
MCMC sampling strategy for the occupancy problem In what follows we make use of
this framework to develop samplers with which occupancy parameter estimates can be
obtained for both probit and logit link functions These algorithms have the added benefit
that they do not require tuning parameters nor eliciting parameter priors subjectively221 Probit Link Model
To extend Albert amp Chibrsquos algorithm to the occupancy framework with a probit link
first we introduce two sets of latent variables denoted by wij and vi corresponding to
the normal latent variables used to augment the data The corresponding hierarchy is
yij |zi sij sim Bernoulli(ziIwijgt0
)wij |λ sim N
(qprimeijλ 1
)λ sim [λ]
zi |vi sim Ivigt0
vi |α sim N (xprimeiα 1)
α sim [α] (2ndash10)
30
represented by the directed graph found in Figure 2-2
α
vi
zi
yi
wi
λ
Figure 2-2 Graphical representation occupancy model after data-augmentation
Under this hierarchical model the joint density is given by
πlowast(z vαwλ) prop Cyπα(α)πλ(λ)Nprodi=1
ϕ(vi xprimeiα 1)I
zivigt0I
(1minuszi )vile0 times
Jprodj=1
(ziIwijgt0)yij (1minus ziIwijgt0)
1minusyijϕ(wij qprimeijλ 1) (2ndash11)
The full conditional densities derived from the posterior in equation 2ndash11 are
detailed below
1 These are obtained from the full conditional of z after integrating out v and w
f (z|αλ) =
Nprodi=1
f (zi |αλ) =Nprodi=1
ψlowastizi (1minus ψlowast
i )1minuszi
where ψlowasti =
ψiprodJ
j=1 pyijij (1minus pij)
1minusyij
ψiprodJ
j=1 pyijij (1minus pij)1minusyij + (1minus ψi)
prodJ
j=1 Iyij=0(2ndash12)
2
f (v|zα) =
Nprodi=1
f (vi |zi α) =Nprodi=1
tr N (x primeiα 1Ai)
where Ai =
(minusinfin 0] zi = 0(0infin) zi = 1
(2ndash13)
31
and tr N(microσ2A) denotes the pdf of a truncated normal random variable withmean micro variance σ2 and truncation region A
3
f (α|v) = ϕp (α αXprimev α) (2ndash14)
where α = (X primeX )minus1and ϕk(x micro ) represents the k-variate normal density withmean vector micro and variance matrix
4
f (w|y zλ) =
Nprodi=1
Jprodj=1
f (wij |yij zi λ) =Nprodi=1
Jprodj=1
tr N (qprimeijλ 1Bij)
where Bij =
(minusinfininfin) zi = 0(minusinfin 0] zi = 1 and yij = 0(0infin) zi = 1 and yij = 1
(2ndash15)
5
f (λ|w) = ϕr (λ λQprimew λ) (2ndash16)
where λ = (Q primeQ)minus1
The Gibbs sampling algorithm for the model can then be summarized as
1 Initialize z α v λ and w
2 Sample zi sim Bern(ψilowast)
3 Sample vi from a truncated normal with micro = x primeiα σ = 1 and truncation regiondepending on zi
4 Sample α sim N (αXprimev α) with α = (X primeX )minus1
5 Sample wij from a truncated normal with micro = qprimeijλ σ = 1 and truncation region
depending on yij and zi
6 Sample λ sim N (λQprimew λ) with λ = (Q primeQ)minus1
222 Logit Link Model
Now turning to the logit link version of the occupancy model again let yij be the
indicator variable used to mark detection of the target species on the j th survey at the
i th site and let zi be the indicator variable that denotes presence (zi = 1) or absence
32
(zi = 0) of the target species at the i th site The model is now defined by
yij |zi λ sim Bernoulli (zipij) where pij =eq
primeijλ
1 + eqprimeijλ
λ sim [λ]
zi |α sim Bernoulli (ψi) where ψi =ex
primeiα
1 + exprimeiα
α sim [α]
In this hierarchy the contribution of a single site to the likelihood is
Li(αλ) =(ex
primeiα)zi
1 + exprimeiα
Jprodj=1
(zi
eqprimeijλ
1 + eqprimeijλ
)yij(1minus zi
eqprimeijλ
1 + eqprimeijλ
)1minusyij
(2ndash17)
As in the probit case we data-augment the likelihood with two separate sets
of covariates however in this case each of them having Polya-gamma distribution
Augmenting the model and using the posterior in (2ndash7) the joint is
[ zαλ|y ] prop [α] [λ]
Nprodi=1
(ex
primeiα)zi
1 + exprimeiαcosh
(∣∣xprime
iα∣∣
2
)exp
[minus(x
prime
iα)2vi
2
]g(vi)times
Jprodj=1
(zi
eqprimeijλ
1 + eqprimeijλ
)yij(1minus zi
eqprimeijλ
1 + eqprimeijλ
)1minusyij
times
cosh
(∣∣ziqprimeijλ∣∣2
)exp
[minus(ziq
primeijλ)2wij
2
]g(wij)
(2ndash18)
The full conditionals for z α v λ and w obtained from (2ndash18) are provided below
1 The full conditional for z is obtained after marginalizing the latent variables andyields
f (z|αλ) =
Nprodi=1
f (zi |αλ) =Nprodi=1
ψlowastizi (1minus ψlowast
i )1minuszi
where ψlowasti =
ψiprodJ
j=1 pyijij (1minus pij)
1minusyij
ψiprodJ
j=1 pyijij (1minus pij)1minusyij + (1minus ψi)
prodJ
j=1 Iyij=0(2ndash19)
33
2 Using the result derived in Polson et al (2013) we have that
f (v|zα) =
Nprodi=1
f (vi |zi α) =Nprodi=1
PG(1 xprimeiα) (2ndash20)
3
f (α|v) prop [α ]
Nprodi=1
exp[zix
prime
iαminus xprime
iα
2minus (x
prime
iα)2vi
2
] (2ndash21)
4 By the same result as that used for v the full conditional for w is
f (w|y zλ) =
Nprodi=1
Jprodj=1
f (wij |yij zi λ)
=
(prodiisinS1
Jprodj=1
PG(1 |qprimeijλ| )
)(prodi isinS1
Jprodj=1
PG(1 0)
) (2ndash22)
with S1 = i isin 1 2 N zi = 1
5
f (λ|z yw) prop [λ ]prodiisinS1
exp
[yijq
prime
ijλminusq
prime
ijλ
2minus
(qprime
ijλ)2wij
2
] (2ndash23)
with S1 as defined above
The Gibbs sampling algorithm is analogous to the one with a probit link but with the
obvious modifications to incorporate Polya-gamma instead of normal latent variables23 Temporal Dynamics and Spatial Structure
The uses of the single-season model are limited to very specific problems In
particular assumptions for the basic model may become too restrictive or unrealistic
whenever the study period extends throughout multiple years or seasons especially
given the increasingly changing environmental conditions that most ecosystems are
currently experiencing
Among the many extensions found in the literature one that we consider particularly
relevant incorporates heterogenous occupancy probabilities through time Extensions of
34
site-occupancy models that incorporate temporally varying probabilities can be traced
back to Hanski (1994) The heterogeneity of occupancy probabilities through time arises
from local colonization and extinction processes MacKenzie et al (2003) proposed an
alternative to Hanskirsquos approach in order to incorporate imperfect detection The method
is flexible enough to let detection occurrence survival and colonization probabilities
each depend upon its own set of covariates using likelihood-based estimation for the
model parameters
However the approach of MacKenzie et al presents two drawbacks First
the uncertainty assessment for maximum likelihood parameter estimates relies on
asymptotic results (obtained from implementation of the delta method) making it
sensitive to sample size And second to obtain parameter estimates the latent process
(occupancy) is marginalized out of the likelihood leading to the usual zero-inflated
Bernoulli model Although this is a convenient strategy to solve the estimation problem
the latent state variables (occupancy indicators) are no longer available and as such
finite sample estimates cannot be calculated unless an additional (and computationally
expensive) parametric bootstrap step is performed (Royle amp Kery 2007) Additionally as
the occupancy process is integrated out the likelihood approach precludes incorporation
of additional structural dependence using random effects Thus the model cannot
account for spatial dependence which plays a fundamental role in this setting
To work around some of the shortcomings encountered when fitting dynamic
occupancy models via likelihood based methods Royle amp Kery developed what they
refer to as a dynamic occupancy state space model (DOSS) alluding to the conceptual
similarity found between this model and the class of state space models found in the
time series literature In particular this model allows one to retain the latent process
(occupancy indicators) in order to obtain small sample estimates and to eventually
generate extensions that incorporate structure in time andor space through random
effects
35
The data used in the DOSS model comes from standard repeated presenceabsence
surveys with N sampling locations (patches or sites) indexed by i = 1 2 N Within
a given season (eg year month week depending on the biology of the species) each
sampling location is visited (surveyed) j = 1 2 J times This process is repeated for
t = 1 2 T seasons Here an important assumption is that the site occupancy status
is closed within but not across seasons
As is usual in the occupancy modeling framework two different processes are
considered The first one is the detection process per site-visit-season combination
denoted by yijt The yijt are indicator functions that take the value 1 if the species is
present at site i survey j and season t and 0 otherwise These detection indicators
are assumed to be independent within each site and season The second response
considered is the partially observed presence (occupancy) indicators zit These are
indicator variables which are equal to 1 whenever yijt = 1 for one or more of the visits
made to site i during season t otherwise the values of the zit rsquos are unknown Royle amp
Kery refer to these two processes as the observation (yijt) and the state (zit) models
In this setting the parameters of greatest interest are the occurrence or site
occupancy probabilities denoted by ψit as well as those representing the population
dynamics which are accounted for by introducing changes in occupancy status over
time through local colonization and survival That is if a site was not occupied at season
t minus 1 at season t it can either be colonized or remain unoccupied On the other hand
if the site was in fact occupied at season t minus 1 it can remain that way (survival) or
become abandoned (local extinction) at season t The probabilities of survival and
colonization from season t minus 1 to season t at the i th site are denoted by θi(tminus1) and
γi(tminus1) respectivelyDuring the initial period (or season) the model for the state process is expressed in
terms of the occupancy probability (equation 2ndash24) For subsequent periods the stateprocess is specified in terms of survival and colonization probabilities (equation 2ndash25) inparticular
zi1 sim Bernoulli (ψi1) (2ndash24)
36
zit |zi(tminus1) sim Bernoulli(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)(2ndash25)
The observation model conditional on the latent process zit is defined by
yijt |zit sim Bernoulli(zitpijt
)(2ndash26)
Royle amp Kery induce the heterogeneity by site site-season and site-survey-seasonrespectively in the occupancy survival and colonization and in the detection probabilitiesthroughthe following specification
logit(ψi1) = x1 + ri ri sim N(0σ2ψ) logitminus1(x1) sim Unif(0 1)
logit(θit) = at + ui ui sim N(0σ2θ ) logitminus1(ai) sim Unif(0 1)logit(γit) = bt + vi vi sim N(0σ2γ) logitminus1(bi) sim Unif(0 1)
logit(pijt) = ct + wij wi sim N(0σ2p) logitminus1(ci) sim Unif(0 1) (2ndash27)
where x1 at bt ct are the season fixed effects for the corresponding probabilities
and where (ri ui vi) and wij are the site and site-survey random effects respectively
Additionally all variance components assume the usual inverse gamma priors
As the authors state this formulation can be regarded as ldquobeing suitably vaguerdquo
however it is also restrictive in the sense that it is not clear what strategy to follow to
incorporate additional covariates while preserving the straightforward sampling strategy231 Dynamic Mixture Occupancy State-Space Model
We assume that the probabilities for occupancy survival colonization and detection
are all functions of linear combinations of covariates However our setup varies
slightly from that considered by Royle amp Kery (2007) In essence we modify the way in
which the estimates for survival and colonization probabilities are attained Our model
incorporates the notion that occupancy at a site occupied during the previous season
takes place through persistence where we define persistence as a function of both
survival and colonization That is a site occupied at time t may again be occupied
at time t + 1 if the current settlers survive if they perish and new settlers colonize
simultaneously or if both current settlers survive and new ones colonize
Our functional forms of choice are again the probit and logit link functions This
means that each probability of interest which we will refer to for illustration as δ is
37
linked to a linear combination of covariates xprime ξ through the relationship defined by
δ = F (xT ξ) where F (middot) represents the inverse link function This particular assumption
facilitates relating the data augmentation algorithms of Albert amp Chib and Polson et al to
Royle amp Keryrsquos DOSS model We refer to this extension of Royle amp Keryrsquos model as the
Dynamic Mixture Occupancy State Space model (DYMOSS)
As before let yijt be the indicator variable used to mark detection of the target
species on the j th survey at the i th site during the tth season and let zit be the indicator
variable that denotes presence (zit = 1) or absence (zit = 0) of the target species at the
i th site tth season with i isin 1 2 N j isin 1 2 J and t isin 1 2 T
Additionally assume that probabilities for occupancy at time t = 1 persistence
colonization and detection are all functions of covariates with corresponding parameter
vectors α (s) =δ(s)tminus1
Tt=2
B(c) =β(c)tminus1
Tt=2
and = λtTt=1 and covariate matrices
X(o) X = Xtminus1Tt=2 and Q(s) = QtTt=1 respectively Using the notation above our
proposed dynamic occupancy model is defined by the following hierarchyState model
zi1|α sim Bernoulli (ψi1) where ψi1 = F(xprime(o)iα
)zit |zi(tminus1) δ
(c)tminus1β
(s)tminus1 sim Bernoulli
(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)where θi(tminus1) = F
(δ(s)tminus1 + xprimei(tminus1)β
(c)tminus1
) and
γi(tminus1) = F(xprimei(tminus1)β
(c)tminus1
)(2ndash28)
Observed modelyijt |zit ηt sim Bernoulli (zitpijt)
where pijt = F (qTijtλt) (2ndash29)
In the hierarchical setup given by Equations 2ndash28 and 2ndash29 θi(tminus1) corresponds to
the probability of persistence from time t minus 1 to time t at site i and γi(tminus1) denotes the
colonization probability Note that θi(tminus1) minus γi(tminus1) yields the survival probability from t minus 1
to t The effect of survival is introduced by changing the intercept of the linear predictor
by a quantity δ(s)tminus1 Although in this version of the model this effect is accomplished by
just modifying the intercept it can be extended to have covariates determining δ(s)tminus1 as
well The graphical representation of the model for a single site is
38
α
zi1
yi1
λ1
zi2
yi2
λ1
δ(s)1
β(c)1
middot middot middot
zit
yit
λt
δ(s)tminus1
β(c)tminus1
middot middot middot
ziT
yiT
λT
δ(s)Tminus1
β(c)Tminus1
Figure 2-3 Graphical representation multiseason model for a single site
The joint posterior for the model defined by this hierarchical setting is
[ zηαβλ|y ] = Cy
Nprodi=1
ψi1 Jprodj=1
pyij1ij1 (1minus pij1)
(1minusyij1)
zi1(1minus ψi1)
Jprodj=1
Iyij1=0
1minuszi1 [η1][α]times
Tprodt=2
Nprodi=1
[(θziti(tminus1)(1minus θi(tminus1))
1minuszit)zi(tminus1)
+(γziti(tminus1)(1minus γi(tminus1))
1minuszit)1minuszi(tminus1)
] Jprod
j=1
pyijtijt (1minus pijt)
1minusyijt
zit
times
Jprodj=1
Iyijt=0
1minuszit [ηt ][βtminus1][λtminus1]
(2ndash30)
which as in the single season case is intractable Once again a Gibbs sampler cannot
be constructed directly to sample from this joint posterior The graphical representation
of the model for one site incorporating the latent variables is provided in Figure 2-4
α
ui1
zi1
yi1
wi1
λ1
zi2
yi2
wi2
λ1
vi1
δ(s)1
β(c)1
middot middot middot
middot middot middot
zit
vi tminus1
yit
wit
λt
δ(s)tminus1
β(c)tminus1
middot middot middot
middot middot middot
ziT
vi Tminus1
yiT
wiT
λT
δ(s)Tminus1
β(s)Tminus1
Figure 2-4 Graphical representation data-augmented multiseason model
Probit link normal-mixture DYMOSS model
39
We deal with the intractability of the joint posterior distribution as before that is
by introducing latent random variables Each of the latent variables incorporates the
relevant linear combinations of covariates for the probabilities considered in the model
This artifact enables us to sample from the joint posterior distributions of the model
parameters For the probit link the sets of latent random variables respectively for first
season occupancy persistence and colonization and detection are
bull ui sim N (bTi α 1)
bull vi(tminus1) sim zi(tminus1)N(δ(s)(tminus1) + xTi(tminus1)β
(c)(tminus1) 1
)+ (1minus zi(tminus1))N
(xTi(tminus1)β
(c)(tminus1) 1
) and
bull wijt sim N (qTijtηt 1)
Introducing these latent variables into the hierarchical formulation yieldsState model
ui1|α sim N(xprime(o)iα 1
)zi1|ui sim Bernoulli
(Iuigt0
)for t gt 1
vi(tminus1)|zi(tminus1)βtminus1 sim zi(tminus1)N(δ(s)(tminus1) + xprimei(tminus1)β
(c)(tminus1) 1
)+
(1minus zi(tminus1))N(xprimei(tminus1)β
(c)(tminus1) 1
)zit |vi(tminus1) sim Bernoulli
(Ivi(tminus1)gt0
)(2ndash31)
Observed modelwijt |ηt sim N
(qTijtλt 1
)yijt |zit wijt sim Bernoulli
(zitIrijtgt0
)(2ndash32)
Note that the result presented in Section 22 corresponds to the particular case for
T = 1 of the model specified by Equations 2ndash31 and 2ndash32
As mentioned previously model parameters are obtained using a Gibbs sampling
approach Let ϕ(x |microσ2) denote the pdf of a normally distributed random variable x
with mean micro and standard deviation σ Also let
1 Wt = (w1t w2t wNt) with wit = (wi1t wi2t wiJitt) (for i = 1 2 N andt = 1 2 T )
40
2 u = (u1 u2 uN)
3 V = (v1 vTminus1) with vt = (v1t v2t vNt)
For the probit link model the joint posterior distribution is
π(ZuV WtTt=1αB(c) δ(s)
)prop [α]
prodNi=1 ϕ
(ui∣∣ xprime(o)iα 1
)Izi1uigt0I
1minuszi1uile0
times
Tprodt=2
[β(c)tminus1 δ
(s)tminus1
] Nprodi=1
ϕ(vi(tminus1)
∣∣micro(v)i(tminus1) 1
)Izitvi(tminus1)gt0
I1minuszitvi(tminus1)le0
times
Tprodt=1
[λt ]
Nprodi=1
Jitprodj=1
ϕ(wijt
∣∣qprimeijtλt 1)(zitIwijtgt0)yij1(1minus zitIwijtgt0)
(1minusyijt)
where micro(v)i(tminus1) = zi(tminus1)δ
(s)tminus1 + xprimei(tminus1)β
(c)tminus1 (2ndash33)
Initialize the Gibbs sampler at α(0)B(0)(c) δ
(s)(0)2minus1 and (0) For m = 0 1 nsim
The sampler proceeds iteratively by block sampling sequentially for each primary
sampling period as follows first the presence process then the latent variables from
the data-augmentation step for the presence component followed by the parameters for
the presence process then the latent variables for the detection component and finally
the parameters for the detection component Letting [|] denote the full conditional
probability density function of the component conditional on all other unknown
parameters and the observed data for m = 1 nsim the sampling procedure can be
summarized as
[z(m)1 | middot
]rarr[u(m)| middot
]rarr[α(m)
∣∣∣ middot ]rarr [W
(m)1 | middot
]rarr[λ(m)1
∣∣∣ middot ]rarr[z(m)2 | middot
]rarr[V(m)2minus1| middot
]rarr[β(c)(m)2minus1 δ(s)(m)
2minus1
∣∣∣ middot ]rarr [W
(m)2 | middot
]rarr[λ(m)2
∣∣∣ middot ]rarr middot middot middot
middot middot middot rarr[z(m)T | middot
]rarr[V(m)Tminus1| middot
]rarr[β(c)(m)Tminus1 δ(s)(m)
Tminus1
∣∣∣ middot ]rarr [W
(m)T | middot
]rarr[λ(m)T
∣∣∣ middot ]The full conditional probability densities for this Gibbs sampling algorithm are
presented in detail within Appendix A
41
Logit link Polya-Gamma DYMOSS model
Using the same notation as before the logit link model resorts to the hierarchy given
byState model
ui1|α sim PG(xT(o)iα 1
)zi1|ui sim Bernoulli
(Iuigt0
)for t gt 1
vi(tminus1)| sim PG(1∣∣zi(tminus1)δ(s)(tminus1) + xprimei(tminus1)β
(c)(tminus1)
∣∣)sim Bernoulli
(Ivi(tminus1)gt0
)(2ndash34)
Observed modelwijt |λt sim PG
(qTijtλt 1
)yijt |zit wijt sim Bernoulli
(zitIwijtgt0
)(2ndash35)
The logit link version of the joint posterior is given by
π(ZuV WtTt=1αB(s)B(c)
)prop
Nprodi=1
(e
xprime(o)i
α)zi1
1 + exprime(o)i
αPG
(ui 1 |xprime(o)iα|
)[λ1][α]times
Ji1prodj=1
(zi1
eqprimeij1λ1
1 + eqprimeij1λ1
)yij1(1minus zi1
eqprimeij1λ1
1 + eqprimeij1λ1
)1minusyij1
PG(wij1 1 |zi1qprimeij1λ1|
)times
Tprodt=2
[δ(s)tminus1][β
(c)tminus1][λt ]
Nprodi=1
(exp
[micro(v)tminus1
])zit1 + exp
[micro(v)i(tminus1)
]PG (vit 1 ∣∣∣micro(v)i(tminus1)
∣∣∣)timesJitprodj=1
(zit
eqprimeijtλt
1 + eqprimeijtλt
)yijt(1minus zit
eqprimeijtλt
1 + eqlowastTij
λt
)1minusyijt
PG(wijt 1 |zitqprimeijtλt |
)
(2ndash36)
with micro(v)tminus1 = zi(tminus1)δ
(s)tminus1 + xprimei(tminus1)β
(c)tminus1
42
The sampling procedure is entirely analogous to that described for the probit
version The full conditional densities derived from expression 2ndash36 are described in
detail in Appendix A232 Incorporating Spatial Dependence
In this section we describe how the additional layer of complexity space can also
be accounted for by continuing to use the same data-augmentation framework The
method we employ to incorporate spatial dependence is a slightly modified version of
the traditional approach for spatial generalized linear mixed models (GLMMrsquos) and
extends the model proposed by Johnson et al (2013) for the single season closed
population occupancy model
The traditional approach consists of using spatial random effects to induce a
correlation structure among adjacent sites This formulation introduced by Besag et al
(1991) assumes that the spatial random effect corresponds to a Gaussian Markov
Random Field (GMRF) The model known as the Spatial GLMM (SGLMM) is used to
analyze areal data It has been applied extensively given the flexibility of its hierarchical
formulation and the availability of software for its implementation (Hughes amp Haran
2013)
Succinctly the spatial dependence is accounted for in the model by adding a
random vector η assumed to have a conditionally-autoregressive (CAR) prior (also
known as the Gaussian Markov random field prior) To define the prior let the pair
G = (V E) represent the undirected graph for the entire spatial region studied where
V = (1 2 N) denotes the vertices of the graph (sites) and E the set of edges
between sites E is constituted by elements of the form (i j) indicating that sites i
and j are spatially adjacent for some i j isin V The prior for the spatial effects is then
characterized by
[η|τ ] prop τ rank()2exp[minusτ2ηprimeη
] (2ndash37)
43
where = (diag(A1)minus A) is the precision matrix with A denoting the adjacency matrix
The entries of the adjacency matrix A are such that diag(A) = 0 and Aij = I(i j)isinE
The matrix is singular Hence the probability density defined in equation 2ndash37
is improper ie it doesnrsquot integrate to 1 Regardless of the impropriety of the prior this
model can be fitted using a Bayesian approach since even if the prior is improper the
posterior for the model parameters is proper If a constraint such assum
k ηk = 0 is
imposed or if the precision matrix is replaced by a positive definite matrix the model
can also be fitted using a maximum likelihood approach
Assuming that all but the detection process are subject to spatial correlations and
using the notation we have developed up to this point the spatially explicit version of the
DYMOSS model is characterized by the hierarchy represented by equations 2ndash38 and
2ndash39
Hence adding spatial structure into the DYMOSS framework described in the
previous section only involves adding the steps to sample η(o) and ηtT
t=2 conditional
on all other parameters Furthermore the corresponding parameters and spatial
random effects of a given component (ie occupancy survival and colonization)
can be effortlessly pooled together into a single parameter vector to perform block
sampling For each of the latent variables the only modification required is to sum the
corresponding spatial effect to the linear predictor so that these retain their conditional
independence given the linear combination of fixed effects and the spatial effects
State modelzi1|α sim Bernoulli (ψi1) where ψi1 = F
(xT(o)iα+ η
(o)i
)[η(o)|τ
]prop τ rank()2exp
[minusτ2η(o)primeη(o)
]zit |zi(tminus1)αβtminus1λtminus1 sim Bernoulli
(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)where θi(tminus1) = F
(δ(s)(tminus1) + xTi(tminus1)β
(c)tminus1 + ηit
) and
γi(tminus1) = F(xTi(tminus1)β
(c)tminus1 + ηit
)[ηt |τ ] prop τ rank()2exp
[minusτ2ηprimetηt
](2ndash38)
44
Observed modelyijt |zit ηt sim Bernoulli (zitpijt)
where pijt = F (qTijtλt) (2ndash39)
In spite of the popularity of this approach to incorporating spatial dependence three
shortcomings have been reported in the literature (Hughes amp Haran 2013 Reich et al
2006) (1) model parameters have no clear interpretation due to spatial confounding
of the predictors with the spatial effect (2) there is variance inflation due to spatial
confounding and (3) the high dimensionality of the latent spatial variables leads to
high computational costs To avoid such difficulties we follow the approach used by
Hughes amp Haran (2013) which builds upon the earlier work by Reich et al (2006) This
methodology is summarized in what follows
Let a vector of spatial effects η have the CAR model given by 2ndash37 above Now
consider a random vector ζ sim MVN(0 τKprimeK
) with defined as above and where
τKprimeK corresponds to the precision of the distribution and not the covariance matrix
with matrix K satisfying KprimeK = I
This last condition implies that the linear predictor Xβ + η = Xβ + Kζ With
respect to how the matrix K is chosen Hughes amp Haran (2013) recommend basing its
construction on the spectral decomposition of operator matrices based on Moranrsquos I
The Moran operator matrix is defined as PperpAPperp with Pperp = IminusX (XprimeX )minus1X
prime and where A
is the adjacency matrix previously described The choice of the Moran operator is based
on the fact that it accounts for the underlying graph while incorporating the spatial
structure residual to the design matrix X These elements are incorporated into its
spectral decomposition of the Moran operator That is its eigenvalues correspond to the
values of Moranrsquos I statistic (a measure of spatial autocorrelation) for a spatial process
orthogonal to X while its eigenvectors provide the patterns of spatial dependence
residual to X Thus the matrix K is chosen to be the matrix whose columns are the
eigenvectors of the Moran operator for a particular adjacency matrix
45
Using this strategy the new hierarchical formulation of our model is simply modified
by letting η(o) = K(o)ζ(o) and ηt = Ktζt with
1 ζ(o) sim MVN(0 τ (o)K(o)primeK(o)
) where K(o) is the eigenvector matrix for
P(o)perpAP(o)perp and
2 ζt sim MVN(0 τtK
primetKt
) where Kt is the Pperp
t APperpt for t = 2 3 T
The algorithms for the probit and logit link from section 231 can be readily
adapted to incorporate the spatial structure simply by obtaining the joint posteriors
for (α ζ(o)) and (β(c)tminus1 δ
(s)tminus1 ζt) making the obvious modification of the corresponding
linear predictors to incorporate the spatial components24 Summary
With a few exceptions (Dorazio amp Taylor-Rodrıguez 2012 Johnson et al 2013
Royle amp Kery 2007) recent Bayesian approaches to site-occupancy modeling with
covariates have relied on model configurations (eg as multivariate normal priors of
parameters in logit scale) that lead to unfamiliar conditional posterior distributions thus
precluding the use of a direct sampling approach Therefore the sampling strategies
available are based on algorithms (eg Metropolis Hastings) that require tuning and the
knowledge to do so correctly
In Dorazio amp Taylor-Rodrıguez (2012) we proposed a Bayesian specification for
which a Gibbs sampler of the basic occupancy model is available and allowed detection
and occupancy probabilities to depend on linear combinations of predictors This
method described in section 221 is based on the data augmentation algorithm of
Albert amp Chib (1993) There the full conditional posteriors of the parameters of the probit
regression model are cast as latent mixtures of normal random variables The probit and
the logit link yield similar results with large sample sizes however their results may be
different when small to moderate sample sizes are considered because the logit link
function places more mass in the tails of the distribution than the probit link does In
46
section 222 we adapt the method for the single season model to work with the logit link
function
The basic occupancy framework is useful but it assumes a single closed population
with fixed probabilities through time Hence its assumptions may not be appropriate to
address problems where the interest lies in the temporal dynamics of the population
Hence we developed a dynamic model that incorporates the notion that occupancy
at a site previously occupied takes place through persistence which depends both on
survival and habitat suitability By this we mean that a site occupied at time t may again
be occupied at time t + 1 if (1) the current settlers survive (2) the existing settlers
perish but new settlers simultaneously colonize or (3) current settlers survive and new
ones colonize during the same season In our current formulation of the DYMOSS both
colonization and persistence depend on habitat suitability characterized by xprimei(tminus1)β(c)tminus1
They only differ in that persistence is also influenced by whether the site being occupied
during season t minus 1 enhances the suitability of the site or harms it through density
dependence
Additionally the study of the dynamics that govern distribution and abundance of
biological populations requires an understanding of the physical and biotic processes
that act upon them and these vary in time and space Consequently as a final step in
this Chapter we described a straightforward strategy to add spatial dependence among
neighboring sites in the dynamic metapopulation model This extension is based on the
popular Bayesian spatial modeling technique of Besag et al (1991) updated using the
methods described in (Hughes amp Haran 2013)
Future steps along these lines are (1) develop the software necessary to
implement the tools described throughout the Chapter and (2) build a suite of additional
extensions using this framework for occupancy models will be explored The first of
them will be used to incorporate information from different sources such as tracks
scats surveys and direct observations into a single model This can be accomplished
47
by adding a layer to the hierarchy where the source and spatial scale of the data is
accounted for The second extension is a single season spatially explicit multiple
species co-occupancy model This model will allow studying complex interactions
and testing hypotheses about species interactions at a given point in time Lastly this
co-occupancy model will be adapted to incorporate temporal dynamics in the spirit of
the DYMOSS model
48
CHAPTER 3INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
Eliminate all other factors and the one which remains must be the truthndashSherlock Holmes
The Sign of Four
31 Introduction
Occupancy models are often used to understand the mechanisms that dictate
the distribution of a species Therefore variable selection plays a fundamental role in
achieving this goal To the best of our knowledge ldquoobjectiverdquo Bayesian alternatives for
variable selection have not been put forth for this problem and with a few exceptions
(Hooten amp Hobbs 2014 Link amp Barker 2009) AIC is the method used to choose from
competing site-occupancy models In addition the procedures currently implemented
and accessible to ecologists require enumerating and estimating all the candidate
models (Fiske amp Chandler 2011 Mazerolle amp Mazerolle 2013) In practice this
can be achieved if the model space considered is small enough which is possible
if the choice of the model space is guided by substantial prior knowledge about the
underlying ecological processes Nevertheless many site-occupancy surveys collect
large amounts of covariate information about the sampled sites Given that the total
number of candidate models grows exponentially fast with the number of predictors
considered choosing a reduced set of models guided by ecological intuition becomes
increasingly difficult This is even more so the case in the occupancy model context
where the model space is the cartesian product of models for presence and models for
detection Given the issues mentioned above we propose the first objective Bayesian
variable selection method for the single-season occupancy model framework This
approach explores in a principled manner the entire model space It is completely
49
automatic precluding the need for both tuning parameters in the sampling algorithm and
subjective elicitation of parameter prior distributions
As mentioned above in ecological modeling if model selection or less frequently
model averaging is considered the Akaike Information Criterion (AIC) (Akaike 1983)
or a version of it is the measure of choice for comparing candidate models (Fiske amp
Chandler 2011 Mazerolle amp Mazerolle 2013) The AIC is designed to find the model
that has on average the density closest in Kullback-Leibler distance to the density
of the true data generating mechanism The model with the smallest AIC is selected
However if nested models are considered one of them being the true one generally the
AIC will not select it (Wasserman 2000) Commonly the model selected by AIC will be
more complex than the true one The reason for this is that the AIC has a weak signal to
noise ratio and as such it tends to overfit (Rao amp Wu 2001) Other versions of the AIC
provide a bias correction that enhances the signal to noise ratio leading to a stronger
penalization for model complexity Some examples are the AICc (Hurvich amp Tsai 1989)
and AICu (McQuarrie et al 1997) however these are also not consistent for selection
albeit asymptotically efficient (Rao amp Wu 2001)
If we are interested in prediction as opposed to testing the AIC is certainly
appropriate However when conducting inference the use of Bayesian model averaging
and selection methods is more fitting If the true data generating mechanism is among
those considered asymptotically Bayesian methods choose the true model with
probability one Conversely if the true model is not among the alternatives and a
suitable parameter prior is used the posterior probability of the most parsimonious
model closest to the true one tends asymptotically to one
In spite of this in general for Bayesian testing direct elicitation of prior probabilistic
statements is often impeded because the problems studied may not be sufficiently
well understood to make an informed decision about the priors Conversely there may
be a prohibitively large number of parameters making specifying priors for each of
50
these parameters an arduous task In addition to this seemingly innocuous subjective
choices for the priors on the parameter space may drastically affect test outcomes
This has been a recurring argument in favor of objective Bayesian procedures
which appeal to the use of formal rules to build parameter priors that incorporate the
structural information inside the likelihood while utilizing some objective criterion (Kass amp
Wasserman 1996)
One popular choice of ldquoobjectiverdquo prior is the reference prior (Berger amp Bernardo
1992) which is the prior that maximizes the amount of signal extracted from the
data These priors have proven to be effective as they are fully automatic and can
be frequentist matching in the sense that the posterior credible interval agrees with the
frequentist confidence interval from repeated sampling with equal coverage-probability
(Kass amp Wasserman 1996) Reference priors however are improper and while
they yield reasonable posterior parameter probabilities the derived model posterior
probabilities may be ill defined To avoid this shortcoming Berger amp Pericchi (1996)
introduced the intrinsic Bayes factor (IBF) for model comparison Moreno et al (1998)
building on the IBF of Berger amp Pericchi (1996) developed a limiting procedure to
generate a system of priors that yield well-defined posteriors even though these
priors may sometimes be improper The IBF is built using a data-dependent prior to
automatically generate Bayes factors however the extension introduced by Moreno
et al (1998) generates the intrinsic prior by taking a theoretical average over the space
of training samples freeing the prior from data dependence
In our view in the face of a large number of predictors the best alternative is to run
a stochastic search algorithm using good ldquoobjectiverdquo testing parameter priors and to
incorporate suitable model priors This being said the discussion about model priors is
deferred until Chapter 4 this Chapter focuses on the priors on the parameter space
The Chapter is structured as follows First issues surrounding multimodel inference
are described and insight about objective Bayesian inferential procedures is provided
51
Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate priors
on the parameter space the intrinsic priors for the parameters of the occupancy model
are derived These are used in the construction of an algorithm for ldquoobjectiverdquo model
selection tailored to the occupancy model framework To assess the performance of our
methods we provide results from a simulation study in which distinct scenarios both
favorable and unfavorable are used to determine the robustness of these tools and
analyze the Blue Hawker data set which has been examined previously in the ecological
literature (Dorazio amp Taylor-Rodrıguez 2012 Kery et al 2010)32 Objective Bayesian Inference
As mentioned before in practice noninformative priors arising from structural
rules are an alternative to subjective elicitation of priors Some of the rules used in
defining noninformative priors include the principle of insufficient reason parametrization
invariance maximum entropy geometric arguments coverage matching and decision
theoretic approaches (see Kass amp Wasserman (1996) for a discussion)
These rules reflect one of two attitudes (1) noninformative priors either aim to
convey unique representations of ignorance or (2) they attempt to produce probability
statements that may be accepted by convention This latter attitude is in the same
spirit as how weights and distances are defined (Kass amp Wasserman 1996) and
characterizes the way in which Bayesian reference methods are interpreted today ie
noninformative priors are seen to be chosen by convention according to the situation
A word of caution must be given when using noninformative priors Difficulties arise
in their implementation that should not be taken lightly In particular these difficulties
may occur because noninformative priors are generally improper (meaning that they do
not integrate or sum to a finite number) and as such are said to depend on arbitrary
constants
Bayes factors strongly depend upon the prior distributions for the parameters
included in each of the models being compared This can be an important limitation
52
considering that when using noninformative priors their introduction will result in the
Bayes factors being a function of the ratio of arbitrary constants given that these priors
are typically improper (see Jeffreys 1961 Pericchi 2005 and references therein)
Many different approaches have been developed to deal with the arbitrary constants
when using improper priors since then These include the use of partial Bayes factors
(Berger amp Pericchi 1996 Good 1950 Lempers 1971) setting the ratio of arbitrary
constants to a predefined value (Spiegelhalter amp Smith 1982) and approximating to the
Bayes factor (see Haughton 1988 as cited in Berger amp Pericchi 1996 Kass amp Raftery
1995 Tierney amp Kadane 1986)321 The Intrinsic Methodology
Berger amp Pericchi (1996) cleverly dealt with the arbitrary constants that arise when
using improper priors by introducing the intrinsic Bayes factor (IBF) procedure This
solution based on partial Bayes factors provides the means to replace the improper
priors by proper ldquoposteriorrdquo priors The IBF is obtained from combining the model
structure with information contained in the observed data Furthermore they showed
that as the sample size tends to infinity the Intrinsic Bayes factor corresponds to the
proper Bayes factor arising from the intrinsic priors
Intrinsic priors however are not unique The asymptotic correspondence between
the IBF and the Bayes factor arising from the intrinsic prior yields two functional
equations that are solved by a whole class of intrinsic priors Because all the priors
in the class produce Bayes factors that are asymptotically equivalent to the IBF for
finite sample sizes the resulting Bayes factor is not unique To address this issue
Moreno et al (1998) formalized the methodology through the ldquolimiting procedurerdquo
This procedure allows one to obtain a unique Bayes factor consolidating the method
as a valid objective Bayesian model selection procedure which we will refer to as the
Bayes factor for intrinsic priors (BFIP) This result is particularly valid for nested models
although the methodology may be extended with some caution to nonnested models
53
As mentioned before the Bayesian hypothesis testing procedure is highly sensitive
to parameter-prior specification and not all priors that are useful for estimation are
recommended for hypothesis testing or model selection Evidence of this is provided
by the Jeffreys-Lindley paradox which states that a point null hypothesis will always
be accepted when the variance of a conjugate prior goes to infinity (Robert 1993)
Additionally when comparing nested models the null model should correspond to
a substantial reduction in complexity from that of larger alternative models Hence
priors for the larger alternative models that place probability mass away from the null
model are wasteful If the true model is ldquofarrdquo from the null it will be easily detected by
any statistical procedure Therefore the prior on the alternative models should ldquowork
harderrdquo at selecting competitive models that are ldquocloserdquo to the null This principle known
as the Savage continuity condition (Gunel amp Dickey 1974) is widely recognized by
statisticians
Interestingly the intrinsic prior in correspondence with the BFIP automatically
satisfies the Savage continuity condition That is when comparing nested models the
intrinsic prior for the more complex model is centered around the null model and in spite
of being a limiting procedure it is not subject to the Jeffreys-Lindley paradox
Moreover beyond the usual pairwise consistency of the Bayes factor for nested
models Casella et al (2009) show that the corresponding Bayesian procedure with
intrinsic priors for variable selection in normal regression is consistent in the entire
class of normal linear models adding an important feature to the list of virtues of the
procedure Consistency of the BFIP for the case where the dimension of the alternative
model grows with the sample size is discussed in Moreno et al (2010)322 Mixtures of g-Priors
As previously mentioned in the Bayesian paradigm a model M in M is defined
by a sampling density and a prior distribution The sampling density associated with
model M is denoted by f (y|βM σ2M M) where (βM σ
2M) is a vector of model-specific
54
unknown parameters The prior for model M and its corresponding set of parameters is
π(βM σ2M M|M) = π(βM σ
2M |MM) middot π(M|M)
Objective local priors for the model parameters (βM σ2M) are achieved through
modifications and extensions of Zellnerrsquos g-prior (Liang et al 2008 Womack et al
2014) In particular below we focus on the intrinsic prior and provide some details for
other scaled mixtures of g-priors We defer the discussion on priors over the model
space until Chapter 5 where we describe them in detail and develop a few alternatives
of our own3221 Intrinsic priors
An automatic choice of an objective prior is the intrinsic prior (Berger amp Pericchi
1996 Moreno et al 1998) Because MB sube M for all M isin M the intrinsic prior for
(βM σ2M) is defined as an expected posterior prior
πI (βM σ2M |M) =
intpR(βM σ
2M |~yM)mR(~y|MB)d~y (3ndash1)
where ~y is a minimal training sample for model M I denotes the intrinsic distributions
and R denotes distributions derived from the reference prior πR(βM σ2M |M) = cM
dβMdσ2M
σ2M
In (3ndash1) mR(~y|M) =intint
f (~y|βM σ2M M)πR(βM σ
2M |M)dβMdσ2M is the reference marginal
of ~y under model M and pR(βM σ2M |~yM) =
f (~y|βM σ2MM)πR(βM σ2
M|M)
mR(~y|M)is the reference
posterior density
In the regression framework the reference marginal mR is improper and produces
improper intrinsic priors However the intrinsic Bayes factor of model M to the base
model MB is well-defined and given by
BF IMMB
(y) = (1minus R2M)
minus nminus|MB |2 times
int 1
0
n + sin2(π2θ) middot (|M|+ 1)
n +sin2(π
2θ)middot(|M|+1)1minusR2
M
nminus|M|
2sin2(π
2θ) middot (|M|+ 1)
n +sin2(π
2θ)middot(|M|+1)1minusR2
M
|M|minus|MB |
2
dθ (3ndash2)
55
where R2M is the coefficient of determination of model M versus model MB The Bayes
factor between two models M and M prime is defined as BF IMMprime(y) = BF I
MMB(y)BF I
MprimeMB(y)
The ldquogoodnessrdquo of the model M based on the intrinsic priors is given by its posterior
probability
pI (M|yM) =BF I
MMB(y)π(M|M)sum
MprimeisinM BF IMprimeMB
(y)π(M prime|M) (3ndash3)
It has been shown that the system of intrinsic priors produces consistent model selection
(Casella et al 2009 Giron et al 2010) In the context of well-formulated models the
true model MT is the smallest well-formulated model M isin M such that α isin M if βα = 0
If MT is the true model then the posterior probability of model MT based on equation
(3ndash3) converges to 13222 Other mixtures of g-priors
Scaled mixtures of g-priors place a reference prior on (βMBσ2) and a multivariate
normal distribution on β in M MB that is normal with mean 0 and precision matrix
qMw
nσ2ZprimeM (IminusH0)ZM
where H0 is the hat matrix associated with ZMB The prior is completed by a prior on w
and choice of scaling qM that is set at |M| + 1 to account for the minimal sample size of
M Under these assumptions the Bayesrsquo factor for M to MB is given by
BFMMB(y) =
(1minus R2
M
) nminus|MB |2
int n + w(|M|+ 1)
n + w(|M|+1)1minusR2
M
nminus|M|
2w(|M|+ 1)
n + w(|M|+1)1minusR2
M
|M|minus|MB |
2
π(w)dw
We consider the following priors on w The intrinsic prior is π(w) = Beta(w 05 05)
which is only defined for w isin (0 1) A version of the Zellner-Siow prior is given by
w sim Gamma(05 05) which produces a multivariate Cauchy distribution on β A family
of hyper-g priors are defined by π(w) prop wminus12(β + w)(α+1)2 which have Cauchy-like
tails but produce more shrinkage than the Cauchy prior
56
33 Objective Bayes Occupancy Model Selection
As mentioned before Bayesian inferential approaches used for ecological models
are lacking In particular there exists a need for suitable objective and automatic
Bayesian testing procedures and software implementations that explore thoroughly the
model space considered With this goal in mind in this section we develop an objective
intrinsic and fully automatic Bayesian model selection methodology for single season
site-occupancy models We refer to this method as automatic and objective given that
in its implementation no hyperparameter tuning is required and that it is built using
noninformative priors with good testing properties (eg intrinsic priors)
An inferential method for the occupancy problem is possible using the intrinsic
approach given that we are able to link intrinsic-Bayesian tools for the normal linear
model through our probit formulation of the occupancy model In other words because
we can represent the single season probit occupancy model through the hierarchy
yij |zi wij sim Bernoulli(ziIwijgt0
)wij |λ sim N
(qprimeijλ 1
)zi |vi sim Bernoulli
(Ivigt0
)vi |α sim N (x primeiα 1)
it is possible to solve the selection problem on the latent scale variables wij and vi and
to use those results at the level of the occupancy and detection processes
In what follows first we provide some necessary notation Then a derivation of
the intrinsic priors for the parameters of the detection and occupancy components
is outlined Using these priors we obtain the general form of the model posterior
probabilities Finally the results are incorporated in a model selection algorithm for
site-occupancy data Although the priors on the model space are not discussed in this
Chapter the software and methods developed have different choices of model priors
built in
57
331 Preliminaries
The notation used in Chapter 2 will be considered in this section as well Namely
presence will be denoted by z detection by y their corresponding latent processes are
v and w and the model parameters are denoted by α and λ However some additional
notation is also necessary Let M0 =M0y M0z
denote the ldquobaserdquo model defined by
the smallest models considered for the detection and presence processes The base
models M0y and M0z include predictors that must be contained in every model that
belongs to the model space Some examples of base models are the intercept only
model a model with covariates related to the sampling design and a model including
some predictors important to the researcher that should be included in every model
Furthermore let the sets [Kz ] = 1 2 Kz and [Ky ] = 1 2 Ky index
the covariates considered for the variable selection procedure for the presence and
detection processes respectively That is these sets denote the covariates that can
be added from the base models in M0 or removed from the largest possible models
considered MF z and MF y which we will refer to as the ldquofullrdquo models The model space
can then be represented by the Cartesian product of subsets such that Ay sube [Ky ]
and Az sube [Kz ] The entire model space is populated by models of the form MA =MAy
MAz
isin M = My timesMz with MAy
isin My and MAzisin Mz
For the presence process z the design matrix for model MAzis given by the block
matrix XAz= (X0|Xr A) X0 corresponds to the design matrix of the base model ndash which
is such that M0z sube MAzisin Mz for all Az isin [Kz ] ndash and Xr A corresponds to the submatrix
that contains the covariates indexed by Az Analogously for the detection process y the
design matrix is given by QAy= (Q0|Qr A) Similarly the coefficients for models MAz
and
MAyare given by αA = (αprime
0αprimer A)
prime and λA = (λprime0λ
primer A)
prime
With these elements in place the model selection problem consists of finding
subsets of covariates indexed by A = Az Ay that have a high posterior probability
given the detection and occupancy processes This is equivalent to finding models with
58
high posterior odds when compared to a suitable base model These posterior odds are
given by
p(MA|y z)p(M0|y z)
=m(y z|MA)π(MA)
m(y z|M0)π(M0)= BFMAM0
(y z)π(MA)
π(M0)
Since we are able to represent the occupancy model as a truncation of latent
normal variables it is possible to work through the occupancy model selection problem
in the latent normal scale used for the presence and detection processes We formulate
two solutions to this problem one that depends on the observed and latent components
and another that solely depends on the latent level variables used to data-augment the
problem We will however focus on the latter approach as this yields a straightforward
MCMC sampling scheme For completeness the other alternative is described in
Section 34
At the root of our objective inferential procedure for occupancy models lies the
conditional argument introduced by Womack et al (work in progress) for the simple
probit regression In the occupancy setting the argument is
p(MA|y zw v) =m(y z vw|MA)π(MA)
m(y zw v)
=fyz(y z|w v)
(intfvw(vw|αλMA)παλ(αλ|MA)d(αλ)
)π(MA)
fyz(y z|w v)sum
MlowastisinM(int
fvw(vw|αλMlowast)παλ(αλ|Mlowast)d(αλ))π(Mlowast)
=m(v|MAz
)m(w|MAy)π(MA)
m(v)m(w)
prop m(v|MAz)m(w|MAy
)π(MA) (3ndash4)
where
1 fyz(y z|w v) =prodN
i=1 Izivigt0I
(1minuszi )vile0
prodJ
j=1(ziIwijgt0)yij (1minus ziIwijgt0)
1minusyij
2 fvw(vw|αλMA) =
(Nprodi=1
ϕ(vi xprimeiαMAz
1)
)︸ ︷︷ ︸
f (v|αr Aα0MAz )
(Nprodi=1
Jiprodj=1
ϕ(wij qprimeijλMAy
1)
)︸ ︷︷ ︸
f (w|λr Aλ0MAy )
and
59
3 παλ(αλ|MA) = πα(α|MAz)πλ(λ|MAy
)
This result implies that once the occupancy and detection indicators are
conditioned on the latent processes v and w respectively the model posterior
probabilities only depend on the latent variables Hence in this case the model
selection problem is driven by the posterior odds
p(MA|y zw v)p(M0|y zw v)
=m(w v|MA)
m(w v|M0)
π(MA)
π(M0) (3ndash5)
where m(w v|MA) = m(w|MAy) middotm(v|MAz
) with
m(v|MAz) =
int intf (v|αr Aα0MAz
)π(αr A|α0MAz)π(α0)dαr Adα0
(3ndash6)
m(w|MAy) =
int intf (w|λr Aλ0MAy
)π(λr A|λ0MAy)π(λ0)dλ0dλr A
(3ndash7)
332 Intrinsic Priors for the Occupancy Problem
In general the intrinsic priors as defined by Moreno et al (1998) use the functional
form of the response to inform their construction assuming some preliminary prior
distribution proper or improper on the model parameters For our purposes we assume
noninformative improper priors for the parameters denoted by πN(middot|middot) Specifically the
intrinsic priors πIP(θMlowast|Mlowast) for a vector of parameters θMlowast corresponding to model
Mlowast isin M0M sub M for a response vector s with probability density (or mass) function
f (s|θMlowast) are defined by
πIP(θM0|M0) = πN(θM0
|M0)
πIP(θM |M) = πN(θM |M)
intm(~s|M)
m(~s|M0)f (~s|θM M)d~s
where ~s is a theoretical training sample
In what follows whenever it is clear from the context in an attempt to simplify the
notation MA will be used to refer to MAzor MAy
and A will denote Az or Ay To derive
60
the parameter priors involved in equations 3ndash6 and 3ndash7 using the objective intrinsic prior
strategy we start by assuming flat priors πN(αA|MA) prop cA and πN(λA|MA) prop dA where
cA and dA are unknown constants
The intrinsic prior for the parameters associated with the occupancy process αA
conditional on model MA is
πIP(αA|MA) = πN(αA|MA)
intm(~v|MA)
m(~v|M0)f (~v|αAMA)d~v
where the marginals m(~v|Mj) with j isin A 0 are obtained by solving the analogous
equation 3ndash6 for the (theoretical) training sample ~v These marginals are given by
m(~v|Mj) = cj (2π)pjminusp0
2 |~X primej~Xj |
12 eminus
12~vprime(Iminus~Hj )~v
The training sample ~v has dimension pAz=∣∣MAz
∣∣ that is the total number of
parameters in model MAz Note that without ambiguity we use
∣∣ middot ∣∣ to denote both
the cardinality of a set and also the determinant of a matrix The design matrix ~XA
corresponds to the training sample ~v and is chosen such that ~X primeA~XA =
pAzNX primeAXA
(Leon-Novelo et al 2012) and ~Hj is the corresponding hat matrix
Replacing m(~v|MA) and m(~v|M0) in πIP(αA|MA) and solving the integral with
respect to the theoretical training sample ~v we have
πIP(αA|MA) = cA
int ((2π)minus
pAzminusp0z2
(c0
cA
)eminus
12~vprime((Iminus~HA)minus(Iminus~H0))~v |~X
primeA~XA|12
|~X prime0~X0|12
)times(
(2π)minuspAz2 eminus
12(~vminus~XAαA)
prime(~vminus~XAαA))d~v
= c0(2π)minus
pAzminusp0z2 |~X prime
Ar~XAr |
12 2minus
pAzminusp0z2 exp
[minus1
2αprimer A
(1
2~X primer A
~Xr A
)αr A
]= πN(α0)timesN
(αr A
∣∣ 0 2 middot ( ~X primer A
~Xr A)minus1)
(3ndash8)
61
Analogously the intrinsic prior for the parameters associated to the detection
process is
πIP(λA|MA) = d0(2π)minus
pAyminusp0y2 | ~Q prime
Ar~QAr |
12 2minus
pAyminusp0y2 exp
[minus1
2λprimer A
(1
2~Q primer A
~Qr A
)λr A
]= πN(λ0)timesN
(λr A
∣∣ 0 2 middot ( ~Q primeA~QA)
minus1)
(3ndash9)
In short the intrinsic priors for αA = (αprime0α
primer A)
prime and λprimeA = (λprime
0λprimer A)
prime are the product
of a reference prior on the parameters of the base model and a normal density on the
parameters indexed by Az and Ay respectively333 Model Posterior Probabilities
We now derive the expressions involved in the calculations of the model posterior
probabilities First recall that p(MA|y zw v) prop m(w v|MA)π(MA) Hence determining
this posterior probability only requires calculating m(w v|MA)
Note that since w and v are independent obtaining the model posteriors from
expression 3ndash4 reduces to finding closed form expressions for the marginals m(v |MAz)
and m(w |MAy) respectively from equations 3ndash6 and 3ndash7 Therefore
m(w v|MA) =
int intf (vw|αλMA)π
IP (α|MAz)πIP
(λ|MAy
)dαdλ
(3ndash10)
For the latent variable associated with the occupancy process plugging the
parameter intrinsic prior given by 3ndash8 into equation 3ndash6 (recalling that ~X primeA~XA =
pAzNX primeAXA)
and integrating out αA yields
m(v|MA) =
int intc0N (v|X0α0 + Xr Aαr A I)N
(αr A|0 2( ~X prime
r A~Xr A)
minus1)dαr Adα0
= c0(2π)minusn2
int (pAz
2N + pAz
) (pAzminusp0z
)
2
times
exp[minus1
2(v minus X0α0)
prime(I minus
(2N
2N + pAz
)Hr Az
)(v minus X0α0)
]dα0
62
= c0 (2π)minus(nminusp0z )2
(pAz
2N + pAz
) (pAzminusp0z
)
2
|X prime0X0|minus
12 times
exp[minus1
2vprime(I minus H0z minus
(2N
2N + pAz
)Hr Az
)v
] (3ndash11)
with Hr Az= HAz
minus H0z where HAzis the hat matrix for the entire model MAz
and H0z is
the hat matrix for the base model
Similarly the marginal distribution for w is
m(w|MA) = d0 (2π)minus(Jminusp0y )2
(pAy
2J + pAy
) (pAyminusp0y
)
2
|Q prime0Q0|minus
12 times
exp[minus1
2wprime(I minus H0y minus
(2J
2J + pAy
)Hr Ay
)w
] (3ndash12)
where J =sumN
i=1 Ji or in other words J denotes the total number of surveys conducted
Now the posteriors for the base model M0 =M0y M0z
are
m(v|M0) =
intc0N (v|X0α0 I) dα0
= c0(2π)minus(nminusp0z )2 |X prime
0X0|minus12 exp
[minus1
2(v (I minus H0z ) v)
](3ndash13)
and
m(w|M0) = d0(2π)minus(Jminusp0y )2 |Q prime
0Q0|minus12 exp
[minus1
2
(w(I minus H0y
)w)]
(3ndash14)
334 Model Selection Algorithm
Having the parameter intrinsic priors in place and knowing the form of the model
posterior probabilities it is finally possible to develop a strategy to conduct model
selection for the occupancy framework
For each of the two components of the model ndashoccupancy and detectionndash the
algorithm first draws the set of active predictors (ie Az and Ay ) together with their
corresponding parameters This is a reversible jump step which uses a Metropolis
63
Hastings correction with proposal distributions given by
q(Alowastz |zo z(t)u v(t)MAz
) =1
2
(p(MAlowast
z|zo z(t)u v(t)Mz MAlowast
zisin L(MAz
)) +1
|L(MAz)|
)q(Alowast
y |y zo z(t)u w(t)MAy) =
1
2
(p(MAlowast
w|y zo z(t)u w(t)My MAlowast
yisin L(MAy
)) +1
|L(MAy)|
)(3ndash15)
where L(MAz) and L(MAy
) denote the sets of models obtained from adding or removing
one predictor at a time from MAzand MAy
respectively
To promote mixing this step is followed by an additional draw from the full
conditionals of α and λ The densities p(α0|) p(αr A|) p(λ0|) and p(λr A|) can
be sampled from directly with Gibbs steps Using the notation a|middot to denote the random
variable a conditioned on all other parameters and on the data these densities are given
by
bull α0|middot sim N((X
prime0X0)
minus1Xprime0v (X
prime0X0)
minus1)bull αr A|middot sim N
(microαr A
αr A
) where the mean vector and the covariance matrix are
given by αr A= 2N
2N+pAz(X
prime
r AXr A)minus1 and microαr A
=(αr A
Xprime
r Av)
bull λ0|middot sim N((Q
prime0Q0)
minus1Qprime0w (Q
prime0Q0)
minus1) and
bull λr A|middot sim N(microλr A
λr A
) analogously with mean and covariance matrix given by
λr A= 2J
2J+pAy(Q
prime
r AQr A)minus1 and microλr A
=(λr A
Qprime
r Aw)
Finally Gibbs sampling steps are also available for the unobserved occupancy
indicators zu and for the corresponding latent variables v and w The full conditional
posterior densities for z(t+1)u v(t+1) and w(t+1) are those introduced in Chapter 2 for the
single season probit model
The following steps summarize the stochastic search algorithm
1 Initialize A(0)y A
(0)z z
(0)u v(0)w(0)α(0)
0 λ(0)0
2 Sample the model indices and corresponding parameters
(a) Draw simultaneously
64
bull Alowastz sim q(Az |zo z(t)u v(t)MAz
)
bull αlowast0 sim p(α0|MAlowast
z zo z
(t)u v(t)) and
bull αlowastr Alowast sim p(αr A|MAlowast
z zo z
(t)u v(t))
(b) Accept (M(t+1)Az
α(t+1)10 α(t+1)1
r A ) = (MAlowastzαlowast
0αlowastr Alowast) with probability
δz = min
(1
p(MAlowastz|zo z(t)u v(t))
p(MA(t)z|zo z(t)u v(t))
q(A(t)z |zo z(t)u v(t)MAlowast
z)
q(Alowastz |zo z
(t)u v(t)MAz
)
)
otherwise let (M(t+1)Az
α(t+1)10 α(t+1)1
r A ) = (A(t)z α(t)2
0 α(t)2r A )
(c) Sample simultaneously
bull Alowasty sim q(Ay |y zo z(t)u w(t)MAy
)
bull λlowast0 sim p(λ0|MAlowast
y y zo z
(t)u w(t)) and
bull λlowastr Alowast sim p(λr A|MAlowast
y y zo z
(t)u w(t))
(d) Accept (M(t+1)Ay
λ(t+1)10 λ(t+1)1
r A ) = (MAlowastyλlowast
0λlowastr Alowast) with probability
δy = min
(1
p(MAlowastz|y zo z(t)u w(t))
p(MA(t)z|y zo z(t)u w(t))
q(A(t)z |y zo z(t)u w(t)MAlowast
y)
q(Alowastz |y zo z
(t)u w(t)MAy
)
)
otherwise let (M(t+1)Ay
λ(t+1)10 λ(t+1)1
r A ) = (A(t)y λ(t)2
0 λ(t)2r A )
3 Sample base model parameters
(a) Draw α(t+1)20 sim p(α0|MA
(t+1)z
zo z(t)u v(t))
(b) Draw λ(t+1)20 sim p(λ0|MA(t+1)y
y zo z(t)u v(t))
4 To improve mixing resample model coefficients not present the base model butare in MA
(a) Draw α(t+1)2r A sim p(αr A|MA
(t+1)z
zo z(t)u v(t))
(b) Draw λ(t+1)2r A sim p(λr A|MA
(t+1)y
yzo z(t)u v(t))
5 Sample latent and missing (unobserved) variables
(a) Sample z(t+1)u sim p(zu|MA(t+1)z
yα(t+1)2r A α(t+1)2
0 λ(t+1)2r A λ(t+1)2
0 )
(b) Sample v(t+1) sim p(v|MA(t+1)z
zo z(t+1)u α(t+1)2
r A α(t+1)20 )
65
(c) Sample w(t+1) sim p(w|MA(t+1)y
zo z(t+1)u λ(t+1)2
r A λ(t+1)20 )
34 Alternative Formulation
Because the occupancy process is partially observed it is reasonable to consider
the posterior odds in terms of the observed responses that is the detections y and
the presences at sites where at least one detection takes place Partitioning the vector
of presences into observed and unobserved z = (zprimeo zprimeu)
prime and integrating out the
unobserved component the model posterior for MA can be obtained as
p(MA|y zo) prop Ezu [m(y z|MA)] π(MA) (3ndash16)
Data-augmenting the model in terms of latent normal variables a la Albert and Chib
the marginals for any model My Mz = M isin M of z and y inside of the expectation in
equation 3ndash16 can be expressed in terms of the latent variables
m(y z|M) =
intT (z)
intT (yz)
m(w v|M)dwdv
=
(intT (z)
m(v| Mz)dv
)(intT (yz)
m(w|My)dw
) (3ndash17)
where T (z) and T (y z) denote the corresponding truncation regions for v and w which
depend on the values taken by z and y and
m(v|Mz) =
intf (v|αMz)π(α|Mz)dα (3ndash18)
m(w|My) =
intf (w|λMy)π(λ|My)dλ (3ndash19)
The last equality in equation 3ndash17 is a consequence of the independence of the
latent processes v and w Using expressions 3ndash18 and 3ndash19 allows one to embed this
model selection problem in the classical linear normal regression setting where many
ldquoobjectiverdquo Bayesian inferential tools are available In particular these expressions
facilitate deriving the parameter intrinsic priors (Berger amp Pericchi 1996 Moreno
et al 1998) for this problem This approach is an extension of the one implemented in
Leon-Novelo et al (2012) for the simple probit regression problem
66
Using this alternative approach all that is left is to integrate m(v|MA) and m(w|MA)
over their corresponding truncation regions T (z) and T (y z) which yields m(y z|MA)
and then to obtain the expectation with respect to the unobserved zrsquos Note however
two issues arise First such integrals are not available in closed form Second
calculating the expectation over the limit of integration further complicates things To
address these difficulties it is possible to express E [m(y z|MA)] as
Ezu [m(y z|MA)] = Ezu
[(intT (z)
m(v| MAz)dv
)(intT (yz)
m(w|MAy)dw
)](3ndash20)
= Ezu
[(intT (z)
intm(v| MAz
α0)πIP(α0|MAz
)dα0dv
)times(int
T (yz)
intm(w| MAy
λ0)πIP(λ0|MAy
)dλ0dw
)]
= Ezu
int (int
T (z)
m(v| MAzα0)dv
)︸ ︷︷ ︸
g1(T (z)|MAz α0)
πIP(α0|MAz)dα0 times
int (intT (yz)
m(w|MAyλ0)dw
)︸ ︷︷ ︸
g2(T (yz)|MAy λ0)
πIP(λ0|MAy)dλ0
= Ezu
[intg1(T (z)|MAz
α0)πIP(α0|MAz
)dα0 timesintg2(T (y z)|MAy
λ0)πIP(λ0|MAy
)dλ0
]= c0 d0
int intEzu
[g1(T (z)|MAz
α0)g2(T (y z)|MAyλ0)
]dα0 dλ0
where the last equality follows from Fubinirsquos theorem since m(v|MAzα0) and
m(w|MAyλ0) are proper densities From 3ndash21 the posterior odds are
p(MA|y zo)p(M0|y zo)
=
int intEzu
[g1(T (z)|MAz
α0)g2(T (y z)|MAyλ0)
]dα0 dλ0int int
Ezu
[g1(T (z)|M0z α0)g2(T (y z)|M0y λ0)
]dα0 dλ0
π(MA)
π(M0)
(3ndash21)
67
35 Simulation Experiments
The proposed methodology was tested under 36 different scenarios where we
evaluate the behavior of the algorithm by varying the number of sites the number of
surveys the amount of signal in the predictors for the presence component and finally
the amount of signal in the predictors for the detection component
For each model component the base model is taken to be the intercept only model
and the full models considered for the presence and the detection have respectively 30
and 20 predictors Therefore the model space contains 230times220 asymp 112times1015 candidate
models
To control the amount of signal in the presence and detection components values
for the model parameter were purposefully chosen so that quantiles 10 50 and 90 of the
occupancy and detection probabilities match some pre-specified probabilities Because
presence and detection are binary variables the amount of signal in each model
component associates to the spread and center of the distribution for the occupancy and
detection probabilities respectively Low signal levels relate to occupancy or detection
probabilities close to 50 High signal levels associate with probabilities close to 0 or 1
Large spreads of the distributions for the occupancy and detection probabilities reflect
greater heterogeneity among the observations collected improving the discrimination
capability of the model and viceversa
Therefore for the presence component the parameter values of the true model
were chosen to set the median for the occupancy probabilities equal 05 The chosen
parameter values also fix quantiles 10 and 90 symmetrically about 05 at small (Qz10 =
03Qz90 = 07) intermediate (Qz
10 = 02Qz90 = 08) and large (Qz
10 = 01Qz90 = 09)
distances For the detection component the model parameters are obtained to reflect
detection probabilities concentrated about low values (Qy50 = 02) intermediate values
(Qy50 = 05) and high values (Qy
50 = 08) while keeping quantiles 10 and 90 fixed at 01
and 09 respectively
68
Table 3-1 Simulation control parameters occupancy model selectorParameter Values considered
N 50 100
J 3 5
(Qz10Q
z50Q
z90)
(03 05 07) (02 05 08) (01 05 09)
(Qy
10Qy50Q
y90)
(01 02 09) (01 05 09) (01 08 09)
There are in total 36 scenarios these result from crossing all the levels of the
simulation control parameters (Table 3-1) Under each of these scenarios 20 data sets
were generated at random True presence and detection indicators were generated
with the probit model formulation from Chapter 2 This with the assumed true models
MTz = 1 x2 x15 x16 x22 x28 for the presence and MTy = 1 q7 q10 q12 q17 for
the detection with the predictors included in the randomly generated datasets In this
context 1 represents the intercept term Throughout the Section we refer to predictors
included in the true models as true predictors and to those absent as false predictors
The selection procedure was conducted using each one of these data sets with
two different priors on the model space the uniform or equal probability prior and a
multiplicity correcting prior
The results are summarized through the marginal posterior inclusion probabilities
(MPIPs) for each predictor and also the five highest posterior probability models (HPM)
The MPIP for a given predictor under a specific scenario and for a particular data set is
defined as
p(predictor is included|y zw v) =sumMisinM
I(predictorisinM)p(M|y zw vM) (3ndash22)
In addition we compare the MPIP odds between predictors present in the true model
and predictors absent from it Specifically we consider the minimum odds of marginal
posterior inclusion probabilities for the predictors Let ~ξ and ξ denote respectively a
69
predictor in the true model MT and a predictor absent from MT We define the minimum
MPIP odds between the probabilities of true and false predictor as
minOddsMPIP =min~ξisinMT
p(I~ξ = 1|~ξ isin MT )
maxξ isinMTp(Iξ = 1|ξ isin MT )
(3ndash23)
If the variable selection procedure adequately discriminates true and false predictors
minOddsMPIP will take values larger than one The ability of the method to discriminate
between the least probable true predictor and the most probable false predictor worsens
as the indicator approaches 0351 Marginal Posterior Inclusion Probabilities for Model Predictors
For clarity in Figures 3-1 through 3-5 only predictors in the true models are labeled
and are emphasized with a dotted line passing through them The left hand side plots
in these figures contain the results for the presence component and the ones on the
right correspond to predictors in the detection component The results obtained with
the uniform model priors correspond to the black lines and those for the multiplicity
correcting prior are in red In these Figures the MPIPrsquos have been averaged over all
datasets corresponding scenarios matching the condition observed
In Figure 3-1 we contrast the mean MPIPrsquos of the predictors over all datasets from
scenarios with 50 sites to the mean MPIPrsquos obtained for the scenarios with 100 sites
Similarly Figure 3-2 compares the mean MPIPrsquos of scenarios where 3 surveys are
performed to those of scenarios having 5 surveys per site Figures 3-4 and 3-5 show the
effect of the different levels of signal considered in the occupancy probabilities and in the
detection probabilities
From these figures mainly three results can be drawn (1) the effect of the model
prior is substantial (2) the proposed methods yield MPIPrsquos that clearly separate
true predictors from false predictors and (3) the separation between MPIPrsquos of true
predictors and false predictors is noticeably larger in the detection component
70
Regardless of the simulation scenario and model component observed under the
uniform prior false predictors obtain a relatively high MPIP Conversely the multiplicity
correction prior strongly shrinks towards 0 the MPIP for false predictors In the presence
component the MPIP for the true predictors is shrunk substantially under the multiplicity
prior however there remains a clear separation between true and false predictors In
contrast in the detection component the MPIP for true predictors remains relatively high
(Figures 3-1 through 3-5)
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif N=50MC N=50
Unif N=100MC N=100
Figure 3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors
71
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif J=3MC J=3
Unif J=5MC J=5
Figure 3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif N=50 J=3Unif N=50 J=5
Unif N=100 J=3Unif N=100 J=5
MC N=50 J=3MC N=50 J=5
MC N=100 J=3MC N=100 J=5
Figure 3-3 Predictor MPIP averaged over scenarios with the interaction between thenumber of sites and the surveys per site using uniform (U) and multiplicitycorrection (MC) priors
72
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
U(03 05 07)MC(03 05 07)
U(02 05 08)MC(02 05 08)
U(01 05 09)MC(01 05 09)
Figure 3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
U(01 02 09)MC(01 02 09)
U(01 05 09)MC(01 05 09)
U(01 08 09)MC(01 08 09)
Figure 3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors
73
In scenarios where more sites were surveyed the separation between the MPIP of
true and false predictors grew in both model components (Figure 3-1) Increasing the
number of sites has an effect over both components given that every time a new site is
included covariate information is added to the design matrix of both the presence and
the detection components
On the hand increasing the number of surveys affects the MPIP of predictors in the
detection component (Figures 3-2 and 3-3) but has only a marginal effect on predictors
of the presence component This may appear to be counterintuitive however increasing
the number of surveys only increases the number of observation in the design matrix
for the detection while leaving unaltered the design matrix for the presence The small
changes observed in the MPIP for the presence predictors J increases are exclusively
a result of having additional detection indicators equal to 1 in sites where with less
surveys would only have 0 valued detections
From Figure 3-3 it is clear that for the presence component the effect of the number
of sites dominates the behavior of the MPIP especially when using the multiplicity
correction priors In the detection component the MPIP is influenced by both the number
of sites and number of surveys The influence of increasing the number of surveys is
larger when considering a smaller number of sites and viceversa
Regarding the effect of the distribution for the occupancy probabilities we observe
that mostly the detection component is affected There is stronger discrimination
between true and false predictors as the distribution has a higher variability (Figure
3-4) This is consistent with intuition since having the presence probabilities more
concentrated about 05 implies that the predictors do not vary much from one site to
the next whereas having the occupancy probabilities more spread out would have the
opposite effect
Finally concentrating the detection probabilities about high or low values For
predictors in the detection component the separation between MPIP of true and false
74
predictors is larger both in scenarios where the distribution of the detection probability
is centered about 02 or 08 when compared to those scenarios where this distribution
is centered about 05 (where the signal of the predictors is weakest) For predictors in
the presence component having the detection probabilities centered at higher values
slightly increases the inclusion probabilities of the true predictors (Figure 3-5) and
reduces that of false predictors
Table 3-2 Comparison of average minOddsMPIP under scenarios having differentnumber of sites (N=50 N=100) and under scenarios having different numberof surveys per site (J=3 J=5) for the presence and detection componentsusing uniform and multiplicity correction priors
Sites SurveysComp π(M) N=50 N=100 J=3 J=5
Presence Unif 112 131 119 124MC 320 846 420 674
Detection Unif 203 264 211 257MC 2115 3246 2139 3252
Table 3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors
(Qz10Q
z50Q
z90) (Qy
10Qy50Q
y90)
Comp π(M) (030507) (020508) (010509) (010209) (010509) (010809)
Presence Unif 105 120 134 110 123 124MC 202 455 805 238 619 640
Detection Unif 234 234 230 257 200 238MC 2537 2077 2528 2933 1852 2849
The separation between the MPIP of true and false predictors is even more
notorious in Tables 3-2 and 3-3 where the minimum MPIP odds between true and
false predictors are shown Under every scenario the value for the minOddsMPIP (as
defined in 3ndash23) was greater than 1 implying that on average even the lowest MPIP
for a true predictor is higher than the maximum MPIP for a false predictor In both
components of the model the minOddsMPIP are markedly larger under the multiplicity
correction prior and increase with the number of sites and with the number of surveys
75
For the presence component increasing the signal in the occupancy probabilities
or having the detection probabilities concentrate about higher values has a positive and
considerable effect on the magnitude of the odds For the detection component these
odds are particularly high specially under the multiplicity correction prior Also having
the distribution for the detection probabilities center about low or high values increases
the minOddsMPIP 352 Summary Statistics for the Highest Posterior Probability Model
Tables 3-4 through 3-7 show the number of true predictors that are included in
the HPM (True +) and the number of false predictors excluded from it (True minus)
The mean percentages observed in these Tables provide one clear message The
highest probability models chosen with either model prior commonly differ from the
corresponding true models The multiplicity correction priorrsquos strong shrinkage only
allows a few true predictors to be selected but at the same time it prevents from
including in the HPM any false predictors On the other hand the uniform prior includes
in the HPM a larger proportion of true predictors but at the expense of also introducing
a large number of false predictors This situation is exacerbated in the presence
component but also occurs to a minor extent in the detection component
Table 3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highestprobability models for the presence and the detection components usinguniform and multiplicity correcting priors on the model space
True + True minusComp π(M) N=50 N=100 N=50 N=100
Presence Unif 057 063 051 055MC 006 013 100 100
Detection Unif 077 085 087 093MC 049 070 100 100
Having more sites or surveys improves the inclusion of true predictors and exclusion
of false ones in the HPM for both the presence and detection components (Tables 3-4
and 3-5) On the other hand if the distribution for the occupancy probabilities is more
76
Table 3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space
True + True minusComp π(M) J=3 J=5 J=3 J=5
Presence Unif 059 061 052 054MC 008 010 100 100
Detection Unif 078 085 087 092MC 050 068 100 100
spread out the HPM includes more true predictors and less false ones in the presence
component In contrast the effect of the spread of the occupancy probabilities in the
detection HPM is negligible (Table 3-6) Finally there is a positive relationship between
the location of the median for the detection probabilities and the number of correctly
classified true and false predictors for the presence The HPM in the detection part of
the model responds positively to low and high values of the median detection probability
(increased signal levels) in terms of correctly classified true and false predictors (Table
3-7)
Table 3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space
True + True minusComp π(M) (030507) (020508) (010509) (030507) (020508) (010509)
Presence Unif 055 061 064 050 054 055MC 002 008 018 100 100 100
Detection Unif 081 082 081 090 089 089MC 057 061 059 100 100 100
36 Case Study Blue Hawker Data Analysis
During 1999 and 2000 an intensive volunteer surveying effort coordinated by the
Centre Suisse de Cartographie de la Faune (CSCF) was conducted in order to analyze
the distribution of the blue hawker Ashna cyanea (Odonata Aeshnidae) a common
dragonfly in Switzerland Given that Switzerland is a small and mountainous country
77
Table 3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space
True + True minusComp π(M) (010209) (010509) (010809) (010209) (010509) (010809)
Presence Unif 059 059 062 051 054 054MC 006 010 011 100 100 100
Detection Unif 089 077 078 091 087 091MC 070 048 059 100 100 100
there is large variation in its topography and physio-geography as such elevation is a
good candidate covariate to predict species occurrence at a large spatial scale It can
be used as a proxy for habitat type intensity of land use temperature as well as some
biotic factors (Kery et al 2010)
Repeated visits to 1-ha pixels took place to obtain the corresponding detection
history In addition to the survey outcome the x and y-coordinates thermal-level the
date of the survey and the elevation were recorded Surveys were restricted to the
known flight period of the blue hawker which takes place between May 1 and October
10 In total 2572 sites were surveyed at least once during the surveying period The
number of surveys per site ranges from 1 to 22 times within each survey year
Kery et al (2010) summarize the results of this effort using AIC-based model
comparisons first by following a backwards elimination approach for the detection
process while keeping the occupancy component fixed at the most complex model and
then for the presence component choosing among a group of three models while using
the detection model chosen In our analysis of this dataset for the detection and the
presence we consider as the full models those used in Kery et al (2010) namely
minus1(ψ) = α0 + α1year+ α2elev+ α3elev2 + α4elev
3
minus1(p) = λ0 + λ1year+ λ2elev+ λ3elev2 + λ4elev
3 + λ5date+ λ6date2
78
where year = Iyear=2000
The model spaces for this data contain 26 = 64 and 24 = 16 models respectively
for the detection and occupancy components That is in total the model space contains
24+6 = 1 024 models Although this model space can be enumerated entirely for
illustration we implemented the algorithm from section 334 generating 10000 draws
from the Gibbs sampler Each one of the models sampled were chosen from the set of
models that could be reached by changing the state of a single term in the current model
(to inclusion or exclusion accordingly) This allows a more thorough exploration of the
model space because for each of the 10000 models drawn the posterior probabilities
for many more models can be observed Below the labels for the predictors are followed
by either ldquozrdquo or ldquoyrdquo accordingly to represent the component they pertain to Finally
using the results from the model selection procedure we conducted a validation step to
determine the predictive accuracy of the HPMrsquos and of the median probability models
(MPMrsquos) The performance of these models is then contrasted with that of the model
ultimately selected by Kery et al (2010)361 Results Variable Selection Procedure
The model finally chosen for the presence component in Kery et al (2010) was not
found among the highest five probability models under either model prior 3-8 Moreover
the year indicator was never chosen under the multiplicity correcting prior hinting that
this term might correspond to a falsely identified predictor under the uniform prior
Results in Table 3-10 support this claim the marginal inclusion posterior probability for
the year predictor is 7 under the multiplicity correction prior The multiplicity correction
prior concentrates more densely the model posterior probability mass in the highest
ranked models (90 of the mass is in the top five models) than the uniform prior (which
account for 40 of the mass)
For the detection component the HPM under both priors is the intercept only model
which we represent in Table 3-9 with a blank label In both cases this model obtains very
79
Table 3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data
Uniform model priorRank Mz selected p(Mz |y)
1 yrz+elevz 0102 yrz+elevz+elevz3 0083 elevz2+elevz3 0084 yrz+elevz2 0075 yrz+elevz3 007
Multiplicity correcting model priorRank Mz selected p(Mz |y)
1 elevz+elevz3 0532 0153 elevz+elevz2 0094 elevz2 0065 elevz+elevz2+elevz3 005
high posterior probabilities The terms contained in cubic polynomial for the elevation
appear to contain some relevant information however this conflicts with the MPIPs
observed in Table 3-11 which under both model priors are relatively low (lt 20 with the
uniform and le 4 with the multiplicity correcting prior)
Table 3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data
Uniform model priorRank Mz selected p(Mz |y)
1 0452 elevy3 0063 elevy2 0054 elevy 0055 yry 004
Multiplicity correcting model priorRank Mz selected p(Mz |y)
1 0862 elevy3 0023 datey2 0024 elevy2 0025 yry 002
Finally it is possible to use the MPIPs to obtain the median probability model which
contains the terms that have a MPIP higher than 50 For the occupancy process
(Table 3-10) under the uniform prior the model with the year the elevation and the
elevation cubed are included The MPM with multiplicity correction prior coincides with
the HPM from this prior The MPM chosen for the detection component (Table 3-11)
under both priors is the intercept only model coinciding again with the HPM
Given the outcomes of the simulation studies from Section 35 especially those
pertaining to the detection component the results in Table 3-11 appear to indicate that
none of the predictors considered belong to the true model especially when considering
80
Table 3-10 MPIP presence component
Predictor p(predictor isin MTz |y z w v)
Unif MultCorryrz 053 007elevz 051 073elevz2 045 023elevz3 050 067
Table 3-11 MPIP detection component
Predictor p(predictor isin MTy |y z w v)
Unif MultCorryry 019 003elevy 018 003elevy2 018 003elevy 3 019 004datey 016 003datey2 015 004
those derived with the multiplicity correction prior On the other hand for the presence
component (Table 3-10) there is an indication that terms related to the cubic polynomial
in elevz can explain the occupancy patterns362 Validation for the Selection Procedure
Approximately half of the sites were selected at random for training (ie for model
selection and parameter estimation) and the remaining half were used as test data In
the previous section we observed that using the marginal posterior inclusion probability
of the predictors the our method effectively separates predictors in the true model from
those that are not in it However in Tables 3-10 and 3-11 this separation is only clear for
the presence component using the multiplicity correction prior
Therefore in the validation procedure we observe the misclassification rates for the
detections using the following models (1) the model ultimately recommended in Kery
et al (2010) (yrz+elevz+elevz2+elevz3 + elevy+ elevy2+ datey+ datey2) (2) the
highest probability model (HPM) with a uniform prior (yrz+elevz) (3) the HPM with a
multiplicity correcting prior (elevz + elevz3 ) (4) the median probability model (MPM)
ndashthe model including only predictors with a MPIP larger than 50ndash with the uniform
prior (yrz+elevz+elevz3) and finally (5) the MPM with a multiplicity correction prior
(elevz+elevz3 same as the HPM with multiplicity correction)
We must emphasize that the models resulting from the implement ion of our model
selection procedure used exclusively the training dataset On the other hand the model
in Kery et al (2010) was chosen to minimize the prediction error of the complete data
81
Because this model was obtained from the full dataset results derived from it can only
be considered as a lower bound for the prediction errors
The benchmark misclassification error rate for true 1rsquos is high (close to 70)
However the misclassification rate for true 0rsquos which accounts for most of the
responses is less pronounced (15) Overall the performance of the selected models
is comparable They yield considerably worse results than the benchmark for the true
1rsquos but achieve rates close to the benchmark for the true zeros Pooling together
the results for true ones and true zeros the selected models with either prior have
misclassification rates close to 30 The benchmark model performs comparably with a
joint misclassification error of 23 (Table 3-12)
Table 3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform andmultiplicity correction model priors
Model True 1 True 0 Jointbenchmark (Kery et al 2010) yrz+elevz+elevz2+elevz3 + 066 015 023
elevy+ elevy2+ datey+ datey2
HPM Unif yrz+elevz 083 017 028HPMHPM MC elevz + elevz3 082 018 028MPM Unif yrz+elevz+elevz3 082 018 029
37 Discussion
In this Chapter we proposed an objective and fully automatic Bayes methodology for
the single season site-occupancy model The methodology is said to be fully automatic
because no hyper-parameter specification is necessary in defining the parameter priors
and objective because it relies on the intrinsic priors derived from noninformative priors
The intrinsic priors have been shown to have desirable properties as testing priors We
also propose a fast stochastic search algorithm to explore large model spaces using our
model selection procedure
Our simulation experiments demonstrated the ability of the method to single out the
predictors present in the true model when considering the marginal posterior inclusion
probabilities for the predictors For predictors in the true model these probabilities
were comparatively larger than those for predictors absent from it Also the simulations
82
indicated that the method has a greater discrimination capability for predictors in the
detection component of the model especially when using multiplicity correction priors
Multiplicity correction priors were not described in this Chapter however their
influence on the selection outcome is significant This behavior was observed in the
simulation experiment and in the analysis of the Blue Hawker data Model priors play an
essential role As the number of predictors grows these are instrumental in controlling
for selection of false positive predictors Additionally model priors can be used to
account for predictor structure in the selection process which helps both to reduce the
size of the model space and to make the selection more robust These issues are the
topic of the next Chapter
Accounting for the polynomial hierarchy in the predictors within the occupancy
context is a straightforward extension of the procedures we describe in Chapter 4
Hence our next step is to develop efficient software for it An additional direction we
plan to pursue is developing methods for occupancy variable selection in a multivariate
setting This can be used to conduct hypothesis testing in scenarios with varying
conditions through time or in the case where multiple species are co-observed A
final variation we will investigate for this problem is that of occupancy model selection
incorporating random effects
83
CHAPTER 4PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
It has long been an axiom of mine that the little things are infinitely themost important
ndashSherlock HolmesA Case of Identity
41 Introduction
In regression problems if a large number of potential predictors is available the
complete model space is too large to enumerate and automatic selection algorithms are
necessary to find informative parsimonious models This multiple testing problem
is difficult and even more so when interactions or powers of the predictors are
considered In the ecological literature models with interactions andor higher order
polynomial terms are ubiquitous (Johnson et al 2013 Kery et al 2010 Zeller et al
2011) given the complexity and non-linearities found in ecological processes Several
model selection procedures even in the classical normal linear setting fail to address
two fundamental issues (1) the model selection outcome is not invariant to affine
transformations when interactions or polynomial structures are found among the
predictors and (2) additional penalization is required to control for false positives as the
model space grows (ie as more covariates are considered)
These two issues motivate the developments developed throughout this Chapter
Building on the results of Chipman (1996) we propose investigate and provide
recommendations for three different prior distributions on the model space These
priors help control for test multiplicity while accounting for polynomial structure in the
predictors They improve upon those proposed by Chipman first by avoiding the need
for specific values for the prior inclusion probabilities of the predictors and second
by formulating principled alternatives to introduce additional structure in the model
84
priors Finally we design a stochastic search algorithm that allows fast and thorough
exploration of model spaces with polynomial structure
Having structure in the predictors can determine the selection outcome As an
illustration consider the model E [y ] = β00 + β01x2 + β20x21 where the order one
term x1 is not present (this choice of subscripts for the coefficients is defined in the
following section) Transforming x1 7rarr xlowast1 = x1 + c for some c = 0 the model
becomes E [y ] = β00 + β01x2 + βlowast20x
lowast21 Note that in terms of the original predictors
xlowast21 = x21 + 2c middot x1 + c2 implying that this seemingly innocuous transformation of x1
modifies the column space of the design matrix by including x1 which was not in the
original model That is when lower order terms in the hierarchy are omitted from the
model the column space of the design matrix is not invariant to afine transformations
As the hat matrix depends on the column space the modelrsquos predictive capability is also
affected by how the covariates in the model are coded an undesirable feature for any
model selection procedure To make model selection invariant to afine transformations
the selection must be constrained to the subset of models that respect the hierarchy
(Griepentrog et al 1982 Khuri 2002 McCullagh amp Nelder 1989 Nelder 2000
Peixoto 1987 1990) These models are known as well-formulated models (WFMs)
Succinctly a model is well-formulated if for any predictor in the model every lower order
predictor associated with it is also in the model The model above is not well-formulated
as it contains x21 but not x1
WFMs exhibit strong heredity in that all lower order terms dividing higher order
terms in the model must also be included An alternative is to only require weak heredity
(Chipman 1996) which only forces some of the lower terms in the corresponding
polynomial hierarchy to be in the model However Nelder (1998) demonstrated that the
conditions under which weak heredity allows the design matrix to be invariant to afine
transformations of the predictors are too restrictive to be useful in practice
85
Although this topic appeared in the literature more than three decades ago (Nelder
1977) only recently have modern variable selection techniques been adapted to
account for the constraints imposed by heredity As described in Bien et al (2013)
the current literature on variable selection for polynomial response surface models
can be classified into three broad groups mult-istep procedures (Brusco et al 2009
Peixoto 1987) regularized regression methods (Bien et al 2013 Yuan et al 2009)
and Bayesian approaches (Chipman 1996) The methods introduced in this Chapter
take a Bayesian approach towards variable selection for well-formulated models with
particular emphasis on model priors
As mentioned in previous chapters the Bayesian variable selection problem
consists of finding models with high posterior probabilities within a pre-specified model
space M The model posterior probability for M isin M is given by
p(M|yM) prop m(y|M)π(M|M) (4ndash1)
Model posterior probabilities depend on the prior distribution on the model space
as well as on the prior distributions for the model specific parameters implicitly through
the marginals m(y|M) Priors on the model specific parameters have been extensively
discussed in the literature (Berger amp Pericchi 1996 Berger et al 2001 George 2000
Jeffreys 1961 Kass amp Wasserman 1996 Liang et al 2008 Zellner amp Siow 1980) In
contrast the effect of the prior on the model space has until recently been neglected
A few authors (eg Casella et al (2014) Scott amp Berger (2010) Wilson et al (2010))
have highlighted the relevance of the priors on the model space in the context of multiple
testing Adequately formulating priors on the model space can both account for structure
in the predictors and provide additional control on the detection of false positive terms
In addition using the popular uniform prior over the model space may lead to the
undesirable and ldquoinformativerdquo implication of favoring models of size p2 (where p is the
86
total number of covariates) since this is the most abundant model size contained in the
model space
Variable selection within the model space of well-formulated polynomial models
poses two challenges for automatic objective model selection procedures First the
notion of model complexity takes on a new dimension Complexity is not exclusively
a function of the number of predictors but also depends upon the depth and
connectedness of the associations defined by the polynomial hierarchy Second
because the model space is shaped by such relationships stochastic search algorithms
used to explore the models must also conform to these restrictions
Models without polynomial hierarchy constitute a special case of WFMs where
all predictors are of order one Hence all the methods developed throughout this
Chapter also apply to models with no predictor structure Additionally although our
proposed methods are presented for the normal linear case to simplify the exposition
these methods are general enough to be embedded in many Bayesian selection
and averaging procedures including of course the occupancy framework previously
discussed
In this Chapter first we provide the necessary definitions to characterize the
well-formulated model selection problem Then we proceed to introduce three new prior
structures on the well-formulated model space and characterize their behavior with
simple examples and simulations With the model priors in place we build a stochastic
search algorithm to explore spaces of well-formulated models that relies on intrinsic
priors for the model specific parameters mdash though this assumption can be relaxed
to use other mixtures of g-priors Finally we implement our procedures using both
simulated and real data
87
42 Setup for Well-Formulated Models
Suppose that the observations yi are modeled using the polynomial regression of
the covariates xi 1 xi p given by
yi =sum
β(α1αp)
pprodj=1
xαji j + ϵi (4ndash2)
where α = (α1 αp) belongs to Np0 the p-dimensional space of natural numbers
including 0 with ϵiiidsim N(0σ2) and only finitely many βα are allowed to be non-zero
As an illustration consider a model space that includes polynomial terms incorporating
covariates xi 1 and xi 2 only The terms x2i 2 and x2i 1xi 2 can be represented by α = (0 2)
and α = (2 1) respectively
The notation y = Z(X)β + ϵ is used to denote that observed response y =
(y1 yn) is modeled via a polynomial function Z of the original covariates contained
in X = (x1 xp) (where xj = (x1j xnj)prime) and the coefficients of the polynomial
terms are given by β A specific polynomial model M is defined by the set of coefficients
βα that are allowed to be non-zero This definition is equivalent to characterizing M
through a collection of multi-indices α isin Np0 In particular model M is specified by
M = αM1 αM|M| for αMk isin Np0 where βα = 0 for α isin M
Any particular model M uses a subset XM of the original covariates X to form the
polynomial terms in the design matrix ZM(X) Without ambiguity a polynomial model
ZM(X) on X can be identified with a polynomial model ZM(XM) on the covariates XM
The number of terms used by M to model the response y denoted by |M| corresponds
to the number of columns of ZM(XM) The coefficient vector and error variance of
the model M are denoted by βM and σ2M respectively Thus M models the data as
y = ZM(XM)βM + ϵM where ϵM sim N(0 Iσ2M
) Model M is said to be nested in model M prime
if M sub M prime M models the response of the covariates in two distinct ways choosing the
set of meaningful covariates XM as well as choosing the polynomial structure of these
covariates ZM(XM)
88
The set Np0 constitutes a partially ordered set or more succinctly a poset A poset
is a set partially ordered through a binary relation ldquo≼rdquo In this context the binary relation
on the poset Np0 is defined between pairs (ααprime) by αprime ≼ α whenever αj ge αprime
j for all
j = 1 prime with αprime ≺ α if additionally αj gt αprimej for some j The order of a term α isin Np
0
is given by the sum of its elements order(α) =sumαj When order(α) = order(αprime) + 1
and αprime ≺ α then αprime is said to immediately precede α which is denoted by αprime rarr α
The parent set of α is defined by P(α) = αprime isin Np0 αprime rarr α and is given by the
set of nodes that immediately precede the given node A polynomial model M is said to
be well-formulated if α isin M implies that P(α) sub M For example any well-formulated
model using x2i 1xi 2 to model yi must also include the parent terms xi 1xi 2 and x2i 1 their
corresponding parent terms xi 1 and xi 2 and the intercept term 1
The poset Np0 can be represented by a Directed Acyclic Graph (DAG) denoted
by (Np0) Without ambiguity we can identify nodes in the graph α isin Np
0 with terms in
the set of covariates The graph has directed edges to a node from its parents Any
well-formulated model M is represented by a subgraph (M) of (Np0) with the property
that if node α isin (M) then the nodes corresponding to P(α) are also in (M) Figure
4-1 shows examples of well-formulated polynomial models where α isin Np0 is identified
withprodp
j=1 xαjj
The motivation for considering only well-formulated polynomial models is
compelling Let ZM be the design matrix associated with a polynomial model The
subspace of y modeled by ZM given by the hat matrix HM = ZM(ZprimeMZM)
minus1ZprimeM is
invariant to affine transformations of the matrix XM if and only if M corresponds to a
well-formulated polynomial model (Peixoto 1990)
89
A B
Figure 4-1 Graphs of well-formulated polynomial models for p = 2
For example if p = 2 and yi = β(00) + β(10)xi 1 + β(01)xi 2 + β(11)xi 1xi 2 + ϵi then
the hat matrix is invariant to any covariate transformation of the form A(xi 1xi 2
)+ b for any
real-valued positive definite 2 times 2 matrix A and any real-valued vector of dimension two
b In contrast if yi = β(00) + β(20)x2i 1 + ϵi then the hat matrix formed after applying the
transformation xi 1 7rarr xi 1 + c for real c = 0 is not the same as the hat matrix formed by
the original xi 1421 Well-Formulated Model Spaces
The spaces of WFMs M considered in this paper can be characterized in terms
of two WFMs MB the base model and MF the full model The base model contains at
least the intercept term and is nested in the full model The model space M is populated
by all well formulated models M that nest MB and are nested in MF
M = M MB sube M sube MF and M is well-formulated
For M to be well-formulated the entire ancestry of each node in M must also be
included in M Because of this M isin M can be uniquely identified by two different sets
of nodes in MF the set of extreme nodes and the set of children nodes For M isin M
90
the sets of extreme and children nodes respectively denoted by E(M) and C(M) are
defined by
E(M) = α isin M MB α isin P(αprime) forall αprime isin M
C(M) = α isin MF M α cupM is well-formulated
The extreme nodes are those nodes that when removed from M give rise to a WFM in
M The children nodes are those nodes that when added to M give rise to a WFM in
M Because MB sube M for all M isin M the set of nodes E(M)cupMB determine M by
beginning with this set and iteratively adding parent nodes Similarly the nodes in C(M)
determine the set αprime isin P(α) α isin C(M)cupαprime isin E(MF ) α ≼ αprime for all α isin C(M)
which contains E(M)cupMB and thus uniquely identifies M
1
x1
x2
x21
x1x2
x22
A Extreme node set
1
x1
x2
x21
x1x2
x22
B Children node set
Figure 4-2
In Figure 4-2 the extreme and children sets for model M = 1 x1 x21 are shown for
the model space characterized by MF = 1 x1 x2 x21 x1x2 x22 In Figure 4-2A the solid
nodes represent nodes α isin M E(M) the dashed node corresponds to α isin E(M) and
the dotted nodes are not in M Solid nodes in Figure 4-2B correspond to those in M
The dashed node is the single node in C(M) and the dotted nodes are not in M cup C(M)43 Priors on the Model Space
As discussed in Scott amp Berger (2010) the Ockhamrsquos-razor effect found
automatically in Bayesian variable selection through the Bayes factor does not correct
91
for multiple testing This penalization acts against more complex models but does not
account for the collection of models in the model space which describes the multiplicity
of the testing problem This is where the role of the prior on the model space becomes
important As Scott amp Berger explain the multiplicity penalty is ldquohidden awayrdquo in the
model prior probabilities π(M|M)
In what follows we propose three different prior structures on the model space
for WFMs discuss their advantages and disadvantages and describe reasonable
choices for their hyper-parameters In addition we investigate how the choice of
prior structure and hyper-parameter combinations affect the posterior probabilities for
predictor inclusion providing some recommendations for different situations431 Model Prior Definition
The graphical structure for the model spaces suggests a method for prior
construction on M guided by the notion of inheritance A node α is said to inherit from
a node αprime if there is a directed path from αprime to α in the graph (MF ) The inheritance
is said to be immediate if order(α) = order(αprime) + 1 (equivalently if αprime isin P(α) or if αprime
immediately precedes α)
For convenience define (M) = M MB to be the set of nodes in M that are not
in the base model MB For α isin (MF ) let γα(M) be the indicator function describing
whether α is included in M ie γα(M) = I(αisinM) Denote by γν(M) the set of indicators
of inclusion in M for all order ν nodes in (MF ) Finally let γltν(M) =cupνminus1
j=0 γ j(M)
the set of indicators for inclusion in M for all nodes in (MF ) of order less than ν With
these definitions the prior probability of any model M isin M can be factored as
π(M|M) =
JmaxMprod
j=JminM
π(γ j(M)|γltj(M)M) (4ndash3)
where JminM and Jmax
M are respectively the minimum and maximum order of nodes in
(MF ) and π(γJminM (M)|γltJmin
M (M)M) = π(γJminM (M)|M)
92
Prior distributions on M can be simplified by making two assumptions First if
order(α) = order(αprime) = j then γα and γαprime are assumed to be conditionally independent
when conditioned on γltj denoted by γα perpperp γαprime|γltj Second immediate inheritance is
invoked and it is assumed that if order(α) = j then γα(M)|γltj(M) = γα(M)|γP(α)(M)
where γP(α)(M) is the inclusion indicator for the set of parent nodes of α This indicator
is one if the complete parent set of α is contained in M and zero otherwise
In Figure 4-3 these two assumptions are depicted with MF being an order two
surface in two main effects The conditional independence assumption (Figure 4-3A)
implies that the inclusion indicators for x21 x22 and x1x2 is independent when conditioned
on all the lower order terms In this same space immediate inheritance implies that
the inclusion of x21 conditioned on the inclusion of all lower order nodes is equivalent to
conditioning it on its parent set (x1 in this case)
x21 perpperp x1x2 perpperp x22
∣∣∣∣∣
1
x1
x2
A Conditional independence
x21∣∣∣∣∣
1
x1
x2
=
x21
∣∣∣∣∣ x1
B Immediate inheritance
Figure 4-3
Denote the conditional inclusion probability of node α in model M by πα =
π(γα(M) = 1|γP(α)(M)M) Under the assumptions of conditional independence
93
and immediate inheritance the prior probability of M is
π(M|πMM) =prod
αisin(MF )
πγα(M)α (1minus πα)
1minusγα(M) (4ndash4)
with πM = πα α isin (MF ) Because M must be well-formulated πα = γα =
0 if γP(α)(M) = 0 Thus the product in 4ndash4 can be restricted to the set of nodes
α isin (M)cup
C(M) Additional structure can be built into the prior on M by making
assumptions about the inclusion probabilities πα such as equality assumptions or
assumptions of a hyper-prior for these parameters Three such prior classes are
developed next first by assigning hyperpriors on πM assuming some structure among
its elements and then marginalizing out the πM
Hierarchical Uniform Prior (HUP) The HUP assumes that the non-zero πα
are all equal Specifically for a model M isin M it is assumed that πα = π for all
α isin (M)cupC(M) A complete Bayesian specification of the HUP is completed by
assuming a prior distribution for π The choice of π sim Beta(a b) produces
πHUP(M|M a b) =B(|(M)|+ a |C(M)|+ b)
B(a b) (4ndash5)
where B is the beta function Setting a = b = 1 gives the particular value of
πHUP(M|M a = 1 b = 1) =1
|(M)|+ |C(M)|+ 1
(|(M)|+ |C(M)|
|(M)|
)minus1
(4ndash6)
The HUP assigns equal probabilities to all models for which the sets of nodes (M)
and C(M) have the same cardinality This prior provides a combinatorial penalization
but essentially fails to account for the hierarchical structure of the model space An
additional penalization for model complexity can be incorporated into the HUP by
changing the values of a and b Because πα = π for all α this penalization can only
depend on some aspect of the entire graph of MF such as the total number of nodes
not in the null model |(MF )|
94
Hierarchical Independence Prior (HIP) The HIP assumes that there are no
equality constraints among the non-zero πα Each non-zero πα is given its own prior
which is assumed to be a Beta distribution with parameters aα and bα Thus the prior
probability of M under the HIP is
πHIP(M|M ab) =
prodαisin(M)
aα
aα + bα
prodαisinC(M)
bα
aα + bα
(4ndash7)
where the product over empty is taken to be 1 Because the πα are totally independent any
choice of aα and bα is equivalent to choosing a probability of success πα for a given α
Setting aα = bα = 1 for all α isin (M)cup
C(M) gives the particular value of
πHIP(M|M a = 1b = 1) =
(1
2
)|(M)|+|C(M)|
(4ndash8)
Although the prior with this choice of hyper-parameters accounts for the hierarchical
structure of the model space it essentially provides no penalization for combinatorial
complexity at different levels of the hierarchy This can be observed by considering a
model space with main effects only the exponent in 4ndash8 is the same for every model in
the space because each node is either in the model or in the children set
Additional penalizations for model complexity can be incorporated into the HIP
Because each γ j is conditioned on γltj in the prior construction the aα and bα for α of
order j can be conditioned on γltj One such additional penalization utilizes the number
of nodes of order j that could be added to produce a WFM conditioned on the inclusion
vector γltj which is denoted as chj(γltj) Choosing aα = 1 and bα(M) = chj(γ
ltj) is
equivalent to choosing a probability of success πα = 1chj(γltj) This penalization can
drive down the false positive rate when chj(γltj) is large but may produce more false
negatives
Hierarchical Order Prior (HOP) A compromise between complete equality and
complete independence of the πα is to assume equality between the πα of a given
order and independence across the different orders Define j(M) = α isin (M)
95
order(α) = j and Cj(M) = α isin C(M) order(α) = j The HOP assumes that πα = πj
for all α isin j(M)cupCj(M) Assuming that πj sim Beta(aj bj) provides a prior probability of
πHOP(M|M ab) =
JmaxMprod
j=JminM
B(|j(M)|+ aj |Cj(M)|+ bj)
B(aj bj)(4ndash9)
The specific choice of aj = bj = 1 for all j gives a value of
πHOP(M|M a = 1b = 1) =prodj
[1
|j(M)|+ |Cj(M)|+ 1
(|j(M)|+ |Cj(M)|
|j(M)|
)minus1]
(4ndash10)
and produces a hierarchical version of the Scott and Berger multiplicity correction
The HOP arises from a conditional exchangeability assumption on the indicator
variables Conditioned on γltj(M) the indicators γα α isin j(M)cup
Cj(M) are
assumed to be exchangeable Bernoulli random variables By de Finettirsquos theorem these
arise from independent Bernoulli random variables with common probability of success
πj with a prior distribution Our construction of the HOP assumes that this prior is a
beta distribution Additional complexity penalizations can be incorporated into the HOP
in a similar fashion to the HIP The number of possible nodes that could be added of
order j while maintaining a WFM is given by chj(M) = chj(γltj(M)) = |j(M)
cupCj(M)|
Using aj = 1 and bj(M) = chj(M) produces a prior with two desirable properties
First if M prime sub M then π(M) le π(M prime) Second for each order j the conditional
probability of including k nodes is greater than or equal to that of including k + 1 nodes
for k = 0 1 chj(M)minus 1432 Choice of Prior Structure and Hyper-Parameters
Each of the priors introduced in Section 31 defines a whole family of model priors
characterized by the probability distribution assumed for the inclusion probabilities πM
For the sake of simplicity this paper focuses on those arising from Beta distributions
and concentrates on particular choices of hyper-parameters which can be specified
automatically First we describe some general features about how each of the three
prior structures (HUP HIP HOP) allocates mass to the models in the model space
96
Second as there is an infinite number of ways in which the hyper-parameters can be
specified focused is placed on the default choice a = b = 1 as well as the complexity
penalizations described in Section 31 The second alternative is referred to as a =
1b = ch where b = ch has a slightly different interpretation depending on the prior
structure Accordingly b = ch is given by bj(M) = bα(M) = chj(M) = |j(M)cup
Cj(M)|
for the HOP and HIP where j = order(α) while b = ch denotes that b = |(MF )| for
the HUP The prior behavior for two model spaces In both cases the base model MB is
taken to be the intercept only model and MF is the DAG shown (Figures 4-4 and 4-5)
The priors considered treat model complexity differently and some general properties
can be seen in these examples
ModelHIP HOP HUP
(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)
1 1 14 49 13 12 13 572 1 x1 18 19 112 112 112 5563 1 x2 18 19 112 112 112 5564 1 x1 x
21 18 19 112 112 112 5168
5 1 x2 x22 18 19 112 112 112 5168
6 1 x1 x2 132 364 112 112 160 1727 1 x1 x2 x
21 132 164 136 160 160 1168
8 1 x1 x2 x1x2 132 164 136 160 160 11689 1 x1 x2 x
22 132 164 136 160 160 1168
10 1 x1 x2 x21 x1x2 132 1192 136 1120 130 1252
11 1 x1 x2 x21 x
22 132 1192 136 1120 130 1252
12 1 x1 x2 x1x2 x22 132 1192 136 1120 130 1252
13 1 x1 x2 x21 x1x2 x
22 132 1576 112 1120 16 1252
Figure 4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch)
First contrast the choice of HIP HUP and HOP for the choice of (ab) = (1 1) The
HIP induces a complexity penalization that only accounts for the order of the terms in
the model This is best exhibited by the model space in Figure 4-4 Models including x1
and x2 models 6 through 13 are given the same prior probability and no penalization is
incurred for the inclusion of any or all of the quadratic terms In contrast to the HIP the
97
ModelHIP HOP HUP
(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)
1 1 18 2764 14 12 14 472 1 x1 18 964 112 110 112 2213 1 x2 18 964 112 110 112 2214 1 x3 18 964 112 110 112 2215 1 x1 x3 18 364 112 120 112 41056 1 x2 x3 18 364 112 120 112 41057 1 x1 x2 116 3128 124 140 130 1428 1 x1 x2 x1x2 116 3128 124 140 120 1709 1 x1 x2 x3 116 1128 18 140 120 17010 1 x1 x2 x3 x1x2 116 1128 18 140 15 170
Figure 4-5 Prior probabilities for the space of well-formulated models associated tothree main effects and one interaction term where MB is taken to be theintercept only model and (ab) isin (1 1) (1 ch)
HUP induces a penalization for model complexity but it does not adequately penalize
models for including additional terms Using the HIP models including all of the terms
are given at least as much probability as any model containing any non-empty set of
terms (Figures 4-4 and 4-5) This lack of penalization of the full model is originates from
its combinatorial simplicity (ie this is the only model that contains every term) and
as an unfortunate consequence this model space distribution favors the base and full
models Similar behavior is observed with the HOP with (ab) = (1 1) As models
become more complex they are appropriately penalized for their size However after a
sufficient number of nodes are added the number of possible models of that particular
size is considerably reduced Thus combinatorial complexity is negligible on the largest
models This is best exhibited in Figure 4-5 where the HOP places more mass on
the full model than on any model containing a single order one node highlighting an
undesirable behavior of the priors with this choice of hyper-parameters
In contrast if (ab) = (1 ch) all three priors produce strong penalization as
models become more complex both in terms of the number and order of the nodes
contained in the model For all of the priors adding a node α to a model M to form M prime
produces p(M) ge p(M prime) However differences between the priors are apparent The
98
HIP penalizes the full model the most with the HOP penalizing it the least and the HUP
lying between them At face value the HOP creates the most compelling penalization
of model complexity In Figure 4-5 the penalization of the HOP is the least dramatic
producing prior odds of 20 for MB versus MF as opposed to the HUP and HIP which
produce prior odds of 40 and 54 respectively Similarly the prior odds in Figure 4-4 are
60 180 and 256 for the HOP HUP and HIP respectively433 Posterior Sensitivity to the Choice of Prior
To determine how the proposed priors are adjusting the posterior probabilities to
account for multiplicity a simple simulation was performed The goal of this exercise
was to understand how the priors respond to increasing complexity First the priors are
compared as the number of main effects p grows Second they are compared as the
depth of the hierarchy increases or in other words as the orderJMmax increases
The quality of a node is characterized by its marginal posterior inclusion
probabilities defined as pα =sum
MisinM I(αisinM)p(M|yM) for α isin MF These posteriors
were obtained for the proposed priors as well as the Equal Probability Prior (EPP)
on M For all prior structures both the default hyper-parameters a = b = 1 and
the penalizing choice of a = 1 and b = ch are considered The results for the
different combinations of MF and MT incorporated in the analysis were obtained
from 100 random replications (ie generating at random 100 matrices of main effects
and responses) The simulation proceeds as follows
1 Randomly generate main effects matrices X = (x1 x18) for xiiidsim Nn(0 In) and
error vectors ϵ sim Nn(0 In) for n = 60
2 Setting all coefficient values equal to one calculate y = ZMTβ + ϵ for the true
models given byMT 1 = x1 x2 x3 x
21 x1x2 x
22 x2x3 with |MT 1| = 7
MT 2 = x1 x2 x16 with |MT 2| = 16MT 3 = x1 x2 x3 x4 with |MT 3| = 4MT 4 = x1 x2 x8 x
21 x3x4 with |MT 4| = 10
MT 5 = x1 x2 x3 x4 x21 x3x4 with |MT 5| = 6
99
Table 4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulationsgrowing p fixed JM
max fixed p growing JMmax
MF
∣∣MF
∣∣ ∣∣M∣∣ MT used MF
∣∣MF
∣∣ ∣∣M∣∣ MT used(x1 + x2 + x3)
2 9 95 MT 1 (x1 + x2 + x3)2 9 95 MT 1
(x1 + + x4)2 14 1337 MT 1 (x1 + x2 + x3)
3 19 2497 MT 1
(x1 + + x5)2 20 38619 MT 1 (x1 + x2 + x3)
4 34 161421 MT 1
Other model spacesMF
∣∣MF
∣∣ ∣∣M∣∣ MT usedx1 + x2 + middot middot middot+ x18 18 262144 MT 2MT 3
(x1 + x2 + x4)2 + x5+ 20 85568 MT 4MT 5x6 + + x10
3 In all simulations the base model MB is the intercept only model The notation(x1 + + xp)
d is used to represent the full order-d polynomial response surface inp main effects The model spaces characterized by their corresponding full modelMF are presented in Table 4-1 as well as the true models used in each case
4 Enumerate the model spaces and calculate p(M|yM) for all M isin Musing the EPP HUP HIP and HOP the latter two each with the two sets ofhyper-parameters
5 Count the number of true positives and false positives in each M for the differentpriors
The true positives (TP) are defined as those nodes α isin MT such that pα gt 05
With the false positives (FP) three different cutoffs are considered for pα elucidating
the adjustment for multiplicity induced by the model priors These cutoffs are
010 020 and 050 for α isin MT The results from this exercise provide insight
about the influence of the prior on the marginal posterior inclusion probabilities In Table
4-1 the model spaces considered are described in terms of the number of models they
contain and in terms of the number of nodes of MF the full model that defines the DAG
for M
Growing number of main effects fixed polynomial degree This simulation
investigates the posterior behavior as the number of covariates grows for a polynomial
100
surface of degree two The true model is assumed to be MT 1 and has 7 polynomial
terms The false positive and true positive rates are displayed in Table 4-2
First focus on the posterior when (ab) = (1 1) As p increases and the cutoff
is low the number of false positives increases for the EPP as well as the hierarchical
priors although less dramatically for the latter All of the priors identify all of the true
positives The false positive rate for the 50 cutoff is less than one for all four prior
structures with the HIP exhibiting the smallest false positive rate
With the second choice of hyper-parameters (1 ch) the improvement of the
hierarchical priors over the EPP is dramatic and the difference in performance is more
pronounced as p increases These also considerably outperform the priors using the
default hyper-parameters a = b = 1 in terms of the false positives Regarding the
number of true positives all priors discovered the 7 true predictors in MT 1 for most of
the 100 random samples of data with only minor differences observed between any of
the priors considered That being said the means for the priors with a = 1b = ch are
slightly lower for the true positives With a 50 cutoff the hierarchical priors keep a tight
control on the number of false positives but in doing so discard true positives with slightly
higher frequency
Growing polynomial degree fixed main effects For these examples the true
model is once again MT 1 When the complexity is increased by making the order of MF
larger (Table 4-3) the inability of the EPP to adjust the inclusion posteriors for multiplicity
becomes more pronounced the EPP becomes less and less efficient at removing false
positives when the FP cutoff is low Among the priors with a = b = 1 as the order
increases the HIP is the best at filtering out the false positives Using the 05 false
positive cutoff some false positives are included both for the EEP and for all the priors
with a = b = 1 indicating that the default hyper-parameters might not be the best option
to control FP The 7 covariates in the true model all obtain a high inclusion posterior
probability both with the EEP and the a = b = 1 priors
101
Table 4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
7 (x1 + x2 + x3)2
178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3 + x4)2
362 194 233 245 010 063 107FP(gt020) 160 047 217 215 001 017 024FP(gt050) 025 006 035 036 000 002 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3 + x4 + x5)2
600 216 260 255 012 043 115FP(gt020) 291 055 213 218 002 019 027FP(gt050) 066 011 025 037 000 003 001TP(gt050) (MT 1) 700 700 700 700 697 699 699
In contrast any of the a = 1 and b = ch priors dramatically improve upon their
a = b = 1 counterparts consistently assigning low inclusion probabilities for the majority
of the false positive terms even for low cutoffs As the order of the polynomial surface
increases the difference in performance between these priors and either the EEP or
their default versions becomes even more clear At the 50 cutoff the hierarchical priors
with complexity penalization exhibit very low false positive rates The true positive rate
decreases slightly for the priors but not to an alarming degree
Other model spaces This part of the analysis considers model spaces that do not
correspond to full polynomial degree response surfaces (Table 4-4) The first example
is a model space with main effects only The second example includes a full quadratic
surface of order 2 but in addition includes six terms for which only main effects are to be
modeled Two true models are used in combination with each model space to observe
how the posterior probabilities vary under the influence of the different priors for ldquolargerdquo
and ldquosmallrdquo true models
102
Table 4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
7 (x1 + x2 + x3)2
178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3)3
737 521 606 291 055 105 139FP(gt020) 291 155 361 208 017 034 031FP(gt050) 040 021 050 026 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 698 700FP(gt010)
7 (x1 + x2 + x3)4
822 400 469 261 052 055 132FP(gt020) 421 113 176 203 012 015 031FP(gt050) 056 017 022 027 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 697 699
By construction in model spaces with main effects only HIP(11) and EPP are
equivalent as are HOP(ab) and HUP(ab) This accounts for the similarities observed
among the results for the first two cases presented in Table 4-4 where the model space
corresponds to a full model with 18 main effects and the true models are a model with
16 and 4 main effects respectively When the number of true coefficients is large the
HUP(11) and HOP(11) do poorly at controlling false positives even at the 50 cutoff
In contrast the HIP (and thus the EPP) with the 50 cutoff identifies the true positives
and no false positives This result however does not imply that the EPP controls false
positives well The true model contains 16 out of the 18 nodes in MF so there is little
potential for false positives The a = 1 and b = ch priors show dramatically different
behavior The HIP controls false positive well but fails to identify the true coefficients at
the 50 cutoff In contrast the HOP identifies all of the true positives and has a small
false positive rate for the 50 cutoff
103
If the number of true positives is small most terms in the full model are truly zero
The EPP includes at least one false positive in approximately 50 of the randomly
sampled datasets On the other hand the HUP(11) provides some control for
multiplicity obtaining on average a lower number of false positives than the EPP
Furthermore the proposed hierarchical priors with a = 1b = ch are substantially better
than the EPP (and the choice of a = b = 1) at controlling false positives and capturing
all true positives using the marginal posterior inclusion probabilities The two examples
suggest that the HOP(1 ch) is the best default choice for model selection when the
number of terms available at a given degree is large
The third and fourth examples in Table 4-4 consider the same irregular model
space with data generated from MT 4 with ten terms and MT 5 with six terms HIP(11)
and EPP again behave quite similarly incorporating a large number of false positives
for the 01 cutoff At the 05 cutoff some false positives are still included The HUP(11)
and HOP(11) behave similarly with a slightly higher false positive rate at the 50 cutoff
In terms of the true positives the EPP and a = b = 1 priors always include all of the
predictors in MT 4 and MT 5 On the other hand the ability of the a = 1b = ch priors
to control for false positives is markedly better than that of the EPP and the hierarchical
priors with choice of a = 1 = b = 1 At the 50 cutoff these priors identify all of the true
positives and true negatives Once again these examples point to the hierarchical priors
with additional penalization for complexity as being good default priors on the model
space44 Random Walks on the Model Space
When the model space M is too large to enumerate a stochastic procedure can
be used to find models with high posterior probability In particular an MCMC algorithm
can be utilized to generate a dependent sample of models from the model posterior The
structure of the model space M both presents difficulties and provides clues on how to
build algorithms to explore it Different MCMC strategies can be adopted two of which
104
Table 4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
16 x1 + x2 + + x18
193 193 200 200 003 180 180FP(gt020) 052 052 200 200 001 046 046FP(gt050) 007 007 200 200 001 004 004TP(gt050) (MT 2) 1599 1599 1600 1600 699 1599 1599FP(gt010)
4 x1 + x2 + + x18
1395 1395 915 915 026 131 131FP(gt020) 545 545 303 303 005 045 045FP(gt050) 084 084 045 045 002 006 006TP(gt050) (MT 3) 400 400 400 400 400 400 400FP(gt010)
10
973 971 1000 560 034 233 220FP(gt020) (x1 + + x4)
2+ 265 265 873 305 012 074 069FP(gt050) +x5 + + x10 035 035 136 168 002 011 012TP(gt050) (MT 4) 1000 1000 1000 999 994 998 999FP(gt010)
6
1352 1352 1106 994 044 163 196FP(gt020) (x1 + + x4)
2+ 422 421 360 501 015 048 068FP(gt050) +x5 + + x10 053 053 057 075 001 008 011TP(gt050) (MT 5) 600 600 600 600 599 599 599
are outlined in this section Combining the different strategies allows the model selection
algorithm to explore the model space thoroughly and relatively fast441 Simple Pruning and Growing
This first strategy relies on small localized jumps around the model space turning
on or off a single node at each step The idea behind this algorithm is to grow the model
by activating one node in the children set or to prune the model by removing one node
in the extreme set At a given step in the algorithm assume that the current state of the
chain is model M Let pG be the probability that algorithm chooses the growth step The
proposed model M prime can either be M+ = M cup α for some α isin C(M) or Mminus = M α
or some α isin E(M)
An example transition kernel is defined by the mixture
g(M prime|M) = pG middot qGrow(M prime|M) + (1minus pG) middot qPrune(M prime|M)
105
=IM =MF
1 + IM =MBmiddotIαisinC(M)
|C(M)|+
IM =MB
1 + IM =MF middotIαisinE(M)
|E(M)|(4ndash11)
where pG has explicitly been defined as 05 when both C(M) and E(M) are non-empty
and as 0 (or 1) when C(M) = empty (or E(M) = empty) After choosing pruning or growing a
single node is proposed for addition to or deletion from M uniformly at random
For this simple algorithm pruning has the reverse kernel of growing and vice-versa
From this construction more elaborate algorithms can be specified First instead of
choosing the node uniformly at random from the corresponding set nodes can be
selected using the relative posterior probability of adding or removing the node Second
more than one node can be selected at any step for instance by also sampling at
random the number of nodes to add or remove given the size of the set Third the
strategy could combine pruning and growing in a single step by sampling one node
α isin C(M) cup E(M) and adding or removing it accordingly Fourth the sets of nodes from
C(M) cup E(M) that yield well-formulated models can be added or removed This simple
algorithm produces small moves around the model space by focusing node addition or
removal only on the set C(M) cup E(M)442 Degree Based Pruning and Growing
In exploring the model space it is possible to take advantage of the hierarchical
structure defined between nodes of different order One can update the vector of
inclusion indicators by blocks denoted j(M) Two flavors of this algorithm are
proposed one that separates the pruning and growing steps and one where both
are done simultaneously
Assume that at a given step say t the algorithm is at M If growing the strategy
proceeds successively by order class going from j = Jmin up to j = Jmax with Jmin
and Jmax being the lowest and highest orders of nodes in MF MB respectively Define
Mt(Jminminus1) = M and set j = Jmin The growth kernel comprises the following steps
proceeding from j = Jmin to j = Jmax
106
1) Propose a model M prime by selecting a set of nodes from Cj(Mt(jminus1)) through thekernel qGrow j(middot|Mt(jminus1))
2) Compute the Metropolis-Hastings correction for M prime versus Mt(jminus1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(jminus1)
3) If j lt Jmax then set j = j + 1 and return to 1) otherwise proceed to 4)
4) Set Mt = Mt(Jmax )
The pruning step is defined In a similar fashion however it starts at order j = Jmax
and proceeds down to j = Jmin Let Ej(M prime) = E(M prime) cap j(MF ) be the set of nodes of
order j that can removed from the model M to produce a WFM Define Mt(Jmax+1) = M
and set j = Jmax The pruning kernel comprises the following steps
1) Propose a model M prime by selecting a set of nodes from Ej(Mt(j+1)) through thekernel qPrunej(middot|Mt(j+1))
2) Compute the Metropolis-Hastings correction for M prime versus Mt(j+1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(j+1)
3) If j gt Jmin then set j = j minus 1 and return to Step 1) otherwise proceed to Step 4)
4) Set Mt = Mt(Jmin )
It is clear that the growing and pruning steps are reverse kernels of each other
Pruning and growing can be combined for each j The forward kernel proceeds from
j = Jmin to j = Jmax and proposes adding sets of nodes from Cj(M) cup Ej(M) The reverse
kernel simply reverses the direction of j proceeding from j = Jmax to j = Jmin 45 Simulation Study
To study the operating characteristics of the proposed priors a simulation
experiment was designed with three goals First the priors are characterized by how
the posterior distributions are affected by the sample size and the signal-to-noise ratio
(SNR) Second given the SNR level the influence of the allocation of the signal across
the terms in the model is investigated Third performance is assessed when the true
model has special points in the scale (McCullagh amp Nelder 1989) ie when the true
107
model has coefficients equal to zero for some lower-order terms in the polynomial
hierarchy
With these goals in mind sets of predictors and responses are generated under
various experimental conditions The model space is defined with MB being the
intercept-only model and MF being the complete order-four polynomial surface in five
main effects that has 126 nodes The entries of the matrix of main effects are generated
as independent standard normal The response vectors are drawn from the n-variate
normal distribution as y sim Nn
(ZMT
(X)βγ In) where MT is the true model and In is the
n times n identity matrix
The sample sizes considered are n isin 130 260 1040 which ensures that
ZMF(X) is of full rank The cardinality of this model space is |M| gt 12 times 1022 which
makes enumeration of all models unfeasible Because the value of the 2k-th moment
of the standard normal distribution increases with k = 1 2 higher-order terms by
construction have a larger variance than their ancestors As such assuming equal
values for all coefficients higher-order terms necessarily contain more ldquosignalrdquo than
the lower order terms from which they inherit (eg x21 has more signal than x1) Once a
higher-order term is selected its entire ancestry is also included Therefore to prevent
the simulation results from being overly optimistic (because of the larger signals from the
higher-order terms) sphering is used to calculate meaningful values of the coefficients
ensuring that the signal is of the magnitude intended in any given direction Given
the results of the simulations from Section 433 only the HOP with a = 1b = ch is
considered with the EPP included for comparison
The total number of combinations of SNR sample size regression coefficient
values and nodes in MT amounts to 108 different scenarios Each scenario was run
with 100 independently generated datasets and the mean behavior of the samples was
observed The results presented in this section correspond to the median probability
model (MPM) from each of the 108 simulation scenarios considered Figure 4-7 shows
108
the comparison between the two priors for the mean number of true positive (TP) and
false positive (FP) terms Although some of the scenarios consider true models that are
not well-formulated the smallest well-formulated model that stems from MT is always
the one shown in Figure 4-6
Figure 4-6 MT DAG of the largest true model used in simulations
The results are summarized in Figure 4-7 Each point on the horizontal axis
corresponds to the average for a given set of simulation conditions Only labels for the
SNR and sample size are included for clarity but the results are also shown for the
different values of the regression coefficients and the different true models considered
Additional details about the procedure and other results are included in the appendices451 SNR and Sample Size Effect
As expected small sample sizes conditioned upon a small SNR impair the ability
of the algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this
effect being greater when using the latter prior However considering the mean number
of TPs jointly with the number of FPs it is clear that although the number of TPs is
specially low with HOP(1 ch) most of the few predictors that are discovered in fact
belong to the true model In comparison to the results with EPP in terms of FPs the
HOP(1 ch) does better and even more so when both the sample size and the SNR are
109
Figure 4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch)
smallest Finally when either the SNR or the sample size is large the performance in
terms of TPs is similar between both priors but the number of FPs are somewhat lower
with the HOP452 Coefficient Magnitude
Three ways to allocate the amount of signal across predictors are considered For
the first choice all coefficients contain the same amount of signal regardless of their
order In the second each order-one coefficient contains twice as much signal as any
order-two coefficient and four times as much as any order-three coefficient Finally
each order-one coefficient contains a half as much signal as any order-two coefficient
and a quarter of what any order-three coefficient has These choices are denoted by
β(1) = c(1o1 1o2 1o3) β(2) = c(1o1 05o2 025o3) and β(3) = c(025o1 05o2 1o3)
respectively In Figure 4-7 the first 4 scenarios correspond to simulations with β(1) the
next four use β(2) the next four correspond to β(3) and then the values are cycled in
110
the same way The results show that scenarios using either β(1) or β(3) behave similarly
contrasting with the negative impact of having the highest signal in the order-one terms
through β(2) In Figure 4-7 the effect of using β(2) is evident as it corresponds to the
lowest values for the TPs regardless of the sample size the SNR or the prior used This
is an intuitive result since giving more signal to higher-order terms makes it easier to
detect higher-order terms and consequently by strong heredity the algorithm will also
select the corresponding lower-order terms included in the true model453 Special Points on the Scale
Four true models were considered (1) the model from Figure 4-6 (MT 1) (2)
the model without the order-one terms (MT 2) (3) the model without order-two terms
(MT 3) and (4) the model without x21 and x2x5 (MT 4) The last three are clearly not
well-formulated In Figure 4-7 the leftmost point on the horizontal axis corresponds to
scenarios with MT 1 the next point is for scenarios with MT 2 followed by those with MT 3
then with MT 4 then MT 1 etc In comparison to the EPP the HOP(1 ch) tightly controls
the inclusion of FPs by choosing smaller models at the expense of also reducing the TP
count especially when there is more uncertainty about the true model (ie SNR=025)
For both prior structures the results indicate that at low SNR levels the presence of
special points has no apparent impact as the selection behavior is similar between the
four models in terms of both the TP and FP An interesting observation is that the effect
of having special points on the scale is vastly magnified whenever the coefficients that
assign more weight to order-one terms (β(2)) are used46 Case Study Ozone Data Analysis
This section uses the ozone data from Breiman amp Friedman (1985) and followsthe analysis performed by Liang et al (2008) who investigated hyper g-priors Afterremoving observations with missing values 330 observations remain includingdaily measurements of maximum ozone concentration near Los Angeles and eightmeteorological variables Table D From the 330 observations 165 were sampled atrandom without replacement and used to run the variable selection procedure theremaining 165 were used for validation The eight meteorological variables interactionsand their squared terms are used as predictors resulting in a full model with 44predictors The model space assumes that the base model MB is the intercept onlymodel and that MF is the quadratic surface in the eight meteorological variables The
111
model space contains approximately 71 billion models and computation of all modelposterior probabilities is not feasible
Table 4-5 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA
vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX
The HOP HUP and HIP with a = 1 and b = ch as well as the EPP are considered forcomparison purposes To obtain the Bayes factors in equation 3ndash3 four different mixtures ofg-priors are utilized intrinsic priors (IP) (which yields the expression in equation 3ndash2) hyper-g(HG) priors (Liang et al 2008) with hyper-parameters α = 2β = 1 and α = β = 1 and Zellner-Siow (ZS) priors (Zellner amp Siow 1980) The results were extracted for the median posteriorprobability (MPM) models Additionally the model is estimated using the R package hierNet(Bien et al 2013) to compare model selection results to those obtained using the hierarchicallasso (Bien et al 2013) restricted to well formulated models by imposing the strong heredityconstraint The procedures were assessed on the basis of their predictive accuracy on thevalidation dataset
Among all models the one that yields the smallest RMSE is the median probability modelobtained using the HOP and EPP with the ZS prior and also using the HOP with both HGpriors (Table 4-6) The HOP model with the intrinsic prior has all the terms contained in thelowest RMSE model with the exception of dpg2 which has a relatively high marginal inclusionprobability of 46 This disparity between the IP and other mixtures of g-priors is explainedby the fact that the IP induces less posterior shrinkage than the ZS and HG priors The MPMobtained through the HUP and HIP are nested in the best model suggesting that these modelspace priors penalize complexity too much and result in false negatives Consideration ofthese MPMs suggest that the HOP is best at producing true positives while controlling for falsepositives
Finally the model obtained from the hierarchical lasso (HierNet) is the largest model andproduces the second to largest RMSE All of the terms contained in any of the other modelsexcept for vh are nested within the hierarchical lasso model and most of the terms that areexclusive to this model receive extremely low marginal inclusion probabilities under any of themodel priors and parameter priors considered under Bayesian model selection
112
Table 4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso
BF Prior Model R2 RMSEIP EPP hum dpg ibt hum2 hum lowast dpg 08054 42739
hum lowast ibt dpg2 ibt2IP HIP hum ibt hum2 hum lowast ibt ibt2 07740 43396IP HOP hum dpg ibt hum2 hum lowast ibt ibt2 07848 43175IP HUP hum dpg ibt hum lowast ibt ibt2 07767 43508ZS EPP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HIP hum ibt hum lowast ibt ibt2 07525 43505ZS HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG11 EPP vh hum dpg ibt hum2 hum lowast ibt dpg2 07701 43049HG11 HIP hum ibt hum lowast ibt ibt2 07525 43505HG11 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG11 HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 EPP hum dpg ibt hum2 hum lowast ibt dpg2 07701 43037HG21 HIP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG21 HUP hum dpg ibt hum lowast ibt 07526 44036
HierNet hum temp ibh dpg ibt vis hum2 hum lowast ibt 07651 43680temp2 temp lowast ibt dpg2
47 DiscussionScott amp Berger (2010) noted that Ockhamrsquos-razor effect found automatically in Bayesian
variable selection through the Bayes factor does not correct for multiple testing The Bayesfactor penalizes complexity of the alternative model according to the number of parametersin excess of those of the null model Therefore the Bayes factor only controls complexity in apairwise fashion If the model selection procedure uses equal prior probabilities for all M isin Mthen these comparisons ignore the effect of the multiplicity of the testing problem This is wherethe role of the prior on the model space becomes important The multiplicity penalty is ldquohiddenawayrdquo in the model prior probabilities π(M|M)
In addition to the multiplicity of the testing problem disregarding the hierarchical polynomialstructure in the predictors in model selection procedures has the potential to lead to differentresults according to how the predictors are setup (eg in what units these predictors areexpressed)
In this Chapter we investigated a solution to these two issues We define prior structuresfor well-formulated models and develop random walk algorithms to traverse this type of modelspace The key to understanding prior distributions on the space of WFMs is the hierarchicalnature of the model space itself The prior distributions described take advantage of thathierarchy in two ways First conditional independence and immediate inheritance are used todevelop the HOP HIP and HUP structures discussed in Section 43 Second the conditionalnature of the priors allows for the direct incorporation of complexity penalizations Of the priorsproposed the HOP using the hyperparameter choice (1 ch) provides the best control of falsepositives while maintaining a reasonable true positive rate Thus this prior is recommended asthe default prior on the space of WFMs
113
In the near future the software developed to carry out a Metropolis-Hastings random walkon the space of WFMs will be integrated to the R package varSelectIP These new functionsimplement various local priors for the regression coefficients including the intrinsic prior Zellner-Siow prior and hyper g-priors In addition the software supports the computation of crediblesets for each regression coefficient conditioned on the selected model as well as under modelaveraging
114
CHAPTER 5CONCLUSIONS
Ecologists are now embracing the use of Bayesian methods to investigate the
interactions that dictate the distribution and abundance of organisms These tools are
both powerful and flexible They allow integrating under a single methodology empirical
observations and theoretical process models and can seamlessly account for several
sources of uncertainty and dependence The estimation and testing methods proposed
throughout the document will contribute to the understanding of Bayesian methods used
in ecology and hopefully these will shed light about the differences between estimation
and testing Bayesian tools
All of our contributions exploit the potential of the latent variable formulation This
approach greatly simplifies the analysis of complex models it redirects the bulk of
the inferential burden away from the original response variables and places it on the
easy-to-work-with latent scale for which several time-tested approaches are available
Our methods are distinctly classified into estimation and testing tools
For estimation we proposed a Bayesian specification of the single-season
occupancy model for which a Gibbs sampler is available using both logit and probit
link functions This setup allows detection and occupancy probabilities to depend
on linear combinations of predictors Then we developed a dynamic version of this
approach incorporating the notion that occupancy at a previously occupied site depends
both on survival of current settlers and habitat suitability Additionally because these
dynamics also vary in space we suggest a strategy to add spatial dependence among
neighboring sites
Ecological inquiry usually requires of competing explanations and uncertainty
surrounds the decision of choosing any one of them Hence a model or a set of
probable models should be selected from all the viable alternatives To address this
testing problem we proposed an objective and fully automatic Bayesian methodology
115
for the single season site-occupancy model Our approach relies on the intrinsic prior
which prevents from introducing (commonly unavailable) subjectively information
into the model In simulation experiments we observed that the methods single out
accurately the predictors present in the true model using the marginal posterior inclusion
probabilities of the predictors For predictors in the true model these probabilities were
comparatively larger than those for predictors not present in the true model Also the
simulations indicated that the method provides better discrimination for predictors in the
detection component of the model
In our simulations and in the analysis of the Blue Hawker data we observed that the
effect from using the multiplicity correction prior was substantial This occurs because
the Bayes factor only penalizes complexity of the alternative model according to its
number of parameters in excess to those of the null model As the number of predictors
grows the number of models in the models space also grows increasing the chances
of making false positive decisions on the inclusion of predictors This is where the role
of the prior on the model space becomes important The multiplicity penalty is ldquohidden
awayrdquo in the model prior probabilities π(M|M) In addition to the multiplicity of the
testing problem disregarding the hierarchical polynomial structure in the predictors in
model selection procedures has the potential to lead to different results according to
how the predictors are coded (eg in what units these predictors are expressed)
To confront this situation we propose three prior structures for well-formulated
models take advantage of the hierarchical structure of the predictors Of the priors
proposed we recommend the HOP using the hyperparameter choice (1 ch) which
provides the best control of false positives while maintaining a reasonable true positive
rate
Overall considering the flexibility of the latent approach several other extensions of
these methods follow Currently we envision three future developments (1) occupancy
models incorporate various sources of information (2) multi-species models that make
116
use of spatial and interspecific dependence and (3) investigate methods to conduct
model selection for the dynamic and spatially explicit version of the model
117
APPENDIX AFULL CONDITIONAL DENSITIES DYMOSS
In this section we introduce the full conditional probability density functions for all
the parameters involved in the DYMOSS model using probit as well as logic links
Sampler Z
The full conditionals corresponding to the presence indicators have the same form
regardless of the link used These are derived separately for the cases t = 1 1 lt t lt T
and t = T since their corresponding probabilities take on slightly different forms
Let ϕ(ν|microσ2) represent the density for a normal random variable ν with mean micro and
variance σ2 and recall that ψi1 = F (xprime(o)iα) and pijt = F (qprimeijtλt) where F () is the
inverse link function The full conditional for zit is given by
1 For t = 1
π(zi1|vi1αλ1βc1 δ
s1) = ψlowast
i1zi1 (1minus ψlowast
i1)1minuszi1
= Bernoulli(ψlowasti1) (Andash1)
where
ψlowasti1 =
ψi1ϕ(vi1|xprimei1βc1 + δs1 1)
prodJi1j=1(1minus pij1)
ψi1ϕ(vi1|xprimei1βc1 + δs1 1)
prodJi1j=1(1minus pij1) + (1minus ψi1)ϕ(vi1|xprimei1β
c1 1)
prodJj=1 Iyij1=0
2 For 1 lt t lt T
π(zit |zi(tminus1) zi(t+1)λt βctminus1 δ
stminus1) = ψlowast
itzit (1minus ψlowast
it)1minuszit
= Bernoulli(ψlowastit) (Andash2)
where
ψlowastit =
κitprodJit
j=1(1minus pijt)
κitprodJit
j=1(1minus pijt) +nablait
prodJj=1 Iyijt=0
with
(a) κit = F (xprimei(tminus1)β
ctminus1 + zi(tminus1)δ
stminus1)ϕ(vit |xprimeitβ
ct + δst 1) and
(b) nablait =(1minus F (xprime
i(tminus1)βctminus1 + zi(tminus1)δ
stminus1)
)ϕ(vit |xprimeitβ
ct 1)
3 For t = T
π(ziT |zi(Tminus1)λT βcTminus1 δ
sTminus1) = ψ⋆iT
ziT (1minus ψ⋆iT )1minusziT
118
=
Nprodi=1
Bernoulli(ψ⋆iT ) (Andash3)
where
ψ⋆iT =κ⋆iT
prodJiTj=1(1minus pijT )
κ⋆iTprodJiT
j=1(1minus pijT ) +nabla⋆iT
prodJj=1 IyijT=0
with
(a) κ⋆iT = F (xprimei(Tminus1)β
cTminus1 + zi(Tminus1)δ
sTminus1) and
(b) nabla⋆iT =
(1minus F (xprime
i(Tminus1)βcTminus1 + zi(Tminus1)δ
sTminus1)
)Sampler ui
1
π(ui |zi1α) = tr N(xprime(o)iα 1 trunc(zi1))
where trunc(zi1) =
(minusinfin 0] zi1 = 0
(0infin) zi1 = 1(Andash4)
and tr N(microσ2A) denotes the pdf of a truncated normal random variable with mean microvariance σ2 and truncation region A
Sampler α
1
π(α|u) prop [α]
Nprodi=1
ϕ(ui xprime(o)iα 1) (Andash5)
If [α] prop 1 then
α|u sim N(m(α)α)
with m(α) = αXprime(o)u and α = (X prime
(o)X(o))minus1
Sampler vit
1 (For t gt 1)
π(vi (tminus1)|zi (tminus1) zit βctminus1 δ
stminus1) = tr N
(micro(v)i(tminus1) 1 trunc(zit)
)(Andash6)
where micro(v)i(tminus1) = xprime
i(tminus1)βctminus1 + zi(tminus1)δ
ci(tminus1) and trunc(zit) defines the corresponding
truncation region given by zit
119
Sampler(β(c)tminus1 δ
(c)tminus1
)
1 (For t gt 1)
π(β(s)tminus1 δ
(c)tminus1|vtminus1 ztminus1) prop [β
(s)tminus1 δ
(c)tminus1]
Nprodi=1
ϕ(vit xprimei(tminus1)β
(c)tminus1 + zi(tminus1)δ
(s)tminus1 1) (Andash7)
If[β(c)tminus1 δ
(s)tminus1
]prop 1 then
β(c)tminus1 δ
(s)tminus1|vtminus1 ztminus1 sim N(m(β
(c)tminus1 δ
(s)tminus1)tminus1)
with m(β(c)tminus1 δ
(s)tminus1) = tminus1 ~X
primetminus1vtminus1 and tminus1 = (~X prime
tminus1 ~Xtminus1)minus1 where ~Xtminus1 =(
Xtminus1 ztminus1)
Sampler wijt
1 (For t gt 1 and zit = 1)
π(wijt | i zit = 1 yijt λ) = tr N(qprimeijtλt 1 tr(yijt)
)(Andash8)
Sampler λt
1 (For t = 1 2 T )
π(λt |zt wt) prop [λt ]prod
i zit=1
Jitprodj=1
ϕ(wijt qprimeijtλt 1) (Andash9)
If [λt ] prop 1 then
λt |wt zt sim N(m(λt)λt)
with m(λt) = λtQ primetwt and λt
= (Q primetQt)
minus1 where Qt and wt respectively are the designmatrix and the vector of latent variables for surveys of sites such that zit = 1
120
APPENDIX BRANDOM WALK ALGORITHMS
Global Jump From the current state M the global jump is performed by drawing
a model M prime at random from the model space This is achieved by beginning at the base
model and increasing the order from JminM to the Jmax
M the minimum and maximum orders
of nodes in (MF ) = MF MB at each order a set of nodes is selected at random from
the prior conditioned on the nodes already in the model The MH correction is
α =
1m(y|M primeM)
m(y|MM)
Local Jump From the current state M the local jump is performed by drawing a
model from the set of models L(M) = Mα α isin E(M) cup C(M) where Mα is M α
for α isin E(M) and M cup α for α isin C(M) The proposal probabilities for the model are
computed as a mixture of p(M prime|yMM prime isin L(M)) and the discrete uniform distribution
The proposal kernel is
q(M prime|yMM prime isin L(M)) =1
2
(p(M prime|yMM prime isin L(M)) +
1
|L(M)|
)This choice promotes moving to better models while maintaining a non-negligible
probability of moving to any of the possible models The MH correction is
α =
1m(y|M primeM)
m(y|MM)
q(M|yMM isin L(M prime))
q(M prime|yMM prime isin L(M))
Intermediate Jump The intermediate jump is performed by increasing or
decreasing the order of the nodes under consideration performing local proposals based
on order For a model M prime define Lj(Mprime) = M prime cup M prime
α α isin (E(M prime) cup C(M prime)) capj(MF )
From a state M the kernel chooses at random whether to increase or decrease the
order If M = MF then decreasing the order is chosen with probability 1 and if M = MB
then increasing the order is chosen with probability 1 in all other cases the probability of
increasing and decreasing order is 12 The proposal kernels are given by
121
Increasing order proposal kernel
1 Set j = JminM minus 1 and M prime
j = M
2 Draw M primej+1 from qincj+1(M
prime|yMM prime isin Lj+1(Mprimej )) where
qincj+1(Mprime|yMM prime isin Lj+1(M
primej )) =
12
(p(M prime|yMM prime isin Lj+1(M
primej )) +
1|Lj+1(M
primej)|
)
3 Set j = j + 1
4 If j lt JmaxM then return to 2 O therwise proceed to 5
5 Set M prime = M primeJmaxM
and compute the proposal probability
qinc(Mprime|yMM) =
JmaxM minus1prod
j=JminM minus1
qincj+1(Mprimej |yMM prime isin Lj+1(M
primej )) (Bndash1)
Decreasing order proposal kernel
1 Set j = JmaxM + 1 and M prime
j = M
2 Draw M primejminus1 from qdecjminus1(M
prime|yMM prime isin Ljminus1(Mprimej )) where
qdecjminus1(Mprime|yMM prime isin Ljminus1(M
primej )) =
12
(p(M prime|yMM prime isin Ljminus1(M
primej )) +
1|Ljminus1(M
primej)|
)
3 Set j = j minus 1
4 If j gt JminM then return to 2 Otherwise proceed to 5
5 Set M prime = M primeJminM
and compute the proposal probability
qdec(Mprime|yMM) =
JminM +1prod
j=JmaxM +1
qdecjminus1(Mprimej |yMM prime isin Ljminus1(M
primej )) (Bndash2)
If increasing order is chosen then the MH correction is given by
α = min
1
(1 + I (M prime = MF )
1 + I (M = MB)
)qdec(M|yMM prime)
qinc(M prime|yMM)
p(M prime|yM)
p(M|yM)
(Bndash3)
and similarly if decreasing order is chosen
Other Local and Intermediate Kernels The local and intermediate kernels
described here perform a kind of stochastic forwards-backwards selection Each kernel
122
q can be relaxed to allow more than one node to be turned on or off at each step which
could provide larger jumps for each of these kernels The tradeoff is that number of
proposed models for such jumps could be very large precluding the use of posterior
information in the construction of the proposal kernel
123
APPENDIX CWFM SIMULATION DETAILS
Briefly the idea is to let ZMT(X )βMT
= (QR)βMT= QηMT
(ie βMT= Rminus1ηMT
)
using the QR decomposition As such setting all values in ηMTproportional to one
corresponds to distributing the signal in the model uniformly across all predictors
regardless of their order
The (unconditional) variance of a single observation yi is var(yi) = var (E [yi |zi ]) +
E [var(yi |zi)] where zi is the i -th row of the design matrix ZMT Hence we take the
signal to noise ratio for each observation to be
SNR(η) = ηTMT
RminusTzRminus1ηMT
σ2
where z = var(zi) We determine how the signal is distributed across predictors up to a
proportionality constant to be able to control simultaneously the signal to noise ratio
Additionally to investigate the ability of the model to capture correctly the
hierarchical structure we specify four different 0-1 vectors that determine the predictors
in MT which generates the data in the different scenarios
Table C-1 Experimental conditions WFM simulationsParameter Values considered
SNR(ηMT) = k 025 1 4
ηMTprop (1 13 14 12) (1 13 1214
1412) (1 1413
1214 12)
γMT(1 13 14 12) (1 13 14 02) (1 13 04 12) (1 03 0 1 1 0 12)
n 130 260 1040
The results presented below are somewhat different from those found in the main
body of the article in Section 5 These are extracted averaging the number of FPrsquos
TPrsquos and model sizes respectively over the 100 independent runs and across the
corresponding scenarios for the 20 highest probability models
124
SNR and Sample Size Effect
In terms of the SNR and the sample size (Figure C-1) we observe that as
expected small sample sizes conditioned upon a small SNR impair the ability of the
algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this effect
more notorious when using the latter prior However considering the mean number
of true positives (TP) jointly with the mean model size it is clear that although the
sensitivity is low most of the few predictors that are discovered belong to the true
model The results observed with SNR of 025 and a relatively small sample size are
far from being impressive however real problems where the SNR is as low as 025
will yield many spurious associations under the EPP The fact that the HOP(1 ch) has
a strong protection against false positive is commendable in itself A SNR of 1 also
represents a feeble relationship between the predictors and the response nonetheless
the method captures approximately half of the true coefficients while including very few
false positives Following intuition as either the sample size or the SNR increase the
algorithms performance is greatly enhanced Either having a large sample size or a
large SNR yields models that contain mostly true predictors Additionally HOP(1 ch)
provides a strong control over the number of false positives therefore for high SNR
or larger sample sizes the number of predictors in the top 20 models is close to the
size of the true model In general the EPP allows the detection of more TPrsquos while
the HOP(1 ch) provides a stronger control on the amount of FPrsquos included when
considering small sample sizes combined with small SNRs As either sample size or
SNR grows the differences between the two priors become indistinct
125
Figure C-1 SNR vs n Average model size average true positives and average false
positives for all simulated scenarios by model ranking according to model
posterior probabilities
Coefficient Magnitude
This part of the experiment explores the effect of how the signal is distributed across
predictors As mentioned before sphering is used to assign the coefficients values
in a manner that controls the amount of signal that goes into each coefficient Three
possible ways to allocate the signal are considered First each order-one coefficient
contains twice as much signal as any order-two coefficient and four times as much
any as order-three coefficient second all coefficients contain the same amount of
signal regardless of their order and third each order-one coefficient contains a half
as much signal as any order-two coefficient and a quarter of what any order-three
126
coefficient has In Figure C-2 these values are denoted by β = c(1o1 05o2 025o3)
β = c(1o1 1o2 1o3) and β = c(025o1 05o2 1o3) respectively
Observe that the number of FPrsquos is invulnerable to how the SNR is distributed
across predictors using the HOP(1 ch) conversely when using the EPP the number
of FPrsquos decreases as the SNR grows always being slightly higher than those obtained
with the HOP With either prior structure the algorithm performs better whenever all
coefficients are equally weighted or when those for the order-three terms have higher
weights In these two cases (ie with β = c(1o1 05o2 025o3) or β = c(1o1 1o2 1o3))
the effect of the SNR appears to be similar In contrast when more weight is given to
order one terms the algorithm yields slightly worse models at any SNR level This is an
intuitive result since giving more signal to higher order terms makes it easier to detect
higher order terms and consequently by strong heredity the algorithm will also select
the corresponding lower order terms included in the true model
Special Points on the Scale
In Nelder (1998) the author argues that the conditions under which the
weak-heredity principle can be used for model selection are so restrictive that the
principle is commonly not valid in practice in this context In addition the author states
that considering well-formulated models only does not take into account the possible
presence of special points on the scales of the predictors that is situations where
omitting lower order terms is justified due to the nature of the data However it is our
contention that every model has an underlying well-formulated structure whether or not
some predictor has special points on its scale will be determined through the estimation
of the coefficients once a valid well-formulated structure has been chosen
To understand how the algorithm behaves whenever the true data generating
mechanism has zero-valued coefficients for some lower order terms in the hierarchy
four different true models are considered Three of them are not well-formulated while
the remaining one is the WFM shown in Figure 4-6 The three models that have special
127
Figure C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model rankingaccording to model posterior probabilities
points correspond to the same model MT from Figure 4-6 but have respectively
zero-valued coefficients for all the order-one terms all the order-two terms and for x21
and x2x5
As seen before in comparison to the EPP the HOP(1 ch) tightly controls the
inclusion FPs by choosing smaller models at the expense of also reducing the TP
count especially when there is more uncertainty about the true model (ie SNR=025)
For both prior structures the results in Figure C-3 indicate that at low SNR levels the
presence of special points has no apparent impact as the selection behavior is similar
between the four models in terms of both the TP and FP As the SNR increases the
TPs and the model size are affected for true models with zero-valued lower order
128
Figure C-3 SNR vs different true models MT Average model size average truepositives and average false positives for all simulated scenarios by modelranking according to model posterior probabilities
terms These differences however are not very large Relatively smaller models are
selected whenever some terms in the hierarchy are missing but with high SNR which
is where the differences are most pronounced the predictors included are mostly true
coefficients The impact is almost imperceptible for the true model that lacks order one
terms and the model with zero coefficients for x21 and x2x5 and is more visible for models
without order two terms This last result is expected due to strong-heredity whenever
the order-one coefficients are missing the inclusion of order-two and order-three
terms will force their selection which is also the case when only a few order two terms
have zero-valued coefficients Conversely when all order two predictors are removed
129
some order three predictors are not selected as their signal is attributed the order two
predictors missing from the true model This is especially the case for the order three
interaction term x1x2x5 which depends on the inclusion of three order two terms terms
(x1x2 x1x5 x2x5) in order for it to be included as well This makes the inclusion of this
term somewhat more challenging the three order two interactions capture most of
the variation of the polynomial terms that is present when the order three term is also
included However special points on the scale commonly occur on a single or at most
on a few covariates A true data generating mechanism that removes all terms of a given
order in the context of polynomial models is clearly not justified here this was only done
for comparison purposes
130
APPENDIX DSUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
The covariates considered for the ozone data analysis match those used in Liang
et al (2008) these are displayed in Table D below
Table D-1 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA
vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX
The marginal posterior inclusion probability corresponds to the probability of including a
given term in the full model MF after summing over all models in the model space For each
node α isin MF this probability is given by pα =sum
MisinM I(αisinM)p(M|yM) Given that in problems
with a large model space such as the one considered for the ozone concentration problem
enumeration of the entire space is not feasible Thus these probabilities are estimated summing
over every model drawn by the random walk over the model space M
Given that there are in total 44 potential predictors for convenience in Tables D-2 to D-5
below we only display the marginal posterior probabilities for the terms included under at least
one of the model priors considered (EPP HIP HUP and HOP) for each of the parameter priors
utilized (intrinsic priors Zellner-Siow priors Hyper-g(11) and Hyper-g(21))
131
Table D-2 Marginal inclusion probabilities
intrinsic prior
EPP HIP HUP HOP
hum 099 069 085 076
dpg 085 048 052 053
ibt 099 100 100 100
hum2 076 051 043 062
humdpg 055 002 003 017
humibt 098 069 084 075
dpg2 072 036 025 046
ibt2 059 078 057 081
Table D-3 Marginal inclusion probabilities
Zellner-Siow prior
EPP HIP HUP HOP
hum 076 067 080 069
dpg 089 050 055 058
ibt 099 100 100 100
hum2 057 049 040 057
humibt 072 066 078 068
dpg2 081 038 031 051
ibt2 054 076 055 077
Table D-4 Marginal inclusion probabilities
Hyper-g11
EPP HIP HUP HOP
vh 054 005 010 011
hum 081 067 080 069
dpg 090 050 055 058
ibt 099 100 099 099
hum2 061 049 040 057
humibt 078 066 078 068
dpg2 083 038 030 051
ibt2 049 076 054 077
Table D-5 Marginal inclusion probabilities
Hyper-g21
EPP HIP HUP HOP
hum 079 064 073 067
dpg 090 052 060 059
ibt 099 100 099 100
hum2 060 047 037 055
humibt 076 064 071 067
dpg2 082 041 036 052
ibt2 047 073 049 075
132
REFERENCES
Akaike H (1983) Information measures and model selection Bull Int Statist Inst 50277ndash290
Albert J H amp Chib S (1993) Bayesian-analysis of binary and polychotomousresponse data Journal of the American Statistical Association 88(422) 669ndash679
Berger J amp Bernardo J (1992) On the development of reference priors BayesianStatistics 4 (pp 35ndash60)
URL httpisbastatdukeedueventsvalencia1992Valencia4Refpdf
Berger J amp Pericchi L (1996) The intrinsic Bayes factor for model selection andprediction Journal of the American Statistical Association 91(433) 109ndash122
URL httpamstattandfonlinecomdoiabs10108001621459199610476668
Berger J Pericchi L amp Ghosh J (2001) Objective Bayesian methods for modelselection introduction and comparison In Model selection vol 38 of IMS LectureNotes Monogr Ser (pp 135ndash207) Inst Math Statist
URL httpwwwjstororgstable1023074356165
Besag J York J amp Mollie A (1991) Bayesian Image-Restoration with 2 Applicationsin Spatial Statistics Annals of the Institute of Statistical Mathematics 43 1ndash20
Bien J Taylor J amp Tibshirani R (2013) A lasso for hierarchical interactions TheAnnals of Statistics 41(3) 1111ndash1141
URL httpprojecteuclidorgeuclidaos1371150895
Breiman L amp Friedman J (1985) Estimating optimal transformations for multipleregression and correlation Journal of the American Statistical Association 80580ndash598
Brusco M J Steinley D amp Cradit J D (2009) An exact algorithm for hierarchicallywell-formulated subsets in second-order polynomial regression Technometrics 51(3)306ndash315
Casella G Giron F J Martınez M L amp Moreno E (2009) Consistency of Bayesianprocedures for variable selection The Annals of Statistics 37 (3) 1207ndash1228
URL httpprojecteuclidorgeuclidaos1239369020
Casella G Moreno E amp Giron F (2014) Cluster Analysis Model Selection and PriorDistributions on Models Bayesian Analysis TBA(TBA) 1ndash46
URL httpwwwstatufledu~casellaPapersClusterModel-July11-Apdf
133
Chipman H (1996) Bayesian variable selection with related predictors CanadianJournal of Statistics 24(1) 17ndash36
URL httponlinelibrarywileycomdoi1023073315687abstract
Clyde M amp George E I (2004) Model Uncertainty Statistical Science 19(1) 81ndash94
URL httpprojecteuclidorgDienstgetRecordid=euclidss1089808274
Dewey J (1958) Experience and nature New York Dover Publications
Dorazio R M amp Taylor-Rodrıguez D (2012) A Gibbs sampler for Bayesian analysis ofsite-occupancy data Methods in Ecology and Evolution 3 1093ndash1098
Ellison A M (2004) Bayesian inference in ecology Ecology Letters 7 509ndash520
Fiske I amp Chandler R (2011) unmarked An R package for fitting hierarchical modelsof wildlife occurrence and abundance Journal of Statistical Software 43(10)
URL httpcorekmiopenacukdownloadpdf5701760pdf
George E (2000) The variable selection problem Journal of the American StatisticalAssociation 95(452) 1304ndash1308
URL httpwwwtandfonlinecomdoiabs10108001621459200010474336
Giron F J Moreno E Casella G amp Martınez M L (2010) Consistency of objectiveBayes factors for nonnested linear models and increasing model dimension Revistade la Real Academia de Ciencias Exactas Fisicas y Naturales Serie A Matematicas104(1) 57ndash67
URL httpwwwspringerlinkcomindex105052RACSAM201006
Good I J (1950) Probability and the Weighing of Evidence New York Haffner
Griepentrog G L Ryan J M amp Smith L D (1982) Linear transformations ofpolynomial regression-models American Statistician 36(3) 171ndash174
Gunel E amp Dickey J (1974) Bayes factors for independence in contingency tablesBiometrika 61 545ndash557
Hanski I (1994) A Practical Model of Metapopulation Dynamics Journal of AnimalEcology 63 151ndash162
Hooten M (2006) Hierarchical spatio-temporal models for ecological processesDoctoral dissertation University of Missouri-Columbia
URL httpsmospacelibraryumsystemeduxmluihandle103554500
Hooten M B amp Hobbs N T (2014) A Guide to Bayesian Model Selection forEcologists Ecological Monographs (In Press)
134
Hughes J amp Haran M (2013) Dimension reduction and alleviation of confoundingfor spatial generalized linear mixed models Journal of the Royal Statistical SocietySeries B Statistical Methodology 75 139ndash159
Hurvich C M amp Tsai C-L (1989) Regression and time series model selection insmall samples Biometrika 76 297ndash307
URL httpbiometoxfordjournalsorgcontent762297abstract
Jeffreys H (1935) Some tests of significance treated by the theory of probabilityProcedings of the Cambridge Philosophy Society 31 203ndash222
Jeffreys H (1961) Theory of Probability London Oxford University Press 3rd ed
Johnson D Conn P Hooten M Ray J amp Pond B (2013) Spatial occupancymodels for large data sets Ecology 94(4) 801ndash808
URL httpwwwesajournalsorgdoiabs10189012-05641mi=3eywlhampaf=R
ampsearchText=human+population
Kass R amp Wasserman L (1995) A reference Bayesian test for nested hypothesesand its relationship to the Schwarz criterion Journal of the American StatisticalAssociation 90(431)
URL httpamstattandfonlinecomdoiabs10108001621459199510476592
Kass R E amp Raftery A E (1995) Bayes Factors Journal of the American StatisticalAssociation 90 773ndash795
URL httpwwwtandfonlinecomdoiabs10108001621459199510476572$
delimiter026E30F$nhttpwwwtandfonlinecomdoiabs10108001621459
199510476572UvBybrTIgcs
Kass R E amp Wasserman L (1996) The Selection of Prior Distributions by FormalRules Journal of the American Statistical Association 91(435) 1343
URL httpwwwjstororgstable2291752origin=crossref
Kery M (2010) Introduction to WinBUGS for Ecologists Bayesian Approach toRegression ANOVA Mixed Models and Related Analyses Academic Press 1st ed
Kery M Gardner B amp Monnerat C (2010) Predicting species distributions fromchecklist data using site-occupancy models Journal of Biogeography 37 (10)1851ndash1862 Kery Marc Gardner Beth Monnerat Christian
Khuri A (2002) Nonsingular linear transformations of the control variables in responsesurface models Technical Report
Krebs C J (1972) Ecology the experimental analysis of distribution and abundance
135
Lempers F B (1971) Posterior probabilities of alternative linear models University ofRotterdam Press Rotterdam
Leon-Novelo L Moreno E amp Casella G (2012) Objective Bayes model selection inprobit models Statistics in medicine 31(4) 353ndash65
URL httpwwwncbinlmnihgovpubmed22162041
Liang F Paulo R Molina G Clyde M a amp Berger J O (2008) Mixtures of g Priorsfor Bayesian Variable Selection Journal of the American Statistical Association103(481) 410ndash423
URL httpwwwtandfonlinecomdoiabs101198016214507000001337
Link W amp Barker R (2009) Bayesian inference with ecological applications Elsevier
URL httpbooksgooglecombookshl=enamplr=ampid=hecon2l2QPcCampoi=fnd
amppg=PP2ampdq=Bayesian+Inference+with+ecological+applicationsampots=S82_
0pxrNmampsig=L3xbsSQcKD8FV6rxCMp2pmP2JKk
MacKenzie D amp Nichols J (2004) Occupancy as a surrogate for abundanceestimation Animal biodiversity and conservation 1 461ndash467
URL httpcrsitbacidmediajurnalrefslandscapemackenzie2004zhpdf
MacKenzie D Nichols J amp Hines J (2003) Estimating site occupancy colonizationand local extinction when a species is detected imperfectly Ecology 84(8)2200ndash2207
URL httpwwwesajournalsorgdoiabs10189002-3090
MacKenzie D I Bailey L L amp Nichols J D (2004) Investigating speciesco-occurrence patterns when species Journal of Animal Ecology 73 546ndash555
MacKenzie D I Nichols J D Lachman G B Droege S Royle J A amp LangtimmC A (2002) Estimating site occupancy rates when detection probabilities are lessthan one Ecology 83(8) 2248ndash2255
Mazerolle M amp Mazerolle M (2013) Package rsquoAICcmodavgrsquo (c)
URL ftpheanetarchivegnewsenseorgdisk1CRANwebpackages
AICcmodavgAICcmodavgpdf
McCullagh P amp Nelder J A (1989) Generalized linear models (2nd ed) LondonEngland Chapman amp Hall
McQuarrie A Shumway R amp Tsai C-L (1997) The model selection criterion AICu
136
Moreno E Bertolino F amp Racugno W (1998) An intrinsic limiting procedure for modelselection and hypotheses testing Journal of the American Statistical Association93(444) 1451ndash1460
Moreno E Giron F J amp Casella G (2010) Consistency of objective Bayes factors asthe model dimension grows The Annals of Statistics 38(4) 1937ndash1952
URL httpprojecteuclidorgeuclidaos1278861238
Nelder J A (1977) Reformulation of linear-models Journal of the Royal StatisticalSociety Series A - Statistics in Society 140 48ndash77
Nelder J A (1998) The selection of terms in response-surface models - how strong isthe weak-heredity principle American Statistician 52(4) 315ndash318
Nelder J A (2000) Functional marginality and response-surface fitting Journal ofApplied Statistics 27 (1) 109ndash112
Nichols J Hines J amp Mackenzie D (2007) Occupancy estimation and modeling withmultiple states and state uncertainty Ecology 88(6) 1395ndash1400
URL httpwwwesajournalsorgdoipdf10189006-1474
Ovaskainen O Hottola J amp Siitonen J (2010) Modeling species co-occurrenceby multivariate logistic regression generates new hypotheses on fungal interactionsEcology 91(9) 2514ndash21
URL httpwwwncbinlmnihgovpubmed20957941
Peixoto J L (1987) Hierarchical variable selection in polynomial regression-modelsAmerican Statistician 41(4) 311ndash313
Peixoto J L (1990) A property of well-formulated polynomial regression-modelsAmerican Statistician 44(1) 26ndash30
Pericchi L R (2005) Model selection and hypothesis testing based on objectiveprobabilities and bayes factors In Handbook of Statistics Elsevier
Polson N G Scott J G amp Windle J (2013) Bayesian Inference for Logistic ModelsUsing Polya-Gamma Latent Variables Journal of the American Statistical Association108 1339ndash1349
URL httpdxdoiorg101080016214592013829001
Rao C R amp Wu Y (2001) On model selection vol Volume 38 of Lecture NotesndashMonograph Series (pp 1ndash57) Beachwood OH Institute of Mathematical Statistics
URL httpdxdoiorg101214lnms1215540960
137
Reich B J Hodges J S amp Zadnik V (2006) Effects of residual smoothing on theposterior of the fixed effects in disease-mapping models Biometrics 62 1197ndash1206
Reiners W amp Lockwood J (2009) Philosophical Foundations for the Practices ofEcology Cambridge University Press
URL httpbooksgooglecombooksid=dr9cPgAACAAJ
Rigler F amp Peters R (1995) Excellence in Ecology Science and Limnology EcologyInstitute Germany
URL httportoncatieaccrcgi-binwxisexeIsisScript=CIENLxis
ampmethod=postampformato=2ampcantidad=1ampexpresion=mfn=008268
Robert C Chopin N amp Rousseau J (2009) Harold Jeffreysrsquo Theory of Probabilityrevisited Statistical Science Volume 24(2) 141ndash179
URL httpswwwnewtonacukpreprintsNI08021pdf
Robert C P (1993) A note on jeffreys-lindley paradox Statistica Sinica 3 601ndash608
Royle J A amp Kery M (2007) A Bayesian state-space formulation of dynamicoccupancy models Ecology 88(7) 1813ndash23
URL httpwwwncbinlmnihgovpubmed17645027
Scott J amp Berger J (2010) Bayes and Empirical-Bayes Multiplicity Adjustment in thevariable selection problem The Annals of Statistics
URL httpprojecteuclidorgeuclidaos1278861454
Spiegelhalter D J amp Smith A F M (1982) Bayes factor for linear and log-linearmodels with vague prior information J R Statist Soc B 44 377ndash387
Tierney L amp Kadane J B (1986) Accurate approximations for posterior moments andmarginal densities Journal of the American Statistical Association 81 82ndash86
Tyre A J Tenhumberg B Field S a Niejalke D Parris K amp Possingham H P(2003) Improving Precision and Reducing Bias in Biological Surveys EstimatingFalse-Negative Error Rates Ecological Applications 13(6) 1790ndash1801
URL httpwwwesajournalsorgdoiabs10189002-5078
Waddle J H Dorazio R M Walls S C Rice K G Beauchamp J Schuman M Jamp Mazzotti F J (2010) A new parameterization for estimating co-occurrence ofinteracting species Ecological applications a publication of the Ecological Society ofAmerica 20 1467ndash1475
Wasserman L (2000) Bayesian Model Selection and Model Averaging Journal ofmathematical psychology 44(1) 92ndash107
138
URL httpwwwncbinlmnihgovpubmed10733859
Wilson M Iversen E Clyde M A Schmidler S C amp Schildkraut J M (2010)Bayesian model search and multilevel inference for SNP association studies TheAnnals of Applied Statistics 4(3) 1342ndash1364
URL httpwwwncbinlmnihgovpmcarticlesPMC3004292
Womack A J Leon-Novelo L amp Casella G (2014) Inference from Intrinsic BayesProcedures Under Model Selection and Uncertainty Journal of the AmericanStatistical Association (June) 140114063448000
URL httpwwwtandfonlinecomdoiabs101080016214592014880348
Yuan M Joseph V R amp Zou H (2009) Structured variable selection and estimationThe Annals of Applied Statistics 3(4) 1738ndash1757
URL httpprojecteuclidorgeuclidaoas1267453962
Zeller K A Nijhawan S Salom-Perez R Potosme S H amp Hines J E (2011)Integrating occupancy modeling and interview data for corridor identification A casestudy for jaguars in nicaragua Biological Conservation 144(2) 892ndash901
Zellner A amp Siow A (1980) Posterior odds ratios for selected regression hypothesesIn Trabajos de estadıstica y de investigacion operativa (pp 585ndash603)
URL httpwwwspringerlinkcomindex5300770UP12246M9pdf
139
BIOGRAPHICAL SKETCH
Daniel Taylor-Rodrıguez was born in Bogota Colombia He earned a BS
degree in economics from the Universidad de Los Andes (2004) and a Specialist
degree in statistics from the Universidad Nacional de Colombia In 2009 he traveled
to Gainesville Florida to pursue a masterrsquos in statistics under the supervision of
George Casella Upon completion he started a PhD in interdisciplinary ecology with
concentration in statistics again under George Casellarsquos supervision After Georgersquos
passing Linda Young and Nikolay Bliznyuk continued to oversee Danielrsquos mentorship
He has currently accepted a joint postdoctoral fellowship at the Statistical and Applied
Mathematical Sciences Institute and the Department of Statistical Science at Duke
University
140
- ACKNOWLEDGMENTS
- TABLE OF CONTENTS
- LIST OF TABLES
- LIST OF FIGURES
- ABSTRACT
- 1 GENERAL INTRODUCTION
-
- 11 Occupancy Modeling
- 12 A Primer on Objective Bayesian Testing
- 13 Overview of the Chapters
-
- 2 MODEL ESTIMATION METHODS
-
- 21 Introduction
-
- 211 The Occupancy Model
- 212 Data Augmentation Algorithms for Binary Models
-
- 22 Single Season Occupancy
-
- 221 Probit Link Model
- 222 Logit Link Model
-
- 23 Temporal Dynamics and Spatial Structure
-
- 231 Dynamic Mixture Occupancy State-Space Model
- 232 Incorporating Spatial Dependence
-
- 24 Summary
-
- 3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
-
- 31 Introduction
- 32 Objective Bayesian Inference
-
- 321 The Intrinsic Methodology
- 322 Mixtures of g-Priors
-
- 3221 Intrinsic priors
- 3222 Other mixtures of g-priors
-
- 33 Objective Bayes Occupancy Model Selection
-
- 331 Preliminaries
- 332 Intrinsic Priors for the Occupancy Problem
- 333 Model Posterior Probabilities
- 334 Model Selection Algorithm
-
- 34 Alternative Formulation
- 35 Simulation Experiments
-
- 351 Marginal Posterior Inclusion Probabilities for Model Predictors
- 352 Summary Statistics for the Highest Posterior Probability Model
-
- 36 Case Study Blue Hawker Data Analysis
-
- 361 Results Variable Selection Procedure
- 362 Validation for the Selection Procedure
-
- 37 Discussion
-
- 4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
-
- 41 Introduction
- 42 Setup for Well-Formulated Models
-
- 421 Well-Formulated Model Spaces
-
- 43 Priors on the Model Space
-
- 431 Model Prior Definition
- 432 Choice of Prior Structure and Hyper-Parameters
- 433 Posterior Sensitivity to the Choice of Prior
-
- 44 Random Walks on the Model Space
-
- 441 Simple Pruning and Growing
- 442 Degree Based Pruning and Growing
-
- 45 Simulation Study
-
- 451 SNR and Sample Size Effect
- 452 Coefficient Magnitude
- 453 Special Points on the Scale
-
- 46 Case Study Ozone Data Analysis
- 47 Discussion
-
- 5 CONCLUSIONS
- A FULL CONDITIONAL DENSITIES DYMOSS
- B RANDOM WALK ALGORITHMS
- C WFM SIMULATION DETAILS
- D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
- REFERENCES
- BIOGRAPHICAL SKETCH
-
ACKNOWLEDGMENTS
Completing this dissertation would not have been possible without the support from
the people that have helped me remain focused motivated and inspired throughout the
years I am undeservingly fortunate to be surrounded by such amazing people
First of all I would like to express my gratitude to Professor George Casella It
was an unsurpassable honor to work with him His wisdom generosity optimism and
unyielding resolve will forever inspire me I will always treasure his teachings and the
fond memories I have of him I thank him and Anne for treating me and my wife as
family
I would like to acknowledge all of my committee members My heartfelt thanks to
my advisor Professor Linda J Young I will carry her thoughtful and patient recommendations
throughout my life I have no words to express how thankful I am to her for guiding me
through the difficult times that followed Dr Casellarsquos passing Also she has my gratitude
for sharing her knowledge and wealth of experience and for providing me with so many
amazing opportunities I am forever grateful to my local advisor Professor Nikolay
Bliznyuk for unsparingly sharing his insightful reflections and knowledge His generosity
and drive to help students develop are a model to follow His kind and extensive efforts
our many conversations his suggestions and advise in all aspects of academic and
non-academic life have made me a better statistician and have had a profound influence
on my way of thinking My appreciation to Professor Madan Oli for his enlightening
advise and for helping me advance my understanding of ecology
I would like to express my absolute gratitude to Dr Andrew Womack my friend and
young mentor His love for good science and hard work although impossible to keep up
with made my doctoral training one of the most exciting times in my life I have sincerely
enjoyed working and learning from him the last couple of years I offer my gratitude
to Dr Salvador Gezan for his friendship and the patience with which he taught me so
much more about statistics (boring our wives to death in the process) I am grateful to
4
Professor Mary Christman for her mentorship and enormous support I would like to
thank Dr Mihai Giurcanu for spending countless hours helping me think more deeply
about statistics his insight has been instrumental to shaping my own ideas Thanks to
Dr Claudio Fuentes for taking an interest in my work and for his advise support and
kind words which helped me retain the confidence to continue
I would like to acknowledge my friends at UF Juan Jose Acosta Mauricio
Mosquera Diana Falla Salvador and Emma Weeks and Anna Denicol thanks for
becoming my family away from home Andreas Tavis Emily Alex Sasha Mike
Yeonhee and Laura thanks for being there for me I truly enjoyed sharing these
years with you Vitor Paula Rafa Leandro Fabio Eduardo Marcelo and all the other
Brazilians in the Animal Science Department thanks for your friendship and for the
many unforgettable (though blurry) weekends
Also I would like to thank Pablo Arboleda for believing in me Because of him I
was able to take the first step towards fulfilling my educational goals My gratitude to
Grupo Bancolombia Fulbright Colombia Colfuturo and the IGERT QSE3 program
for supporting me throughout my studies Also thanks to Marc Kery and Christian
Monnerat for providing data to validate our methods Thanks to the staff in the Statistics
Department specially to Ryan Chance to the staff at the HPC and also to Karen Bray
at SNRE
Above all else I would like to thank my wife and family Nata you have always been
there for me pushing me forward believing in me helping me make better decisions
and regardless of how hard things get you have always managed to give me true and
lasting happiness Thank you for your love strength and patience Mom Dad Alejandro
Alberto Laura Sammy Vale and Tommy without your love trust and support getting
this far would not have been possible Thank you for giving me so much Gustavo
Lilia Angelica and Juan Pablo thanks for taking me into your family your words of
encouragement have led the way
5
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS 4
LIST OF TABLES 8
LIST OF FIGURES 10
ABSTRACT 12
CHAPTER
1 GENERAL INTRODUCTION 14
11 Occupancy Modeling 1512 A Primer on Objective Bayesian Testing 1713 Overview of the Chapters 21
2 MODEL ESTIMATION METHODS 23
21 Introduction 23211 The Occupancy Model 24212 Data Augmentation Algorithms for Binary Models 26
22 Single Season Occupancy 29221 Probit Link Model 30222 Logit Link Model 32
23 Temporal Dynamics and Spatial Structure 34231 Dynamic Mixture Occupancy State-Space Model 37232 Incorporating Spatial Dependence 43
24 Summary 46
3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS 49
31 Introduction 4932 Objective Bayesian Inference 52
321 The Intrinsic Methodology 53322 Mixtures of g-Priors 54
3221 Intrinsic priors 553222 Other mixtures of g-priors 56
33 Objective Bayes Occupancy Model Selection 57331 Preliminaries 58332 Intrinsic Priors for the Occupancy Problem 60333 Model Posterior Probabilities 62334 Model Selection Algorithm 63
34 Alternative Formulation 6635 Simulation Experiments 68
351 Marginal Posterior Inclusion Probabilities for Model Predictors 70
6
352 Summary Statistics for the Highest Posterior Probability Model 7636 Case Study Blue Hawker Data Analysis 77
361 Results Variable Selection Procedure 79362 Validation for the Selection Procedure 81
37 Discussion 82
4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS 84
41 Introduction 8442 Setup for Well-Formulated Models 88
421 Well-Formulated Model Spaces 9043 Priors on the Model Space 91
431 Model Prior Definition 92432 Choice of Prior Structure and Hyper-Parameters 96433 Posterior Sensitivity to the Choice of Prior 99
44 Random Walks on the Model Space 104441 Simple Pruning and Growing 105442 Degree Based Pruning and Growing 106
45 Simulation Study 107451 SNR and Sample Size Effect 109452 Coefficient Magnitude 110453 Special Points on the Scale 111
46 Case Study Ozone Data Analysis 11147 Discussion 113
5 CONCLUSIONS 115
APPENDIX
A FULL CONDITIONAL DENSITIES DYMOSS 118
B RANDOM WALK ALGORITHMS 121
C WFM SIMULATION DETAILS 124
D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS 131
REFERENCES 133
BIOGRAPHICAL SKETCH 140
7
LIST OF TABLES
Table page
1-1 Interpretation of BFji when contrasting Mj and Mi 20
3-1 Simulation control parameters occupancy model selector 69
3-2 Comparison of average minOddsMPIP under scenarios having different numberof sites (N=50 N=100) and under scenarios having different number of surveysper site (J=3 J=5) for the presence and detection components using uniformand multiplicity correction priors 75
3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors 75
3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highest probabilitymodels for the presence and the detection components using uniform andmultiplicity correcting priors on the model space 76
3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space 77
3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 77
3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 78
3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data 80
3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data 80
3-10 MPIP presence component 81
3-11 MPIP detection component 81
3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform and multiplicitycorrection model priors 82
8
4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulations 100
4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchical independenceprior (HIP) the hierarchical order prior (HOP) and the hierarchical uniformprior (HUP) 102
4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and the hierarchicaluniform prior (HUP) 103
4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP) 105
4-5 Variables used in the analyses of the ozone contamination dataset 112
4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso 113
C-1 Experimental conditions WFM simulations 124
D-1 Variables used in the analyses of the ozone contamination dataset 131
D-2 Marginal inclusion probabilities intrinsic prior 132
D-3 Marginal inclusion probabilities Zellner-Siow prior 132
D-4 Marginal inclusion probabilities Hyper-g11 132
D-5 Marginal inclusion probabilities Hyper-g21 132
9
LIST OF FIGURES
Figure page
2-1 Graphical representation occupancy model 25
2-2 Graphical representation occupancy model after data-augmentation 31
2-3 Graphical representation multiseason model for a single site 39
2-4 Graphical representation data-augmented multiseason model 39
3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors 71
3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors 72
3-3 Predictor MPIP averaged over scenarios with the interaction between the numberof sites and the surveys per site using uniform (U) and multiplicity correction(MC) priors 72
3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors 73
3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors 73
4-1 Graphs of well-formulated polynomial models for p = 2 90
4-2 E(M) and C(M) in M defined by a quadratic surface in two main effects formodel M = 1 x1 x21 91
4-3 Graphical representation of assumptions on M defined by the quadratic surfacein two main effects 93
4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the intercept onlymodel and (ab) isin (1 1) (1 ch) 97
4-5 Prior probabilities for the space of well-formulated models associated to threemain effects and one interaction term where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch) 98
4-6 MT DAG of the largest true model used in simulations 109
4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch) 110
C-1 SNR vs n Average model size average true positives and average false positivesfor all simulated scenarios by model ranking according to model posterior probabilities126
10
C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 128
C-3 SNR vs different true models MT Average model size average true positivesand average false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 129
11
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
OBJECTIVE BAYESIAN METHODS FOR OCCUPANCY MODEL ESTIMATION ANDSELECTION
By
Daniel Taylor-Rodrıguez
August 2014
Chair Linda J YoungCochair Nikolay BliznyukMajor Interdisciplinary Ecology
The ecological literature contains numerous methods for conducting inference about
the dynamics that govern biological populations Among these methods occupancy
models have played a leading role during the past decade in the analysis of large
biological population surveys The flexibility of the occupancy framework has brought
about useful extensions for determining key population parameters which provide
insights about the distribution structure and dynamics of a population However the
methods used to fit the models and to conduct inference have gradually grown in
complexity leaving practitioners unable to fully understand their implicit assumptions
increasing the potential for misuse This motivated our first contribution We develop
a flexible and straightforward estimation method for occupancy models that provides
the means to directly incorporate temporal and spatial heterogeneity using covariate
information that characterizes habitat quality and the detectability of a species
Adding to the issue mentioned above studies of complex ecological systems now
collect large amounts of information To identify the drivers of these systems robust
techniques that account for test multiplicity and for the structure in the predictors are
necessary but unavailable for ecological models We develop tools to address this
methodological gap First working in an ldquoobjectiverdquo Bayesian framework we develop
the first fully automatic and objective method for occupancy model selection based
12
on intrinsic parameter priors Moreover for the general variable selection problem we
propose three sets of prior structures on the model space that correct for multiple testing
and a stochastic search algorithm that relies on the priors on the models space to
account for the polynomial structure in the predictors
13
CHAPTER 1GENERAL INTRODUCTION
As with any other branch of science ecology strives to grasp truths about the
world that surrounds us and in particular about nature The objective truth sought
by ecology may well be beyond our grasp however it is reasonable to think that at
least partially ldquoNature is capable of being understoodrdquo (Dewey 1958) We can observe
and interpret nature to formulate hypotheses which can then be tested against reality
Hypotheses that encounter no or little opposition when confronted with reality may
become contextual versions of the truth and may be generalized by scaling them
spatially andor temporally accordingly to delimit the bounds within which they are valid
To formulate hypotheses accurately and in a fashion amenable to scientific inquiry
not only the point of view and assumptions considered must be made explicit but
also the object of interest the properties worthy of consideration of that object and
the methods used in studying such properties (Reiners amp Lockwood 2009 Rigler amp
Peters 1995) Ecology as defined by Krebs (1972) is ldquothe study of interactions that
determine the distribution and abundance of organismsrdquo This characterizes organisms
and their interactions as the objects of interest to ecology and prescribes distribution
and abundance as a relevant property of these organisms
With regards to the methods used to acquire ecological scientific knowledge
traditionally theoretical mathematical models (such as deterministic PDEs) have been
used However naturally varying systems are imprecisely observed and as such are
subject to multiple sources of uncertainty that must be explicitly accounted for Because
of this the ecological scientific community is developing a growing interest in flexible
and powerful statistical methods and among these Bayesian hierarchical models
predominate These methods rely on empirical observations and can accommodate
fairly complex relationships between empirical observations and theoretical process
models while accounting for diverse sources of uncertainty (Hooten 2006)
14
Bayesian approaches are now used extensively in ecological modeling however
there are two issues of concern one from the standpoint of ecological practitioners
and another from the perspective of scientific ecological endeavors First Bayesian
modeling tools require a considerable understanding of probability and statistical theory
leading practitioners to view them as black box approaches (Kery 2010) Second
although Bayesian applications proliferate in the literature in general there is a lack of
awareness of the distinction between approaches specifically devised for testing and
those for estimation (Ellison 2004) Furthermore there is a dangerous unfamiliarity with
the proven risks of using tools designed for estimation in testing procedures (Berger amp
Pericchi 1996 Berger et al 2001 Kass amp Raftery 1995 Moreno et al 1998 Robert
et al 2009 Robert 1993) (eg use of flat priors in hypothesis testing)
Occupancy models have played a leading role during the past decade in large
biological population surveys The flexibility of the occupancy framework has allowed
the development of useful extensions to determine several key population parameters
which provide robust notions of the distribution structure and dynamics of a population
In order to address some of the concerns stated in previous paragraph we concentrate
in the occupancy framework to develop estimation and testing tools that will allow
ecologists first to gain insight about the estimation procedure and second to conduct
statistically sound model selection for site-occupancy data
11 Occupancy Modeling
Since MacKenzie et al (2002) and Tyre et al (2003) introduced the site-occupancy
framework countless applications and extensions of the method have been developed
in the ecological literature as evidenced by the 438000 hits on Google Scholar for
a search of rdquooccupancy modelrdquo This class of models acknowledges that techniques
used to conduct biological population surveys are prone to detection errors ndashif an
individual is detected it must be present while if it is not detected it might or might
not be Occupancy models improve upon traditional binary regression by accounting
15
for observed detection and partially observed presence as two separate but related
components In the site occupancy setting the chosen locations are surveyed
repeatedly in order to reduce the ambiguity caused by the observed zeros This
approach therefore allows probabilities of both presence (occurrence) and detection
to be estimated
The uses of site-occupancy models are many For example metapopulation
and island biogeography models are often parameterized in terms of site (or patch)
occupancy (Hansky 19921994 1997 as cited in MacKenzie et al (2003)) and
occupancy may be used as a surrogate for abundance to answer questions regarding
geographic distribution range size and metapopulation dynamics (MacKenzie et al
2004 Royle amp Kery 2007)
The basic occupancy framework which assumes a single closed population with
fixed probabilities through time has proven to be quite useful however it might be of
limited utility when addressing some problems In particular assumptions for the basic
model may become too restrictive or unrealistic whenever the study period extends
throughout multiple years or seasons especially given the increasingly changing
environmental conditions that most ecosystems are currently experiencing
Among the extensions found in the literature one that we consider particularly
relevant incorporates heterogenous occupancy probabilities through time Models
that incorporate temporally varying probabilities stem from important meta-population
notions provided by Hanski (1994) such as occupancy probabilities depending on local
colonization and local extinction processes In spite of the conceptual usefulness of
Hanskirsquos model several strong and untenable assumptions (eg all patches being
homogenous in quality) are required for it to provide practically meaningful results
A more viable alternative which builds on Hanski (1994) is an extension of
the single season occupancy model of MacKenzie et al (2003) In this model the
heterogeneity of occupancy probabilities across seasons arises from local colonization
16
and extinction processes This model is flexible enough to let detection occurrence
extinction and colonization probabilities to each depend upon its own set of covariates
Model parameters are obtained through likelihood-based estimation
Using a maximum likelihood approach presents two drawbacks First the
uncertainty assessment for maximum likelihood parameter estimates relies on
asymptotic results which are obtained from implementation of the delta method
making it sensitive to sample size Second to obtain parameter estimates the latent
process (occupancy) is marginalized out of the likelihood leading to the usual zero
inflated Bernoulli model Although this is a convenient strategy for solving the estimation
problem after integrating the latent state variables (occupancy indicators) they are
no longer available Therefore finite sample estimates cannot be calculated directly
Instead a supplementary parametric bootstrapping step is necessary Further
additional structure such as temporal or spatial variation cannot be introduced by
means of random effects (Royle amp Kery 2007)
12 A Primer on Objective Bayesian Testing
With the advent of high dimensional data such as that found in modern problems
in ecology genetics physics etc coupled with evolving computing capability objective
Bayesian inferential methods have gained increasing popularity This however is by no
means a new approach in the way Bayesian inference is conducted In fact starting with
Bayes and Laplace and continuing for almost 200 years Bayesian analysis was primarily
based on ldquononinformativerdquo priors (Berger amp Bernardo 1992)
Now subjective elicitation of prior probabilities in Bayesian analysis is widely
recognized as the ideal (Berger et al 2001) however it is often the case that the
available information is insufficient to specify appropriate prior probabilistic statements
Commonly as in model selection problems where large model spaces have to be
explored the number of model parameters is prohibitively large preventing one from
eliciting prior information for the entire parameter space As a consequence in practice
17
the determination of priors through the definition of structural rules has become the
alternative to subjective elicitation for a variety of problems in Bayesian testing Priors
arising from these rules are known in the literature as noninformative objective default
or reference Many of these connotations generate controversy and are accused
perhaps rightly of providing a false pretension of objectivity Nevertheless we will avoid
that discussion and refer to them herein exchangeably as noninformative or objective
priors to convey the sense that no attempt to introduce an informed opinion is made in
defining prior probabilities
A plethora of ldquononinformativerdquo methods has been developed in the past few
decades (see Berger amp Bernardo (1992) Berger amp Pericchi (1996) Berger et al (2001)
Clyde amp George (2004) Kass amp Wasserman (1995 1996) Liang et al (2008) Moreno
et al (1998) Spiegelhalter amp Smith (1982) Wasserman (2000) and the references
therein) We find particularly interesting those derived from the model structure in which
no tuning parameters are required especially since these can be regarded as automatic
methods Among them methods based on the Bayes factor for Intrinsic Priors have
proven their worth in a variety of inferential problems given their excellent performance
flexibility and ease of use This class of priors is discussed in detail in chapter 3 For
now some basic notation and notions of Bayesian inferential procedures are introduced
Hypothesis testing and the Bayes factor
Bayesian model selection techniques that aim to find the true model as opposed
to searching for the model that best predicts the data are fundamentally extensions to
Bayesian hypothesis testing strategies In general this Bayesian approach to hypothesis
testing and model selection relies on determining the amount of evidence found in favor
of one hypothesis (or model) over the other given an observed set of data Approached
from a Bayesian standpoint this type of problem can be formulated in great generality
using a natural well defined probabilistic framework that incorporates both model and
parameter uncertainty
18
Jeffreys (1935) first developed the Bayesian strategy to hypothesis testing and
consequently to the model selection problem Bayesian model selection within
a model space M = (M1M2 MJ) where each model is associated with a
parameter θj which may be a vector of parameters itself incorporates three types
of probability distributions (1) a prior probability distribution for each model π(Mj)
(2) a prior probability distribution for the parameters in each model π(θj |Mj) and (3)
the distribution of the data conditional on both the model and the modelrsquos parameters
f (x|θj Mj) These three probability densities induce the joint distribution p(x θj Mj) =
f (x|θj Mj) middot π(θj |Mj) middot π(Mj) which is instrumental in producing model posterior
probabilities The model posterior probability is the probability that a model is true given
the data It is obtained by marginalizing over the parameter space and using Bayes rule
p(Mj |x) =m(x|Mj)π(Mj)sumJ
i=1m(x|Mi)π(Mi) (1ndash1)
where m(x|Mj) =intf (x|θj Mj)π(θj |Mj)dθj is the marginal likelihood of Mj
Given that interest lies in comparing different models evidence in favor of one or
another model is assessed with pairwise comparisons using posterior odds
p(Mj |x)p(Mk |x)
=m(x|Mj)
m(x|Mk)middot π(Mj)
π(Mk) (1ndash2)
The first term on the right hand side of (1ndash2) m(x|Mj )
m(x|Mk) is known as the Bayes factor
comparing model Mj to model Mk and it is denoted by BFjk(x) The Bayes factor
provides a measure of the evidence in favor of either model given the data and updates
the model prior odds given by π(Mj )
π(Mk) to produce the posterior odds
Note that the model posterior probability in (1ndash1) can be expressed as a function of
Bayes factors To illustrate let model Mlowast isin M be a reference model All other models
compare in M are compared to the reference model Then dividing both the numerator
19
and denominator in (1ndash1) by m(x|Mlowast)π(Mlowast) yields
p(Mj |x) =BFjlowast(x)
π(Mj )
π(Mlowast)
1 +sum
MiisinMMi =Mlowast
BFilowast(x)π(Mi )π(Mlowast)
(1ndash3)
Therefore as the Bayes factor increases the posterior probability of model Mj given the
data increases If all models have equal prior probabilities a straightforward criterion
to select the best among all candidate models is to choose the model with the largest
Bayes factor As such the Bayes factor is not only useful for identifying models favored
by the data but it also provides a means to rank models in terms of their posterior
probabilities
Assuming equal model prior probabilities in (1ndash3) the prior odds are set equal to
one and the model posterior odds in (1ndash2) become p(Mj |x)p(Mk |x) = BFjk(x) Based
on the Bayes factors the evidence in favor of one or another model can be interpreted
using Table 1-1 adapted from Kass amp Raftery (1995)
Table 1-1 Interpretation of BFji when contrasting Mj and Mi
lnBFjk BFjk Evidence in favor of Mj P(Mj |x)0 to 2 1 to 3 Weak evidence 05-0752 to 6 3 to 20 Positive evidence 075-095
6 to 10 20 to 150 Strong evidence 095-099gt10 gt150 Very strong evidence gt 099
Bayesian hypothesis testing and model selection procedures through Bayes factors
and posterior probabilities have several desirable features First these methods have a
straight forward interpretation since the Bayes factor is an increasing function of model
(or hypothesis) posterior probabilities Second these methods can yield frequentist
matching confidence bounds when implemented with good testing priors (Kass amp
Wasserman 1996) such as the reference priors of Berger amp Bernardo (1992) Third
since the Bayes factor contains the ratio of marginal densities it automatically penalizes
complexity according to the number of parameters in each model this property is
known as Ockhamrsquos razor (Kass amp Raftery 1995) Four the use of Bayes factors does
20
not require having nested hypotheses (ie having the null hypothesis nested in the
alternative) standard distributions or regular asymptotics (eg convergence to normal
or chi squared distributions) (Berger et al 2001) In contrast this is not always the case
with frequentist and likelihood ratio tests which depend on known distributions (at least
asymptotically) for the test statistic to perform the test Finally Bayesian hypothesis
testing procedures using the Bayes factor can naturally incorporate model uncertainty by
using the Bayesian machinery for model averaged predictions and confidence bounds
(Kass amp Raftery 1995) It is not clear how to account for this uncertainty rigorously in a
fully frequentist approach
13 Overview of the Chapters
In the chapters that follow we develop a flexible and straightforward hierarchical
Bayesian framework for occupancy models allowing us to obtain estimates and conduct
robust testing from an ldquoobjectiverdquo Bayesian perspective Latent mixtures of random
variables supply a foundation for our methodology This approach provides a means to
directly incorporate spatial dependency and temporal heterogeneity through predictors
that characterize either habitat quality of a given site or detectability features of a
particular survey conducted in a specific site On the other hand the Bayesian testing
methods we propose are (1) a fully automatic and objective method for occupancy
model selection and (2) an objective Bayesian testing tool that accounts for multiple
testing and for polynomial hierarchical structure in the space of predictors
Chapter 2 introduces the methods proposed for estimation of occupancy model
parameters A simple estimation procedure for the single season occupancy model
with covariates is formulated using both probit and logit links Based on the simple
version an extension is provided to cope with metapopulation dynamics by introducing
persistence and colonization processes Finally given the fundamental role that spatial
dependence plays in defining temporal dynamics a strategy to seamlessly account for
this feature in our framework is introduced
21
Chapter 3 develops a new fully automatic and objective method for occupancy
model selection that is asymptotically consistent for variable selection and averts the
use of tuning parameters In this Chapter first some issues surrounding multimodel
inference are described and insight about objective Bayesian inferential procedures is
provided Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate
priors on the parameter space the intrinsic priors for the parameters of the occupancy
model are obtained These are used in the construction of a variable selection algorithm
for ldquoobjectiverdquo variable selection tailored to the occupancy model framework
Chapter 4 touches on two important and interconnected issues when conducting
model testing that have yet to receive the attention they deserve (1) controlling for false
discovery in hypothesis testing given the size of the model space ie given the number
of tests performed and (2) non-invariance to location transformations of the variable
selection procedures in the face of polynomial predictor structure These elements both
depend on the definition of prior probabilities on the model space In this chapter a set
of priors on the model space and a stochastic search algorithm are proposed Together
these control for model multiplicity and account for the polynomial structure among the
predictors
22
CHAPTER 2MODEL ESTIMATION METHODS
ldquoData Data Datardquo he cried impatiently ldquoI canrsquot make bricks without clayrdquo
ndashSherlock HolmesThe Adventure of the Copper Beeches
21 Introduction
Prior to the introduction of site-occupancy models (MacKenzie et al 2002 Tyre
et al 2003) presence-absence data from ecological monitoring programs were used
without any adjustment to assess the impact of management actions to observe trends
in species distribution through space and time or to model the habitat of a species (Tyre
et al 2003) These efforts however were suspect due to false-negative errors not
being accounted for False-negative errors occur whenever a species is present at a site
but goes undetected during the survey
Site-occupancy models developed independently by MacKenzie et al (2002)
and Tyre et al (2003) extend simple binary-regression models to account for the
aforementioned errors in detection of individuals common in surveys of animal or plant
populations Since their introduction the site-occupancy framework has been used in
countless applications and numerous extensions for it have been proposed Occupancy
models improve upon traditional binary regression by analyzing observed detection
and partially observed presence as two separate but related components In the site
occupancy setting the chosen locations are surveyed repeatedly in order to reduce the
ambiguity caused by the observed zeros This approach therefore allows simultaneous
estimation of the probabilities of presence (occurrence) and detection
Several extensions to the basic single-season closed population model are
now available The occupancy approach has been used to determine species range
dynamics (MacKenzie et al 2003 Royle amp Kery 2007) and to understand agestage
23
structure within populations (Nichols et al 2007) model species co-occurrence
(MacKenzie et al 2004 Ovaskainen et al 2010 Waddle et al 2010) It has even been
suggested as a surrogate for abundance (MacKenzie amp Nichols 2004) MacKenzie et al
suggested using occupancy models to conduct large-scale monitoring programs since
this approach avoids the high costs associated with surveys designed for abundance
estimation Also to investigate metapopulation dynamics occupancy models improve
upon incidence function models (Hanski 1994) which are often parameterized in terms
of site (or patch) occupancy and assume homogenous patches and a metapopulation
that is at a colonization-extinction equilibrium
Nevertheless the implementation of Bayesian occupancy models commonly resorts
to sampling strategies dependent on hyper-parameters subjective prior elicitation
and relatively elaborate algorithms From the standpoint of practitioners these are
often treated as black-box methods (Kery 2010) As such the potential of using the
methodology incorrectly is high Commonly these procedures are fitted with packages
such as BUGS or JAGS Although the packagersquos ease of use has led to a wide-spread
adoption of the methods the user may be oblivious as to the assumptions underpinning
the analysis
We believe providing straightforward and robust alternatives to implement these
methods will help practitioners gain insight about how occupancy modeling and more
generally Bayesian modeling is performed In this Chapter using a simple Gibbs
sampling approach first we develop a versatile method to estimate the single season
closed population site-occupancy model then extend it to analyze metapopulation
dynamics through time and finally provide a further adaptation to incorporate spatial
dependence among neighboring sites211 The Occupancy Model
In this section of the document we first introduce our results published in Dorazio
amp Taylor-Rodrıguez (2012) and build upon them to propose relevant extensions For
24
the standard sampling protocol for collecting site-occupancy data J gt 1 independent
surveys are conducted at each of N representative sample locations (sites) noting
whether a species is detected or not detected during each survey Let yij denote a binary
random variable that indicates detection (y = 1) or non-detection (y = 0) during the
j th survey of site i Without loss of generality J may be assumed constant among all N
sites to simplify description of the model In practice however site-specific variation in
J poses no real difficulties and is easily implemented This sampling protocol therefore
yields a N times J matrix Y of detectionnon-detection data
Note that the observed process yij is an imperfect representation of the underlying
occupancy or presence process Hence letting zi denote the presence indicator at site i
this model specification can therefore be represented through the hierarchy
yij |zi λ sim Bernoulli (zipij)
zi |α sim Bernoulli (ψi) (2ndash1)
where pij is the probability of correctly classifying as occupied the i th site during the j th
survey ψi is the presence probability at the i th site The graphical representation of this
process is
ψi
zi
yi
pi
Figure 2-1 Graphical representation occupancy model
Probabilities of detection and occupancy can both be made functions of covariates
and their corresponding parameter estimates can be obtained using either a maximum
25
likelihood or a Bayesian approach Existing methodologies from the likelihood
perspective marginalize over the latent occupancy process (zi ) making the estimation
procedure depend only on the detections Most Bayesian strategies rely on MCMC
algorithms that require parameter prior specification and tuning However Albert amp Chib
(1993) proposed a longstanding strategy in the Bayesian statistical literature that models
binary outcomes using a simple Gibbs sampler This procedure which is described in
the following section can be extrapolated to the occupancy setting eliminating the need
for tuning parameters and subjective prior elicitation212 Data Augmentation Algorithms for Binary Models
Probit model Data-augmentation with latent normal variables
At the root of Albert amp Chibrsquos algorithm lies the idea that if the observed outcome is
0 the latent variable can be simulated from a truncated normal distribution with support
(minusinfin 0] And if the outcome is 1 the latent variable can be simulated from a truncated
normal distribution in (0infin) To understand the reasoning behind this strategy let
Y sim Bern((xTβ)
) and V = xTβ + ε with ε sim N (0 1) In such a case note that
Pr(y = 1 | xTβ) = (xTβ) = Pr(ε lt xTβ)
= Pr(ε gt minusxTβ)
= Pr(v gt 0 | xTβ)
Thus whenever y = 1 then v gt 0 and v le 0 otherwise In other words we
may think of y as a truncated version of v Thus we can sample iteratively alternating
between the latent variables conditioned on model parameters and vice versa to draw
from the desired posterior densities By augmenting the data with the latent variables
we are able to obtain full conditional posterior distributions for model parameters that are
easy to draw from (equation 2ndash3 below) Further we may sample the latent variables
we may also sample the parameters
Given some initial values for all model parameters values for the latent variables
can be simulated By conditioning on the latter it is then possible to draw samples
26
from the parameterrsquos posterior distributions These samples can be used to generate
new values for the latent variables etc The process is iterated using a Gibbs sampling
approach Generally after a large number iterations it yields draws from the joint
posterior distribution of the latent variables and the model parameters conditional on the
observed outcome values We formalize the procedure below
Assume that each outcome Y1Y2 Yn is such that Yi |xi β sim Bernoulli(qi)
where qi = (xTi β) is the standard normal CDF evaluated at xTi β where xi and β
are the p-dimensional vectors of observed covariates for the i -th observation and their
corresponding parameters respectively
Now let y = y1 y2 yn be the vector of observed outcomes and [ β ] represents
the prior distribution of the model parameters Therefore the posterior distribution of β is
given by
[ β|y ] prop [ β ]nprodi=1
(xTi β)yi(1minus(xTi β)
)1minusyi (2ndash2)
which is intractable Nevertheless introducing latent random variables V = (V1 Vn)
such that Vi sim N (xTi β 1) resolves this difficulty by specifying that whenever Yi = 1
then Vi gt 0 and if Yi = 0 then Vi le 0 This yields
[ β v|y ] prop [ β ]
nprodi=1
ϕ(vi | xTi β 1)Ivile0Iyi=0 + Ivigt0Iyi=1
(2ndash3)
where ϕ(x |micro τ 2) is the probability density function of normal random variable x
with mean micro and variance τ2 The data augmentation artifact works since [ β|y ] =int[ β v|y ]dv hence if we sample from joint posterior 2ndash3 and extract only the sampled
values for β they will correspond to samples from [ β|y ]
From the expression above it is possible to obtain the full conditional distributions
for V and β Thus a Gibbs sampler can be proposed For example if we use a flat prior
27
for β (ie [ β ] prop 1) the full conditionals are given by
β|V y sim MVNk
((XTX )minus1(XTV ) (XTX )minus1
)(2ndash4)
V|β y simnprodi=1
tr N (xTi β 1Qi) (2ndash5)
where MVNq(micro ) represents a multivariate normal distribution with mean vector micro
and variance-covariance matrix and tr N (ξσ2Q) stands for the truncated normal
distribution with mean ξ variance σ2 and truncation region Q For each i = 1 2 n
the support of the truncated variables is given by Q = (minusinfin 0 ] if yi = 0 and Q = (0infin)
otherwise Note that conjugate normal priors could be used alternatively
At iteration m + 1 the Gibbs sampler draws V(m+1) conditional on β(m) from (2ndash5)
and then samples β(m+1) conditional on V(m+1) from (2ndash4) This process is repeated for
s = 0 1 nsim where nsim is the number of iterations in the Gibbs sampler
Logit model Data-augmentation with latent Polya-gamma variables
Recently Polson et al (2013) developed a novel and efficient approach for Bayesian
inference for logistic models using Polya-gamma latent variables which is analogous
to the Albert amp Chib algorithm The result arises from what the authors refer to as the
Polya-gamma distribution To construct a random variable from this family consider the
infinite mixture of the iid sequence of Exp(1) random variables Ekinfink=1 given by
ω =2
π2
infinsumk=1
Ek
(2k minus 1)2
with probability density function
g(ω) =infinsumk=1
(minus1)k 2k + 1radic2πω3
eminus(2k+1)2
8ω Iωisin(0infin) (2ndash6)
and Laplace density transform E[eminustω] = coshminus1(radic
t2)
28
The Polya-gamma family of densities is obtained through an exponential tilting of
the density g from 2ndash6 These densities indexed by c ge 0 are characterized by
f (ω|c) = cosh c2 eminusc2ω2 g(ω)
The likelihood for the binomial logistic model can be expressed in terms of latent
Polya-gamma variables as follows Assume yi sim Bernoulli(δi) with predictors xprimei =
(xi1 xip) and success probability δi = exprimeiβ(1 + ex
primeiβ) Hence the posterior for the
model parameters can be represented as
[β|y] =[β]prodn
i δyii (1minus δi)
1minusyi
c(y)
where c(y) is the normalizing constant
To facilitate the sampling procedure a data augmentation step can be performed
by introducing a Polya-gamma random variable ω sim PG(xprimeβ 1) This yields the
data-augmented posterior
[βω|y] =
(prodn
i=1 Pr(yi = 1|β))f (ω|xprime
β) [β] dω
c(y) (2ndash7)
such that [β|y] =int
R+[βω|y] dω
Thus from the augmented model the full conditional density for β is given by
[β|ω y] prop
(nprodi=1
Pr(yi = 1|β)
)f (ω|xprime
β) [β] dω
=
nprodi=1
(exprimeiβ)yi
1 + exprimeiβ
nprodi=1
cosh
(∣∣xprime
iβ∣∣
2
)exp
[minus(x
prime
iβ)2ωi
2
]g(ωi)
(2ndash8)
This expression yields a normal posterior distribution if β is assigned flat or normal
priors Hence a two-step sampling strategy analogous to that of Albert amp Chib (1993)
can be used to estimate β in the occupancy framework22 Single Season Occupancy
Let pij = F (qTij λ) be the probability of correctly classifying as occupied the i th
site during the j th survey conditional on the site being occupied and let ψi = F (xTi α)
29
correspond to the presence probability at the i th site Further let Fminus1(middot) denote a link
function (ie probit or logit) connecting the response to the predictors and denote by λ
and α respectively the r -variate and p-variate coefficient vectors for the detection and
for the presence probabilities Then the following is the joint posterior probability for the
presence indicators and the model parameters
πlowast(z vαwλ) prop πα(α)πλ(λ)Nprodi=1
F (xprimeiα)zi (1minus F (xprimeiα))
(1minuszi ) times
Jprodj=1
(ziF (qprimeijλ))
yij (1minus ziF (qprimeijλ))
1minusyij (2ndash9)
As in the simple probit regression problem this posterior is intractable Consequently
sampling from it directly is not possible But the procedures of Albert amp Chib for the
probit model and of Polson et al for the logit model can be extended to generate an
MCMC sampling strategy for the occupancy problem In what follows we make use of
this framework to develop samplers with which occupancy parameter estimates can be
obtained for both probit and logit link functions These algorithms have the added benefit
that they do not require tuning parameters nor eliciting parameter priors subjectively221 Probit Link Model
To extend Albert amp Chibrsquos algorithm to the occupancy framework with a probit link
first we introduce two sets of latent variables denoted by wij and vi corresponding to
the normal latent variables used to augment the data The corresponding hierarchy is
yij |zi sij sim Bernoulli(ziIwijgt0
)wij |λ sim N
(qprimeijλ 1
)λ sim [λ]
zi |vi sim Ivigt0
vi |α sim N (xprimeiα 1)
α sim [α] (2ndash10)
30
represented by the directed graph found in Figure 2-2
α
vi
zi
yi
wi
λ
Figure 2-2 Graphical representation occupancy model after data-augmentation
Under this hierarchical model the joint density is given by
πlowast(z vαwλ) prop Cyπα(α)πλ(λ)Nprodi=1
ϕ(vi xprimeiα 1)I
zivigt0I
(1minuszi )vile0 times
Jprodj=1
(ziIwijgt0)yij (1minus ziIwijgt0)
1minusyijϕ(wij qprimeijλ 1) (2ndash11)
The full conditional densities derived from the posterior in equation 2ndash11 are
detailed below
1 These are obtained from the full conditional of z after integrating out v and w
f (z|αλ) =
Nprodi=1
f (zi |αλ) =Nprodi=1
ψlowastizi (1minus ψlowast
i )1minuszi
where ψlowasti =
ψiprodJ
j=1 pyijij (1minus pij)
1minusyij
ψiprodJ
j=1 pyijij (1minus pij)1minusyij + (1minus ψi)
prodJ
j=1 Iyij=0(2ndash12)
2
f (v|zα) =
Nprodi=1
f (vi |zi α) =Nprodi=1
tr N (x primeiα 1Ai)
where Ai =
(minusinfin 0] zi = 0(0infin) zi = 1
(2ndash13)
31
and tr N(microσ2A) denotes the pdf of a truncated normal random variable withmean micro variance σ2 and truncation region A
3
f (α|v) = ϕp (α αXprimev α) (2ndash14)
where α = (X primeX )minus1and ϕk(x micro ) represents the k-variate normal density withmean vector micro and variance matrix
4
f (w|y zλ) =
Nprodi=1
Jprodj=1
f (wij |yij zi λ) =Nprodi=1
Jprodj=1
tr N (qprimeijλ 1Bij)
where Bij =
(minusinfininfin) zi = 0(minusinfin 0] zi = 1 and yij = 0(0infin) zi = 1 and yij = 1
(2ndash15)
5
f (λ|w) = ϕr (λ λQprimew λ) (2ndash16)
where λ = (Q primeQ)minus1
The Gibbs sampling algorithm for the model can then be summarized as
1 Initialize z α v λ and w
2 Sample zi sim Bern(ψilowast)
3 Sample vi from a truncated normal with micro = x primeiα σ = 1 and truncation regiondepending on zi
4 Sample α sim N (αXprimev α) with α = (X primeX )minus1
5 Sample wij from a truncated normal with micro = qprimeijλ σ = 1 and truncation region
depending on yij and zi
6 Sample λ sim N (λQprimew λ) with λ = (Q primeQ)minus1
222 Logit Link Model
Now turning to the logit link version of the occupancy model again let yij be the
indicator variable used to mark detection of the target species on the j th survey at the
i th site and let zi be the indicator variable that denotes presence (zi = 1) or absence
32
(zi = 0) of the target species at the i th site The model is now defined by
yij |zi λ sim Bernoulli (zipij) where pij =eq
primeijλ
1 + eqprimeijλ
λ sim [λ]
zi |α sim Bernoulli (ψi) where ψi =ex
primeiα
1 + exprimeiα
α sim [α]
In this hierarchy the contribution of a single site to the likelihood is
Li(αλ) =(ex
primeiα)zi
1 + exprimeiα
Jprodj=1
(zi
eqprimeijλ
1 + eqprimeijλ
)yij(1minus zi
eqprimeijλ
1 + eqprimeijλ
)1minusyij
(2ndash17)
As in the probit case we data-augment the likelihood with two separate sets
of covariates however in this case each of them having Polya-gamma distribution
Augmenting the model and using the posterior in (2ndash7) the joint is
[ zαλ|y ] prop [α] [λ]
Nprodi=1
(ex
primeiα)zi
1 + exprimeiαcosh
(∣∣xprime
iα∣∣
2
)exp
[minus(x
prime
iα)2vi
2
]g(vi)times
Jprodj=1
(zi
eqprimeijλ
1 + eqprimeijλ
)yij(1minus zi
eqprimeijλ
1 + eqprimeijλ
)1minusyij
times
cosh
(∣∣ziqprimeijλ∣∣2
)exp
[minus(ziq
primeijλ)2wij
2
]g(wij)
(2ndash18)
The full conditionals for z α v λ and w obtained from (2ndash18) are provided below
1 The full conditional for z is obtained after marginalizing the latent variables andyields
f (z|αλ) =
Nprodi=1
f (zi |αλ) =Nprodi=1
ψlowastizi (1minus ψlowast
i )1minuszi
where ψlowasti =
ψiprodJ
j=1 pyijij (1minus pij)
1minusyij
ψiprodJ
j=1 pyijij (1minus pij)1minusyij + (1minus ψi)
prodJ
j=1 Iyij=0(2ndash19)
33
2 Using the result derived in Polson et al (2013) we have that
f (v|zα) =
Nprodi=1
f (vi |zi α) =Nprodi=1
PG(1 xprimeiα) (2ndash20)
3
f (α|v) prop [α ]
Nprodi=1
exp[zix
prime
iαminus xprime
iα
2minus (x
prime
iα)2vi
2
] (2ndash21)
4 By the same result as that used for v the full conditional for w is
f (w|y zλ) =
Nprodi=1
Jprodj=1
f (wij |yij zi λ)
=
(prodiisinS1
Jprodj=1
PG(1 |qprimeijλ| )
)(prodi isinS1
Jprodj=1
PG(1 0)
) (2ndash22)
with S1 = i isin 1 2 N zi = 1
5
f (λ|z yw) prop [λ ]prodiisinS1
exp
[yijq
prime
ijλminusq
prime
ijλ
2minus
(qprime
ijλ)2wij
2
] (2ndash23)
with S1 as defined above
The Gibbs sampling algorithm is analogous to the one with a probit link but with the
obvious modifications to incorporate Polya-gamma instead of normal latent variables23 Temporal Dynamics and Spatial Structure
The uses of the single-season model are limited to very specific problems In
particular assumptions for the basic model may become too restrictive or unrealistic
whenever the study period extends throughout multiple years or seasons especially
given the increasingly changing environmental conditions that most ecosystems are
currently experiencing
Among the many extensions found in the literature one that we consider particularly
relevant incorporates heterogenous occupancy probabilities through time Extensions of
34
site-occupancy models that incorporate temporally varying probabilities can be traced
back to Hanski (1994) The heterogeneity of occupancy probabilities through time arises
from local colonization and extinction processes MacKenzie et al (2003) proposed an
alternative to Hanskirsquos approach in order to incorporate imperfect detection The method
is flexible enough to let detection occurrence survival and colonization probabilities
each depend upon its own set of covariates using likelihood-based estimation for the
model parameters
However the approach of MacKenzie et al presents two drawbacks First
the uncertainty assessment for maximum likelihood parameter estimates relies on
asymptotic results (obtained from implementation of the delta method) making it
sensitive to sample size And second to obtain parameter estimates the latent process
(occupancy) is marginalized out of the likelihood leading to the usual zero-inflated
Bernoulli model Although this is a convenient strategy to solve the estimation problem
the latent state variables (occupancy indicators) are no longer available and as such
finite sample estimates cannot be calculated unless an additional (and computationally
expensive) parametric bootstrap step is performed (Royle amp Kery 2007) Additionally as
the occupancy process is integrated out the likelihood approach precludes incorporation
of additional structural dependence using random effects Thus the model cannot
account for spatial dependence which plays a fundamental role in this setting
To work around some of the shortcomings encountered when fitting dynamic
occupancy models via likelihood based methods Royle amp Kery developed what they
refer to as a dynamic occupancy state space model (DOSS) alluding to the conceptual
similarity found between this model and the class of state space models found in the
time series literature In particular this model allows one to retain the latent process
(occupancy indicators) in order to obtain small sample estimates and to eventually
generate extensions that incorporate structure in time andor space through random
effects
35
The data used in the DOSS model comes from standard repeated presenceabsence
surveys with N sampling locations (patches or sites) indexed by i = 1 2 N Within
a given season (eg year month week depending on the biology of the species) each
sampling location is visited (surveyed) j = 1 2 J times This process is repeated for
t = 1 2 T seasons Here an important assumption is that the site occupancy status
is closed within but not across seasons
As is usual in the occupancy modeling framework two different processes are
considered The first one is the detection process per site-visit-season combination
denoted by yijt The yijt are indicator functions that take the value 1 if the species is
present at site i survey j and season t and 0 otherwise These detection indicators
are assumed to be independent within each site and season The second response
considered is the partially observed presence (occupancy) indicators zit These are
indicator variables which are equal to 1 whenever yijt = 1 for one or more of the visits
made to site i during season t otherwise the values of the zit rsquos are unknown Royle amp
Kery refer to these two processes as the observation (yijt) and the state (zit) models
In this setting the parameters of greatest interest are the occurrence or site
occupancy probabilities denoted by ψit as well as those representing the population
dynamics which are accounted for by introducing changes in occupancy status over
time through local colonization and survival That is if a site was not occupied at season
t minus 1 at season t it can either be colonized or remain unoccupied On the other hand
if the site was in fact occupied at season t minus 1 it can remain that way (survival) or
become abandoned (local extinction) at season t The probabilities of survival and
colonization from season t minus 1 to season t at the i th site are denoted by θi(tminus1) and
γi(tminus1) respectivelyDuring the initial period (or season) the model for the state process is expressed in
terms of the occupancy probability (equation 2ndash24) For subsequent periods the stateprocess is specified in terms of survival and colonization probabilities (equation 2ndash25) inparticular
zi1 sim Bernoulli (ψi1) (2ndash24)
36
zit |zi(tminus1) sim Bernoulli(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)(2ndash25)
The observation model conditional on the latent process zit is defined by
yijt |zit sim Bernoulli(zitpijt
)(2ndash26)
Royle amp Kery induce the heterogeneity by site site-season and site-survey-seasonrespectively in the occupancy survival and colonization and in the detection probabilitiesthroughthe following specification
logit(ψi1) = x1 + ri ri sim N(0σ2ψ) logitminus1(x1) sim Unif(0 1)
logit(θit) = at + ui ui sim N(0σ2θ ) logitminus1(ai) sim Unif(0 1)logit(γit) = bt + vi vi sim N(0σ2γ) logitminus1(bi) sim Unif(0 1)
logit(pijt) = ct + wij wi sim N(0σ2p) logitminus1(ci) sim Unif(0 1) (2ndash27)
where x1 at bt ct are the season fixed effects for the corresponding probabilities
and where (ri ui vi) and wij are the site and site-survey random effects respectively
Additionally all variance components assume the usual inverse gamma priors
As the authors state this formulation can be regarded as ldquobeing suitably vaguerdquo
however it is also restrictive in the sense that it is not clear what strategy to follow to
incorporate additional covariates while preserving the straightforward sampling strategy231 Dynamic Mixture Occupancy State-Space Model
We assume that the probabilities for occupancy survival colonization and detection
are all functions of linear combinations of covariates However our setup varies
slightly from that considered by Royle amp Kery (2007) In essence we modify the way in
which the estimates for survival and colonization probabilities are attained Our model
incorporates the notion that occupancy at a site occupied during the previous season
takes place through persistence where we define persistence as a function of both
survival and colonization That is a site occupied at time t may again be occupied
at time t + 1 if the current settlers survive if they perish and new settlers colonize
simultaneously or if both current settlers survive and new ones colonize
Our functional forms of choice are again the probit and logit link functions This
means that each probability of interest which we will refer to for illustration as δ is
37
linked to a linear combination of covariates xprime ξ through the relationship defined by
δ = F (xT ξ) where F (middot) represents the inverse link function This particular assumption
facilitates relating the data augmentation algorithms of Albert amp Chib and Polson et al to
Royle amp Keryrsquos DOSS model We refer to this extension of Royle amp Keryrsquos model as the
Dynamic Mixture Occupancy State Space model (DYMOSS)
As before let yijt be the indicator variable used to mark detection of the target
species on the j th survey at the i th site during the tth season and let zit be the indicator
variable that denotes presence (zit = 1) or absence (zit = 0) of the target species at the
i th site tth season with i isin 1 2 N j isin 1 2 J and t isin 1 2 T
Additionally assume that probabilities for occupancy at time t = 1 persistence
colonization and detection are all functions of covariates with corresponding parameter
vectors α (s) =δ(s)tminus1
Tt=2
B(c) =β(c)tminus1
Tt=2
and = λtTt=1 and covariate matrices
X(o) X = Xtminus1Tt=2 and Q(s) = QtTt=1 respectively Using the notation above our
proposed dynamic occupancy model is defined by the following hierarchyState model
zi1|α sim Bernoulli (ψi1) where ψi1 = F(xprime(o)iα
)zit |zi(tminus1) δ
(c)tminus1β
(s)tminus1 sim Bernoulli
(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)where θi(tminus1) = F
(δ(s)tminus1 + xprimei(tminus1)β
(c)tminus1
) and
γi(tminus1) = F(xprimei(tminus1)β
(c)tminus1
)(2ndash28)
Observed modelyijt |zit ηt sim Bernoulli (zitpijt)
where pijt = F (qTijtλt) (2ndash29)
In the hierarchical setup given by Equations 2ndash28 and 2ndash29 θi(tminus1) corresponds to
the probability of persistence from time t minus 1 to time t at site i and γi(tminus1) denotes the
colonization probability Note that θi(tminus1) minus γi(tminus1) yields the survival probability from t minus 1
to t The effect of survival is introduced by changing the intercept of the linear predictor
by a quantity δ(s)tminus1 Although in this version of the model this effect is accomplished by
just modifying the intercept it can be extended to have covariates determining δ(s)tminus1 as
well The graphical representation of the model for a single site is
38
α
zi1
yi1
λ1
zi2
yi2
λ1
δ(s)1
β(c)1
middot middot middot
zit
yit
λt
δ(s)tminus1
β(c)tminus1
middot middot middot
ziT
yiT
λT
δ(s)Tminus1
β(c)Tminus1
Figure 2-3 Graphical representation multiseason model for a single site
The joint posterior for the model defined by this hierarchical setting is
[ zηαβλ|y ] = Cy
Nprodi=1
ψi1 Jprodj=1
pyij1ij1 (1minus pij1)
(1minusyij1)
zi1(1minus ψi1)
Jprodj=1
Iyij1=0
1minuszi1 [η1][α]times
Tprodt=2
Nprodi=1
[(θziti(tminus1)(1minus θi(tminus1))
1minuszit)zi(tminus1)
+(γziti(tminus1)(1minus γi(tminus1))
1minuszit)1minuszi(tminus1)
] Jprod
j=1
pyijtijt (1minus pijt)
1minusyijt
zit
times
Jprodj=1
Iyijt=0
1minuszit [ηt ][βtminus1][λtminus1]
(2ndash30)
which as in the single season case is intractable Once again a Gibbs sampler cannot
be constructed directly to sample from this joint posterior The graphical representation
of the model for one site incorporating the latent variables is provided in Figure 2-4
α
ui1
zi1
yi1
wi1
λ1
zi2
yi2
wi2
λ1
vi1
δ(s)1
β(c)1
middot middot middot
middot middot middot
zit
vi tminus1
yit
wit
λt
δ(s)tminus1
β(c)tminus1
middot middot middot
middot middot middot
ziT
vi Tminus1
yiT
wiT
λT
δ(s)Tminus1
β(s)Tminus1
Figure 2-4 Graphical representation data-augmented multiseason model
Probit link normal-mixture DYMOSS model
39
We deal with the intractability of the joint posterior distribution as before that is
by introducing latent random variables Each of the latent variables incorporates the
relevant linear combinations of covariates for the probabilities considered in the model
This artifact enables us to sample from the joint posterior distributions of the model
parameters For the probit link the sets of latent random variables respectively for first
season occupancy persistence and colonization and detection are
bull ui sim N (bTi α 1)
bull vi(tminus1) sim zi(tminus1)N(δ(s)(tminus1) + xTi(tminus1)β
(c)(tminus1) 1
)+ (1minus zi(tminus1))N
(xTi(tminus1)β
(c)(tminus1) 1
) and
bull wijt sim N (qTijtηt 1)
Introducing these latent variables into the hierarchical formulation yieldsState model
ui1|α sim N(xprime(o)iα 1
)zi1|ui sim Bernoulli
(Iuigt0
)for t gt 1
vi(tminus1)|zi(tminus1)βtminus1 sim zi(tminus1)N(δ(s)(tminus1) + xprimei(tminus1)β
(c)(tminus1) 1
)+
(1minus zi(tminus1))N(xprimei(tminus1)β
(c)(tminus1) 1
)zit |vi(tminus1) sim Bernoulli
(Ivi(tminus1)gt0
)(2ndash31)
Observed modelwijt |ηt sim N
(qTijtλt 1
)yijt |zit wijt sim Bernoulli
(zitIrijtgt0
)(2ndash32)
Note that the result presented in Section 22 corresponds to the particular case for
T = 1 of the model specified by Equations 2ndash31 and 2ndash32
As mentioned previously model parameters are obtained using a Gibbs sampling
approach Let ϕ(x |microσ2) denote the pdf of a normally distributed random variable x
with mean micro and standard deviation σ Also let
1 Wt = (w1t w2t wNt) with wit = (wi1t wi2t wiJitt) (for i = 1 2 N andt = 1 2 T )
40
2 u = (u1 u2 uN)
3 V = (v1 vTminus1) with vt = (v1t v2t vNt)
For the probit link model the joint posterior distribution is
π(ZuV WtTt=1αB(c) δ(s)
)prop [α]
prodNi=1 ϕ
(ui∣∣ xprime(o)iα 1
)Izi1uigt0I
1minuszi1uile0
times
Tprodt=2
[β(c)tminus1 δ
(s)tminus1
] Nprodi=1
ϕ(vi(tminus1)
∣∣micro(v)i(tminus1) 1
)Izitvi(tminus1)gt0
I1minuszitvi(tminus1)le0
times
Tprodt=1
[λt ]
Nprodi=1
Jitprodj=1
ϕ(wijt
∣∣qprimeijtλt 1)(zitIwijtgt0)yij1(1minus zitIwijtgt0)
(1minusyijt)
where micro(v)i(tminus1) = zi(tminus1)δ
(s)tminus1 + xprimei(tminus1)β
(c)tminus1 (2ndash33)
Initialize the Gibbs sampler at α(0)B(0)(c) δ
(s)(0)2minus1 and (0) For m = 0 1 nsim
The sampler proceeds iteratively by block sampling sequentially for each primary
sampling period as follows first the presence process then the latent variables from
the data-augmentation step for the presence component followed by the parameters for
the presence process then the latent variables for the detection component and finally
the parameters for the detection component Letting [|] denote the full conditional
probability density function of the component conditional on all other unknown
parameters and the observed data for m = 1 nsim the sampling procedure can be
summarized as
[z(m)1 | middot
]rarr[u(m)| middot
]rarr[α(m)
∣∣∣ middot ]rarr [W
(m)1 | middot
]rarr[λ(m)1
∣∣∣ middot ]rarr[z(m)2 | middot
]rarr[V(m)2minus1| middot
]rarr[β(c)(m)2minus1 δ(s)(m)
2minus1
∣∣∣ middot ]rarr [W
(m)2 | middot
]rarr[λ(m)2
∣∣∣ middot ]rarr middot middot middot
middot middot middot rarr[z(m)T | middot
]rarr[V(m)Tminus1| middot
]rarr[β(c)(m)Tminus1 δ(s)(m)
Tminus1
∣∣∣ middot ]rarr [W
(m)T | middot
]rarr[λ(m)T
∣∣∣ middot ]The full conditional probability densities for this Gibbs sampling algorithm are
presented in detail within Appendix A
41
Logit link Polya-Gamma DYMOSS model
Using the same notation as before the logit link model resorts to the hierarchy given
byState model
ui1|α sim PG(xT(o)iα 1
)zi1|ui sim Bernoulli
(Iuigt0
)for t gt 1
vi(tminus1)| sim PG(1∣∣zi(tminus1)δ(s)(tminus1) + xprimei(tminus1)β
(c)(tminus1)
∣∣)sim Bernoulli
(Ivi(tminus1)gt0
)(2ndash34)
Observed modelwijt |λt sim PG
(qTijtλt 1
)yijt |zit wijt sim Bernoulli
(zitIwijtgt0
)(2ndash35)
The logit link version of the joint posterior is given by
π(ZuV WtTt=1αB(s)B(c)
)prop
Nprodi=1
(e
xprime(o)i
α)zi1
1 + exprime(o)i
αPG
(ui 1 |xprime(o)iα|
)[λ1][α]times
Ji1prodj=1
(zi1
eqprimeij1λ1
1 + eqprimeij1λ1
)yij1(1minus zi1
eqprimeij1λ1
1 + eqprimeij1λ1
)1minusyij1
PG(wij1 1 |zi1qprimeij1λ1|
)times
Tprodt=2
[δ(s)tminus1][β
(c)tminus1][λt ]
Nprodi=1
(exp
[micro(v)tminus1
])zit1 + exp
[micro(v)i(tminus1)
]PG (vit 1 ∣∣∣micro(v)i(tminus1)
∣∣∣)timesJitprodj=1
(zit
eqprimeijtλt
1 + eqprimeijtλt
)yijt(1minus zit
eqprimeijtλt
1 + eqlowastTij
λt
)1minusyijt
PG(wijt 1 |zitqprimeijtλt |
)
(2ndash36)
with micro(v)tminus1 = zi(tminus1)δ
(s)tminus1 + xprimei(tminus1)β
(c)tminus1
42
The sampling procedure is entirely analogous to that described for the probit
version The full conditional densities derived from expression 2ndash36 are described in
detail in Appendix A232 Incorporating Spatial Dependence
In this section we describe how the additional layer of complexity space can also
be accounted for by continuing to use the same data-augmentation framework The
method we employ to incorporate spatial dependence is a slightly modified version of
the traditional approach for spatial generalized linear mixed models (GLMMrsquos) and
extends the model proposed by Johnson et al (2013) for the single season closed
population occupancy model
The traditional approach consists of using spatial random effects to induce a
correlation structure among adjacent sites This formulation introduced by Besag et al
(1991) assumes that the spatial random effect corresponds to a Gaussian Markov
Random Field (GMRF) The model known as the Spatial GLMM (SGLMM) is used to
analyze areal data It has been applied extensively given the flexibility of its hierarchical
formulation and the availability of software for its implementation (Hughes amp Haran
2013)
Succinctly the spatial dependence is accounted for in the model by adding a
random vector η assumed to have a conditionally-autoregressive (CAR) prior (also
known as the Gaussian Markov random field prior) To define the prior let the pair
G = (V E) represent the undirected graph for the entire spatial region studied where
V = (1 2 N) denotes the vertices of the graph (sites) and E the set of edges
between sites E is constituted by elements of the form (i j) indicating that sites i
and j are spatially adjacent for some i j isin V The prior for the spatial effects is then
characterized by
[η|τ ] prop τ rank()2exp[minusτ2ηprimeη
] (2ndash37)
43
where = (diag(A1)minus A) is the precision matrix with A denoting the adjacency matrix
The entries of the adjacency matrix A are such that diag(A) = 0 and Aij = I(i j)isinE
The matrix is singular Hence the probability density defined in equation 2ndash37
is improper ie it doesnrsquot integrate to 1 Regardless of the impropriety of the prior this
model can be fitted using a Bayesian approach since even if the prior is improper the
posterior for the model parameters is proper If a constraint such assum
k ηk = 0 is
imposed or if the precision matrix is replaced by a positive definite matrix the model
can also be fitted using a maximum likelihood approach
Assuming that all but the detection process are subject to spatial correlations and
using the notation we have developed up to this point the spatially explicit version of the
DYMOSS model is characterized by the hierarchy represented by equations 2ndash38 and
2ndash39
Hence adding spatial structure into the DYMOSS framework described in the
previous section only involves adding the steps to sample η(o) and ηtT
t=2 conditional
on all other parameters Furthermore the corresponding parameters and spatial
random effects of a given component (ie occupancy survival and colonization)
can be effortlessly pooled together into a single parameter vector to perform block
sampling For each of the latent variables the only modification required is to sum the
corresponding spatial effect to the linear predictor so that these retain their conditional
independence given the linear combination of fixed effects and the spatial effects
State modelzi1|α sim Bernoulli (ψi1) where ψi1 = F
(xT(o)iα+ η
(o)i
)[η(o)|τ
]prop τ rank()2exp
[minusτ2η(o)primeη(o)
]zit |zi(tminus1)αβtminus1λtminus1 sim Bernoulli
(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)where θi(tminus1) = F
(δ(s)(tminus1) + xTi(tminus1)β
(c)tminus1 + ηit
) and
γi(tminus1) = F(xTi(tminus1)β
(c)tminus1 + ηit
)[ηt |τ ] prop τ rank()2exp
[minusτ2ηprimetηt
](2ndash38)
44
Observed modelyijt |zit ηt sim Bernoulli (zitpijt)
where pijt = F (qTijtλt) (2ndash39)
In spite of the popularity of this approach to incorporating spatial dependence three
shortcomings have been reported in the literature (Hughes amp Haran 2013 Reich et al
2006) (1) model parameters have no clear interpretation due to spatial confounding
of the predictors with the spatial effect (2) there is variance inflation due to spatial
confounding and (3) the high dimensionality of the latent spatial variables leads to
high computational costs To avoid such difficulties we follow the approach used by
Hughes amp Haran (2013) which builds upon the earlier work by Reich et al (2006) This
methodology is summarized in what follows
Let a vector of spatial effects η have the CAR model given by 2ndash37 above Now
consider a random vector ζ sim MVN(0 τKprimeK
) with defined as above and where
τKprimeK corresponds to the precision of the distribution and not the covariance matrix
with matrix K satisfying KprimeK = I
This last condition implies that the linear predictor Xβ + η = Xβ + Kζ With
respect to how the matrix K is chosen Hughes amp Haran (2013) recommend basing its
construction on the spectral decomposition of operator matrices based on Moranrsquos I
The Moran operator matrix is defined as PperpAPperp with Pperp = IminusX (XprimeX )minus1X
prime and where A
is the adjacency matrix previously described The choice of the Moran operator is based
on the fact that it accounts for the underlying graph while incorporating the spatial
structure residual to the design matrix X These elements are incorporated into its
spectral decomposition of the Moran operator That is its eigenvalues correspond to the
values of Moranrsquos I statistic (a measure of spatial autocorrelation) for a spatial process
orthogonal to X while its eigenvectors provide the patterns of spatial dependence
residual to X Thus the matrix K is chosen to be the matrix whose columns are the
eigenvectors of the Moran operator for a particular adjacency matrix
45
Using this strategy the new hierarchical formulation of our model is simply modified
by letting η(o) = K(o)ζ(o) and ηt = Ktζt with
1 ζ(o) sim MVN(0 τ (o)K(o)primeK(o)
) where K(o) is the eigenvector matrix for
P(o)perpAP(o)perp and
2 ζt sim MVN(0 τtK
primetKt
) where Kt is the Pperp
t APperpt for t = 2 3 T
The algorithms for the probit and logit link from section 231 can be readily
adapted to incorporate the spatial structure simply by obtaining the joint posteriors
for (α ζ(o)) and (β(c)tminus1 δ
(s)tminus1 ζt) making the obvious modification of the corresponding
linear predictors to incorporate the spatial components24 Summary
With a few exceptions (Dorazio amp Taylor-Rodrıguez 2012 Johnson et al 2013
Royle amp Kery 2007) recent Bayesian approaches to site-occupancy modeling with
covariates have relied on model configurations (eg as multivariate normal priors of
parameters in logit scale) that lead to unfamiliar conditional posterior distributions thus
precluding the use of a direct sampling approach Therefore the sampling strategies
available are based on algorithms (eg Metropolis Hastings) that require tuning and the
knowledge to do so correctly
In Dorazio amp Taylor-Rodrıguez (2012) we proposed a Bayesian specification for
which a Gibbs sampler of the basic occupancy model is available and allowed detection
and occupancy probabilities to depend on linear combinations of predictors This
method described in section 221 is based on the data augmentation algorithm of
Albert amp Chib (1993) There the full conditional posteriors of the parameters of the probit
regression model are cast as latent mixtures of normal random variables The probit and
the logit link yield similar results with large sample sizes however their results may be
different when small to moderate sample sizes are considered because the logit link
function places more mass in the tails of the distribution than the probit link does In
46
section 222 we adapt the method for the single season model to work with the logit link
function
The basic occupancy framework is useful but it assumes a single closed population
with fixed probabilities through time Hence its assumptions may not be appropriate to
address problems where the interest lies in the temporal dynamics of the population
Hence we developed a dynamic model that incorporates the notion that occupancy
at a site previously occupied takes place through persistence which depends both on
survival and habitat suitability By this we mean that a site occupied at time t may again
be occupied at time t + 1 if (1) the current settlers survive (2) the existing settlers
perish but new settlers simultaneously colonize or (3) current settlers survive and new
ones colonize during the same season In our current formulation of the DYMOSS both
colonization and persistence depend on habitat suitability characterized by xprimei(tminus1)β(c)tminus1
They only differ in that persistence is also influenced by whether the site being occupied
during season t minus 1 enhances the suitability of the site or harms it through density
dependence
Additionally the study of the dynamics that govern distribution and abundance of
biological populations requires an understanding of the physical and biotic processes
that act upon them and these vary in time and space Consequently as a final step in
this Chapter we described a straightforward strategy to add spatial dependence among
neighboring sites in the dynamic metapopulation model This extension is based on the
popular Bayesian spatial modeling technique of Besag et al (1991) updated using the
methods described in (Hughes amp Haran 2013)
Future steps along these lines are (1) develop the software necessary to
implement the tools described throughout the Chapter and (2) build a suite of additional
extensions using this framework for occupancy models will be explored The first of
them will be used to incorporate information from different sources such as tracks
scats surveys and direct observations into a single model This can be accomplished
47
by adding a layer to the hierarchy where the source and spatial scale of the data is
accounted for The second extension is a single season spatially explicit multiple
species co-occupancy model This model will allow studying complex interactions
and testing hypotheses about species interactions at a given point in time Lastly this
co-occupancy model will be adapted to incorporate temporal dynamics in the spirit of
the DYMOSS model
48
CHAPTER 3INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
Eliminate all other factors and the one which remains must be the truthndashSherlock Holmes
The Sign of Four
31 Introduction
Occupancy models are often used to understand the mechanisms that dictate
the distribution of a species Therefore variable selection plays a fundamental role in
achieving this goal To the best of our knowledge ldquoobjectiverdquo Bayesian alternatives for
variable selection have not been put forth for this problem and with a few exceptions
(Hooten amp Hobbs 2014 Link amp Barker 2009) AIC is the method used to choose from
competing site-occupancy models In addition the procedures currently implemented
and accessible to ecologists require enumerating and estimating all the candidate
models (Fiske amp Chandler 2011 Mazerolle amp Mazerolle 2013) In practice this
can be achieved if the model space considered is small enough which is possible
if the choice of the model space is guided by substantial prior knowledge about the
underlying ecological processes Nevertheless many site-occupancy surveys collect
large amounts of covariate information about the sampled sites Given that the total
number of candidate models grows exponentially fast with the number of predictors
considered choosing a reduced set of models guided by ecological intuition becomes
increasingly difficult This is even more so the case in the occupancy model context
where the model space is the cartesian product of models for presence and models for
detection Given the issues mentioned above we propose the first objective Bayesian
variable selection method for the single-season occupancy model framework This
approach explores in a principled manner the entire model space It is completely
49
automatic precluding the need for both tuning parameters in the sampling algorithm and
subjective elicitation of parameter prior distributions
As mentioned above in ecological modeling if model selection or less frequently
model averaging is considered the Akaike Information Criterion (AIC) (Akaike 1983)
or a version of it is the measure of choice for comparing candidate models (Fiske amp
Chandler 2011 Mazerolle amp Mazerolle 2013) The AIC is designed to find the model
that has on average the density closest in Kullback-Leibler distance to the density
of the true data generating mechanism The model with the smallest AIC is selected
However if nested models are considered one of them being the true one generally the
AIC will not select it (Wasserman 2000) Commonly the model selected by AIC will be
more complex than the true one The reason for this is that the AIC has a weak signal to
noise ratio and as such it tends to overfit (Rao amp Wu 2001) Other versions of the AIC
provide a bias correction that enhances the signal to noise ratio leading to a stronger
penalization for model complexity Some examples are the AICc (Hurvich amp Tsai 1989)
and AICu (McQuarrie et al 1997) however these are also not consistent for selection
albeit asymptotically efficient (Rao amp Wu 2001)
If we are interested in prediction as opposed to testing the AIC is certainly
appropriate However when conducting inference the use of Bayesian model averaging
and selection methods is more fitting If the true data generating mechanism is among
those considered asymptotically Bayesian methods choose the true model with
probability one Conversely if the true model is not among the alternatives and a
suitable parameter prior is used the posterior probability of the most parsimonious
model closest to the true one tends asymptotically to one
In spite of this in general for Bayesian testing direct elicitation of prior probabilistic
statements is often impeded because the problems studied may not be sufficiently
well understood to make an informed decision about the priors Conversely there may
be a prohibitively large number of parameters making specifying priors for each of
50
these parameters an arduous task In addition to this seemingly innocuous subjective
choices for the priors on the parameter space may drastically affect test outcomes
This has been a recurring argument in favor of objective Bayesian procedures
which appeal to the use of formal rules to build parameter priors that incorporate the
structural information inside the likelihood while utilizing some objective criterion (Kass amp
Wasserman 1996)
One popular choice of ldquoobjectiverdquo prior is the reference prior (Berger amp Bernardo
1992) which is the prior that maximizes the amount of signal extracted from the
data These priors have proven to be effective as they are fully automatic and can
be frequentist matching in the sense that the posterior credible interval agrees with the
frequentist confidence interval from repeated sampling with equal coverage-probability
(Kass amp Wasserman 1996) Reference priors however are improper and while
they yield reasonable posterior parameter probabilities the derived model posterior
probabilities may be ill defined To avoid this shortcoming Berger amp Pericchi (1996)
introduced the intrinsic Bayes factor (IBF) for model comparison Moreno et al (1998)
building on the IBF of Berger amp Pericchi (1996) developed a limiting procedure to
generate a system of priors that yield well-defined posteriors even though these
priors may sometimes be improper The IBF is built using a data-dependent prior to
automatically generate Bayes factors however the extension introduced by Moreno
et al (1998) generates the intrinsic prior by taking a theoretical average over the space
of training samples freeing the prior from data dependence
In our view in the face of a large number of predictors the best alternative is to run
a stochastic search algorithm using good ldquoobjectiverdquo testing parameter priors and to
incorporate suitable model priors This being said the discussion about model priors is
deferred until Chapter 4 this Chapter focuses on the priors on the parameter space
The Chapter is structured as follows First issues surrounding multimodel inference
are described and insight about objective Bayesian inferential procedures is provided
51
Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate priors
on the parameter space the intrinsic priors for the parameters of the occupancy model
are derived These are used in the construction of an algorithm for ldquoobjectiverdquo model
selection tailored to the occupancy model framework To assess the performance of our
methods we provide results from a simulation study in which distinct scenarios both
favorable and unfavorable are used to determine the robustness of these tools and
analyze the Blue Hawker data set which has been examined previously in the ecological
literature (Dorazio amp Taylor-Rodrıguez 2012 Kery et al 2010)32 Objective Bayesian Inference
As mentioned before in practice noninformative priors arising from structural
rules are an alternative to subjective elicitation of priors Some of the rules used in
defining noninformative priors include the principle of insufficient reason parametrization
invariance maximum entropy geometric arguments coverage matching and decision
theoretic approaches (see Kass amp Wasserman (1996) for a discussion)
These rules reflect one of two attitudes (1) noninformative priors either aim to
convey unique representations of ignorance or (2) they attempt to produce probability
statements that may be accepted by convention This latter attitude is in the same
spirit as how weights and distances are defined (Kass amp Wasserman 1996) and
characterizes the way in which Bayesian reference methods are interpreted today ie
noninformative priors are seen to be chosen by convention according to the situation
A word of caution must be given when using noninformative priors Difficulties arise
in their implementation that should not be taken lightly In particular these difficulties
may occur because noninformative priors are generally improper (meaning that they do
not integrate or sum to a finite number) and as such are said to depend on arbitrary
constants
Bayes factors strongly depend upon the prior distributions for the parameters
included in each of the models being compared This can be an important limitation
52
considering that when using noninformative priors their introduction will result in the
Bayes factors being a function of the ratio of arbitrary constants given that these priors
are typically improper (see Jeffreys 1961 Pericchi 2005 and references therein)
Many different approaches have been developed to deal with the arbitrary constants
when using improper priors since then These include the use of partial Bayes factors
(Berger amp Pericchi 1996 Good 1950 Lempers 1971) setting the ratio of arbitrary
constants to a predefined value (Spiegelhalter amp Smith 1982) and approximating to the
Bayes factor (see Haughton 1988 as cited in Berger amp Pericchi 1996 Kass amp Raftery
1995 Tierney amp Kadane 1986)321 The Intrinsic Methodology
Berger amp Pericchi (1996) cleverly dealt with the arbitrary constants that arise when
using improper priors by introducing the intrinsic Bayes factor (IBF) procedure This
solution based on partial Bayes factors provides the means to replace the improper
priors by proper ldquoposteriorrdquo priors The IBF is obtained from combining the model
structure with information contained in the observed data Furthermore they showed
that as the sample size tends to infinity the Intrinsic Bayes factor corresponds to the
proper Bayes factor arising from the intrinsic priors
Intrinsic priors however are not unique The asymptotic correspondence between
the IBF and the Bayes factor arising from the intrinsic prior yields two functional
equations that are solved by a whole class of intrinsic priors Because all the priors
in the class produce Bayes factors that are asymptotically equivalent to the IBF for
finite sample sizes the resulting Bayes factor is not unique To address this issue
Moreno et al (1998) formalized the methodology through the ldquolimiting procedurerdquo
This procedure allows one to obtain a unique Bayes factor consolidating the method
as a valid objective Bayesian model selection procedure which we will refer to as the
Bayes factor for intrinsic priors (BFIP) This result is particularly valid for nested models
although the methodology may be extended with some caution to nonnested models
53
As mentioned before the Bayesian hypothesis testing procedure is highly sensitive
to parameter-prior specification and not all priors that are useful for estimation are
recommended for hypothesis testing or model selection Evidence of this is provided
by the Jeffreys-Lindley paradox which states that a point null hypothesis will always
be accepted when the variance of a conjugate prior goes to infinity (Robert 1993)
Additionally when comparing nested models the null model should correspond to
a substantial reduction in complexity from that of larger alternative models Hence
priors for the larger alternative models that place probability mass away from the null
model are wasteful If the true model is ldquofarrdquo from the null it will be easily detected by
any statistical procedure Therefore the prior on the alternative models should ldquowork
harderrdquo at selecting competitive models that are ldquocloserdquo to the null This principle known
as the Savage continuity condition (Gunel amp Dickey 1974) is widely recognized by
statisticians
Interestingly the intrinsic prior in correspondence with the BFIP automatically
satisfies the Savage continuity condition That is when comparing nested models the
intrinsic prior for the more complex model is centered around the null model and in spite
of being a limiting procedure it is not subject to the Jeffreys-Lindley paradox
Moreover beyond the usual pairwise consistency of the Bayes factor for nested
models Casella et al (2009) show that the corresponding Bayesian procedure with
intrinsic priors for variable selection in normal regression is consistent in the entire
class of normal linear models adding an important feature to the list of virtues of the
procedure Consistency of the BFIP for the case where the dimension of the alternative
model grows with the sample size is discussed in Moreno et al (2010)322 Mixtures of g-Priors
As previously mentioned in the Bayesian paradigm a model M in M is defined
by a sampling density and a prior distribution The sampling density associated with
model M is denoted by f (y|βM σ2M M) where (βM σ
2M) is a vector of model-specific
54
unknown parameters The prior for model M and its corresponding set of parameters is
π(βM σ2M M|M) = π(βM σ
2M |MM) middot π(M|M)
Objective local priors for the model parameters (βM σ2M) are achieved through
modifications and extensions of Zellnerrsquos g-prior (Liang et al 2008 Womack et al
2014) In particular below we focus on the intrinsic prior and provide some details for
other scaled mixtures of g-priors We defer the discussion on priors over the model
space until Chapter 5 where we describe them in detail and develop a few alternatives
of our own3221 Intrinsic priors
An automatic choice of an objective prior is the intrinsic prior (Berger amp Pericchi
1996 Moreno et al 1998) Because MB sube M for all M isin M the intrinsic prior for
(βM σ2M) is defined as an expected posterior prior
πI (βM σ2M |M) =
intpR(βM σ
2M |~yM)mR(~y|MB)d~y (3ndash1)
where ~y is a minimal training sample for model M I denotes the intrinsic distributions
and R denotes distributions derived from the reference prior πR(βM σ2M |M) = cM
dβMdσ2M
σ2M
In (3ndash1) mR(~y|M) =intint
f (~y|βM σ2M M)πR(βM σ
2M |M)dβMdσ2M is the reference marginal
of ~y under model M and pR(βM σ2M |~yM) =
f (~y|βM σ2MM)πR(βM σ2
M|M)
mR(~y|M)is the reference
posterior density
In the regression framework the reference marginal mR is improper and produces
improper intrinsic priors However the intrinsic Bayes factor of model M to the base
model MB is well-defined and given by
BF IMMB
(y) = (1minus R2M)
minus nminus|MB |2 times
int 1
0
n + sin2(π2θ) middot (|M|+ 1)
n +sin2(π
2θ)middot(|M|+1)1minusR2
M
nminus|M|
2sin2(π
2θ) middot (|M|+ 1)
n +sin2(π
2θ)middot(|M|+1)1minusR2
M
|M|minus|MB |
2
dθ (3ndash2)
55
where R2M is the coefficient of determination of model M versus model MB The Bayes
factor between two models M and M prime is defined as BF IMMprime(y) = BF I
MMB(y)BF I
MprimeMB(y)
The ldquogoodnessrdquo of the model M based on the intrinsic priors is given by its posterior
probability
pI (M|yM) =BF I
MMB(y)π(M|M)sum
MprimeisinM BF IMprimeMB
(y)π(M prime|M) (3ndash3)
It has been shown that the system of intrinsic priors produces consistent model selection
(Casella et al 2009 Giron et al 2010) In the context of well-formulated models the
true model MT is the smallest well-formulated model M isin M such that α isin M if βα = 0
If MT is the true model then the posterior probability of model MT based on equation
(3ndash3) converges to 13222 Other mixtures of g-priors
Scaled mixtures of g-priors place a reference prior on (βMBσ2) and a multivariate
normal distribution on β in M MB that is normal with mean 0 and precision matrix
qMw
nσ2ZprimeM (IminusH0)ZM
where H0 is the hat matrix associated with ZMB The prior is completed by a prior on w
and choice of scaling qM that is set at |M| + 1 to account for the minimal sample size of
M Under these assumptions the Bayesrsquo factor for M to MB is given by
BFMMB(y) =
(1minus R2
M
) nminus|MB |2
int n + w(|M|+ 1)
n + w(|M|+1)1minusR2
M
nminus|M|
2w(|M|+ 1)
n + w(|M|+1)1minusR2
M
|M|minus|MB |
2
π(w)dw
We consider the following priors on w The intrinsic prior is π(w) = Beta(w 05 05)
which is only defined for w isin (0 1) A version of the Zellner-Siow prior is given by
w sim Gamma(05 05) which produces a multivariate Cauchy distribution on β A family
of hyper-g priors are defined by π(w) prop wminus12(β + w)(α+1)2 which have Cauchy-like
tails but produce more shrinkage than the Cauchy prior
56
33 Objective Bayes Occupancy Model Selection
As mentioned before Bayesian inferential approaches used for ecological models
are lacking In particular there exists a need for suitable objective and automatic
Bayesian testing procedures and software implementations that explore thoroughly the
model space considered With this goal in mind in this section we develop an objective
intrinsic and fully automatic Bayesian model selection methodology for single season
site-occupancy models We refer to this method as automatic and objective given that
in its implementation no hyperparameter tuning is required and that it is built using
noninformative priors with good testing properties (eg intrinsic priors)
An inferential method for the occupancy problem is possible using the intrinsic
approach given that we are able to link intrinsic-Bayesian tools for the normal linear
model through our probit formulation of the occupancy model In other words because
we can represent the single season probit occupancy model through the hierarchy
yij |zi wij sim Bernoulli(ziIwijgt0
)wij |λ sim N
(qprimeijλ 1
)zi |vi sim Bernoulli
(Ivigt0
)vi |α sim N (x primeiα 1)
it is possible to solve the selection problem on the latent scale variables wij and vi and
to use those results at the level of the occupancy and detection processes
In what follows first we provide some necessary notation Then a derivation of
the intrinsic priors for the parameters of the detection and occupancy components
is outlined Using these priors we obtain the general form of the model posterior
probabilities Finally the results are incorporated in a model selection algorithm for
site-occupancy data Although the priors on the model space are not discussed in this
Chapter the software and methods developed have different choices of model priors
built in
57
331 Preliminaries
The notation used in Chapter 2 will be considered in this section as well Namely
presence will be denoted by z detection by y their corresponding latent processes are
v and w and the model parameters are denoted by α and λ However some additional
notation is also necessary Let M0 =M0y M0z
denote the ldquobaserdquo model defined by
the smallest models considered for the detection and presence processes The base
models M0y and M0z include predictors that must be contained in every model that
belongs to the model space Some examples of base models are the intercept only
model a model with covariates related to the sampling design and a model including
some predictors important to the researcher that should be included in every model
Furthermore let the sets [Kz ] = 1 2 Kz and [Ky ] = 1 2 Ky index
the covariates considered for the variable selection procedure for the presence and
detection processes respectively That is these sets denote the covariates that can
be added from the base models in M0 or removed from the largest possible models
considered MF z and MF y which we will refer to as the ldquofullrdquo models The model space
can then be represented by the Cartesian product of subsets such that Ay sube [Ky ]
and Az sube [Kz ] The entire model space is populated by models of the form MA =MAy
MAz
isin M = My timesMz with MAy
isin My and MAzisin Mz
For the presence process z the design matrix for model MAzis given by the block
matrix XAz= (X0|Xr A) X0 corresponds to the design matrix of the base model ndash which
is such that M0z sube MAzisin Mz for all Az isin [Kz ] ndash and Xr A corresponds to the submatrix
that contains the covariates indexed by Az Analogously for the detection process y the
design matrix is given by QAy= (Q0|Qr A) Similarly the coefficients for models MAz
and
MAyare given by αA = (αprime
0αprimer A)
prime and λA = (λprime0λ
primer A)
prime
With these elements in place the model selection problem consists of finding
subsets of covariates indexed by A = Az Ay that have a high posterior probability
given the detection and occupancy processes This is equivalent to finding models with
58
high posterior odds when compared to a suitable base model These posterior odds are
given by
p(MA|y z)p(M0|y z)
=m(y z|MA)π(MA)
m(y z|M0)π(M0)= BFMAM0
(y z)π(MA)
π(M0)
Since we are able to represent the occupancy model as a truncation of latent
normal variables it is possible to work through the occupancy model selection problem
in the latent normal scale used for the presence and detection processes We formulate
two solutions to this problem one that depends on the observed and latent components
and another that solely depends on the latent level variables used to data-augment the
problem We will however focus on the latter approach as this yields a straightforward
MCMC sampling scheme For completeness the other alternative is described in
Section 34
At the root of our objective inferential procedure for occupancy models lies the
conditional argument introduced by Womack et al (work in progress) for the simple
probit regression In the occupancy setting the argument is
p(MA|y zw v) =m(y z vw|MA)π(MA)
m(y zw v)
=fyz(y z|w v)
(intfvw(vw|αλMA)παλ(αλ|MA)d(αλ)
)π(MA)
fyz(y z|w v)sum
MlowastisinM(int
fvw(vw|αλMlowast)παλ(αλ|Mlowast)d(αλ))π(Mlowast)
=m(v|MAz
)m(w|MAy)π(MA)
m(v)m(w)
prop m(v|MAz)m(w|MAy
)π(MA) (3ndash4)
where
1 fyz(y z|w v) =prodN
i=1 Izivigt0I
(1minuszi )vile0
prodJ
j=1(ziIwijgt0)yij (1minus ziIwijgt0)
1minusyij
2 fvw(vw|αλMA) =
(Nprodi=1
ϕ(vi xprimeiαMAz
1)
)︸ ︷︷ ︸
f (v|αr Aα0MAz )
(Nprodi=1
Jiprodj=1
ϕ(wij qprimeijλMAy
1)
)︸ ︷︷ ︸
f (w|λr Aλ0MAy )
and
59
3 παλ(αλ|MA) = πα(α|MAz)πλ(λ|MAy
)
This result implies that once the occupancy and detection indicators are
conditioned on the latent processes v and w respectively the model posterior
probabilities only depend on the latent variables Hence in this case the model
selection problem is driven by the posterior odds
p(MA|y zw v)p(M0|y zw v)
=m(w v|MA)
m(w v|M0)
π(MA)
π(M0) (3ndash5)
where m(w v|MA) = m(w|MAy) middotm(v|MAz
) with
m(v|MAz) =
int intf (v|αr Aα0MAz
)π(αr A|α0MAz)π(α0)dαr Adα0
(3ndash6)
m(w|MAy) =
int intf (w|λr Aλ0MAy
)π(λr A|λ0MAy)π(λ0)dλ0dλr A
(3ndash7)
332 Intrinsic Priors for the Occupancy Problem
In general the intrinsic priors as defined by Moreno et al (1998) use the functional
form of the response to inform their construction assuming some preliminary prior
distribution proper or improper on the model parameters For our purposes we assume
noninformative improper priors for the parameters denoted by πN(middot|middot) Specifically the
intrinsic priors πIP(θMlowast|Mlowast) for a vector of parameters θMlowast corresponding to model
Mlowast isin M0M sub M for a response vector s with probability density (or mass) function
f (s|θMlowast) are defined by
πIP(θM0|M0) = πN(θM0
|M0)
πIP(θM |M) = πN(θM |M)
intm(~s|M)
m(~s|M0)f (~s|θM M)d~s
where ~s is a theoretical training sample
In what follows whenever it is clear from the context in an attempt to simplify the
notation MA will be used to refer to MAzor MAy
and A will denote Az or Ay To derive
60
the parameter priors involved in equations 3ndash6 and 3ndash7 using the objective intrinsic prior
strategy we start by assuming flat priors πN(αA|MA) prop cA and πN(λA|MA) prop dA where
cA and dA are unknown constants
The intrinsic prior for the parameters associated with the occupancy process αA
conditional on model MA is
πIP(αA|MA) = πN(αA|MA)
intm(~v|MA)
m(~v|M0)f (~v|αAMA)d~v
where the marginals m(~v|Mj) with j isin A 0 are obtained by solving the analogous
equation 3ndash6 for the (theoretical) training sample ~v These marginals are given by
m(~v|Mj) = cj (2π)pjminusp0
2 |~X primej~Xj |
12 eminus
12~vprime(Iminus~Hj )~v
The training sample ~v has dimension pAz=∣∣MAz
∣∣ that is the total number of
parameters in model MAz Note that without ambiguity we use
∣∣ middot ∣∣ to denote both
the cardinality of a set and also the determinant of a matrix The design matrix ~XA
corresponds to the training sample ~v and is chosen such that ~X primeA~XA =
pAzNX primeAXA
(Leon-Novelo et al 2012) and ~Hj is the corresponding hat matrix
Replacing m(~v|MA) and m(~v|M0) in πIP(αA|MA) and solving the integral with
respect to the theoretical training sample ~v we have
πIP(αA|MA) = cA
int ((2π)minus
pAzminusp0z2
(c0
cA
)eminus
12~vprime((Iminus~HA)minus(Iminus~H0))~v |~X
primeA~XA|12
|~X prime0~X0|12
)times(
(2π)minuspAz2 eminus
12(~vminus~XAαA)
prime(~vminus~XAαA))d~v
= c0(2π)minus
pAzminusp0z2 |~X prime
Ar~XAr |
12 2minus
pAzminusp0z2 exp
[minus1
2αprimer A
(1
2~X primer A
~Xr A
)αr A
]= πN(α0)timesN
(αr A
∣∣ 0 2 middot ( ~X primer A
~Xr A)minus1)
(3ndash8)
61
Analogously the intrinsic prior for the parameters associated to the detection
process is
πIP(λA|MA) = d0(2π)minus
pAyminusp0y2 | ~Q prime
Ar~QAr |
12 2minus
pAyminusp0y2 exp
[minus1
2λprimer A
(1
2~Q primer A
~Qr A
)λr A
]= πN(λ0)timesN
(λr A
∣∣ 0 2 middot ( ~Q primeA~QA)
minus1)
(3ndash9)
In short the intrinsic priors for αA = (αprime0α
primer A)
prime and λprimeA = (λprime
0λprimer A)
prime are the product
of a reference prior on the parameters of the base model and a normal density on the
parameters indexed by Az and Ay respectively333 Model Posterior Probabilities
We now derive the expressions involved in the calculations of the model posterior
probabilities First recall that p(MA|y zw v) prop m(w v|MA)π(MA) Hence determining
this posterior probability only requires calculating m(w v|MA)
Note that since w and v are independent obtaining the model posteriors from
expression 3ndash4 reduces to finding closed form expressions for the marginals m(v |MAz)
and m(w |MAy) respectively from equations 3ndash6 and 3ndash7 Therefore
m(w v|MA) =
int intf (vw|αλMA)π
IP (α|MAz)πIP
(λ|MAy
)dαdλ
(3ndash10)
For the latent variable associated with the occupancy process plugging the
parameter intrinsic prior given by 3ndash8 into equation 3ndash6 (recalling that ~X primeA~XA =
pAzNX primeAXA)
and integrating out αA yields
m(v|MA) =
int intc0N (v|X0α0 + Xr Aαr A I)N
(αr A|0 2( ~X prime
r A~Xr A)
minus1)dαr Adα0
= c0(2π)minusn2
int (pAz
2N + pAz
) (pAzminusp0z
)
2
times
exp[minus1
2(v minus X0α0)
prime(I minus
(2N
2N + pAz
)Hr Az
)(v minus X0α0)
]dα0
62
= c0 (2π)minus(nminusp0z )2
(pAz
2N + pAz
) (pAzminusp0z
)
2
|X prime0X0|minus
12 times
exp[minus1
2vprime(I minus H0z minus
(2N
2N + pAz
)Hr Az
)v
] (3ndash11)
with Hr Az= HAz
minus H0z where HAzis the hat matrix for the entire model MAz
and H0z is
the hat matrix for the base model
Similarly the marginal distribution for w is
m(w|MA) = d0 (2π)minus(Jminusp0y )2
(pAy
2J + pAy
) (pAyminusp0y
)
2
|Q prime0Q0|minus
12 times
exp[minus1
2wprime(I minus H0y minus
(2J
2J + pAy
)Hr Ay
)w
] (3ndash12)
where J =sumN
i=1 Ji or in other words J denotes the total number of surveys conducted
Now the posteriors for the base model M0 =M0y M0z
are
m(v|M0) =
intc0N (v|X0α0 I) dα0
= c0(2π)minus(nminusp0z )2 |X prime
0X0|minus12 exp
[minus1
2(v (I minus H0z ) v)
](3ndash13)
and
m(w|M0) = d0(2π)minus(Jminusp0y )2 |Q prime
0Q0|minus12 exp
[minus1
2
(w(I minus H0y
)w)]
(3ndash14)
334 Model Selection Algorithm
Having the parameter intrinsic priors in place and knowing the form of the model
posterior probabilities it is finally possible to develop a strategy to conduct model
selection for the occupancy framework
For each of the two components of the model ndashoccupancy and detectionndash the
algorithm first draws the set of active predictors (ie Az and Ay ) together with their
corresponding parameters This is a reversible jump step which uses a Metropolis
63
Hastings correction with proposal distributions given by
q(Alowastz |zo z(t)u v(t)MAz
) =1
2
(p(MAlowast
z|zo z(t)u v(t)Mz MAlowast
zisin L(MAz
)) +1
|L(MAz)|
)q(Alowast
y |y zo z(t)u w(t)MAy) =
1
2
(p(MAlowast
w|y zo z(t)u w(t)My MAlowast
yisin L(MAy
)) +1
|L(MAy)|
)(3ndash15)
where L(MAz) and L(MAy
) denote the sets of models obtained from adding or removing
one predictor at a time from MAzand MAy
respectively
To promote mixing this step is followed by an additional draw from the full
conditionals of α and λ The densities p(α0|) p(αr A|) p(λ0|) and p(λr A|) can
be sampled from directly with Gibbs steps Using the notation a|middot to denote the random
variable a conditioned on all other parameters and on the data these densities are given
by
bull α0|middot sim N((X
prime0X0)
minus1Xprime0v (X
prime0X0)
minus1)bull αr A|middot sim N
(microαr A
αr A
) where the mean vector and the covariance matrix are
given by αr A= 2N
2N+pAz(X
prime
r AXr A)minus1 and microαr A
=(αr A
Xprime
r Av)
bull λ0|middot sim N((Q
prime0Q0)
minus1Qprime0w (Q
prime0Q0)
minus1) and
bull λr A|middot sim N(microλr A
λr A
) analogously with mean and covariance matrix given by
λr A= 2J
2J+pAy(Q
prime
r AQr A)minus1 and microλr A
=(λr A
Qprime
r Aw)
Finally Gibbs sampling steps are also available for the unobserved occupancy
indicators zu and for the corresponding latent variables v and w The full conditional
posterior densities for z(t+1)u v(t+1) and w(t+1) are those introduced in Chapter 2 for the
single season probit model
The following steps summarize the stochastic search algorithm
1 Initialize A(0)y A
(0)z z
(0)u v(0)w(0)α(0)
0 λ(0)0
2 Sample the model indices and corresponding parameters
(a) Draw simultaneously
64
bull Alowastz sim q(Az |zo z(t)u v(t)MAz
)
bull αlowast0 sim p(α0|MAlowast
z zo z
(t)u v(t)) and
bull αlowastr Alowast sim p(αr A|MAlowast
z zo z
(t)u v(t))
(b) Accept (M(t+1)Az
α(t+1)10 α(t+1)1
r A ) = (MAlowastzαlowast
0αlowastr Alowast) with probability
δz = min
(1
p(MAlowastz|zo z(t)u v(t))
p(MA(t)z|zo z(t)u v(t))
q(A(t)z |zo z(t)u v(t)MAlowast
z)
q(Alowastz |zo z
(t)u v(t)MAz
)
)
otherwise let (M(t+1)Az
α(t+1)10 α(t+1)1
r A ) = (A(t)z α(t)2
0 α(t)2r A )
(c) Sample simultaneously
bull Alowasty sim q(Ay |y zo z(t)u w(t)MAy
)
bull λlowast0 sim p(λ0|MAlowast
y y zo z
(t)u w(t)) and
bull λlowastr Alowast sim p(λr A|MAlowast
y y zo z
(t)u w(t))
(d) Accept (M(t+1)Ay
λ(t+1)10 λ(t+1)1
r A ) = (MAlowastyλlowast
0λlowastr Alowast) with probability
δy = min
(1
p(MAlowastz|y zo z(t)u w(t))
p(MA(t)z|y zo z(t)u w(t))
q(A(t)z |y zo z(t)u w(t)MAlowast
y)
q(Alowastz |y zo z
(t)u w(t)MAy
)
)
otherwise let (M(t+1)Ay
λ(t+1)10 λ(t+1)1
r A ) = (A(t)y λ(t)2
0 λ(t)2r A )
3 Sample base model parameters
(a) Draw α(t+1)20 sim p(α0|MA
(t+1)z
zo z(t)u v(t))
(b) Draw λ(t+1)20 sim p(λ0|MA(t+1)y
y zo z(t)u v(t))
4 To improve mixing resample model coefficients not present the base model butare in MA
(a) Draw α(t+1)2r A sim p(αr A|MA
(t+1)z
zo z(t)u v(t))
(b) Draw λ(t+1)2r A sim p(λr A|MA
(t+1)y
yzo z(t)u v(t))
5 Sample latent and missing (unobserved) variables
(a) Sample z(t+1)u sim p(zu|MA(t+1)z
yα(t+1)2r A α(t+1)2
0 λ(t+1)2r A λ(t+1)2
0 )
(b) Sample v(t+1) sim p(v|MA(t+1)z
zo z(t+1)u α(t+1)2
r A α(t+1)20 )
65
(c) Sample w(t+1) sim p(w|MA(t+1)y
zo z(t+1)u λ(t+1)2
r A λ(t+1)20 )
34 Alternative Formulation
Because the occupancy process is partially observed it is reasonable to consider
the posterior odds in terms of the observed responses that is the detections y and
the presences at sites where at least one detection takes place Partitioning the vector
of presences into observed and unobserved z = (zprimeo zprimeu)
prime and integrating out the
unobserved component the model posterior for MA can be obtained as
p(MA|y zo) prop Ezu [m(y z|MA)] π(MA) (3ndash16)
Data-augmenting the model in terms of latent normal variables a la Albert and Chib
the marginals for any model My Mz = M isin M of z and y inside of the expectation in
equation 3ndash16 can be expressed in terms of the latent variables
m(y z|M) =
intT (z)
intT (yz)
m(w v|M)dwdv
=
(intT (z)
m(v| Mz)dv
)(intT (yz)
m(w|My)dw
) (3ndash17)
where T (z) and T (y z) denote the corresponding truncation regions for v and w which
depend on the values taken by z and y and
m(v|Mz) =
intf (v|αMz)π(α|Mz)dα (3ndash18)
m(w|My) =
intf (w|λMy)π(λ|My)dλ (3ndash19)
The last equality in equation 3ndash17 is a consequence of the independence of the
latent processes v and w Using expressions 3ndash18 and 3ndash19 allows one to embed this
model selection problem in the classical linear normal regression setting where many
ldquoobjectiverdquo Bayesian inferential tools are available In particular these expressions
facilitate deriving the parameter intrinsic priors (Berger amp Pericchi 1996 Moreno
et al 1998) for this problem This approach is an extension of the one implemented in
Leon-Novelo et al (2012) for the simple probit regression problem
66
Using this alternative approach all that is left is to integrate m(v|MA) and m(w|MA)
over their corresponding truncation regions T (z) and T (y z) which yields m(y z|MA)
and then to obtain the expectation with respect to the unobserved zrsquos Note however
two issues arise First such integrals are not available in closed form Second
calculating the expectation over the limit of integration further complicates things To
address these difficulties it is possible to express E [m(y z|MA)] as
Ezu [m(y z|MA)] = Ezu
[(intT (z)
m(v| MAz)dv
)(intT (yz)
m(w|MAy)dw
)](3ndash20)
= Ezu
[(intT (z)
intm(v| MAz
α0)πIP(α0|MAz
)dα0dv
)times(int
T (yz)
intm(w| MAy
λ0)πIP(λ0|MAy
)dλ0dw
)]
= Ezu
int (int
T (z)
m(v| MAzα0)dv
)︸ ︷︷ ︸
g1(T (z)|MAz α0)
πIP(α0|MAz)dα0 times
int (intT (yz)
m(w|MAyλ0)dw
)︸ ︷︷ ︸
g2(T (yz)|MAy λ0)
πIP(λ0|MAy)dλ0
= Ezu
[intg1(T (z)|MAz
α0)πIP(α0|MAz
)dα0 timesintg2(T (y z)|MAy
λ0)πIP(λ0|MAy
)dλ0
]= c0 d0
int intEzu
[g1(T (z)|MAz
α0)g2(T (y z)|MAyλ0)
]dα0 dλ0
where the last equality follows from Fubinirsquos theorem since m(v|MAzα0) and
m(w|MAyλ0) are proper densities From 3ndash21 the posterior odds are
p(MA|y zo)p(M0|y zo)
=
int intEzu
[g1(T (z)|MAz
α0)g2(T (y z)|MAyλ0)
]dα0 dλ0int int
Ezu
[g1(T (z)|M0z α0)g2(T (y z)|M0y λ0)
]dα0 dλ0
π(MA)
π(M0)
(3ndash21)
67
35 Simulation Experiments
The proposed methodology was tested under 36 different scenarios where we
evaluate the behavior of the algorithm by varying the number of sites the number of
surveys the amount of signal in the predictors for the presence component and finally
the amount of signal in the predictors for the detection component
For each model component the base model is taken to be the intercept only model
and the full models considered for the presence and the detection have respectively 30
and 20 predictors Therefore the model space contains 230times220 asymp 112times1015 candidate
models
To control the amount of signal in the presence and detection components values
for the model parameter were purposefully chosen so that quantiles 10 50 and 90 of the
occupancy and detection probabilities match some pre-specified probabilities Because
presence and detection are binary variables the amount of signal in each model
component associates to the spread and center of the distribution for the occupancy and
detection probabilities respectively Low signal levels relate to occupancy or detection
probabilities close to 50 High signal levels associate with probabilities close to 0 or 1
Large spreads of the distributions for the occupancy and detection probabilities reflect
greater heterogeneity among the observations collected improving the discrimination
capability of the model and viceversa
Therefore for the presence component the parameter values of the true model
were chosen to set the median for the occupancy probabilities equal 05 The chosen
parameter values also fix quantiles 10 and 90 symmetrically about 05 at small (Qz10 =
03Qz90 = 07) intermediate (Qz
10 = 02Qz90 = 08) and large (Qz
10 = 01Qz90 = 09)
distances For the detection component the model parameters are obtained to reflect
detection probabilities concentrated about low values (Qy50 = 02) intermediate values
(Qy50 = 05) and high values (Qy
50 = 08) while keeping quantiles 10 and 90 fixed at 01
and 09 respectively
68
Table 3-1 Simulation control parameters occupancy model selectorParameter Values considered
N 50 100
J 3 5
(Qz10Q
z50Q
z90)
(03 05 07) (02 05 08) (01 05 09)
(Qy
10Qy50Q
y90)
(01 02 09) (01 05 09) (01 08 09)
There are in total 36 scenarios these result from crossing all the levels of the
simulation control parameters (Table 3-1) Under each of these scenarios 20 data sets
were generated at random True presence and detection indicators were generated
with the probit model formulation from Chapter 2 This with the assumed true models
MTz = 1 x2 x15 x16 x22 x28 for the presence and MTy = 1 q7 q10 q12 q17 for
the detection with the predictors included in the randomly generated datasets In this
context 1 represents the intercept term Throughout the Section we refer to predictors
included in the true models as true predictors and to those absent as false predictors
The selection procedure was conducted using each one of these data sets with
two different priors on the model space the uniform or equal probability prior and a
multiplicity correcting prior
The results are summarized through the marginal posterior inclusion probabilities
(MPIPs) for each predictor and also the five highest posterior probability models (HPM)
The MPIP for a given predictor under a specific scenario and for a particular data set is
defined as
p(predictor is included|y zw v) =sumMisinM
I(predictorisinM)p(M|y zw vM) (3ndash22)
In addition we compare the MPIP odds between predictors present in the true model
and predictors absent from it Specifically we consider the minimum odds of marginal
posterior inclusion probabilities for the predictors Let ~ξ and ξ denote respectively a
69
predictor in the true model MT and a predictor absent from MT We define the minimum
MPIP odds between the probabilities of true and false predictor as
minOddsMPIP =min~ξisinMT
p(I~ξ = 1|~ξ isin MT )
maxξ isinMTp(Iξ = 1|ξ isin MT )
(3ndash23)
If the variable selection procedure adequately discriminates true and false predictors
minOddsMPIP will take values larger than one The ability of the method to discriminate
between the least probable true predictor and the most probable false predictor worsens
as the indicator approaches 0351 Marginal Posterior Inclusion Probabilities for Model Predictors
For clarity in Figures 3-1 through 3-5 only predictors in the true models are labeled
and are emphasized with a dotted line passing through them The left hand side plots
in these figures contain the results for the presence component and the ones on the
right correspond to predictors in the detection component The results obtained with
the uniform model priors correspond to the black lines and those for the multiplicity
correcting prior are in red In these Figures the MPIPrsquos have been averaged over all
datasets corresponding scenarios matching the condition observed
In Figure 3-1 we contrast the mean MPIPrsquos of the predictors over all datasets from
scenarios with 50 sites to the mean MPIPrsquos obtained for the scenarios with 100 sites
Similarly Figure 3-2 compares the mean MPIPrsquos of scenarios where 3 surveys are
performed to those of scenarios having 5 surveys per site Figures 3-4 and 3-5 show the
effect of the different levels of signal considered in the occupancy probabilities and in the
detection probabilities
From these figures mainly three results can be drawn (1) the effect of the model
prior is substantial (2) the proposed methods yield MPIPrsquos that clearly separate
true predictors from false predictors and (3) the separation between MPIPrsquos of true
predictors and false predictors is noticeably larger in the detection component
70
Regardless of the simulation scenario and model component observed under the
uniform prior false predictors obtain a relatively high MPIP Conversely the multiplicity
correction prior strongly shrinks towards 0 the MPIP for false predictors In the presence
component the MPIP for the true predictors is shrunk substantially under the multiplicity
prior however there remains a clear separation between true and false predictors In
contrast in the detection component the MPIP for true predictors remains relatively high
(Figures 3-1 through 3-5)
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif N=50MC N=50
Unif N=100MC N=100
Figure 3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors
71
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif J=3MC J=3
Unif J=5MC J=5
Figure 3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif N=50 J=3Unif N=50 J=5
Unif N=100 J=3Unif N=100 J=5
MC N=50 J=3MC N=50 J=5
MC N=100 J=3MC N=100 J=5
Figure 3-3 Predictor MPIP averaged over scenarios with the interaction between thenumber of sites and the surveys per site using uniform (U) and multiplicitycorrection (MC) priors
72
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
U(03 05 07)MC(03 05 07)
U(02 05 08)MC(02 05 08)
U(01 05 09)MC(01 05 09)
Figure 3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
U(01 02 09)MC(01 02 09)
U(01 05 09)MC(01 05 09)
U(01 08 09)MC(01 08 09)
Figure 3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors
73
In scenarios where more sites were surveyed the separation between the MPIP of
true and false predictors grew in both model components (Figure 3-1) Increasing the
number of sites has an effect over both components given that every time a new site is
included covariate information is added to the design matrix of both the presence and
the detection components
On the hand increasing the number of surveys affects the MPIP of predictors in the
detection component (Figures 3-2 and 3-3) but has only a marginal effect on predictors
of the presence component This may appear to be counterintuitive however increasing
the number of surveys only increases the number of observation in the design matrix
for the detection while leaving unaltered the design matrix for the presence The small
changes observed in the MPIP for the presence predictors J increases are exclusively
a result of having additional detection indicators equal to 1 in sites where with less
surveys would only have 0 valued detections
From Figure 3-3 it is clear that for the presence component the effect of the number
of sites dominates the behavior of the MPIP especially when using the multiplicity
correction priors In the detection component the MPIP is influenced by both the number
of sites and number of surveys The influence of increasing the number of surveys is
larger when considering a smaller number of sites and viceversa
Regarding the effect of the distribution for the occupancy probabilities we observe
that mostly the detection component is affected There is stronger discrimination
between true and false predictors as the distribution has a higher variability (Figure
3-4) This is consistent with intuition since having the presence probabilities more
concentrated about 05 implies that the predictors do not vary much from one site to
the next whereas having the occupancy probabilities more spread out would have the
opposite effect
Finally concentrating the detection probabilities about high or low values For
predictors in the detection component the separation between MPIP of true and false
74
predictors is larger both in scenarios where the distribution of the detection probability
is centered about 02 or 08 when compared to those scenarios where this distribution
is centered about 05 (where the signal of the predictors is weakest) For predictors in
the presence component having the detection probabilities centered at higher values
slightly increases the inclusion probabilities of the true predictors (Figure 3-5) and
reduces that of false predictors
Table 3-2 Comparison of average minOddsMPIP under scenarios having differentnumber of sites (N=50 N=100) and under scenarios having different numberof surveys per site (J=3 J=5) for the presence and detection componentsusing uniform and multiplicity correction priors
Sites SurveysComp π(M) N=50 N=100 J=3 J=5
Presence Unif 112 131 119 124MC 320 846 420 674
Detection Unif 203 264 211 257MC 2115 3246 2139 3252
Table 3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors
(Qz10Q
z50Q
z90) (Qy
10Qy50Q
y90)
Comp π(M) (030507) (020508) (010509) (010209) (010509) (010809)
Presence Unif 105 120 134 110 123 124MC 202 455 805 238 619 640
Detection Unif 234 234 230 257 200 238MC 2537 2077 2528 2933 1852 2849
The separation between the MPIP of true and false predictors is even more
notorious in Tables 3-2 and 3-3 where the minimum MPIP odds between true and
false predictors are shown Under every scenario the value for the minOddsMPIP (as
defined in 3ndash23) was greater than 1 implying that on average even the lowest MPIP
for a true predictor is higher than the maximum MPIP for a false predictor In both
components of the model the minOddsMPIP are markedly larger under the multiplicity
correction prior and increase with the number of sites and with the number of surveys
75
For the presence component increasing the signal in the occupancy probabilities
or having the detection probabilities concentrate about higher values has a positive and
considerable effect on the magnitude of the odds For the detection component these
odds are particularly high specially under the multiplicity correction prior Also having
the distribution for the detection probabilities center about low or high values increases
the minOddsMPIP 352 Summary Statistics for the Highest Posterior Probability Model
Tables 3-4 through 3-7 show the number of true predictors that are included in
the HPM (True +) and the number of false predictors excluded from it (True minus)
The mean percentages observed in these Tables provide one clear message The
highest probability models chosen with either model prior commonly differ from the
corresponding true models The multiplicity correction priorrsquos strong shrinkage only
allows a few true predictors to be selected but at the same time it prevents from
including in the HPM any false predictors On the other hand the uniform prior includes
in the HPM a larger proportion of true predictors but at the expense of also introducing
a large number of false predictors This situation is exacerbated in the presence
component but also occurs to a minor extent in the detection component
Table 3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highestprobability models for the presence and the detection components usinguniform and multiplicity correcting priors on the model space
True + True minusComp π(M) N=50 N=100 N=50 N=100
Presence Unif 057 063 051 055MC 006 013 100 100
Detection Unif 077 085 087 093MC 049 070 100 100
Having more sites or surveys improves the inclusion of true predictors and exclusion
of false ones in the HPM for both the presence and detection components (Tables 3-4
and 3-5) On the other hand if the distribution for the occupancy probabilities is more
76
Table 3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space
True + True minusComp π(M) J=3 J=5 J=3 J=5
Presence Unif 059 061 052 054MC 008 010 100 100
Detection Unif 078 085 087 092MC 050 068 100 100
spread out the HPM includes more true predictors and less false ones in the presence
component In contrast the effect of the spread of the occupancy probabilities in the
detection HPM is negligible (Table 3-6) Finally there is a positive relationship between
the location of the median for the detection probabilities and the number of correctly
classified true and false predictors for the presence The HPM in the detection part of
the model responds positively to low and high values of the median detection probability
(increased signal levels) in terms of correctly classified true and false predictors (Table
3-7)
Table 3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space
True + True minusComp π(M) (030507) (020508) (010509) (030507) (020508) (010509)
Presence Unif 055 061 064 050 054 055MC 002 008 018 100 100 100
Detection Unif 081 082 081 090 089 089MC 057 061 059 100 100 100
36 Case Study Blue Hawker Data Analysis
During 1999 and 2000 an intensive volunteer surveying effort coordinated by the
Centre Suisse de Cartographie de la Faune (CSCF) was conducted in order to analyze
the distribution of the blue hawker Ashna cyanea (Odonata Aeshnidae) a common
dragonfly in Switzerland Given that Switzerland is a small and mountainous country
77
Table 3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space
True + True minusComp π(M) (010209) (010509) (010809) (010209) (010509) (010809)
Presence Unif 059 059 062 051 054 054MC 006 010 011 100 100 100
Detection Unif 089 077 078 091 087 091MC 070 048 059 100 100 100
there is large variation in its topography and physio-geography as such elevation is a
good candidate covariate to predict species occurrence at a large spatial scale It can
be used as a proxy for habitat type intensity of land use temperature as well as some
biotic factors (Kery et al 2010)
Repeated visits to 1-ha pixels took place to obtain the corresponding detection
history In addition to the survey outcome the x and y-coordinates thermal-level the
date of the survey and the elevation were recorded Surveys were restricted to the
known flight period of the blue hawker which takes place between May 1 and October
10 In total 2572 sites were surveyed at least once during the surveying period The
number of surveys per site ranges from 1 to 22 times within each survey year
Kery et al (2010) summarize the results of this effort using AIC-based model
comparisons first by following a backwards elimination approach for the detection
process while keeping the occupancy component fixed at the most complex model and
then for the presence component choosing among a group of three models while using
the detection model chosen In our analysis of this dataset for the detection and the
presence we consider as the full models those used in Kery et al (2010) namely
minus1(ψ) = α0 + α1year+ α2elev+ α3elev2 + α4elev
3
minus1(p) = λ0 + λ1year+ λ2elev+ λ3elev2 + λ4elev
3 + λ5date+ λ6date2
78
where year = Iyear=2000
The model spaces for this data contain 26 = 64 and 24 = 16 models respectively
for the detection and occupancy components That is in total the model space contains
24+6 = 1 024 models Although this model space can be enumerated entirely for
illustration we implemented the algorithm from section 334 generating 10000 draws
from the Gibbs sampler Each one of the models sampled were chosen from the set of
models that could be reached by changing the state of a single term in the current model
(to inclusion or exclusion accordingly) This allows a more thorough exploration of the
model space because for each of the 10000 models drawn the posterior probabilities
for many more models can be observed Below the labels for the predictors are followed
by either ldquozrdquo or ldquoyrdquo accordingly to represent the component they pertain to Finally
using the results from the model selection procedure we conducted a validation step to
determine the predictive accuracy of the HPMrsquos and of the median probability models
(MPMrsquos) The performance of these models is then contrasted with that of the model
ultimately selected by Kery et al (2010)361 Results Variable Selection Procedure
The model finally chosen for the presence component in Kery et al (2010) was not
found among the highest five probability models under either model prior 3-8 Moreover
the year indicator was never chosen under the multiplicity correcting prior hinting that
this term might correspond to a falsely identified predictor under the uniform prior
Results in Table 3-10 support this claim the marginal inclusion posterior probability for
the year predictor is 7 under the multiplicity correction prior The multiplicity correction
prior concentrates more densely the model posterior probability mass in the highest
ranked models (90 of the mass is in the top five models) than the uniform prior (which
account for 40 of the mass)
For the detection component the HPM under both priors is the intercept only model
which we represent in Table 3-9 with a blank label In both cases this model obtains very
79
Table 3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data
Uniform model priorRank Mz selected p(Mz |y)
1 yrz+elevz 0102 yrz+elevz+elevz3 0083 elevz2+elevz3 0084 yrz+elevz2 0075 yrz+elevz3 007
Multiplicity correcting model priorRank Mz selected p(Mz |y)
1 elevz+elevz3 0532 0153 elevz+elevz2 0094 elevz2 0065 elevz+elevz2+elevz3 005
high posterior probabilities The terms contained in cubic polynomial for the elevation
appear to contain some relevant information however this conflicts with the MPIPs
observed in Table 3-11 which under both model priors are relatively low (lt 20 with the
uniform and le 4 with the multiplicity correcting prior)
Table 3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data
Uniform model priorRank Mz selected p(Mz |y)
1 0452 elevy3 0063 elevy2 0054 elevy 0055 yry 004
Multiplicity correcting model priorRank Mz selected p(Mz |y)
1 0862 elevy3 0023 datey2 0024 elevy2 0025 yry 002
Finally it is possible to use the MPIPs to obtain the median probability model which
contains the terms that have a MPIP higher than 50 For the occupancy process
(Table 3-10) under the uniform prior the model with the year the elevation and the
elevation cubed are included The MPM with multiplicity correction prior coincides with
the HPM from this prior The MPM chosen for the detection component (Table 3-11)
under both priors is the intercept only model coinciding again with the HPM
Given the outcomes of the simulation studies from Section 35 especially those
pertaining to the detection component the results in Table 3-11 appear to indicate that
none of the predictors considered belong to the true model especially when considering
80
Table 3-10 MPIP presence component
Predictor p(predictor isin MTz |y z w v)
Unif MultCorryrz 053 007elevz 051 073elevz2 045 023elevz3 050 067
Table 3-11 MPIP detection component
Predictor p(predictor isin MTy |y z w v)
Unif MultCorryry 019 003elevy 018 003elevy2 018 003elevy 3 019 004datey 016 003datey2 015 004
those derived with the multiplicity correction prior On the other hand for the presence
component (Table 3-10) there is an indication that terms related to the cubic polynomial
in elevz can explain the occupancy patterns362 Validation for the Selection Procedure
Approximately half of the sites were selected at random for training (ie for model
selection and parameter estimation) and the remaining half were used as test data In
the previous section we observed that using the marginal posterior inclusion probability
of the predictors the our method effectively separates predictors in the true model from
those that are not in it However in Tables 3-10 and 3-11 this separation is only clear for
the presence component using the multiplicity correction prior
Therefore in the validation procedure we observe the misclassification rates for the
detections using the following models (1) the model ultimately recommended in Kery
et al (2010) (yrz+elevz+elevz2+elevz3 + elevy+ elevy2+ datey+ datey2) (2) the
highest probability model (HPM) with a uniform prior (yrz+elevz) (3) the HPM with a
multiplicity correcting prior (elevz + elevz3 ) (4) the median probability model (MPM)
ndashthe model including only predictors with a MPIP larger than 50ndash with the uniform
prior (yrz+elevz+elevz3) and finally (5) the MPM with a multiplicity correction prior
(elevz+elevz3 same as the HPM with multiplicity correction)
We must emphasize that the models resulting from the implement ion of our model
selection procedure used exclusively the training dataset On the other hand the model
in Kery et al (2010) was chosen to minimize the prediction error of the complete data
81
Because this model was obtained from the full dataset results derived from it can only
be considered as a lower bound for the prediction errors
The benchmark misclassification error rate for true 1rsquos is high (close to 70)
However the misclassification rate for true 0rsquos which accounts for most of the
responses is less pronounced (15) Overall the performance of the selected models
is comparable They yield considerably worse results than the benchmark for the true
1rsquos but achieve rates close to the benchmark for the true zeros Pooling together
the results for true ones and true zeros the selected models with either prior have
misclassification rates close to 30 The benchmark model performs comparably with a
joint misclassification error of 23 (Table 3-12)
Table 3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform andmultiplicity correction model priors
Model True 1 True 0 Jointbenchmark (Kery et al 2010) yrz+elevz+elevz2+elevz3 + 066 015 023
elevy+ elevy2+ datey+ datey2
HPM Unif yrz+elevz 083 017 028HPMHPM MC elevz + elevz3 082 018 028MPM Unif yrz+elevz+elevz3 082 018 029
37 Discussion
In this Chapter we proposed an objective and fully automatic Bayes methodology for
the single season site-occupancy model The methodology is said to be fully automatic
because no hyper-parameter specification is necessary in defining the parameter priors
and objective because it relies on the intrinsic priors derived from noninformative priors
The intrinsic priors have been shown to have desirable properties as testing priors We
also propose a fast stochastic search algorithm to explore large model spaces using our
model selection procedure
Our simulation experiments demonstrated the ability of the method to single out the
predictors present in the true model when considering the marginal posterior inclusion
probabilities for the predictors For predictors in the true model these probabilities
were comparatively larger than those for predictors absent from it Also the simulations
82
indicated that the method has a greater discrimination capability for predictors in the
detection component of the model especially when using multiplicity correction priors
Multiplicity correction priors were not described in this Chapter however their
influence on the selection outcome is significant This behavior was observed in the
simulation experiment and in the analysis of the Blue Hawker data Model priors play an
essential role As the number of predictors grows these are instrumental in controlling
for selection of false positive predictors Additionally model priors can be used to
account for predictor structure in the selection process which helps both to reduce the
size of the model space and to make the selection more robust These issues are the
topic of the next Chapter
Accounting for the polynomial hierarchy in the predictors within the occupancy
context is a straightforward extension of the procedures we describe in Chapter 4
Hence our next step is to develop efficient software for it An additional direction we
plan to pursue is developing methods for occupancy variable selection in a multivariate
setting This can be used to conduct hypothesis testing in scenarios with varying
conditions through time or in the case where multiple species are co-observed A
final variation we will investigate for this problem is that of occupancy model selection
incorporating random effects
83
CHAPTER 4PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
It has long been an axiom of mine that the little things are infinitely themost important
ndashSherlock HolmesA Case of Identity
41 Introduction
In regression problems if a large number of potential predictors is available the
complete model space is too large to enumerate and automatic selection algorithms are
necessary to find informative parsimonious models This multiple testing problem
is difficult and even more so when interactions or powers of the predictors are
considered In the ecological literature models with interactions andor higher order
polynomial terms are ubiquitous (Johnson et al 2013 Kery et al 2010 Zeller et al
2011) given the complexity and non-linearities found in ecological processes Several
model selection procedures even in the classical normal linear setting fail to address
two fundamental issues (1) the model selection outcome is not invariant to affine
transformations when interactions or polynomial structures are found among the
predictors and (2) additional penalization is required to control for false positives as the
model space grows (ie as more covariates are considered)
These two issues motivate the developments developed throughout this Chapter
Building on the results of Chipman (1996) we propose investigate and provide
recommendations for three different prior distributions on the model space These
priors help control for test multiplicity while accounting for polynomial structure in the
predictors They improve upon those proposed by Chipman first by avoiding the need
for specific values for the prior inclusion probabilities of the predictors and second
by formulating principled alternatives to introduce additional structure in the model
84
priors Finally we design a stochastic search algorithm that allows fast and thorough
exploration of model spaces with polynomial structure
Having structure in the predictors can determine the selection outcome As an
illustration consider the model E [y ] = β00 + β01x2 + β20x21 where the order one
term x1 is not present (this choice of subscripts for the coefficients is defined in the
following section) Transforming x1 7rarr xlowast1 = x1 + c for some c = 0 the model
becomes E [y ] = β00 + β01x2 + βlowast20x
lowast21 Note that in terms of the original predictors
xlowast21 = x21 + 2c middot x1 + c2 implying that this seemingly innocuous transformation of x1
modifies the column space of the design matrix by including x1 which was not in the
original model That is when lower order terms in the hierarchy are omitted from the
model the column space of the design matrix is not invariant to afine transformations
As the hat matrix depends on the column space the modelrsquos predictive capability is also
affected by how the covariates in the model are coded an undesirable feature for any
model selection procedure To make model selection invariant to afine transformations
the selection must be constrained to the subset of models that respect the hierarchy
(Griepentrog et al 1982 Khuri 2002 McCullagh amp Nelder 1989 Nelder 2000
Peixoto 1987 1990) These models are known as well-formulated models (WFMs)
Succinctly a model is well-formulated if for any predictor in the model every lower order
predictor associated with it is also in the model The model above is not well-formulated
as it contains x21 but not x1
WFMs exhibit strong heredity in that all lower order terms dividing higher order
terms in the model must also be included An alternative is to only require weak heredity
(Chipman 1996) which only forces some of the lower terms in the corresponding
polynomial hierarchy to be in the model However Nelder (1998) demonstrated that the
conditions under which weak heredity allows the design matrix to be invariant to afine
transformations of the predictors are too restrictive to be useful in practice
85
Although this topic appeared in the literature more than three decades ago (Nelder
1977) only recently have modern variable selection techniques been adapted to
account for the constraints imposed by heredity As described in Bien et al (2013)
the current literature on variable selection for polynomial response surface models
can be classified into three broad groups mult-istep procedures (Brusco et al 2009
Peixoto 1987) regularized regression methods (Bien et al 2013 Yuan et al 2009)
and Bayesian approaches (Chipman 1996) The methods introduced in this Chapter
take a Bayesian approach towards variable selection for well-formulated models with
particular emphasis on model priors
As mentioned in previous chapters the Bayesian variable selection problem
consists of finding models with high posterior probabilities within a pre-specified model
space M The model posterior probability for M isin M is given by
p(M|yM) prop m(y|M)π(M|M) (4ndash1)
Model posterior probabilities depend on the prior distribution on the model space
as well as on the prior distributions for the model specific parameters implicitly through
the marginals m(y|M) Priors on the model specific parameters have been extensively
discussed in the literature (Berger amp Pericchi 1996 Berger et al 2001 George 2000
Jeffreys 1961 Kass amp Wasserman 1996 Liang et al 2008 Zellner amp Siow 1980) In
contrast the effect of the prior on the model space has until recently been neglected
A few authors (eg Casella et al (2014) Scott amp Berger (2010) Wilson et al (2010))
have highlighted the relevance of the priors on the model space in the context of multiple
testing Adequately formulating priors on the model space can both account for structure
in the predictors and provide additional control on the detection of false positive terms
In addition using the popular uniform prior over the model space may lead to the
undesirable and ldquoinformativerdquo implication of favoring models of size p2 (where p is the
86
total number of covariates) since this is the most abundant model size contained in the
model space
Variable selection within the model space of well-formulated polynomial models
poses two challenges for automatic objective model selection procedures First the
notion of model complexity takes on a new dimension Complexity is not exclusively
a function of the number of predictors but also depends upon the depth and
connectedness of the associations defined by the polynomial hierarchy Second
because the model space is shaped by such relationships stochastic search algorithms
used to explore the models must also conform to these restrictions
Models without polynomial hierarchy constitute a special case of WFMs where
all predictors are of order one Hence all the methods developed throughout this
Chapter also apply to models with no predictor structure Additionally although our
proposed methods are presented for the normal linear case to simplify the exposition
these methods are general enough to be embedded in many Bayesian selection
and averaging procedures including of course the occupancy framework previously
discussed
In this Chapter first we provide the necessary definitions to characterize the
well-formulated model selection problem Then we proceed to introduce three new prior
structures on the well-formulated model space and characterize their behavior with
simple examples and simulations With the model priors in place we build a stochastic
search algorithm to explore spaces of well-formulated models that relies on intrinsic
priors for the model specific parameters mdash though this assumption can be relaxed
to use other mixtures of g-priors Finally we implement our procedures using both
simulated and real data
87
42 Setup for Well-Formulated Models
Suppose that the observations yi are modeled using the polynomial regression of
the covariates xi 1 xi p given by
yi =sum
β(α1αp)
pprodj=1
xαji j + ϵi (4ndash2)
where α = (α1 αp) belongs to Np0 the p-dimensional space of natural numbers
including 0 with ϵiiidsim N(0σ2) and only finitely many βα are allowed to be non-zero
As an illustration consider a model space that includes polynomial terms incorporating
covariates xi 1 and xi 2 only The terms x2i 2 and x2i 1xi 2 can be represented by α = (0 2)
and α = (2 1) respectively
The notation y = Z(X)β + ϵ is used to denote that observed response y =
(y1 yn) is modeled via a polynomial function Z of the original covariates contained
in X = (x1 xp) (where xj = (x1j xnj)prime) and the coefficients of the polynomial
terms are given by β A specific polynomial model M is defined by the set of coefficients
βα that are allowed to be non-zero This definition is equivalent to characterizing M
through a collection of multi-indices α isin Np0 In particular model M is specified by
M = αM1 αM|M| for αMk isin Np0 where βα = 0 for α isin M
Any particular model M uses a subset XM of the original covariates X to form the
polynomial terms in the design matrix ZM(X) Without ambiguity a polynomial model
ZM(X) on X can be identified with a polynomial model ZM(XM) on the covariates XM
The number of terms used by M to model the response y denoted by |M| corresponds
to the number of columns of ZM(XM) The coefficient vector and error variance of
the model M are denoted by βM and σ2M respectively Thus M models the data as
y = ZM(XM)βM + ϵM where ϵM sim N(0 Iσ2M
) Model M is said to be nested in model M prime
if M sub M prime M models the response of the covariates in two distinct ways choosing the
set of meaningful covariates XM as well as choosing the polynomial structure of these
covariates ZM(XM)
88
The set Np0 constitutes a partially ordered set or more succinctly a poset A poset
is a set partially ordered through a binary relation ldquo≼rdquo In this context the binary relation
on the poset Np0 is defined between pairs (ααprime) by αprime ≼ α whenever αj ge αprime
j for all
j = 1 prime with αprime ≺ α if additionally αj gt αprimej for some j The order of a term α isin Np
0
is given by the sum of its elements order(α) =sumαj When order(α) = order(αprime) + 1
and αprime ≺ α then αprime is said to immediately precede α which is denoted by αprime rarr α
The parent set of α is defined by P(α) = αprime isin Np0 αprime rarr α and is given by the
set of nodes that immediately precede the given node A polynomial model M is said to
be well-formulated if α isin M implies that P(α) sub M For example any well-formulated
model using x2i 1xi 2 to model yi must also include the parent terms xi 1xi 2 and x2i 1 their
corresponding parent terms xi 1 and xi 2 and the intercept term 1
The poset Np0 can be represented by a Directed Acyclic Graph (DAG) denoted
by (Np0) Without ambiguity we can identify nodes in the graph α isin Np
0 with terms in
the set of covariates The graph has directed edges to a node from its parents Any
well-formulated model M is represented by a subgraph (M) of (Np0) with the property
that if node α isin (M) then the nodes corresponding to P(α) are also in (M) Figure
4-1 shows examples of well-formulated polynomial models where α isin Np0 is identified
withprodp
j=1 xαjj
The motivation for considering only well-formulated polynomial models is
compelling Let ZM be the design matrix associated with a polynomial model The
subspace of y modeled by ZM given by the hat matrix HM = ZM(ZprimeMZM)
minus1ZprimeM is
invariant to affine transformations of the matrix XM if and only if M corresponds to a
well-formulated polynomial model (Peixoto 1990)
89
A B
Figure 4-1 Graphs of well-formulated polynomial models for p = 2
For example if p = 2 and yi = β(00) + β(10)xi 1 + β(01)xi 2 + β(11)xi 1xi 2 + ϵi then
the hat matrix is invariant to any covariate transformation of the form A(xi 1xi 2
)+ b for any
real-valued positive definite 2 times 2 matrix A and any real-valued vector of dimension two
b In contrast if yi = β(00) + β(20)x2i 1 + ϵi then the hat matrix formed after applying the
transformation xi 1 7rarr xi 1 + c for real c = 0 is not the same as the hat matrix formed by
the original xi 1421 Well-Formulated Model Spaces
The spaces of WFMs M considered in this paper can be characterized in terms
of two WFMs MB the base model and MF the full model The base model contains at
least the intercept term and is nested in the full model The model space M is populated
by all well formulated models M that nest MB and are nested in MF
M = M MB sube M sube MF and M is well-formulated
For M to be well-formulated the entire ancestry of each node in M must also be
included in M Because of this M isin M can be uniquely identified by two different sets
of nodes in MF the set of extreme nodes and the set of children nodes For M isin M
90
the sets of extreme and children nodes respectively denoted by E(M) and C(M) are
defined by
E(M) = α isin M MB α isin P(αprime) forall αprime isin M
C(M) = α isin MF M α cupM is well-formulated
The extreme nodes are those nodes that when removed from M give rise to a WFM in
M The children nodes are those nodes that when added to M give rise to a WFM in
M Because MB sube M for all M isin M the set of nodes E(M)cupMB determine M by
beginning with this set and iteratively adding parent nodes Similarly the nodes in C(M)
determine the set αprime isin P(α) α isin C(M)cupαprime isin E(MF ) α ≼ αprime for all α isin C(M)
which contains E(M)cupMB and thus uniquely identifies M
1
x1
x2
x21
x1x2
x22
A Extreme node set
1
x1
x2
x21
x1x2
x22
B Children node set
Figure 4-2
In Figure 4-2 the extreme and children sets for model M = 1 x1 x21 are shown for
the model space characterized by MF = 1 x1 x2 x21 x1x2 x22 In Figure 4-2A the solid
nodes represent nodes α isin M E(M) the dashed node corresponds to α isin E(M) and
the dotted nodes are not in M Solid nodes in Figure 4-2B correspond to those in M
The dashed node is the single node in C(M) and the dotted nodes are not in M cup C(M)43 Priors on the Model Space
As discussed in Scott amp Berger (2010) the Ockhamrsquos-razor effect found
automatically in Bayesian variable selection through the Bayes factor does not correct
91
for multiple testing This penalization acts against more complex models but does not
account for the collection of models in the model space which describes the multiplicity
of the testing problem This is where the role of the prior on the model space becomes
important As Scott amp Berger explain the multiplicity penalty is ldquohidden awayrdquo in the
model prior probabilities π(M|M)
In what follows we propose three different prior structures on the model space
for WFMs discuss their advantages and disadvantages and describe reasonable
choices for their hyper-parameters In addition we investigate how the choice of
prior structure and hyper-parameter combinations affect the posterior probabilities for
predictor inclusion providing some recommendations for different situations431 Model Prior Definition
The graphical structure for the model spaces suggests a method for prior
construction on M guided by the notion of inheritance A node α is said to inherit from
a node αprime if there is a directed path from αprime to α in the graph (MF ) The inheritance
is said to be immediate if order(α) = order(αprime) + 1 (equivalently if αprime isin P(α) or if αprime
immediately precedes α)
For convenience define (M) = M MB to be the set of nodes in M that are not
in the base model MB For α isin (MF ) let γα(M) be the indicator function describing
whether α is included in M ie γα(M) = I(αisinM) Denote by γν(M) the set of indicators
of inclusion in M for all order ν nodes in (MF ) Finally let γltν(M) =cupνminus1
j=0 γ j(M)
the set of indicators for inclusion in M for all nodes in (MF ) of order less than ν With
these definitions the prior probability of any model M isin M can be factored as
π(M|M) =
JmaxMprod
j=JminM
π(γ j(M)|γltj(M)M) (4ndash3)
where JminM and Jmax
M are respectively the minimum and maximum order of nodes in
(MF ) and π(γJminM (M)|γltJmin
M (M)M) = π(γJminM (M)|M)
92
Prior distributions on M can be simplified by making two assumptions First if
order(α) = order(αprime) = j then γα and γαprime are assumed to be conditionally independent
when conditioned on γltj denoted by γα perpperp γαprime|γltj Second immediate inheritance is
invoked and it is assumed that if order(α) = j then γα(M)|γltj(M) = γα(M)|γP(α)(M)
where γP(α)(M) is the inclusion indicator for the set of parent nodes of α This indicator
is one if the complete parent set of α is contained in M and zero otherwise
In Figure 4-3 these two assumptions are depicted with MF being an order two
surface in two main effects The conditional independence assumption (Figure 4-3A)
implies that the inclusion indicators for x21 x22 and x1x2 is independent when conditioned
on all the lower order terms In this same space immediate inheritance implies that
the inclusion of x21 conditioned on the inclusion of all lower order nodes is equivalent to
conditioning it on its parent set (x1 in this case)
x21 perpperp x1x2 perpperp x22
∣∣∣∣∣
1
x1
x2
A Conditional independence
x21∣∣∣∣∣
1
x1
x2
=
x21
∣∣∣∣∣ x1
B Immediate inheritance
Figure 4-3
Denote the conditional inclusion probability of node α in model M by πα =
π(γα(M) = 1|γP(α)(M)M) Under the assumptions of conditional independence
93
and immediate inheritance the prior probability of M is
π(M|πMM) =prod
αisin(MF )
πγα(M)α (1minus πα)
1minusγα(M) (4ndash4)
with πM = πα α isin (MF ) Because M must be well-formulated πα = γα =
0 if γP(α)(M) = 0 Thus the product in 4ndash4 can be restricted to the set of nodes
α isin (M)cup
C(M) Additional structure can be built into the prior on M by making
assumptions about the inclusion probabilities πα such as equality assumptions or
assumptions of a hyper-prior for these parameters Three such prior classes are
developed next first by assigning hyperpriors on πM assuming some structure among
its elements and then marginalizing out the πM
Hierarchical Uniform Prior (HUP) The HUP assumes that the non-zero πα
are all equal Specifically for a model M isin M it is assumed that πα = π for all
α isin (M)cupC(M) A complete Bayesian specification of the HUP is completed by
assuming a prior distribution for π The choice of π sim Beta(a b) produces
πHUP(M|M a b) =B(|(M)|+ a |C(M)|+ b)
B(a b) (4ndash5)
where B is the beta function Setting a = b = 1 gives the particular value of
πHUP(M|M a = 1 b = 1) =1
|(M)|+ |C(M)|+ 1
(|(M)|+ |C(M)|
|(M)|
)minus1
(4ndash6)
The HUP assigns equal probabilities to all models for which the sets of nodes (M)
and C(M) have the same cardinality This prior provides a combinatorial penalization
but essentially fails to account for the hierarchical structure of the model space An
additional penalization for model complexity can be incorporated into the HUP by
changing the values of a and b Because πα = π for all α this penalization can only
depend on some aspect of the entire graph of MF such as the total number of nodes
not in the null model |(MF )|
94
Hierarchical Independence Prior (HIP) The HIP assumes that there are no
equality constraints among the non-zero πα Each non-zero πα is given its own prior
which is assumed to be a Beta distribution with parameters aα and bα Thus the prior
probability of M under the HIP is
πHIP(M|M ab) =
prodαisin(M)
aα
aα + bα
prodαisinC(M)
bα
aα + bα
(4ndash7)
where the product over empty is taken to be 1 Because the πα are totally independent any
choice of aα and bα is equivalent to choosing a probability of success πα for a given α
Setting aα = bα = 1 for all α isin (M)cup
C(M) gives the particular value of
πHIP(M|M a = 1b = 1) =
(1
2
)|(M)|+|C(M)|
(4ndash8)
Although the prior with this choice of hyper-parameters accounts for the hierarchical
structure of the model space it essentially provides no penalization for combinatorial
complexity at different levels of the hierarchy This can be observed by considering a
model space with main effects only the exponent in 4ndash8 is the same for every model in
the space because each node is either in the model or in the children set
Additional penalizations for model complexity can be incorporated into the HIP
Because each γ j is conditioned on γltj in the prior construction the aα and bα for α of
order j can be conditioned on γltj One such additional penalization utilizes the number
of nodes of order j that could be added to produce a WFM conditioned on the inclusion
vector γltj which is denoted as chj(γltj) Choosing aα = 1 and bα(M) = chj(γ
ltj) is
equivalent to choosing a probability of success πα = 1chj(γltj) This penalization can
drive down the false positive rate when chj(γltj) is large but may produce more false
negatives
Hierarchical Order Prior (HOP) A compromise between complete equality and
complete independence of the πα is to assume equality between the πα of a given
order and independence across the different orders Define j(M) = α isin (M)
95
order(α) = j and Cj(M) = α isin C(M) order(α) = j The HOP assumes that πα = πj
for all α isin j(M)cupCj(M) Assuming that πj sim Beta(aj bj) provides a prior probability of
πHOP(M|M ab) =
JmaxMprod
j=JminM
B(|j(M)|+ aj |Cj(M)|+ bj)
B(aj bj)(4ndash9)
The specific choice of aj = bj = 1 for all j gives a value of
πHOP(M|M a = 1b = 1) =prodj
[1
|j(M)|+ |Cj(M)|+ 1
(|j(M)|+ |Cj(M)|
|j(M)|
)minus1]
(4ndash10)
and produces a hierarchical version of the Scott and Berger multiplicity correction
The HOP arises from a conditional exchangeability assumption on the indicator
variables Conditioned on γltj(M) the indicators γα α isin j(M)cup
Cj(M) are
assumed to be exchangeable Bernoulli random variables By de Finettirsquos theorem these
arise from independent Bernoulli random variables with common probability of success
πj with a prior distribution Our construction of the HOP assumes that this prior is a
beta distribution Additional complexity penalizations can be incorporated into the HOP
in a similar fashion to the HIP The number of possible nodes that could be added of
order j while maintaining a WFM is given by chj(M) = chj(γltj(M)) = |j(M)
cupCj(M)|
Using aj = 1 and bj(M) = chj(M) produces a prior with two desirable properties
First if M prime sub M then π(M) le π(M prime) Second for each order j the conditional
probability of including k nodes is greater than or equal to that of including k + 1 nodes
for k = 0 1 chj(M)minus 1432 Choice of Prior Structure and Hyper-Parameters
Each of the priors introduced in Section 31 defines a whole family of model priors
characterized by the probability distribution assumed for the inclusion probabilities πM
For the sake of simplicity this paper focuses on those arising from Beta distributions
and concentrates on particular choices of hyper-parameters which can be specified
automatically First we describe some general features about how each of the three
prior structures (HUP HIP HOP) allocates mass to the models in the model space
96
Second as there is an infinite number of ways in which the hyper-parameters can be
specified focused is placed on the default choice a = b = 1 as well as the complexity
penalizations described in Section 31 The second alternative is referred to as a =
1b = ch where b = ch has a slightly different interpretation depending on the prior
structure Accordingly b = ch is given by bj(M) = bα(M) = chj(M) = |j(M)cup
Cj(M)|
for the HOP and HIP where j = order(α) while b = ch denotes that b = |(MF )| for
the HUP The prior behavior for two model spaces In both cases the base model MB is
taken to be the intercept only model and MF is the DAG shown (Figures 4-4 and 4-5)
The priors considered treat model complexity differently and some general properties
can be seen in these examples
ModelHIP HOP HUP
(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)
1 1 14 49 13 12 13 572 1 x1 18 19 112 112 112 5563 1 x2 18 19 112 112 112 5564 1 x1 x
21 18 19 112 112 112 5168
5 1 x2 x22 18 19 112 112 112 5168
6 1 x1 x2 132 364 112 112 160 1727 1 x1 x2 x
21 132 164 136 160 160 1168
8 1 x1 x2 x1x2 132 164 136 160 160 11689 1 x1 x2 x
22 132 164 136 160 160 1168
10 1 x1 x2 x21 x1x2 132 1192 136 1120 130 1252
11 1 x1 x2 x21 x
22 132 1192 136 1120 130 1252
12 1 x1 x2 x1x2 x22 132 1192 136 1120 130 1252
13 1 x1 x2 x21 x1x2 x
22 132 1576 112 1120 16 1252
Figure 4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch)
First contrast the choice of HIP HUP and HOP for the choice of (ab) = (1 1) The
HIP induces a complexity penalization that only accounts for the order of the terms in
the model This is best exhibited by the model space in Figure 4-4 Models including x1
and x2 models 6 through 13 are given the same prior probability and no penalization is
incurred for the inclusion of any or all of the quadratic terms In contrast to the HIP the
97
ModelHIP HOP HUP
(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)
1 1 18 2764 14 12 14 472 1 x1 18 964 112 110 112 2213 1 x2 18 964 112 110 112 2214 1 x3 18 964 112 110 112 2215 1 x1 x3 18 364 112 120 112 41056 1 x2 x3 18 364 112 120 112 41057 1 x1 x2 116 3128 124 140 130 1428 1 x1 x2 x1x2 116 3128 124 140 120 1709 1 x1 x2 x3 116 1128 18 140 120 17010 1 x1 x2 x3 x1x2 116 1128 18 140 15 170
Figure 4-5 Prior probabilities for the space of well-formulated models associated tothree main effects and one interaction term where MB is taken to be theintercept only model and (ab) isin (1 1) (1 ch)
HUP induces a penalization for model complexity but it does not adequately penalize
models for including additional terms Using the HIP models including all of the terms
are given at least as much probability as any model containing any non-empty set of
terms (Figures 4-4 and 4-5) This lack of penalization of the full model is originates from
its combinatorial simplicity (ie this is the only model that contains every term) and
as an unfortunate consequence this model space distribution favors the base and full
models Similar behavior is observed with the HOP with (ab) = (1 1) As models
become more complex they are appropriately penalized for their size However after a
sufficient number of nodes are added the number of possible models of that particular
size is considerably reduced Thus combinatorial complexity is negligible on the largest
models This is best exhibited in Figure 4-5 where the HOP places more mass on
the full model than on any model containing a single order one node highlighting an
undesirable behavior of the priors with this choice of hyper-parameters
In contrast if (ab) = (1 ch) all three priors produce strong penalization as
models become more complex both in terms of the number and order of the nodes
contained in the model For all of the priors adding a node α to a model M to form M prime
produces p(M) ge p(M prime) However differences between the priors are apparent The
98
HIP penalizes the full model the most with the HOP penalizing it the least and the HUP
lying between them At face value the HOP creates the most compelling penalization
of model complexity In Figure 4-5 the penalization of the HOP is the least dramatic
producing prior odds of 20 for MB versus MF as opposed to the HUP and HIP which
produce prior odds of 40 and 54 respectively Similarly the prior odds in Figure 4-4 are
60 180 and 256 for the HOP HUP and HIP respectively433 Posterior Sensitivity to the Choice of Prior
To determine how the proposed priors are adjusting the posterior probabilities to
account for multiplicity a simple simulation was performed The goal of this exercise
was to understand how the priors respond to increasing complexity First the priors are
compared as the number of main effects p grows Second they are compared as the
depth of the hierarchy increases or in other words as the orderJMmax increases
The quality of a node is characterized by its marginal posterior inclusion
probabilities defined as pα =sum
MisinM I(αisinM)p(M|yM) for α isin MF These posteriors
were obtained for the proposed priors as well as the Equal Probability Prior (EPP)
on M For all prior structures both the default hyper-parameters a = b = 1 and
the penalizing choice of a = 1 and b = ch are considered The results for the
different combinations of MF and MT incorporated in the analysis were obtained
from 100 random replications (ie generating at random 100 matrices of main effects
and responses) The simulation proceeds as follows
1 Randomly generate main effects matrices X = (x1 x18) for xiiidsim Nn(0 In) and
error vectors ϵ sim Nn(0 In) for n = 60
2 Setting all coefficient values equal to one calculate y = ZMTβ + ϵ for the true
models given byMT 1 = x1 x2 x3 x
21 x1x2 x
22 x2x3 with |MT 1| = 7
MT 2 = x1 x2 x16 with |MT 2| = 16MT 3 = x1 x2 x3 x4 with |MT 3| = 4MT 4 = x1 x2 x8 x
21 x3x4 with |MT 4| = 10
MT 5 = x1 x2 x3 x4 x21 x3x4 with |MT 5| = 6
99
Table 4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulationsgrowing p fixed JM
max fixed p growing JMmax
MF
∣∣MF
∣∣ ∣∣M∣∣ MT used MF
∣∣MF
∣∣ ∣∣M∣∣ MT used(x1 + x2 + x3)
2 9 95 MT 1 (x1 + x2 + x3)2 9 95 MT 1
(x1 + + x4)2 14 1337 MT 1 (x1 + x2 + x3)
3 19 2497 MT 1
(x1 + + x5)2 20 38619 MT 1 (x1 + x2 + x3)
4 34 161421 MT 1
Other model spacesMF
∣∣MF
∣∣ ∣∣M∣∣ MT usedx1 + x2 + middot middot middot+ x18 18 262144 MT 2MT 3
(x1 + x2 + x4)2 + x5+ 20 85568 MT 4MT 5x6 + + x10
3 In all simulations the base model MB is the intercept only model The notation(x1 + + xp)
d is used to represent the full order-d polynomial response surface inp main effects The model spaces characterized by their corresponding full modelMF are presented in Table 4-1 as well as the true models used in each case
4 Enumerate the model spaces and calculate p(M|yM) for all M isin Musing the EPP HUP HIP and HOP the latter two each with the two sets ofhyper-parameters
5 Count the number of true positives and false positives in each M for the differentpriors
The true positives (TP) are defined as those nodes α isin MT such that pα gt 05
With the false positives (FP) three different cutoffs are considered for pα elucidating
the adjustment for multiplicity induced by the model priors These cutoffs are
010 020 and 050 for α isin MT The results from this exercise provide insight
about the influence of the prior on the marginal posterior inclusion probabilities In Table
4-1 the model spaces considered are described in terms of the number of models they
contain and in terms of the number of nodes of MF the full model that defines the DAG
for M
Growing number of main effects fixed polynomial degree This simulation
investigates the posterior behavior as the number of covariates grows for a polynomial
100
surface of degree two The true model is assumed to be MT 1 and has 7 polynomial
terms The false positive and true positive rates are displayed in Table 4-2
First focus on the posterior when (ab) = (1 1) As p increases and the cutoff
is low the number of false positives increases for the EPP as well as the hierarchical
priors although less dramatically for the latter All of the priors identify all of the true
positives The false positive rate for the 50 cutoff is less than one for all four prior
structures with the HIP exhibiting the smallest false positive rate
With the second choice of hyper-parameters (1 ch) the improvement of the
hierarchical priors over the EPP is dramatic and the difference in performance is more
pronounced as p increases These also considerably outperform the priors using the
default hyper-parameters a = b = 1 in terms of the false positives Regarding the
number of true positives all priors discovered the 7 true predictors in MT 1 for most of
the 100 random samples of data with only minor differences observed between any of
the priors considered That being said the means for the priors with a = 1b = ch are
slightly lower for the true positives With a 50 cutoff the hierarchical priors keep a tight
control on the number of false positives but in doing so discard true positives with slightly
higher frequency
Growing polynomial degree fixed main effects For these examples the true
model is once again MT 1 When the complexity is increased by making the order of MF
larger (Table 4-3) the inability of the EPP to adjust the inclusion posteriors for multiplicity
becomes more pronounced the EPP becomes less and less efficient at removing false
positives when the FP cutoff is low Among the priors with a = b = 1 as the order
increases the HIP is the best at filtering out the false positives Using the 05 false
positive cutoff some false positives are included both for the EEP and for all the priors
with a = b = 1 indicating that the default hyper-parameters might not be the best option
to control FP The 7 covariates in the true model all obtain a high inclusion posterior
probability both with the EEP and the a = b = 1 priors
101
Table 4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
7 (x1 + x2 + x3)2
178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3 + x4)2
362 194 233 245 010 063 107FP(gt020) 160 047 217 215 001 017 024FP(gt050) 025 006 035 036 000 002 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3 + x4 + x5)2
600 216 260 255 012 043 115FP(gt020) 291 055 213 218 002 019 027FP(gt050) 066 011 025 037 000 003 001TP(gt050) (MT 1) 700 700 700 700 697 699 699
In contrast any of the a = 1 and b = ch priors dramatically improve upon their
a = b = 1 counterparts consistently assigning low inclusion probabilities for the majority
of the false positive terms even for low cutoffs As the order of the polynomial surface
increases the difference in performance between these priors and either the EEP or
their default versions becomes even more clear At the 50 cutoff the hierarchical priors
with complexity penalization exhibit very low false positive rates The true positive rate
decreases slightly for the priors but not to an alarming degree
Other model spaces This part of the analysis considers model spaces that do not
correspond to full polynomial degree response surfaces (Table 4-4) The first example
is a model space with main effects only The second example includes a full quadratic
surface of order 2 but in addition includes six terms for which only main effects are to be
modeled Two true models are used in combination with each model space to observe
how the posterior probabilities vary under the influence of the different priors for ldquolargerdquo
and ldquosmallrdquo true models
102
Table 4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
7 (x1 + x2 + x3)2
178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3)3
737 521 606 291 055 105 139FP(gt020) 291 155 361 208 017 034 031FP(gt050) 040 021 050 026 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 698 700FP(gt010)
7 (x1 + x2 + x3)4
822 400 469 261 052 055 132FP(gt020) 421 113 176 203 012 015 031FP(gt050) 056 017 022 027 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 697 699
By construction in model spaces with main effects only HIP(11) and EPP are
equivalent as are HOP(ab) and HUP(ab) This accounts for the similarities observed
among the results for the first two cases presented in Table 4-4 where the model space
corresponds to a full model with 18 main effects and the true models are a model with
16 and 4 main effects respectively When the number of true coefficients is large the
HUP(11) and HOP(11) do poorly at controlling false positives even at the 50 cutoff
In contrast the HIP (and thus the EPP) with the 50 cutoff identifies the true positives
and no false positives This result however does not imply that the EPP controls false
positives well The true model contains 16 out of the 18 nodes in MF so there is little
potential for false positives The a = 1 and b = ch priors show dramatically different
behavior The HIP controls false positive well but fails to identify the true coefficients at
the 50 cutoff In contrast the HOP identifies all of the true positives and has a small
false positive rate for the 50 cutoff
103
If the number of true positives is small most terms in the full model are truly zero
The EPP includes at least one false positive in approximately 50 of the randomly
sampled datasets On the other hand the HUP(11) provides some control for
multiplicity obtaining on average a lower number of false positives than the EPP
Furthermore the proposed hierarchical priors with a = 1b = ch are substantially better
than the EPP (and the choice of a = b = 1) at controlling false positives and capturing
all true positives using the marginal posterior inclusion probabilities The two examples
suggest that the HOP(1 ch) is the best default choice for model selection when the
number of terms available at a given degree is large
The third and fourth examples in Table 4-4 consider the same irregular model
space with data generated from MT 4 with ten terms and MT 5 with six terms HIP(11)
and EPP again behave quite similarly incorporating a large number of false positives
for the 01 cutoff At the 05 cutoff some false positives are still included The HUP(11)
and HOP(11) behave similarly with a slightly higher false positive rate at the 50 cutoff
In terms of the true positives the EPP and a = b = 1 priors always include all of the
predictors in MT 4 and MT 5 On the other hand the ability of the a = 1b = ch priors
to control for false positives is markedly better than that of the EPP and the hierarchical
priors with choice of a = 1 = b = 1 At the 50 cutoff these priors identify all of the true
positives and true negatives Once again these examples point to the hierarchical priors
with additional penalization for complexity as being good default priors on the model
space44 Random Walks on the Model Space
When the model space M is too large to enumerate a stochastic procedure can
be used to find models with high posterior probability In particular an MCMC algorithm
can be utilized to generate a dependent sample of models from the model posterior The
structure of the model space M both presents difficulties and provides clues on how to
build algorithms to explore it Different MCMC strategies can be adopted two of which
104
Table 4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
16 x1 + x2 + + x18
193 193 200 200 003 180 180FP(gt020) 052 052 200 200 001 046 046FP(gt050) 007 007 200 200 001 004 004TP(gt050) (MT 2) 1599 1599 1600 1600 699 1599 1599FP(gt010)
4 x1 + x2 + + x18
1395 1395 915 915 026 131 131FP(gt020) 545 545 303 303 005 045 045FP(gt050) 084 084 045 045 002 006 006TP(gt050) (MT 3) 400 400 400 400 400 400 400FP(gt010)
10
973 971 1000 560 034 233 220FP(gt020) (x1 + + x4)
2+ 265 265 873 305 012 074 069FP(gt050) +x5 + + x10 035 035 136 168 002 011 012TP(gt050) (MT 4) 1000 1000 1000 999 994 998 999FP(gt010)
6
1352 1352 1106 994 044 163 196FP(gt020) (x1 + + x4)
2+ 422 421 360 501 015 048 068FP(gt050) +x5 + + x10 053 053 057 075 001 008 011TP(gt050) (MT 5) 600 600 600 600 599 599 599
are outlined in this section Combining the different strategies allows the model selection
algorithm to explore the model space thoroughly and relatively fast441 Simple Pruning and Growing
This first strategy relies on small localized jumps around the model space turning
on or off a single node at each step The idea behind this algorithm is to grow the model
by activating one node in the children set or to prune the model by removing one node
in the extreme set At a given step in the algorithm assume that the current state of the
chain is model M Let pG be the probability that algorithm chooses the growth step The
proposed model M prime can either be M+ = M cup α for some α isin C(M) or Mminus = M α
or some α isin E(M)
An example transition kernel is defined by the mixture
g(M prime|M) = pG middot qGrow(M prime|M) + (1minus pG) middot qPrune(M prime|M)
105
=IM =MF
1 + IM =MBmiddotIαisinC(M)
|C(M)|+
IM =MB
1 + IM =MF middotIαisinE(M)
|E(M)|(4ndash11)
where pG has explicitly been defined as 05 when both C(M) and E(M) are non-empty
and as 0 (or 1) when C(M) = empty (or E(M) = empty) After choosing pruning or growing a
single node is proposed for addition to or deletion from M uniformly at random
For this simple algorithm pruning has the reverse kernel of growing and vice-versa
From this construction more elaborate algorithms can be specified First instead of
choosing the node uniformly at random from the corresponding set nodes can be
selected using the relative posterior probability of adding or removing the node Second
more than one node can be selected at any step for instance by also sampling at
random the number of nodes to add or remove given the size of the set Third the
strategy could combine pruning and growing in a single step by sampling one node
α isin C(M) cup E(M) and adding or removing it accordingly Fourth the sets of nodes from
C(M) cup E(M) that yield well-formulated models can be added or removed This simple
algorithm produces small moves around the model space by focusing node addition or
removal only on the set C(M) cup E(M)442 Degree Based Pruning and Growing
In exploring the model space it is possible to take advantage of the hierarchical
structure defined between nodes of different order One can update the vector of
inclusion indicators by blocks denoted j(M) Two flavors of this algorithm are
proposed one that separates the pruning and growing steps and one where both
are done simultaneously
Assume that at a given step say t the algorithm is at M If growing the strategy
proceeds successively by order class going from j = Jmin up to j = Jmax with Jmin
and Jmax being the lowest and highest orders of nodes in MF MB respectively Define
Mt(Jminminus1) = M and set j = Jmin The growth kernel comprises the following steps
proceeding from j = Jmin to j = Jmax
106
1) Propose a model M prime by selecting a set of nodes from Cj(Mt(jminus1)) through thekernel qGrow j(middot|Mt(jminus1))
2) Compute the Metropolis-Hastings correction for M prime versus Mt(jminus1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(jminus1)
3) If j lt Jmax then set j = j + 1 and return to 1) otherwise proceed to 4)
4) Set Mt = Mt(Jmax )
The pruning step is defined In a similar fashion however it starts at order j = Jmax
and proceeds down to j = Jmin Let Ej(M prime) = E(M prime) cap j(MF ) be the set of nodes of
order j that can removed from the model M to produce a WFM Define Mt(Jmax+1) = M
and set j = Jmax The pruning kernel comprises the following steps
1) Propose a model M prime by selecting a set of nodes from Ej(Mt(j+1)) through thekernel qPrunej(middot|Mt(j+1))
2) Compute the Metropolis-Hastings correction for M prime versus Mt(j+1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(j+1)
3) If j gt Jmin then set j = j minus 1 and return to Step 1) otherwise proceed to Step 4)
4) Set Mt = Mt(Jmin )
It is clear that the growing and pruning steps are reverse kernels of each other
Pruning and growing can be combined for each j The forward kernel proceeds from
j = Jmin to j = Jmax and proposes adding sets of nodes from Cj(M) cup Ej(M) The reverse
kernel simply reverses the direction of j proceeding from j = Jmax to j = Jmin 45 Simulation Study
To study the operating characteristics of the proposed priors a simulation
experiment was designed with three goals First the priors are characterized by how
the posterior distributions are affected by the sample size and the signal-to-noise ratio
(SNR) Second given the SNR level the influence of the allocation of the signal across
the terms in the model is investigated Third performance is assessed when the true
model has special points in the scale (McCullagh amp Nelder 1989) ie when the true
107
model has coefficients equal to zero for some lower-order terms in the polynomial
hierarchy
With these goals in mind sets of predictors and responses are generated under
various experimental conditions The model space is defined with MB being the
intercept-only model and MF being the complete order-four polynomial surface in five
main effects that has 126 nodes The entries of the matrix of main effects are generated
as independent standard normal The response vectors are drawn from the n-variate
normal distribution as y sim Nn
(ZMT
(X)βγ In) where MT is the true model and In is the
n times n identity matrix
The sample sizes considered are n isin 130 260 1040 which ensures that
ZMF(X) is of full rank The cardinality of this model space is |M| gt 12 times 1022 which
makes enumeration of all models unfeasible Because the value of the 2k-th moment
of the standard normal distribution increases with k = 1 2 higher-order terms by
construction have a larger variance than their ancestors As such assuming equal
values for all coefficients higher-order terms necessarily contain more ldquosignalrdquo than
the lower order terms from which they inherit (eg x21 has more signal than x1) Once a
higher-order term is selected its entire ancestry is also included Therefore to prevent
the simulation results from being overly optimistic (because of the larger signals from the
higher-order terms) sphering is used to calculate meaningful values of the coefficients
ensuring that the signal is of the magnitude intended in any given direction Given
the results of the simulations from Section 433 only the HOP with a = 1b = ch is
considered with the EPP included for comparison
The total number of combinations of SNR sample size regression coefficient
values and nodes in MT amounts to 108 different scenarios Each scenario was run
with 100 independently generated datasets and the mean behavior of the samples was
observed The results presented in this section correspond to the median probability
model (MPM) from each of the 108 simulation scenarios considered Figure 4-7 shows
108
the comparison between the two priors for the mean number of true positive (TP) and
false positive (FP) terms Although some of the scenarios consider true models that are
not well-formulated the smallest well-formulated model that stems from MT is always
the one shown in Figure 4-6
Figure 4-6 MT DAG of the largest true model used in simulations
The results are summarized in Figure 4-7 Each point on the horizontal axis
corresponds to the average for a given set of simulation conditions Only labels for the
SNR and sample size are included for clarity but the results are also shown for the
different values of the regression coefficients and the different true models considered
Additional details about the procedure and other results are included in the appendices451 SNR and Sample Size Effect
As expected small sample sizes conditioned upon a small SNR impair the ability
of the algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this
effect being greater when using the latter prior However considering the mean number
of TPs jointly with the number of FPs it is clear that although the number of TPs is
specially low with HOP(1 ch) most of the few predictors that are discovered in fact
belong to the true model In comparison to the results with EPP in terms of FPs the
HOP(1 ch) does better and even more so when both the sample size and the SNR are
109
Figure 4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch)
smallest Finally when either the SNR or the sample size is large the performance in
terms of TPs is similar between both priors but the number of FPs are somewhat lower
with the HOP452 Coefficient Magnitude
Three ways to allocate the amount of signal across predictors are considered For
the first choice all coefficients contain the same amount of signal regardless of their
order In the second each order-one coefficient contains twice as much signal as any
order-two coefficient and four times as much as any order-three coefficient Finally
each order-one coefficient contains a half as much signal as any order-two coefficient
and a quarter of what any order-three coefficient has These choices are denoted by
β(1) = c(1o1 1o2 1o3) β(2) = c(1o1 05o2 025o3) and β(3) = c(025o1 05o2 1o3)
respectively In Figure 4-7 the first 4 scenarios correspond to simulations with β(1) the
next four use β(2) the next four correspond to β(3) and then the values are cycled in
110
the same way The results show that scenarios using either β(1) or β(3) behave similarly
contrasting with the negative impact of having the highest signal in the order-one terms
through β(2) In Figure 4-7 the effect of using β(2) is evident as it corresponds to the
lowest values for the TPs regardless of the sample size the SNR or the prior used This
is an intuitive result since giving more signal to higher-order terms makes it easier to
detect higher-order terms and consequently by strong heredity the algorithm will also
select the corresponding lower-order terms included in the true model453 Special Points on the Scale
Four true models were considered (1) the model from Figure 4-6 (MT 1) (2)
the model without the order-one terms (MT 2) (3) the model without order-two terms
(MT 3) and (4) the model without x21 and x2x5 (MT 4) The last three are clearly not
well-formulated In Figure 4-7 the leftmost point on the horizontal axis corresponds to
scenarios with MT 1 the next point is for scenarios with MT 2 followed by those with MT 3
then with MT 4 then MT 1 etc In comparison to the EPP the HOP(1 ch) tightly controls
the inclusion of FPs by choosing smaller models at the expense of also reducing the TP
count especially when there is more uncertainty about the true model (ie SNR=025)
For both prior structures the results indicate that at low SNR levels the presence of
special points has no apparent impact as the selection behavior is similar between the
four models in terms of both the TP and FP An interesting observation is that the effect
of having special points on the scale is vastly magnified whenever the coefficients that
assign more weight to order-one terms (β(2)) are used46 Case Study Ozone Data Analysis
This section uses the ozone data from Breiman amp Friedman (1985) and followsthe analysis performed by Liang et al (2008) who investigated hyper g-priors Afterremoving observations with missing values 330 observations remain includingdaily measurements of maximum ozone concentration near Los Angeles and eightmeteorological variables Table D From the 330 observations 165 were sampled atrandom without replacement and used to run the variable selection procedure theremaining 165 were used for validation The eight meteorological variables interactionsand their squared terms are used as predictors resulting in a full model with 44predictors The model space assumes that the base model MB is the intercept onlymodel and that MF is the quadratic surface in the eight meteorological variables The
111
model space contains approximately 71 billion models and computation of all modelposterior probabilities is not feasible
Table 4-5 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA
vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX
The HOP HUP and HIP with a = 1 and b = ch as well as the EPP are considered forcomparison purposes To obtain the Bayes factors in equation 3ndash3 four different mixtures ofg-priors are utilized intrinsic priors (IP) (which yields the expression in equation 3ndash2) hyper-g(HG) priors (Liang et al 2008) with hyper-parameters α = 2β = 1 and α = β = 1 and Zellner-Siow (ZS) priors (Zellner amp Siow 1980) The results were extracted for the median posteriorprobability (MPM) models Additionally the model is estimated using the R package hierNet(Bien et al 2013) to compare model selection results to those obtained using the hierarchicallasso (Bien et al 2013) restricted to well formulated models by imposing the strong heredityconstraint The procedures were assessed on the basis of their predictive accuracy on thevalidation dataset
Among all models the one that yields the smallest RMSE is the median probability modelobtained using the HOP and EPP with the ZS prior and also using the HOP with both HGpriors (Table 4-6) The HOP model with the intrinsic prior has all the terms contained in thelowest RMSE model with the exception of dpg2 which has a relatively high marginal inclusionprobability of 46 This disparity between the IP and other mixtures of g-priors is explainedby the fact that the IP induces less posterior shrinkage than the ZS and HG priors The MPMobtained through the HUP and HIP are nested in the best model suggesting that these modelspace priors penalize complexity too much and result in false negatives Consideration ofthese MPMs suggest that the HOP is best at producing true positives while controlling for falsepositives
Finally the model obtained from the hierarchical lasso (HierNet) is the largest model andproduces the second to largest RMSE All of the terms contained in any of the other modelsexcept for vh are nested within the hierarchical lasso model and most of the terms that areexclusive to this model receive extremely low marginal inclusion probabilities under any of themodel priors and parameter priors considered under Bayesian model selection
112
Table 4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso
BF Prior Model R2 RMSEIP EPP hum dpg ibt hum2 hum lowast dpg 08054 42739
hum lowast ibt dpg2 ibt2IP HIP hum ibt hum2 hum lowast ibt ibt2 07740 43396IP HOP hum dpg ibt hum2 hum lowast ibt ibt2 07848 43175IP HUP hum dpg ibt hum lowast ibt ibt2 07767 43508ZS EPP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HIP hum ibt hum lowast ibt ibt2 07525 43505ZS HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG11 EPP vh hum dpg ibt hum2 hum lowast ibt dpg2 07701 43049HG11 HIP hum ibt hum lowast ibt ibt2 07525 43505HG11 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG11 HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 EPP hum dpg ibt hum2 hum lowast ibt dpg2 07701 43037HG21 HIP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG21 HUP hum dpg ibt hum lowast ibt 07526 44036
HierNet hum temp ibh dpg ibt vis hum2 hum lowast ibt 07651 43680temp2 temp lowast ibt dpg2
47 DiscussionScott amp Berger (2010) noted that Ockhamrsquos-razor effect found automatically in Bayesian
variable selection through the Bayes factor does not correct for multiple testing The Bayesfactor penalizes complexity of the alternative model according to the number of parametersin excess of those of the null model Therefore the Bayes factor only controls complexity in apairwise fashion If the model selection procedure uses equal prior probabilities for all M isin Mthen these comparisons ignore the effect of the multiplicity of the testing problem This is wherethe role of the prior on the model space becomes important The multiplicity penalty is ldquohiddenawayrdquo in the model prior probabilities π(M|M)
In addition to the multiplicity of the testing problem disregarding the hierarchical polynomialstructure in the predictors in model selection procedures has the potential to lead to differentresults according to how the predictors are setup (eg in what units these predictors areexpressed)
In this Chapter we investigated a solution to these two issues We define prior structuresfor well-formulated models and develop random walk algorithms to traverse this type of modelspace The key to understanding prior distributions on the space of WFMs is the hierarchicalnature of the model space itself The prior distributions described take advantage of thathierarchy in two ways First conditional independence and immediate inheritance are used todevelop the HOP HIP and HUP structures discussed in Section 43 Second the conditionalnature of the priors allows for the direct incorporation of complexity penalizations Of the priorsproposed the HOP using the hyperparameter choice (1 ch) provides the best control of falsepositives while maintaining a reasonable true positive rate Thus this prior is recommended asthe default prior on the space of WFMs
113
In the near future the software developed to carry out a Metropolis-Hastings random walkon the space of WFMs will be integrated to the R package varSelectIP These new functionsimplement various local priors for the regression coefficients including the intrinsic prior Zellner-Siow prior and hyper g-priors In addition the software supports the computation of crediblesets for each regression coefficient conditioned on the selected model as well as under modelaveraging
114
CHAPTER 5CONCLUSIONS
Ecologists are now embracing the use of Bayesian methods to investigate the
interactions that dictate the distribution and abundance of organisms These tools are
both powerful and flexible They allow integrating under a single methodology empirical
observations and theoretical process models and can seamlessly account for several
sources of uncertainty and dependence The estimation and testing methods proposed
throughout the document will contribute to the understanding of Bayesian methods used
in ecology and hopefully these will shed light about the differences between estimation
and testing Bayesian tools
All of our contributions exploit the potential of the latent variable formulation This
approach greatly simplifies the analysis of complex models it redirects the bulk of
the inferential burden away from the original response variables and places it on the
easy-to-work-with latent scale for which several time-tested approaches are available
Our methods are distinctly classified into estimation and testing tools
For estimation we proposed a Bayesian specification of the single-season
occupancy model for which a Gibbs sampler is available using both logit and probit
link functions This setup allows detection and occupancy probabilities to depend
on linear combinations of predictors Then we developed a dynamic version of this
approach incorporating the notion that occupancy at a previously occupied site depends
both on survival of current settlers and habitat suitability Additionally because these
dynamics also vary in space we suggest a strategy to add spatial dependence among
neighboring sites
Ecological inquiry usually requires of competing explanations and uncertainty
surrounds the decision of choosing any one of them Hence a model or a set of
probable models should be selected from all the viable alternatives To address this
testing problem we proposed an objective and fully automatic Bayesian methodology
115
for the single season site-occupancy model Our approach relies on the intrinsic prior
which prevents from introducing (commonly unavailable) subjectively information
into the model In simulation experiments we observed that the methods single out
accurately the predictors present in the true model using the marginal posterior inclusion
probabilities of the predictors For predictors in the true model these probabilities were
comparatively larger than those for predictors not present in the true model Also the
simulations indicated that the method provides better discrimination for predictors in the
detection component of the model
In our simulations and in the analysis of the Blue Hawker data we observed that the
effect from using the multiplicity correction prior was substantial This occurs because
the Bayes factor only penalizes complexity of the alternative model according to its
number of parameters in excess to those of the null model As the number of predictors
grows the number of models in the models space also grows increasing the chances
of making false positive decisions on the inclusion of predictors This is where the role
of the prior on the model space becomes important The multiplicity penalty is ldquohidden
awayrdquo in the model prior probabilities π(M|M) In addition to the multiplicity of the
testing problem disregarding the hierarchical polynomial structure in the predictors in
model selection procedures has the potential to lead to different results according to
how the predictors are coded (eg in what units these predictors are expressed)
To confront this situation we propose three prior structures for well-formulated
models take advantage of the hierarchical structure of the predictors Of the priors
proposed we recommend the HOP using the hyperparameter choice (1 ch) which
provides the best control of false positives while maintaining a reasonable true positive
rate
Overall considering the flexibility of the latent approach several other extensions of
these methods follow Currently we envision three future developments (1) occupancy
models incorporate various sources of information (2) multi-species models that make
116
use of spatial and interspecific dependence and (3) investigate methods to conduct
model selection for the dynamic and spatially explicit version of the model
117
APPENDIX AFULL CONDITIONAL DENSITIES DYMOSS
In this section we introduce the full conditional probability density functions for all
the parameters involved in the DYMOSS model using probit as well as logic links
Sampler Z
The full conditionals corresponding to the presence indicators have the same form
regardless of the link used These are derived separately for the cases t = 1 1 lt t lt T
and t = T since their corresponding probabilities take on slightly different forms
Let ϕ(ν|microσ2) represent the density for a normal random variable ν with mean micro and
variance σ2 and recall that ψi1 = F (xprime(o)iα) and pijt = F (qprimeijtλt) where F () is the
inverse link function The full conditional for zit is given by
1 For t = 1
π(zi1|vi1αλ1βc1 δ
s1) = ψlowast
i1zi1 (1minus ψlowast
i1)1minuszi1
= Bernoulli(ψlowasti1) (Andash1)
where
ψlowasti1 =
ψi1ϕ(vi1|xprimei1βc1 + δs1 1)
prodJi1j=1(1minus pij1)
ψi1ϕ(vi1|xprimei1βc1 + δs1 1)
prodJi1j=1(1minus pij1) + (1minus ψi1)ϕ(vi1|xprimei1β
c1 1)
prodJj=1 Iyij1=0
2 For 1 lt t lt T
π(zit |zi(tminus1) zi(t+1)λt βctminus1 δ
stminus1) = ψlowast
itzit (1minus ψlowast
it)1minuszit
= Bernoulli(ψlowastit) (Andash2)
where
ψlowastit =
κitprodJit
j=1(1minus pijt)
κitprodJit
j=1(1minus pijt) +nablait
prodJj=1 Iyijt=0
with
(a) κit = F (xprimei(tminus1)β
ctminus1 + zi(tminus1)δ
stminus1)ϕ(vit |xprimeitβ
ct + δst 1) and
(b) nablait =(1minus F (xprime
i(tminus1)βctminus1 + zi(tminus1)δ
stminus1)
)ϕ(vit |xprimeitβ
ct 1)
3 For t = T
π(ziT |zi(Tminus1)λT βcTminus1 δ
sTminus1) = ψ⋆iT
ziT (1minus ψ⋆iT )1minusziT
118
=
Nprodi=1
Bernoulli(ψ⋆iT ) (Andash3)
where
ψ⋆iT =κ⋆iT
prodJiTj=1(1minus pijT )
κ⋆iTprodJiT
j=1(1minus pijT ) +nabla⋆iT
prodJj=1 IyijT=0
with
(a) κ⋆iT = F (xprimei(Tminus1)β
cTminus1 + zi(Tminus1)δ
sTminus1) and
(b) nabla⋆iT =
(1minus F (xprime
i(Tminus1)βcTminus1 + zi(Tminus1)δ
sTminus1)
)Sampler ui
1
π(ui |zi1α) = tr N(xprime(o)iα 1 trunc(zi1))
where trunc(zi1) =
(minusinfin 0] zi1 = 0
(0infin) zi1 = 1(Andash4)
and tr N(microσ2A) denotes the pdf of a truncated normal random variable with mean microvariance σ2 and truncation region A
Sampler α
1
π(α|u) prop [α]
Nprodi=1
ϕ(ui xprime(o)iα 1) (Andash5)
If [α] prop 1 then
α|u sim N(m(α)α)
with m(α) = αXprime(o)u and α = (X prime
(o)X(o))minus1
Sampler vit
1 (For t gt 1)
π(vi (tminus1)|zi (tminus1) zit βctminus1 δ
stminus1) = tr N
(micro(v)i(tminus1) 1 trunc(zit)
)(Andash6)
where micro(v)i(tminus1) = xprime
i(tminus1)βctminus1 + zi(tminus1)δ
ci(tminus1) and trunc(zit) defines the corresponding
truncation region given by zit
119
Sampler(β(c)tminus1 δ
(c)tminus1
)
1 (For t gt 1)
π(β(s)tminus1 δ
(c)tminus1|vtminus1 ztminus1) prop [β
(s)tminus1 δ
(c)tminus1]
Nprodi=1
ϕ(vit xprimei(tminus1)β
(c)tminus1 + zi(tminus1)δ
(s)tminus1 1) (Andash7)
If[β(c)tminus1 δ
(s)tminus1
]prop 1 then
β(c)tminus1 δ
(s)tminus1|vtminus1 ztminus1 sim N(m(β
(c)tminus1 δ
(s)tminus1)tminus1)
with m(β(c)tminus1 δ
(s)tminus1) = tminus1 ~X
primetminus1vtminus1 and tminus1 = (~X prime
tminus1 ~Xtminus1)minus1 where ~Xtminus1 =(
Xtminus1 ztminus1)
Sampler wijt
1 (For t gt 1 and zit = 1)
π(wijt | i zit = 1 yijt λ) = tr N(qprimeijtλt 1 tr(yijt)
)(Andash8)
Sampler λt
1 (For t = 1 2 T )
π(λt |zt wt) prop [λt ]prod
i zit=1
Jitprodj=1
ϕ(wijt qprimeijtλt 1) (Andash9)
If [λt ] prop 1 then
λt |wt zt sim N(m(λt)λt)
with m(λt) = λtQ primetwt and λt
= (Q primetQt)
minus1 where Qt and wt respectively are the designmatrix and the vector of latent variables for surveys of sites such that zit = 1
120
APPENDIX BRANDOM WALK ALGORITHMS
Global Jump From the current state M the global jump is performed by drawing
a model M prime at random from the model space This is achieved by beginning at the base
model and increasing the order from JminM to the Jmax
M the minimum and maximum orders
of nodes in (MF ) = MF MB at each order a set of nodes is selected at random from
the prior conditioned on the nodes already in the model The MH correction is
α =
1m(y|M primeM)
m(y|MM)
Local Jump From the current state M the local jump is performed by drawing a
model from the set of models L(M) = Mα α isin E(M) cup C(M) where Mα is M α
for α isin E(M) and M cup α for α isin C(M) The proposal probabilities for the model are
computed as a mixture of p(M prime|yMM prime isin L(M)) and the discrete uniform distribution
The proposal kernel is
q(M prime|yMM prime isin L(M)) =1
2
(p(M prime|yMM prime isin L(M)) +
1
|L(M)|
)This choice promotes moving to better models while maintaining a non-negligible
probability of moving to any of the possible models The MH correction is
α =
1m(y|M primeM)
m(y|MM)
q(M|yMM isin L(M prime))
q(M prime|yMM prime isin L(M))
Intermediate Jump The intermediate jump is performed by increasing or
decreasing the order of the nodes under consideration performing local proposals based
on order For a model M prime define Lj(Mprime) = M prime cup M prime
α α isin (E(M prime) cup C(M prime)) capj(MF )
From a state M the kernel chooses at random whether to increase or decrease the
order If M = MF then decreasing the order is chosen with probability 1 and if M = MB
then increasing the order is chosen with probability 1 in all other cases the probability of
increasing and decreasing order is 12 The proposal kernels are given by
121
Increasing order proposal kernel
1 Set j = JminM minus 1 and M prime
j = M
2 Draw M primej+1 from qincj+1(M
prime|yMM prime isin Lj+1(Mprimej )) where
qincj+1(Mprime|yMM prime isin Lj+1(M
primej )) =
12
(p(M prime|yMM prime isin Lj+1(M
primej )) +
1|Lj+1(M
primej)|
)
3 Set j = j + 1
4 If j lt JmaxM then return to 2 O therwise proceed to 5
5 Set M prime = M primeJmaxM
and compute the proposal probability
qinc(Mprime|yMM) =
JmaxM minus1prod
j=JminM minus1
qincj+1(Mprimej |yMM prime isin Lj+1(M
primej )) (Bndash1)
Decreasing order proposal kernel
1 Set j = JmaxM + 1 and M prime
j = M
2 Draw M primejminus1 from qdecjminus1(M
prime|yMM prime isin Ljminus1(Mprimej )) where
qdecjminus1(Mprime|yMM prime isin Ljminus1(M
primej )) =
12
(p(M prime|yMM prime isin Ljminus1(M
primej )) +
1|Ljminus1(M
primej)|
)
3 Set j = j minus 1
4 If j gt JminM then return to 2 Otherwise proceed to 5
5 Set M prime = M primeJminM
and compute the proposal probability
qdec(Mprime|yMM) =
JminM +1prod
j=JmaxM +1
qdecjminus1(Mprimej |yMM prime isin Ljminus1(M
primej )) (Bndash2)
If increasing order is chosen then the MH correction is given by
α = min
1
(1 + I (M prime = MF )
1 + I (M = MB)
)qdec(M|yMM prime)
qinc(M prime|yMM)
p(M prime|yM)
p(M|yM)
(Bndash3)
and similarly if decreasing order is chosen
Other Local and Intermediate Kernels The local and intermediate kernels
described here perform a kind of stochastic forwards-backwards selection Each kernel
122
q can be relaxed to allow more than one node to be turned on or off at each step which
could provide larger jumps for each of these kernels The tradeoff is that number of
proposed models for such jumps could be very large precluding the use of posterior
information in the construction of the proposal kernel
123
APPENDIX CWFM SIMULATION DETAILS
Briefly the idea is to let ZMT(X )βMT
= (QR)βMT= QηMT
(ie βMT= Rminus1ηMT
)
using the QR decomposition As such setting all values in ηMTproportional to one
corresponds to distributing the signal in the model uniformly across all predictors
regardless of their order
The (unconditional) variance of a single observation yi is var(yi) = var (E [yi |zi ]) +
E [var(yi |zi)] where zi is the i -th row of the design matrix ZMT Hence we take the
signal to noise ratio for each observation to be
SNR(η) = ηTMT
RminusTzRminus1ηMT
σ2
where z = var(zi) We determine how the signal is distributed across predictors up to a
proportionality constant to be able to control simultaneously the signal to noise ratio
Additionally to investigate the ability of the model to capture correctly the
hierarchical structure we specify four different 0-1 vectors that determine the predictors
in MT which generates the data in the different scenarios
Table C-1 Experimental conditions WFM simulationsParameter Values considered
SNR(ηMT) = k 025 1 4
ηMTprop (1 13 14 12) (1 13 1214
1412) (1 1413
1214 12)
γMT(1 13 14 12) (1 13 14 02) (1 13 04 12) (1 03 0 1 1 0 12)
n 130 260 1040
The results presented below are somewhat different from those found in the main
body of the article in Section 5 These are extracted averaging the number of FPrsquos
TPrsquos and model sizes respectively over the 100 independent runs and across the
corresponding scenarios for the 20 highest probability models
124
SNR and Sample Size Effect
In terms of the SNR and the sample size (Figure C-1) we observe that as
expected small sample sizes conditioned upon a small SNR impair the ability of the
algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this effect
more notorious when using the latter prior However considering the mean number
of true positives (TP) jointly with the mean model size it is clear that although the
sensitivity is low most of the few predictors that are discovered belong to the true
model The results observed with SNR of 025 and a relatively small sample size are
far from being impressive however real problems where the SNR is as low as 025
will yield many spurious associations under the EPP The fact that the HOP(1 ch) has
a strong protection against false positive is commendable in itself A SNR of 1 also
represents a feeble relationship between the predictors and the response nonetheless
the method captures approximately half of the true coefficients while including very few
false positives Following intuition as either the sample size or the SNR increase the
algorithms performance is greatly enhanced Either having a large sample size or a
large SNR yields models that contain mostly true predictors Additionally HOP(1 ch)
provides a strong control over the number of false positives therefore for high SNR
or larger sample sizes the number of predictors in the top 20 models is close to the
size of the true model In general the EPP allows the detection of more TPrsquos while
the HOP(1 ch) provides a stronger control on the amount of FPrsquos included when
considering small sample sizes combined with small SNRs As either sample size or
SNR grows the differences between the two priors become indistinct
125
Figure C-1 SNR vs n Average model size average true positives and average false
positives for all simulated scenarios by model ranking according to model
posterior probabilities
Coefficient Magnitude
This part of the experiment explores the effect of how the signal is distributed across
predictors As mentioned before sphering is used to assign the coefficients values
in a manner that controls the amount of signal that goes into each coefficient Three
possible ways to allocate the signal are considered First each order-one coefficient
contains twice as much signal as any order-two coefficient and four times as much
any as order-three coefficient second all coefficients contain the same amount of
signal regardless of their order and third each order-one coefficient contains a half
as much signal as any order-two coefficient and a quarter of what any order-three
126
coefficient has In Figure C-2 these values are denoted by β = c(1o1 05o2 025o3)
β = c(1o1 1o2 1o3) and β = c(025o1 05o2 1o3) respectively
Observe that the number of FPrsquos is invulnerable to how the SNR is distributed
across predictors using the HOP(1 ch) conversely when using the EPP the number
of FPrsquos decreases as the SNR grows always being slightly higher than those obtained
with the HOP With either prior structure the algorithm performs better whenever all
coefficients are equally weighted or when those for the order-three terms have higher
weights In these two cases (ie with β = c(1o1 05o2 025o3) or β = c(1o1 1o2 1o3))
the effect of the SNR appears to be similar In contrast when more weight is given to
order one terms the algorithm yields slightly worse models at any SNR level This is an
intuitive result since giving more signal to higher order terms makes it easier to detect
higher order terms and consequently by strong heredity the algorithm will also select
the corresponding lower order terms included in the true model
Special Points on the Scale
In Nelder (1998) the author argues that the conditions under which the
weak-heredity principle can be used for model selection are so restrictive that the
principle is commonly not valid in practice in this context In addition the author states
that considering well-formulated models only does not take into account the possible
presence of special points on the scales of the predictors that is situations where
omitting lower order terms is justified due to the nature of the data However it is our
contention that every model has an underlying well-formulated structure whether or not
some predictor has special points on its scale will be determined through the estimation
of the coefficients once a valid well-formulated structure has been chosen
To understand how the algorithm behaves whenever the true data generating
mechanism has zero-valued coefficients for some lower order terms in the hierarchy
four different true models are considered Three of them are not well-formulated while
the remaining one is the WFM shown in Figure 4-6 The three models that have special
127
Figure C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model rankingaccording to model posterior probabilities
points correspond to the same model MT from Figure 4-6 but have respectively
zero-valued coefficients for all the order-one terms all the order-two terms and for x21
and x2x5
As seen before in comparison to the EPP the HOP(1 ch) tightly controls the
inclusion FPs by choosing smaller models at the expense of also reducing the TP
count especially when there is more uncertainty about the true model (ie SNR=025)
For both prior structures the results in Figure C-3 indicate that at low SNR levels the
presence of special points has no apparent impact as the selection behavior is similar
between the four models in terms of both the TP and FP As the SNR increases the
TPs and the model size are affected for true models with zero-valued lower order
128
Figure C-3 SNR vs different true models MT Average model size average truepositives and average false positives for all simulated scenarios by modelranking according to model posterior probabilities
terms These differences however are not very large Relatively smaller models are
selected whenever some terms in the hierarchy are missing but with high SNR which
is where the differences are most pronounced the predictors included are mostly true
coefficients The impact is almost imperceptible for the true model that lacks order one
terms and the model with zero coefficients for x21 and x2x5 and is more visible for models
without order two terms This last result is expected due to strong-heredity whenever
the order-one coefficients are missing the inclusion of order-two and order-three
terms will force their selection which is also the case when only a few order two terms
have zero-valued coefficients Conversely when all order two predictors are removed
129
some order three predictors are not selected as their signal is attributed the order two
predictors missing from the true model This is especially the case for the order three
interaction term x1x2x5 which depends on the inclusion of three order two terms terms
(x1x2 x1x5 x2x5) in order for it to be included as well This makes the inclusion of this
term somewhat more challenging the three order two interactions capture most of
the variation of the polynomial terms that is present when the order three term is also
included However special points on the scale commonly occur on a single or at most
on a few covariates A true data generating mechanism that removes all terms of a given
order in the context of polynomial models is clearly not justified here this was only done
for comparison purposes
130
APPENDIX DSUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
The covariates considered for the ozone data analysis match those used in Liang
et al (2008) these are displayed in Table D below
Table D-1 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA
vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX
The marginal posterior inclusion probability corresponds to the probability of including a
given term in the full model MF after summing over all models in the model space For each
node α isin MF this probability is given by pα =sum
MisinM I(αisinM)p(M|yM) Given that in problems
with a large model space such as the one considered for the ozone concentration problem
enumeration of the entire space is not feasible Thus these probabilities are estimated summing
over every model drawn by the random walk over the model space M
Given that there are in total 44 potential predictors for convenience in Tables D-2 to D-5
below we only display the marginal posterior probabilities for the terms included under at least
one of the model priors considered (EPP HIP HUP and HOP) for each of the parameter priors
utilized (intrinsic priors Zellner-Siow priors Hyper-g(11) and Hyper-g(21))
131
Table D-2 Marginal inclusion probabilities
intrinsic prior
EPP HIP HUP HOP
hum 099 069 085 076
dpg 085 048 052 053
ibt 099 100 100 100
hum2 076 051 043 062
humdpg 055 002 003 017
humibt 098 069 084 075
dpg2 072 036 025 046
ibt2 059 078 057 081
Table D-3 Marginal inclusion probabilities
Zellner-Siow prior
EPP HIP HUP HOP
hum 076 067 080 069
dpg 089 050 055 058
ibt 099 100 100 100
hum2 057 049 040 057
humibt 072 066 078 068
dpg2 081 038 031 051
ibt2 054 076 055 077
Table D-4 Marginal inclusion probabilities
Hyper-g11
EPP HIP HUP HOP
vh 054 005 010 011
hum 081 067 080 069
dpg 090 050 055 058
ibt 099 100 099 099
hum2 061 049 040 057
humibt 078 066 078 068
dpg2 083 038 030 051
ibt2 049 076 054 077
Table D-5 Marginal inclusion probabilities
Hyper-g21
EPP HIP HUP HOP
hum 079 064 073 067
dpg 090 052 060 059
ibt 099 100 099 100
hum2 060 047 037 055
humibt 076 064 071 067
dpg2 082 041 036 052
ibt2 047 073 049 075
132
REFERENCES
Akaike H (1983) Information measures and model selection Bull Int Statist Inst 50277ndash290
Albert J H amp Chib S (1993) Bayesian-analysis of binary and polychotomousresponse data Journal of the American Statistical Association 88(422) 669ndash679
Berger J amp Bernardo J (1992) On the development of reference priors BayesianStatistics 4 (pp 35ndash60)
URL httpisbastatdukeedueventsvalencia1992Valencia4Refpdf
Berger J amp Pericchi L (1996) The intrinsic Bayes factor for model selection andprediction Journal of the American Statistical Association 91(433) 109ndash122
URL httpamstattandfonlinecomdoiabs10108001621459199610476668
Berger J Pericchi L amp Ghosh J (2001) Objective Bayesian methods for modelselection introduction and comparison In Model selection vol 38 of IMS LectureNotes Monogr Ser (pp 135ndash207) Inst Math Statist
URL httpwwwjstororgstable1023074356165
Besag J York J amp Mollie A (1991) Bayesian Image-Restoration with 2 Applicationsin Spatial Statistics Annals of the Institute of Statistical Mathematics 43 1ndash20
Bien J Taylor J amp Tibshirani R (2013) A lasso for hierarchical interactions TheAnnals of Statistics 41(3) 1111ndash1141
URL httpprojecteuclidorgeuclidaos1371150895
Breiman L amp Friedman J (1985) Estimating optimal transformations for multipleregression and correlation Journal of the American Statistical Association 80580ndash598
Brusco M J Steinley D amp Cradit J D (2009) An exact algorithm for hierarchicallywell-formulated subsets in second-order polynomial regression Technometrics 51(3)306ndash315
Casella G Giron F J Martınez M L amp Moreno E (2009) Consistency of Bayesianprocedures for variable selection The Annals of Statistics 37 (3) 1207ndash1228
URL httpprojecteuclidorgeuclidaos1239369020
Casella G Moreno E amp Giron F (2014) Cluster Analysis Model Selection and PriorDistributions on Models Bayesian Analysis TBA(TBA) 1ndash46
URL httpwwwstatufledu~casellaPapersClusterModel-July11-Apdf
133
Chipman H (1996) Bayesian variable selection with related predictors CanadianJournal of Statistics 24(1) 17ndash36
URL httponlinelibrarywileycomdoi1023073315687abstract
Clyde M amp George E I (2004) Model Uncertainty Statistical Science 19(1) 81ndash94
URL httpprojecteuclidorgDienstgetRecordid=euclidss1089808274
Dewey J (1958) Experience and nature New York Dover Publications
Dorazio R M amp Taylor-Rodrıguez D (2012) A Gibbs sampler for Bayesian analysis ofsite-occupancy data Methods in Ecology and Evolution 3 1093ndash1098
Ellison A M (2004) Bayesian inference in ecology Ecology Letters 7 509ndash520
Fiske I amp Chandler R (2011) unmarked An R package for fitting hierarchical modelsof wildlife occurrence and abundance Journal of Statistical Software 43(10)
URL httpcorekmiopenacukdownloadpdf5701760pdf
George E (2000) The variable selection problem Journal of the American StatisticalAssociation 95(452) 1304ndash1308
URL httpwwwtandfonlinecomdoiabs10108001621459200010474336
Giron F J Moreno E Casella G amp Martınez M L (2010) Consistency of objectiveBayes factors for nonnested linear models and increasing model dimension Revistade la Real Academia de Ciencias Exactas Fisicas y Naturales Serie A Matematicas104(1) 57ndash67
URL httpwwwspringerlinkcomindex105052RACSAM201006
Good I J (1950) Probability and the Weighing of Evidence New York Haffner
Griepentrog G L Ryan J M amp Smith L D (1982) Linear transformations ofpolynomial regression-models American Statistician 36(3) 171ndash174
Gunel E amp Dickey J (1974) Bayes factors for independence in contingency tablesBiometrika 61 545ndash557
Hanski I (1994) A Practical Model of Metapopulation Dynamics Journal of AnimalEcology 63 151ndash162
Hooten M (2006) Hierarchical spatio-temporal models for ecological processesDoctoral dissertation University of Missouri-Columbia
URL httpsmospacelibraryumsystemeduxmluihandle103554500
Hooten M B amp Hobbs N T (2014) A Guide to Bayesian Model Selection forEcologists Ecological Monographs (In Press)
134
Hughes J amp Haran M (2013) Dimension reduction and alleviation of confoundingfor spatial generalized linear mixed models Journal of the Royal Statistical SocietySeries B Statistical Methodology 75 139ndash159
Hurvich C M amp Tsai C-L (1989) Regression and time series model selection insmall samples Biometrika 76 297ndash307
URL httpbiometoxfordjournalsorgcontent762297abstract
Jeffreys H (1935) Some tests of significance treated by the theory of probabilityProcedings of the Cambridge Philosophy Society 31 203ndash222
Jeffreys H (1961) Theory of Probability London Oxford University Press 3rd ed
Johnson D Conn P Hooten M Ray J amp Pond B (2013) Spatial occupancymodels for large data sets Ecology 94(4) 801ndash808
URL httpwwwesajournalsorgdoiabs10189012-05641mi=3eywlhampaf=R
ampsearchText=human+population
Kass R amp Wasserman L (1995) A reference Bayesian test for nested hypothesesand its relationship to the Schwarz criterion Journal of the American StatisticalAssociation 90(431)
URL httpamstattandfonlinecomdoiabs10108001621459199510476592
Kass R E amp Raftery A E (1995) Bayes Factors Journal of the American StatisticalAssociation 90 773ndash795
URL httpwwwtandfonlinecomdoiabs10108001621459199510476572$
delimiter026E30F$nhttpwwwtandfonlinecomdoiabs10108001621459
199510476572UvBybrTIgcs
Kass R E amp Wasserman L (1996) The Selection of Prior Distributions by FormalRules Journal of the American Statistical Association 91(435) 1343
URL httpwwwjstororgstable2291752origin=crossref
Kery M (2010) Introduction to WinBUGS for Ecologists Bayesian Approach toRegression ANOVA Mixed Models and Related Analyses Academic Press 1st ed
Kery M Gardner B amp Monnerat C (2010) Predicting species distributions fromchecklist data using site-occupancy models Journal of Biogeography 37 (10)1851ndash1862 Kery Marc Gardner Beth Monnerat Christian
Khuri A (2002) Nonsingular linear transformations of the control variables in responsesurface models Technical Report
Krebs C J (1972) Ecology the experimental analysis of distribution and abundance
135
Lempers F B (1971) Posterior probabilities of alternative linear models University ofRotterdam Press Rotterdam
Leon-Novelo L Moreno E amp Casella G (2012) Objective Bayes model selection inprobit models Statistics in medicine 31(4) 353ndash65
URL httpwwwncbinlmnihgovpubmed22162041
Liang F Paulo R Molina G Clyde M a amp Berger J O (2008) Mixtures of g Priorsfor Bayesian Variable Selection Journal of the American Statistical Association103(481) 410ndash423
URL httpwwwtandfonlinecomdoiabs101198016214507000001337
Link W amp Barker R (2009) Bayesian inference with ecological applications Elsevier
URL httpbooksgooglecombookshl=enamplr=ampid=hecon2l2QPcCampoi=fnd
amppg=PP2ampdq=Bayesian+Inference+with+ecological+applicationsampots=S82_
0pxrNmampsig=L3xbsSQcKD8FV6rxCMp2pmP2JKk
MacKenzie D amp Nichols J (2004) Occupancy as a surrogate for abundanceestimation Animal biodiversity and conservation 1 461ndash467
URL httpcrsitbacidmediajurnalrefslandscapemackenzie2004zhpdf
MacKenzie D Nichols J amp Hines J (2003) Estimating site occupancy colonizationand local extinction when a species is detected imperfectly Ecology 84(8)2200ndash2207
URL httpwwwesajournalsorgdoiabs10189002-3090
MacKenzie D I Bailey L L amp Nichols J D (2004) Investigating speciesco-occurrence patterns when species Journal of Animal Ecology 73 546ndash555
MacKenzie D I Nichols J D Lachman G B Droege S Royle J A amp LangtimmC A (2002) Estimating site occupancy rates when detection probabilities are lessthan one Ecology 83(8) 2248ndash2255
Mazerolle M amp Mazerolle M (2013) Package rsquoAICcmodavgrsquo (c)
URL ftpheanetarchivegnewsenseorgdisk1CRANwebpackages
AICcmodavgAICcmodavgpdf
McCullagh P amp Nelder J A (1989) Generalized linear models (2nd ed) LondonEngland Chapman amp Hall
McQuarrie A Shumway R amp Tsai C-L (1997) The model selection criterion AICu
136
Moreno E Bertolino F amp Racugno W (1998) An intrinsic limiting procedure for modelselection and hypotheses testing Journal of the American Statistical Association93(444) 1451ndash1460
Moreno E Giron F J amp Casella G (2010) Consistency of objective Bayes factors asthe model dimension grows The Annals of Statistics 38(4) 1937ndash1952
URL httpprojecteuclidorgeuclidaos1278861238
Nelder J A (1977) Reformulation of linear-models Journal of the Royal StatisticalSociety Series A - Statistics in Society 140 48ndash77
Nelder J A (1998) The selection of terms in response-surface models - how strong isthe weak-heredity principle American Statistician 52(4) 315ndash318
Nelder J A (2000) Functional marginality and response-surface fitting Journal ofApplied Statistics 27 (1) 109ndash112
Nichols J Hines J amp Mackenzie D (2007) Occupancy estimation and modeling withmultiple states and state uncertainty Ecology 88(6) 1395ndash1400
URL httpwwwesajournalsorgdoipdf10189006-1474
Ovaskainen O Hottola J amp Siitonen J (2010) Modeling species co-occurrenceby multivariate logistic regression generates new hypotheses on fungal interactionsEcology 91(9) 2514ndash21
URL httpwwwncbinlmnihgovpubmed20957941
Peixoto J L (1987) Hierarchical variable selection in polynomial regression-modelsAmerican Statistician 41(4) 311ndash313
Peixoto J L (1990) A property of well-formulated polynomial regression-modelsAmerican Statistician 44(1) 26ndash30
Pericchi L R (2005) Model selection and hypothesis testing based on objectiveprobabilities and bayes factors In Handbook of Statistics Elsevier
Polson N G Scott J G amp Windle J (2013) Bayesian Inference for Logistic ModelsUsing Polya-Gamma Latent Variables Journal of the American Statistical Association108 1339ndash1349
URL httpdxdoiorg101080016214592013829001
Rao C R amp Wu Y (2001) On model selection vol Volume 38 of Lecture NotesndashMonograph Series (pp 1ndash57) Beachwood OH Institute of Mathematical Statistics
URL httpdxdoiorg101214lnms1215540960
137
Reich B J Hodges J S amp Zadnik V (2006) Effects of residual smoothing on theposterior of the fixed effects in disease-mapping models Biometrics 62 1197ndash1206
Reiners W amp Lockwood J (2009) Philosophical Foundations for the Practices ofEcology Cambridge University Press
URL httpbooksgooglecombooksid=dr9cPgAACAAJ
Rigler F amp Peters R (1995) Excellence in Ecology Science and Limnology EcologyInstitute Germany
URL httportoncatieaccrcgi-binwxisexeIsisScript=CIENLxis
ampmethod=postampformato=2ampcantidad=1ampexpresion=mfn=008268
Robert C Chopin N amp Rousseau J (2009) Harold Jeffreysrsquo Theory of Probabilityrevisited Statistical Science Volume 24(2) 141ndash179
URL httpswwwnewtonacukpreprintsNI08021pdf
Robert C P (1993) A note on jeffreys-lindley paradox Statistica Sinica 3 601ndash608
Royle J A amp Kery M (2007) A Bayesian state-space formulation of dynamicoccupancy models Ecology 88(7) 1813ndash23
URL httpwwwncbinlmnihgovpubmed17645027
Scott J amp Berger J (2010) Bayes and Empirical-Bayes Multiplicity Adjustment in thevariable selection problem The Annals of Statistics
URL httpprojecteuclidorgeuclidaos1278861454
Spiegelhalter D J amp Smith A F M (1982) Bayes factor for linear and log-linearmodels with vague prior information J R Statist Soc B 44 377ndash387
Tierney L amp Kadane J B (1986) Accurate approximations for posterior moments andmarginal densities Journal of the American Statistical Association 81 82ndash86
Tyre A J Tenhumberg B Field S a Niejalke D Parris K amp Possingham H P(2003) Improving Precision and Reducing Bias in Biological Surveys EstimatingFalse-Negative Error Rates Ecological Applications 13(6) 1790ndash1801
URL httpwwwesajournalsorgdoiabs10189002-5078
Waddle J H Dorazio R M Walls S C Rice K G Beauchamp J Schuman M Jamp Mazzotti F J (2010) A new parameterization for estimating co-occurrence ofinteracting species Ecological applications a publication of the Ecological Society ofAmerica 20 1467ndash1475
Wasserman L (2000) Bayesian Model Selection and Model Averaging Journal ofmathematical psychology 44(1) 92ndash107
138
URL httpwwwncbinlmnihgovpubmed10733859
Wilson M Iversen E Clyde M A Schmidler S C amp Schildkraut J M (2010)Bayesian model search and multilevel inference for SNP association studies TheAnnals of Applied Statistics 4(3) 1342ndash1364
URL httpwwwncbinlmnihgovpmcarticlesPMC3004292
Womack A J Leon-Novelo L amp Casella G (2014) Inference from Intrinsic BayesProcedures Under Model Selection and Uncertainty Journal of the AmericanStatistical Association (June) 140114063448000
URL httpwwwtandfonlinecomdoiabs101080016214592014880348
Yuan M Joseph V R amp Zou H (2009) Structured variable selection and estimationThe Annals of Applied Statistics 3(4) 1738ndash1757
URL httpprojecteuclidorgeuclidaoas1267453962
Zeller K A Nijhawan S Salom-Perez R Potosme S H amp Hines J E (2011)Integrating occupancy modeling and interview data for corridor identification A casestudy for jaguars in nicaragua Biological Conservation 144(2) 892ndash901
Zellner A amp Siow A (1980) Posterior odds ratios for selected regression hypothesesIn Trabajos de estadıstica y de investigacion operativa (pp 585ndash603)
URL httpwwwspringerlinkcomindex5300770UP12246M9pdf
139
BIOGRAPHICAL SKETCH
Daniel Taylor-Rodrıguez was born in Bogota Colombia He earned a BS
degree in economics from the Universidad de Los Andes (2004) and a Specialist
degree in statistics from the Universidad Nacional de Colombia In 2009 he traveled
to Gainesville Florida to pursue a masterrsquos in statistics under the supervision of
George Casella Upon completion he started a PhD in interdisciplinary ecology with
concentration in statistics again under George Casellarsquos supervision After Georgersquos
passing Linda Young and Nikolay Bliznyuk continued to oversee Danielrsquos mentorship
He has currently accepted a joint postdoctoral fellowship at the Statistical and Applied
Mathematical Sciences Institute and the Department of Statistical Science at Duke
University
140
- ACKNOWLEDGMENTS
- TABLE OF CONTENTS
- LIST OF TABLES
- LIST OF FIGURES
- ABSTRACT
- 1 GENERAL INTRODUCTION
-
- 11 Occupancy Modeling
- 12 A Primer on Objective Bayesian Testing
- 13 Overview of the Chapters
-
- 2 MODEL ESTIMATION METHODS
-
- 21 Introduction
-
- 211 The Occupancy Model
- 212 Data Augmentation Algorithms for Binary Models
-
- 22 Single Season Occupancy
-
- 221 Probit Link Model
- 222 Logit Link Model
-
- 23 Temporal Dynamics and Spatial Structure
-
- 231 Dynamic Mixture Occupancy State-Space Model
- 232 Incorporating Spatial Dependence
-
- 24 Summary
-
- 3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
-
- 31 Introduction
- 32 Objective Bayesian Inference
-
- 321 The Intrinsic Methodology
- 322 Mixtures of g-Priors
-
- 3221 Intrinsic priors
- 3222 Other mixtures of g-priors
-
- 33 Objective Bayes Occupancy Model Selection
-
- 331 Preliminaries
- 332 Intrinsic Priors for the Occupancy Problem
- 333 Model Posterior Probabilities
- 334 Model Selection Algorithm
-
- 34 Alternative Formulation
- 35 Simulation Experiments
-
- 351 Marginal Posterior Inclusion Probabilities for Model Predictors
- 352 Summary Statistics for the Highest Posterior Probability Model
-
- 36 Case Study Blue Hawker Data Analysis
-
- 361 Results Variable Selection Procedure
- 362 Validation for the Selection Procedure
-
- 37 Discussion
-
- 4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
-
- 41 Introduction
- 42 Setup for Well-Formulated Models
-
- 421 Well-Formulated Model Spaces
-
- 43 Priors on the Model Space
-
- 431 Model Prior Definition
- 432 Choice of Prior Structure and Hyper-Parameters
- 433 Posterior Sensitivity to the Choice of Prior
-
- 44 Random Walks on the Model Space
-
- 441 Simple Pruning and Growing
- 442 Degree Based Pruning and Growing
-
- 45 Simulation Study
-
- 451 SNR and Sample Size Effect
- 452 Coefficient Magnitude
- 453 Special Points on the Scale
-
- 46 Case Study Ozone Data Analysis
- 47 Discussion
-
- 5 CONCLUSIONS
- A FULL CONDITIONAL DENSITIES DYMOSS
- B RANDOM WALK ALGORITHMS
- C WFM SIMULATION DETAILS
- D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
- REFERENCES
- BIOGRAPHICAL SKETCH
-
Professor Mary Christman for her mentorship and enormous support I would like to
thank Dr Mihai Giurcanu for spending countless hours helping me think more deeply
about statistics his insight has been instrumental to shaping my own ideas Thanks to
Dr Claudio Fuentes for taking an interest in my work and for his advise support and
kind words which helped me retain the confidence to continue
I would like to acknowledge my friends at UF Juan Jose Acosta Mauricio
Mosquera Diana Falla Salvador and Emma Weeks and Anna Denicol thanks for
becoming my family away from home Andreas Tavis Emily Alex Sasha Mike
Yeonhee and Laura thanks for being there for me I truly enjoyed sharing these
years with you Vitor Paula Rafa Leandro Fabio Eduardo Marcelo and all the other
Brazilians in the Animal Science Department thanks for your friendship and for the
many unforgettable (though blurry) weekends
Also I would like to thank Pablo Arboleda for believing in me Because of him I
was able to take the first step towards fulfilling my educational goals My gratitude to
Grupo Bancolombia Fulbright Colombia Colfuturo and the IGERT QSE3 program
for supporting me throughout my studies Also thanks to Marc Kery and Christian
Monnerat for providing data to validate our methods Thanks to the staff in the Statistics
Department specially to Ryan Chance to the staff at the HPC and also to Karen Bray
at SNRE
Above all else I would like to thank my wife and family Nata you have always been
there for me pushing me forward believing in me helping me make better decisions
and regardless of how hard things get you have always managed to give me true and
lasting happiness Thank you for your love strength and patience Mom Dad Alejandro
Alberto Laura Sammy Vale and Tommy without your love trust and support getting
this far would not have been possible Thank you for giving me so much Gustavo
Lilia Angelica and Juan Pablo thanks for taking me into your family your words of
encouragement have led the way
5
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS 4
LIST OF TABLES 8
LIST OF FIGURES 10
ABSTRACT 12
CHAPTER
1 GENERAL INTRODUCTION 14
11 Occupancy Modeling 1512 A Primer on Objective Bayesian Testing 1713 Overview of the Chapters 21
2 MODEL ESTIMATION METHODS 23
21 Introduction 23211 The Occupancy Model 24212 Data Augmentation Algorithms for Binary Models 26
22 Single Season Occupancy 29221 Probit Link Model 30222 Logit Link Model 32
23 Temporal Dynamics and Spatial Structure 34231 Dynamic Mixture Occupancy State-Space Model 37232 Incorporating Spatial Dependence 43
24 Summary 46
3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS 49
31 Introduction 4932 Objective Bayesian Inference 52
321 The Intrinsic Methodology 53322 Mixtures of g-Priors 54
3221 Intrinsic priors 553222 Other mixtures of g-priors 56
33 Objective Bayes Occupancy Model Selection 57331 Preliminaries 58332 Intrinsic Priors for the Occupancy Problem 60333 Model Posterior Probabilities 62334 Model Selection Algorithm 63
34 Alternative Formulation 6635 Simulation Experiments 68
351 Marginal Posterior Inclusion Probabilities for Model Predictors 70
6
352 Summary Statistics for the Highest Posterior Probability Model 7636 Case Study Blue Hawker Data Analysis 77
361 Results Variable Selection Procedure 79362 Validation for the Selection Procedure 81
37 Discussion 82
4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS 84
41 Introduction 8442 Setup for Well-Formulated Models 88
421 Well-Formulated Model Spaces 9043 Priors on the Model Space 91
431 Model Prior Definition 92432 Choice of Prior Structure and Hyper-Parameters 96433 Posterior Sensitivity to the Choice of Prior 99
44 Random Walks on the Model Space 104441 Simple Pruning and Growing 105442 Degree Based Pruning and Growing 106
45 Simulation Study 107451 SNR and Sample Size Effect 109452 Coefficient Magnitude 110453 Special Points on the Scale 111
46 Case Study Ozone Data Analysis 11147 Discussion 113
5 CONCLUSIONS 115
APPENDIX
A FULL CONDITIONAL DENSITIES DYMOSS 118
B RANDOM WALK ALGORITHMS 121
C WFM SIMULATION DETAILS 124
D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS 131
REFERENCES 133
BIOGRAPHICAL SKETCH 140
7
LIST OF TABLES
Table page
1-1 Interpretation of BFji when contrasting Mj and Mi 20
3-1 Simulation control parameters occupancy model selector 69
3-2 Comparison of average minOddsMPIP under scenarios having different numberof sites (N=50 N=100) and under scenarios having different number of surveysper site (J=3 J=5) for the presence and detection components using uniformand multiplicity correction priors 75
3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors 75
3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highest probabilitymodels for the presence and the detection components using uniform andmultiplicity correcting priors on the model space 76
3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space 77
3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 77
3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 78
3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data 80
3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data 80
3-10 MPIP presence component 81
3-11 MPIP detection component 81
3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform and multiplicitycorrection model priors 82
8
4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulations 100
4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchical independenceprior (HIP) the hierarchical order prior (HOP) and the hierarchical uniformprior (HUP) 102
4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and the hierarchicaluniform prior (HUP) 103
4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP) 105
4-5 Variables used in the analyses of the ozone contamination dataset 112
4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso 113
C-1 Experimental conditions WFM simulations 124
D-1 Variables used in the analyses of the ozone contamination dataset 131
D-2 Marginal inclusion probabilities intrinsic prior 132
D-3 Marginal inclusion probabilities Zellner-Siow prior 132
D-4 Marginal inclusion probabilities Hyper-g11 132
D-5 Marginal inclusion probabilities Hyper-g21 132
9
LIST OF FIGURES
Figure page
2-1 Graphical representation occupancy model 25
2-2 Graphical representation occupancy model after data-augmentation 31
2-3 Graphical representation multiseason model for a single site 39
2-4 Graphical representation data-augmented multiseason model 39
3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors 71
3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors 72
3-3 Predictor MPIP averaged over scenarios with the interaction between the numberof sites and the surveys per site using uniform (U) and multiplicity correction(MC) priors 72
3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors 73
3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors 73
4-1 Graphs of well-formulated polynomial models for p = 2 90
4-2 E(M) and C(M) in M defined by a quadratic surface in two main effects formodel M = 1 x1 x21 91
4-3 Graphical representation of assumptions on M defined by the quadratic surfacein two main effects 93
4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the intercept onlymodel and (ab) isin (1 1) (1 ch) 97
4-5 Prior probabilities for the space of well-formulated models associated to threemain effects and one interaction term where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch) 98
4-6 MT DAG of the largest true model used in simulations 109
4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch) 110
C-1 SNR vs n Average model size average true positives and average false positivesfor all simulated scenarios by model ranking according to model posterior probabilities126
10
C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 128
C-3 SNR vs different true models MT Average model size average true positivesand average false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 129
11
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
OBJECTIVE BAYESIAN METHODS FOR OCCUPANCY MODEL ESTIMATION ANDSELECTION
By
Daniel Taylor-Rodrıguez
August 2014
Chair Linda J YoungCochair Nikolay BliznyukMajor Interdisciplinary Ecology
The ecological literature contains numerous methods for conducting inference about
the dynamics that govern biological populations Among these methods occupancy
models have played a leading role during the past decade in the analysis of large
biological population surveys The flexibility of the occupancy framework has brought
about useful extensions for determining key population parameters which provide
insights about the distribution structure and dynamics of a population However the
methods used to fit the models and to conduct inference have gradually grown in
complexity leaving practitioners unable to fully understand their implicit assumptions
increasing the potential for misuse This motivated our first contribution We develop
a flexible and straightforward estimation method for occupancy models that provides
the means to directly incorporate temporal and spatial heterogeneity using covariate
information that characterizes habitat quality and the detectability of a species
Adding to the issue mentioned above studies of complex ecological systems now
collect large amounts of information To identify the drivers of these systems robust
techniques that account for test multiplicity and for the structure in the predictors are
necessary but unavailable for ecological models We develop tools to address this
methodological gap First working in an ldquoobjectiverdquo Bayesian framework we develop
the first fully automatic and objective method for occupancy model selection based
12
on intrinsic parameter priors Moreover for the general variable selection problem we
propose three sets of prior structures on the model space that correct for multiple testing
and a stochastic search algorithm that relies on the priors on the models space to
account for the polynomial structure in the predictors
13
CHAPTER 1GENERAL INTRODUCTION
As with any other branch of science ecology strives to grasp truths about the
world that surrounds us and in particular about nature The objective truth sought
by ecology may well be beyond our grasp however it is reasonable to think that at
least partially ldquoNature is capable of being understoodrdquo (Dewey 1958) We can observe
and interpret nature to formulate hypotheses which can then be tested against reality
Hypotheses that encounter no or little opposition when confronted with reality may
become contextual versions of the truth and may be generalized by scaling them
spatially andor temporally accordingly to delimit the bounds within which they are valid
To formulate hypotheses accurately and in a fashion amenable to scientific inquiry
not only the point of view and assumptions considered must be made explicit but
also the object of interest the properties worthy of consideration of that object and
the methods used in studying such properties (Reiners amp Lockwood 2009 Rigler amp
Peters 1995) Ecology as defined by Krebs (1972) is ldquothe study of interactions that
determine the distribution and abundance of organismsrdquo This characterizes organisms
and their interactions as the objects of interest to ecology and prescribes distribution
and abundance as a relevant property of these organisms
With regards to the methods used to acquire ecological scientific knowledge
traditionally theoretical mathematical models (such as deterministic PDEs) have been
used However naturally varying systems are imprecisely observed and as such are
subject to multiple sources of uncertainty that must be explicitly accounted for Because
of this the ecological scientific community is developing a growing interest in flexible
and powerful statistical methods and among these Bayesian hierarchical models
predominate These methods rely on empirical observations and can accommodate
fairly complex relationships between empirical observations and theoretical process
models while accounting for diverse sources of uncertainty (Hooten 2006)
14
Bayesian approaches are now used extensively in ecological modeling however
there are two issues of concern one from the standpoint of ecological practitioners
and another from the perspective of scientific ecological endeavors First Bayesian
modeling tools require a considerable understanding of probability and statistical theory
leading practitioners to view them as black box approaches (Kery 2010) Second
although Bayesian applications proliferate in the literature in general there is a lack of
awareness of the distinction between approaches specifically devised for testing and
those for estimation (Ellison 2004) Furthermore there is a dangerous unfamiliarity with
the proven risks of using tools designed for estimation in testing procedures (Berger amp
Pericchi 1996 Berger et al 2001 Kass amp Raftery 1995 Moreno et al 1998 Robert
et al 2009 Robert 1993) (eg use of flat priors in hypothesis testing)
Occupancy models have played a leading role during the past decade in large
biological population surveys The flexibility of the occupancy framework has allowed
the development of useful extensions to determine several key population parameters
which provide robust notions of the distribution structure and dynamics of a population
In order to address some of the concerns stated in previous paragraph we concentrate
in the occupancy framework to develop estimation and testing tools that will allow
ecologists first to gain insight about the estimation procedure and second to conduct
statistically sound model selection for site-occupancy data
11 Occupancy Modeling
Since MacKenzie et al (2002) and Tyre et al (2003) introduced the site-occupancy
framework countless applications and extensions of the method have been developed
in the ecological literature as evidenced by the 438000 hits on Google Scholar for
a search of rdquooccupancy modelrdquo This class of models acknowledges that techniques
used to conduct biological population surveys are prone to detection errors ndashif an
individual is detected it must be present while if it is not detected it might or might
not be Occupancy models improve upon traditional binary regression by accounting
15
for observed detection and partially observed presence as two separate but related
components In the site occupancy setting the chosen locations are surveyed
repeatedly in order to reduce the ambiguity caused by the observed zeros This
approach therefore allows probabilities of both presence (occurrence) and detection
to be estimated
The uses of site-occupancy models are many For example metapopulation
and island biogeography models are often parameterized in terms of site (or patch)
occupancy (Hansky 19921994 1997 as cited in MacKenzie et al (2003)) and
occupancy may be used as a surrogate for abundance to answer questions regarding
geographic distribution range size and metapopulation dynamics (MacKenzie et al
2004 Royle amp Kery 2007)
The basic occupancy framework which assumes a single closed population with
fixed probabilities through time has proven to be quite useful however it might be of
limited utility when addressing some problems In particular assumptions for the basic
model may become too restrictive or unrealistic whenever the study period extends
throughout multiple years or seasons especially given the increasingly changing
environmental conditions that most ecosystems are currently experiencing
Among the extensions found in the literature one that we consider particularly
relevant incorporates heterogenous occupancy probabilities through time Models
that incorporate temporally varying probabilities stem from important meta-population
notions provided by Hanski (1994) such as occupancy probabilities depending on local
colonization and local extinction processes In spite of the conceptual usefulness of
Hanskirsquos model several strong and untenable assumptions (eg all patches being
homogenous in quality) are required for it to provide practically meaningful results
A more viable alternative which builds on Hanski (1994) is an extension of
the single season occupancy model of MacKenzie et al (2003) In this model the
heterogeneity of occupancy probabilities across seasons arises from local colonization
16
and extinction processes This model is flexible enough to let detection occurrence
extinction and colonization probabilities to each depend upon its own set of covariates
Model parameters are obtained through likelihood-based estimation
Using a maximum likelihood approach presents two drawbacks First the
uncertainty assessment for maximum likelihood parameter estimates relies on
asymptotic results which are obtained from implementation of the delta method
making it sensitive to sample size Second to obtain parameter estimates the latent
process (occupancy) is marginalized out of the likelihood leading to the usual zero
inflated Bernoulli model Although this is a convenient strategy for solving the estimation
problem after integrating the latent state variables (occupancy indicators) they are
no longer available Therefore finite sample estimates cannot be calculated directly
Instead a supplementary parametric bootstrapping step is necessary Further
additional structure such as temporal or spatial variation cannot be introduced by
means of random effects (Royle amp Kery 2007)
12 A Primer on Objective Bayesian Testing
With the advent of high dimensional data such as that found in modern problems
in ecology genetics physics etc coupled with evolving computing capability objective
Bayesian inferential methods have gained increasing popularity This however is by no
means a new approach in the way Bayesian inference is conducted In fact starting with
Bayes and Laplace and continuing for almost 200 years Bayesian analysis was primarily
based on ldquononinformativerdquo priors (Berger amp Bernardo 1992)
Now subjective elicitation of prior probabilities in Bayesian analysis is widely
recognized as the ideal (Berger et al 2001) however it is often the case that the
available information is insufficient to specify appropriate prior probabilistic statements
Commonly as in model selection problems where large model spaces have to be
explored the number of model parameters is prohibitively large preventing one from
eliciting prior information for the entire parameter space As a consequence in practice
17
the determination of priors through the definition of structural rules has become the
alternative to subjective elicitation for a variety of problems in Bayesian testing Priors
arising from these rules are known in the literature as noninformative objective default
or reference Many of these connotations generate controversy and are accused
perhaps rightly of providing a false pretension of objectivity Nevertheless we will avoid
that discussion and refer to them herein exchangeably as noninformative or objective
priors to convey the sense that no attempt to introduce an informed opinion is made in
defining prior probabilities
A plethora of ldquononinformativerdquo methods has been developed in the past few
decades (see Berger amp Bernardo (1992) Berger amp Pericchi (1996) Berger et al (2001)
Clyde amp George (2004) Kass amp Wasserman (1995 1996) Liang et al (2008) Moreno
et al (1998) Spiegelhalter amp Smith (1982) Wasserman (2000) and the references
therein) We find particularly interesting those derived from the model structure in which
no tuning parameters are required especially since these can be regarded as automatic
methods Among them methods based on the Bayes factor for Intrinsic Priors have
proven their worth in a variety of inferential problems given their excellent performance
flexibility and ease of use This class of priors is discussed in detail in chapter 3 For
now some basic notation and notions of Bayesian inferential procedures are introduced
Hypothesis testing and the Bayes factor
Bayesian model selection techniques that aim to find the true model as opposed
to searching for the model that best predicts the data are fundamentally extensions to
Bayesian hypothesis testing strategies In general this Bayesian approach to hypothesis
testing and model selection relies on determining the amount of evidence found in favor
of one hypothesis (or model) over the other given an observed set of data Approached
from a Bayesian standpoint this type of problem can be formulated in great generality
using a natural well defined probabilistic framework that incorporates both model and
parameter uncertainty
18
Jeffreys (1935) first developed the Bayesian strategy to hypothesis testing and
consequently to the model selection problem Bayesian model selection within
a model space M = (M1M2 MJ) where each model is associated with a
parameter θj which may be a vector of parameters itself incorporates three types
of probability distributions (1) a prior probability distribution for each model π(Mj)
(2) a prior probability distribution for the parameters in each model π(θj |Mj) and (3)
the distribution of the data conditional on both the model and the modelrsquos parameters
f (x|θj Mj) These three probability densities induce the joint distribution p(x θj Mj) =
f (x|θj Mj) middot π(θj |Mj) middot π(Mj) which is instrumental in producing model posterior
probabilities The model posterior probability is the probability that a model is true given
the data It is obtained by marginalizing over the parameter space and using Bayes rule
p(Mj |x) =m(x|Mj)π(Mj)sumJ
i=1m(x|Mi)π(Mi) (1ndash1)
where m(x|Mj) =intf (x|θj Mj)π(θj |Mj)dθj is the marginal likelihood of Mj
Given that interest lies in comparing different models evidence in favor of one or
another model is assessed with pairwise comparisons using posterior odds
p(Mj |x)p(Mk |x)
=m(x|Mj)
m(x|Mk)middot π(Mj)
π(Mk) (1ndash2)
The first term on the right hand side of (1ndash2) m(x|Mj )
m(x|Mk) is known as the Bayes factor
comparing model Mj to model Mk and it is denoted by BFjk(x) The Bayes factor
provides a measure of the evidence in favor of either model given the data and updates
the model prior odds given by π(Mj )
π(Mk) to produce the posterior odds
Note that the model posterior probability in (1ndash1) can be expressed as a function of
Bayes factors To illustrate let model Mlowast isin M be a reference model All other models
compare in M are compared to the reference model Then dividing both the numerator
19
and denominator in (1ndash1) by m(x|Mlowast)π(Mlowast) yields
p(Mj |x) =BFjlowast(x)
π(Mj )
π(Mlowast)
1 +sum
MiisinMMi =Mlowast
BFilowast(x)π(Mi )π(Mlowast)
(1ndash3)
Therefore as the Bayes factor increases the posterior probability of model Mj given the
data increases If all models have equal prior probabilities a straightforward criterion
to select the best among all candidate models is to choose the model with the largest
Bayes factor As such the Bayes factor is not only useful for identifying models favored
by the data but it also provides a means to rank models in terms of their posterior
probabilities
Assuming equal model prior probabilities in (1ndash3) the prior odds are set equal to
one and the model posterior odds in (1ndash2) become p(Mj |x)p(Mk |x) = BFjk(x) Based
on the Bayes factors the evidence in favor of one or another model can be interpreted
using Table 1-1 adapted from Kass amp Raftery (1995)
Table 1-1 Interpretation of BFji when contrasting Mj and Mi
lnBFjk BFjk Evidence in favor of Mj P(Mj |x)0 to 2 1 to 3 Weak evidence 05-0752 to 6 3 to 20 Positive evidence 075-095
6 to 10 20 to 150 Strong evidence 095-099gt10 gt150 Very strong evidence gt 099
Bayesian hypothesis testing and model selection procedures through Bayes factors
and posterior probabilities have several desirable features First these methods have a
straight forward interpretation since the Bayes factor is an increasing function of model
(or hypothesis) posterior probabilities Second these methods can yield frequentist
matching confidence bounds when implemented with good testing priors (Kass amp
Wasserman 1996) such as the reference priors of Berger amp Bernardo (1992) Third
since the Bayes factor contains the ratio of marginal densities it automatically penalizes
complexity according to the number of parameters in each model this property is
known as Ockhamrsquos razor (Kass amp Raftery 1995) Four the use of Bayes factors does
20
not require having nested hypotheses (ie having the null hypothesis nested in the
alternative) standard distributions or regular asymptotics (eg convergence to normal
or chi squared distributions) (Berger et al 2001) In contrast this is not always the case
with frequentist and likelihood ratio tests which depend on known distributions (at least
asymptotically) for the test statistic to perform the test Finally Bayesian hypothesis
testing procedures using the Bayes factor can naturally incorporate model uncertainty by
using the Bayesian machinery for model averaged predictions and confidence bounds
(Kass amp Raftery 1995) It is not clear how to account for this uncertainty rigorously in a
fully frequentist approach
13 Overview of the Chapters
In the chapters that follow we develop a flexible and straightforward hierarchical
Bayesian framework for occupancy models allowing us to obtain estimates and conduct
robust testing from an ldquoobjectiverdquo Bayesian perspective Latent mixtures of random
variables supply a foundation for our methodology This approach provides a means to
directly incorporate spatial dependency and temporal heterogeneity through predictors
that characterize either habitat quality of a given site or detectability features of a
particular survey conducted in a specific site On the other hand the Bayesian testing
methods we propose are (1) a fully automatic and objective method for occupancy
model selection and (2) an objective Bayesian testing tool that accounts for multiple
testing and for polynomial hierarchical structure in the space of predictors
Chapter 2 introduces the methods proposed for estimation of occupancy model
parameters A simple estimation procedure for the single season occupancy model
with covariates is formulated using both probit and logit links Based on the simple
version an extension is provided to cope with metapopulation dynamics by introducing
persistence and colonization processes Finally given the fundamental role that spatial
dependence plays in defining temporal dynamics a strategy to seamlessly account for
this feature in our framework is introduced
21
Chapter 3 develops a new fully automatic and objective method for occupancy
model selection that is asymptotically consistent for variable selection and averts the
use of tuning parameters In this Chapter first some issues surrounding multimodel
inference are described and insight about objective Bayesian inferential procedures is
provided Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate
priors on the parameter space the intrinsic priors for the parameters of the occupancy
model are obtained These are used in the construction of a variable selection algorithm
for ldquoobjectiverdquo variable selection tailored to the occupancy model framework
Chapter 4 touches on two important and interconnected issues when conducting
model testing that have yet to receive the attention they deserve (1) controlling for false
discovery in hypothesis testing given the size of the model space ie given the number
of tests performed and (2) non-invariance to location transformations of the variable
selection procedures in the face of polynomial predictor structure These elements both
depend on the definition of prior probabilities on the model space In this chapter a set
of priors on the model space and a stochastic search algorithm are proposed Together
these control for model multiplicity and account for the polynomial structure among the
predictors
22
CHAPTER 2MODEL ESTIMATION METHODS
ldquoData Data Datardquo he cried impatiently ldquoI canrsquot make bricks without clayrdquo
ndashSherlock HolmesThe Adventure of the Copper Beeches
21 Introduction
Prior to the introduction of site-occupancy models (MacKenzie et al 2002 Tyre
et al 2003) presence-absence data from ecological monitoring programs were used
without any adjustment to assess the impact of management actions to observe trends
in species distribution through space and time or to model the habitat of a species (Tyre
et al 2003) These efforts however were suspect due to false-negative errors not
being accounted for False-negative errors occur whenever a species is present at a site
but goes undetected during the survey
Site-occupancy models developed independently by MacKenzie et al (2002)
and Tyre et al (2003) extend simple binary-regression models to account for the
aforementioned errors in detection of individuals common in surveys of animal or plant
populations Since their introduction the site-occupancy framework has been used in
countless applications and numerous extensions for it have been proposed Occupancy
models improve upon traditional binary regression by analyzing observed detection
and partially observed presence as two separate but related components In the site
occupancy setting the chosen locations are surveyed repeatedly in order to reduce the
ambiguity caused by the observed zeros This approach therefore allows simultaneous
estimation of the probabilities of presence (occurrence) and detection
Several extensions to the basic single-season closed population model are
now available The occupancy approach has been used to determine species range
dynamics (MacKenzie et al 2003 Royle amp Kery 2007) and to understand agestage
23
structure within populations (Nichols et al 2007) model species co-occurrence
(MacKenzie et al 2004 Ovaskainen et al 2010 Waddle et al 2010) It has even been
suggested as a surrogate for abundance (MacKenzie amp Nichols 2004) MacKenzie et al
suggested using occupancy models to conduct large-scale monitoring programs since
this approach avoids the high costs associated with surveys designed for abundance
estimation Also to investigate metapopulation dynamics occupancy models improve
upon incidence function models (Hanski 1994) which are often parameterized in terms
of site (or patch) occupancy and assume homogenous patches and a metapopulation
that is at a colonization-extinction equilibrium
Nevertheless the implementation of Bayesian occupancy models commonly resorts
to sampling strategies dependent on hyper-parameters subjective prior elicitation
and relatively elaborate algorithms From the standpoint of practitioners these are
often treated as black-box methods (Kery 2010) As such the potential of using the
methodology incorrectly is high Commonly these procedures are fitted with packages
such as BUGS or JAGS Although the packagersquos ease of use has led to a wide-spread
adoption of the methods the user may be oblivious as to the assumptions underpinning
the analysis
We believe providing straightforward and robust alternatives to implement these
methods will help practitioners gain insight about how occupancy modeling and more
generally Bayesian modeling is performed In this Chapter using a simple Gibbs
sampling approach first we develop a versatile method to estimate the single season
closed population site-occupancy model then extend it to analyze metapopulation
dynamics through time and finally provide a further adaptation to incorporate spatial
dependence among neighboring sites211 The Occupancy Model
In this section of the document we first introduce our results published in Dorazio
amp Taylor-Rodrıguez (2012) and build upon them to propose relevant extensions For
24
the standard sampling protocol for collecting site-occupancy data J gt 1 independent
surveys are conducted at each of N representative sample locations (sites) noting
whether a species is detected or not detected during each survey Let yij denote a binary
random variable that indicates detection (y = 1) or non-detection (y = 0) during the
j th survey of site i Without loss of generality J may be assumed constant among all N
sites to simplify description of the model In practice however site-specific variation in
J poses no real difficulties and is easily implemented This sampling protocol therefore
yields a N times J matrix Y of detectionnon-detection data
Note that the observed process yij is an imperfect representation of the underlying
occupancy or presence process Hence letting zi denote the presence indicator at site i
this model specification can therefore be represented through the hierarchy
yij |zi λ sim Bernoulli (zipij)
zi |α sim Bernoulli (ψi) (2ndash1)
where pij is the probability of correctly classifying as occupied the i th site during the j th
survey ψi is the presence probability at the i th site The graphical representation of this
process is
ψi
zi
yi
pi
Figure 2-1 Graphical representation occupancy model
Probabilities of detection and occupancy can both be made functions of covariates
and their corresponding parameter estimates can be obtained using either a maximum
25
likelihood or a Bayesian approach Existing methodologies from the likelihood
perspective marginalize over the latent occupancy process (zi ) making the estimation
procedure depend only on the detections Most Bayesian strategies rely on MCMC
algorithms that require parameter prior specification and tuning However Albert amp Chib
(1993) proposed a longstanding strategy in the Bayesian statistical literature that models
binary outcomes using a simple Gibbs sampler This procedure which is described in
the following section can be extrapolated to the occupancy setting eliminating the need
for tuning parameters and subjective prior elicitation212 Data Augmentation Algorithms for Binary Models
Probit model Data-augmentation with latent normal variables
At the root of Albert amp Chibrsquos algorithm lies the idea that if the observed outcome is
0 the latent variable can be simulated from a truncated normal distribution with support
(minusinfin 0] And if the outcome is 1 the latent variable can be simulated from a truncated
normal distribution in (0infin) To understand the reasoning behind this strategy let
Y sim Bern((xTβ)
) and V = xTβ + ε with ε sim N (0 1) In such a case note that
Pr(y = 1 | xTβ) = (xTβ) = Pr(ε lt xTβ)
= Pr(ε gt minusxTβ)
= Pr(v gt 0 | xTβ)
Thus whenever y = 1 then v gt 0 and v le 0 otherwise In other words we
may think of y as a truncated version of v Thus we can sample iteratively alternating
between the latent variables conditioned on model parameters and vice versa to draw
from the desired posterior densities By augmenting the data with the latent variables
we are able to obtain full conditional posterior distributions for model parameters that are
easy to draw from (equation 2ndash3 below) Further we may sample the latent variables
we may also sample the parameters
Given some initial values for all model parameters values for the latent variables
can be simulated By conditioning on the latter it is then possible to draw samples
26
from the parameterrsquos posterior distributions These samples can be used to generate
new values for the latent variables etc The process is iterated using a Gibbs sampling
approach Generally after a large number iterations it yields draws from the joint
posterior distribution of the latent variables and the model parameters conditional on the
observed outcome values We formalize the procedure below
Assume that each outcome Y1Y2 Yn is such that Yi |xi β sim Bernoulli(qi)
where qi = (xTi β) is the standard normal CDF evaluated at xTi β where xi and β
are the p-dimensional vectors of observed covariates for the i -th observation and their
corresponding parameters respectively
Now let y = y1 y2 yn be the vector of observed outcomes and [ β ] represents
the prior distribution of the model parameters Therefore the posterior distribution of β is
given by
[ β|y ] prop [ β ]nprodi=1
(xTi β)yi(1minus(xTi β)
)1minusyi (2ndash2)
which is intractable Nevertheless introducing latent random variables V = (V1 Vn)
such that Vi sim N (xTi β 1) resolves this difficulty by specifying that whenever Yi = 1
then Vi gt 0 and if Yi = 0 then Vi le 0 This yields
[ β v|y ] prop [ β ]
nprodi=1
ϕ(vi | xTi β 1)Ivile0Iyi=0 + Ivigt0Iyi=1
(2ndash3)
where ϕ(x |micro τ 2) is the probability density function of normal random variable x
with mean micro and variance τ2 The data augmentation artifact works since [ β|y ] =int[ β v|y ]dv hence if we sample from joint posterior 2ndash3 and extract only the sampled
values for β they will correspond to samples from [ β|y ]
From the expression above it is possible to obtain the full conditional distributions
for V and β Thus a Gibbs sampler can be proposed For example if we use a flat prior
27
for β (ie [ β ] prop 1) the full conditionals are given by
β|V y sim MVNk
((XTX )minus1(XTV ) (XTX )minus1
)(2ndash4)
V|β y simnprodi=1
tr N (xTi β 1Qi) (2ndash5)
where MVNq(micro ) represents a multivariate normal distribution with mean vector micro
and variance-covariance matrix and tr N (ξσ2Q) stands for the truncated normal
distribution with mean ξ variance σ2 and truncation region Q For each i = 1 2 n
the support of the truncated variables is given by Q = (minusinfin 0 ] if yi = 0 and Q = (0infin)
otherwise Note that conjugate normal priors could be used alternatively
At iteration m + 1 the Gibbs sampler draws V(m+1) conditional on β(m) from (2ndash5)
and then samples β(m+1) conditional on V(m+1) from (2ndash4) This process is repeated for
s = 0 1 nsim where nsim is the number of iterations in the Gibbs sampler
Logit model Data-augmentation with latent Polya-gamma variables
Recently Polson et al (2013) developed a novel and efficient approach for Bayesian
inference for logistic models using Polya-gamma latent variables which is analogous
to the Albert amp Chib algorithm The result arises from what the authors refer to as the
Polya-gamma distribution To construct a random variable from this family consider the
infinite mixture of the iid sequence of Exp(1) random variables Ekinfink=1 given by
ω =2
π2
infinsumk=1
Ek
(2k minus 1)2
with probability density function
g(ω) =infinsumk=1
(minus1)k 2k + 1radic2πω3
eminus(2k+1)2
8ω Iωisin(0infin) (2ndash6)
and Laplace density transform E[eminustω] = coshminus1(radic
t2)
28
The Polya-gamma family of densities is obtained through an exponential tilting of
the density g from 2ndash6 These densities indexed by c ge 0 are characterized by
f (ω|c) = cosh c2 eminusc2ω2 g(ω)
The likelihood for the binomial logistic model can be expressed in terms of latent
Polya-gamma variables as follows Assume yi sim Bernoulli(δi) with predictors xprimei =
(xi1 xip) and success probability δi = exprimeiβ(1 + ex
primeiβ) Hence the posterior for the
model parameters can be represented as
[β|y] =[β]prodn
i δyii (1minus δi)
1minusyi
c(y)
where c(y) is the normalizing constant
To facilitate the sampling procedure a data augmentation step can be performed
by introducing a Polya-gamma random variable ω sim PG(xprimeβ 1) This yields the
data-augmented posterior
[βω|y] =
(prodn
i=1 Pr(yi = 1|β))f (ω|xprime
β) [β] dω
c(y) (2ndash7)
such that [β|y] =int
R+[βω|y] dω
Thus from the augmented model the full conditional density for β is given by
[β|ω y] prop
(nprodi=1
Pr(yi = 1|β)
)f (ω|xprime
β) [β] dω
=
nprodi=1
(exprimeiβ)yi
1 + exprimeiβ
nprodi=1
cosh
(∣∣xprime
iβ∣∣
2
)exp
[minus(x
prime
iβ)2ωi
2
]g(ωi)
(2ndash8)
This expression yields a normal posterior distribution if β is assigned flat or normal
priors Hence a two-step sampling strategy analogous to that of Albert amp Chib (1993)
can be used to estimate β in the occupancy framework22 Single Season Occupancy
Let pij = F (qTij λ) be the probability of correctly classifying as occupied the i th
site during the j th survey conditional on the site being occupied and let ψi = F (xTi α)
29
correspond to the presence probability at the i th site Further let Fminus1(middot) denote a link
function (ie probit or logit) connecting the response to the predictors and denote by λ
and α respectively the r -variate and p-variate coefficient vectors for the detection and
for the presence probabilities Then the following is the joint posterior probability for the
presence indicators and the model parameters
πlowast(z vαwλ) prop πα(α)πλ(λ)Nprodi=1
F (xprimeiα)zi (1minus F (xprimeiα))
(1minuszi ) times
Jprodj=1
(ziF (qprimeijλ))
yij (1minus ziF (qprimeijλ))
1minusyij (2ndash9)
As in the simple probit regression problem this posterior is intractable Consequently
sampling from it directly is not possible But the procedures of Albert amp Chib for the
probit model and of Polson et al for the logit model can be extended to generate an
MCMC sampling strategy for the occupancy problem In what follows we make use of
this framework to develop samplers with which occupancy parameter estimates can be
obtained for both probit and logit link functions These algorithms have the added benefit
that they do not require tuning parameters nor eliciting parameter priors subjectively221 Probit Link Model
To extend Albert amp Chibrsquos algorithm to the occupancy framework with a probit link
first we introduce two sets of latent variables denoted by wij and vi corresponding to
the normal latent variables used to augment the data The corresponding hierarchy is
yij |zi sij sim Bernoulli(ziIwijgt0
)wij |λ sim N
(qprimeijλ 1
)λ sim [λ]
zi |vi sim Ivigt0
vi |α sim N (xprimeiα 1)
α sim [α] (2ndash10)
30
represented by the directed graph found in Figure 2-2
α
vi
zi
yi
wi
λ
Figure 2-2 Graphical representation occupancy model after data-augmentation
Under this hierarchical model the joint density is given by
πlowast(z vαwλ) prop Cyπα(α)πλ(λ)Nprodi=1
ϕ(vi xprimeiα 1)I
zivigt0I
(1minuszi )vile0 times
Jprodj=1
(ziIwijgt0)yij (1minus ziIwijgt0)
1minusyijϕ(wij qprimeijλ 1) (2ndash11)
The full conditional densities derived from the posterior in equation 2ndash11 are
detailed below
1 These are obtained from the full conditional of z after integrating out v and w
f (z|αλ) =
Nprodi=1
f (zi |αλ) =Nprodi=1
ψlowastizi (1minus ψlowast
i )1minuszi
where ψlowasti =
ψiprodJ
j=1 pyijij (1minus pij)
1minusyij
ψiprodJ
j=1 pyijij (1minus pij)1minusyij + (1minus ψi)
prodJ
j=1 Iyij=0(2ndash12)
2
f (v|zα) =
Nprodi=1
f (vi |zi α) =Nprodi=1
tr N (x primeiα 1Ai)
where Ai =
(minusinfin 0] zi = 0(0infin) zi = 1
(2ndash13)
31
and tr N(microσ2A) denotes the pdf of a truncated normal random variable withmean micro variance σ2 and truncation region A
3
f (α|v) = ϕp (α αXprimev α) (2ndash14)
where α = (X primeX )minus1and ϕk(x micro ) represents the k-variate normal density withmean vector micro and variance matrix
4
f (w|y zλ) =
Nprodi=1
Jprodj=1
f (wij |yij zi λ) =Nprodi=1
Jprodj=1
tr N (qprimeijλ 1Bij)
where Bij =
(minusinfininfin) zi = 0(minusinfin 0] zi = 1 and yij = 0(0infin) zi = 1 and yij = 1
(2ndash15)
5
f (λ|w) = ϕr (λ λQprimew λ) (2ndash16)
where λ = (Q primeQ)minus1
The Gibbs sampling algorithm for the model can then be summarized as
1 Initialize z α v λ and w
2 Sample zi sim Bern(ψilowast)
3 Sample vi from a truncated normal with micro = x primeiα σ = 1 and truncation regiondepending on zi
4 Sample α sim N (αXprimev α) with α = (X primeX )minus1
5 Sample wij from a truncated normal with micro = qprimeijλ σ = 1 and truncation region
depending on yij and zi
6 Sample λ sim N (λQprimew λ) with λ = (Q primeQ)minus1
222 Logit Link Model
Now turning to the logit link version of the occupancy model again let yij be the
indicator variable used to mark detection of the target species on the j th survey at the
i th site and let zi be the indicator variable that denotes presence (zi = 1) or absence
32
(zi = 0) of the target species at the i th site The model is now defined by
yij |zi λ sim Bernoulli (zipij) where pij =eq
primeijλ
1 + eqprimeijλ
λ sim [λ]
zi |α sim Bernoulli (ψi) where ψi =ex
primeiα
1 + exprimeiα
α sim [α]
In this hierarchy the contribution of a single site to the likelihood is
Li(αλ) =(ex
primeiα)zi
1 + exprimeiα
Jprodj=1
(zi
eqprimeijλ
1 + eqprimeijλ
)yij(1minus zi
eqprimeijλ
1 + eqprimeijλ
)1minusyij
(2ndash17)
As in the probit case we data-augment the likelihood with two separate sets
of covariates however in this case each of them having Polya-gamma distribution
Augmenting the model and using the posterior in (2ndash7) the joint is
[ zαλ|y ] prop [α] [λ]
Nprodi=1
(ex
primeiα)zi
1 + exprimeiαcosh
(∣∣xprime
iα∣∣
2
)exp
[minus(x
prime
iα)2vi
2
]g(vi)times
Jprodj=1
(zi
eqprimeijλ
1 + eqprimeijλ
)yij(1minus zi
eqprimeijλ
1 + eqprimeijλ
)1minusyij
times
cosh
(∣∣ziqprimeijλ∣∣2
)exp
[minus(ziq
primeijλ)2wij
2
]g(wij)
(2ndash18)
The full conditionals for z α v λ and w obtained from (2ndash18) are provided below
1 The full conditional for z is obtained after marginalizing the latent variables andyields
f (z|αλ) =
Nprodi=1
f (zi |αλ) =Nprodi=1
ψlowastizi (1minus ψlowast
i )1minuszi
where ψlowasti =
ψiprodJ
j=1 pyijij (1minus pij)
1minusyij
ψiprodJ
j=1 pyijij (1minus pij)1minusyij + (1minus ψi)
prodJ
j=1 Iyij=0(2ndash19)
33
2 Using the result derived in Polson et al (2013) we have that
f (v|zα) =
Nprodi=1
f (vi |zi α) =Nprodi=1
PG(1 xprimeiα) (2ndash20)
3
f (α|v) prop [α ]
Nprodi=1
exp[zix
prime
iαminus xprime
iα
2minus (x
prime
iα)2vi
2
] (2ndash21)
4 By the same result as that used for v the full conditional for w is
f (w|y zλ) =
Nprodi=1
Jprodj=1
f (wij |yij zi λ)
=
(prodiisinS1
Jprodj=1
PG(1 |qprimeijλ| )
)(prodi isinS1
Jprodj=1
PG(1 0)
) (2ndash22)
with S1 = i isin 1 2 N zi = 1
5
f (λ|z yw) prop [λ ]prodiisinS1
exp
[yijq
prime
ijλminusq
prime
ijλ
2minus
(qprime
ijλ)2wij
2
] (2ndash23)
with S1 as defined above
The Gibbs sampling algorithm is analogous to the one with a probit link but with the
obvious modifications to incorporate Polya-gamma instead of normal latent variables23 Temporal Dynamics and Spatial Structure
The uses of the single-season model are limited to very specific problems In
particular assumptions for the basic model may become too restrictive or unrealistic
whenever the study period extends throughout multiple years or seasons especially
given the increasingly changing environmental conditions that most ecosystems are
currently experiencing
Among the many extensions found in the literature one that we consider particularly
relevant incorporates heterogenous occupancy probabilities through time Extensions of
34
site-occupancy models that incorporate temporally varying probabilities can be traced
back to Hanski (1994) The heterogeneity of occupancy probabilities through time arises
from local colonization and extinction processes MacKenzie et al (2003) proposed an
alternative to Hanskirsquos approach in order to incorporate imperfect detection The method
is flexible enough to let detection occurrence survival and colonization probabilities
each depend upon its own set of covariates using likelihood-based estimation for the
model parameters
However the approach of MacKenzie et al presents two drawbacks First
the uncertainty assessment for maximum likelihood parameter estimates relies on
asymptotic results (obtained from implementation of the delta method) making it
sensitive to sample size And second to obtain parameter estimates the latent process
(occupancy) is marginalized out of the likelihood leading to the usual zero-inflated
Bernoulli model Although this is a convenient strategy to solve the estimation problem
the latent state variables (occupancy indicators) are no longer available and as such
finite sample estimates cannot be calculated unless an additional (and computationally
expensive) parametric bootstrap step is performed (Royle amp Kery 2007) Additionally as
the occupancy process is integrated out the likelihood approach precludes incorporation
of additional structural dependence using random effects Thus the model cannot
account for spatial dependence which plays a fundamental role in this setting
To work around some of the shortcomings encountered when fitting dynamic
occupancy models via likelihood based methods Royle amp Kery developed what they
refer to as a dynamic occupancy state space model (DOSS) alluding to the conceptual
similarity found between this model and the class of state space models found in the
time series literature In particular this model allows one to retain the latent process
(occupancy indicators) in order to obtain small sample estimates and to eventually
generate extensions that incorporate structure in time andor space through random
effects
35
The data used in the DOSS model comes from standard repeated presenceabsence
surveys with N sampling locations (patches or sites) indexed by i = 1 2 N Within
a given season (eg year month week depending on the biology of the species) each
sampling location is visited (surveyed) j = 1 2 J times This process is repeated for
t = 1 2 T seasons Here an important assumption is that the site occupancy status
is closed within but not across seasons
As is usual in the occupancy modeling framework two different processes are
considered The first one is the detection process per site-visit-season combination
denoted by yijt The yijt are indicator functions that take the value 1 if the species is
present at site i survey j and season t and 0 otherwise These detection indicators
are assumed to be independent within each site and season The second response
considered is the partially observed presence (occupancy) indicators zit These are
indicator variables which are equal to 1 whenever yijt = 1 for one or more of the visits
made to site i during season t otherwise the values of the zit rsquos are unknown Royle amp
Kery refer to these two processes as the observation (yijt) and the state (zit) models
In this setting the parameters of greatest interest are the occurrence or site
occupancy probabilities denoted by ψit as well as those representing the population
dynamics which are accounted for by introducing changes in occupancy status over
time through local colonization and survival That is if a site was not occupied at season
t minus 1 at season t it can either be colonized or remain unoccupied On the other hand
if the site was in fact occupied at season t minus 1 it can remain that way (survival) or
become abandoned (local extinction) at season t The probabilities of survival and
colonization from season t minus 1 to season t at the i th site are denoted by θi(tminus1) and
γi(tminus1) respectivelyDuring the initial period (or season) the model for the state process is expressed in
terms of the occupancy probability (equation 2ndash24) For subsequent periods the stateprocess is specified in terms of survival and colonization probabilities (equation 2ndash25) inparticular
zi1 sim Bernoulli (ψi1) (2ndash24)
36
zit |zi(tminus1) sim Bernoulli(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)(2ndash25)
The observation model conditional on the latent process zit is defined by
yijt |zit sim Bernoulli(zitpijt
)(2ndash26)
Royle amp Kery induce the heterogeneity by site site-season and site-survey-seasonrespectively in the occupancy survival and colonization and in the detection probabilitiesthroughthe following specification
logit(ψi1) = x1 + ri ri sim N(0σ2ψ) logitminus1(x1) sim Unif(0 1)
logit(θit) = at + ui ui sim N(0σ2θ ) logitminus1(ai) sim Unif(0 1)logit(γit) = bt + vi vi sim N(0σ2γ) logitminus1(bi) sim Unif(0 1)
logit(pijt) = ct + wij wi sim N(0σ2p) logitminus1(ci) sim Unif(0 1) (2ndash27)
where x1 at bt ct are the season fixed effects for the corresponding probabilities
and where (ri ui vi) and wij are the site and site-survey random effects respectively
Additionally all variance components assume the usual inverse gamma priors
As the authors state this formulation can be regarded as ldquobeing suitably vaguerdquo
however it is also restrictive in the sense that it is not clear what strategy to follow to
incorporate additional covariates while preserving the straightforward sampling strategy231 Dynamic Mixture Occupancy State-Space Model
We assume that the probabilities for occupancy survival colonization and detection
are all functions of linear combinations of covariates However our setup varies
slightly from that considered by Royle amp Kery (2007) In essence we modify the way in
which the estimates for survival and colonization probabilities are attained Our model
incorporates the notion that occupancy at a site occupied during the previous season
takes place through persistence where we define persistence as a function of both
survival and colonization That is a site occupied at time t may again be occupied
at time t + 1 if the current settlers survive if they perish and new settlers colonize
simultaneously or if both current settlers survive and new ones colonize
Our functional forms of choice are again the probit and logit link functions This
means that each probability of interest which we will refer to for illustration as δ is
37
linked to a linear combination of covariates xprime ξ through the relationship defined by
δ = F (xT ξ) where F (middot) represents the inverse link function This particular assumption
facilitates relating the data augmentation algorithms of Albert amp Chib and Polson et al to
Royle amp Keryrsquos DOSS model We refer to this extension of Royle amp Keryrsquos model as the
Dynamic Mixture Occupancy State Space model (DYMOSS)
As before let yijt be the indicator variable used to mark detection of the target
species on the j th survey at the i th site during the tth season and let zit be the indicator
variable that denotes presence (zit = 1) or absence (zit = 0) of the target species at the
i th site tth season with i isin 1 2 N j isin 1 2 J and t isin 1 2 T
Additionally assume that probabilities for occupancy at time t = 1 persistence
colonization and detection are all functions of covariates with corresponding parameter
vectors α (s) =δ(s)tminus1
Tt=2
B(c) =β(c)tminus1
Tt=2
and = λtTt=1 and covariate matrices
X(o) X = Xtminus1Tt=2 and Q(s) = QtTt=1 respectively Using the notation above our
proposed dynamic occupancy model is defined by the following hierarchyState model
zi1|α sim Bernoulli (ψi1) where ψi1 = F(xprime(o)iα
)zit |zi(tminus1) δ
(c)tminus1β
(s)tminus1 sim Bernoulli
(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)where θi(tminus1) = F
(δ(s)tminus1 + xprimei(tminus1)β
(c)tminus1
) and
γi(tminus1) = F(xprimei(tminus1)β
(c)tminus1
)(2ndash28)
Observed modelyijt |zit ηt sim Bernoulli (zitpijt)
where pijt = F (qTijtλt) (2ndash29)
In the hierarchical setup given by Equations 2ndash28 and 2ndash29 θi(tminus1) corresponds to
the probability of persistence from time t minus 1 to time t at site i and γi(tminus1) denotes the
colonization probability Note that θi(tminus1) minus γi(tminus1) yields the survival probability from t minus 1
to t The effect of survival is introduced by changing the intercept of the linear predictor
by a quantity δ(s)tminus1 Although in this version of the model this effect is accomplished by
just modifying the intercept it can be extended to have covariates determining δ(s)tminus1 as
well The graphical representation of the model for a single site is
38
α
zi1
yi1
λ1
zi2
yi2
λ1
δ(s)1
β(c)1
middot middot middot
zit
yit
λt
δ(s)tminus1
β(c)tminus1
middot middot middot
ziT
yiT
λT
δ(s)Tminus1
β(c)Tminus1
Figure 2-3 Graphical representation multiseason model for a single site
The joint posterior for the model defined by this hierarchical setting is
[ zηαβλ|y ] = Cy
Nprodi=1
ψi1 Jprodj=1
pyij1ij1 (1minus pij1)
(1minusyij1)
zi1(1minus ψi1)
Jprodj=1
Iyij1=0
1minuszi1 [η1][α]times
Tprodt=2
Nprodi=1
[(θziti(tminus1)(1minus θi(tminus1))
1minuszit)zi(tminus1)
+(γziti(tminus1)(1minus γi(tminus1))
1minuszit)1minuszi(tminus1)
] Jprod
j=1
pyijtijt (1minus pijt)
1minusyijt
zit
times
Jprodj=1
Iyijt=0
1minuszit [ηt ][βtminus1][λtminus1]
(2ndash30)
which as in the single season case is intractable Once again a Gibbs sampler cannot
be constructed directly to sample from this joint posterior The graphical representation
of the model for one site incorporating the latent variables is provided in Figure 2-4
α
ui1
zi1
yi1
wi1
λ1
zi2
yi2
wi2
λ1
vi1
δ(s)1
β(c)1
middot middot middot
middot middot middot
zit
vi tminus1
yit
wit
λt
δ(s)tminus1
β(c)tminus1
middot middot middot
middot middot middot
ziT
vi Tminus1
yiT
wiT
λT
δ(s)Tminus1
β(s)Tminus1
Figure 2-4 Graphical representation data-augmented multiseason model
Probit link normal-mixture DYMOSS model
39
We deal with the intractability of the joint posterior distribution as before that is
by introducing latent random variables Each of the latent variables incorporates the
relevant linear combinations of covariates for the probabilities considered in the model
This artifact enables us to sample from the joint posterior distributions of the model
parameters For the probit link the sets of latent random variables respectively for first
season occupancy persistence and colonization and detection are
bull ui sim N (bTi α 1)
bull vi(tminus1) sim zi(tminus1)N(δ(s)(tminus1) + xTi(tminus1)β
(c)(tminus1) 1
)+ (1minus zi(tminus1))N
(xTi(tminus1)β
(c)(tminus1) 1
) and
bull wijt sim N (qTijtηt 1)
Introducing these latent variables into the hierarchical formulation yieldsState model
ui1|α sim N(xprime(o)iα 1
)zi1|ui sim Bernoulli
(Iuigt0
)for t gt 1
vi(tminus1)|zi(tminus1)βtminus1 sim zi(tminus1)N(δ(s)(tminus1) + xprimei(tminus1)β
(c)(tminus1) 1
)+
(1minus zi(tminus1))N(xprimei(tminus1)β
(c)(tminus1) 1
)zit |vi(tminus1) sim Bernoulli
(Ivi(tminus1)gt0
)(2ndash31)
Observed modelwijt |ηt sim N
(qTijtλt 1
)yijt |zit wijt sim Bernoulli
(zitIrijtgt0
)(2ndash32)
Note that the result presented in Section 22 corresponds to the particular case for
T = 1 of the model specified by Equations 2ndash31 and 2ndash32
As mentioned previously model parameters are obtained using a Gibbs sampling
approach Let ϕ(x |microσ2) denote the pdf of a normally distributed random variable x
with mean micro and standard deviation σ Also let
1 Wt = (w1t w2t wNt) with wit = (wi1t wi2t wiJitt) (for i = 1 2 N andt = 1 2 T )
40
2 u = (u1 u2 uN)
3 V = (v1 vTminus1) with vt = (v1t v2t vNt)
For the probit link model the joint posterior distribution is
π(ZuV WtTt=1αB(c) δ(s)
)prop [α]
prodNi=1 ϕ
(ui∣∣ xprime(o)iα 1
)Izi1uigt0I
1minuszi1uile0
times
Tprodt=2
[β(c)tminus1 δ
(s)tminus1
] Nprodi=1
ϕ(vi(tminus1)
∣∣micro(v)i(tminus1) 1
)Izitvi(tminus1)gt0
I1minuszitvi(tminus1)le0
times
Tprodt=1
[λt ]
Nprodi=1
Jitprodj=1
ϕ(wijt
∣∣qprimeijtλt 1)(zitIwijtgt0)yij1(1minus zitIwijtgt0)
(1minusyijt)
where micro(v)i(tminus1) = zi(tminus1)δ
(s)tminus1 + xprimei(tminus1)β
(c)tminus1 (2ndash33)
Initialize the Gibbs sampler at α(0)B(0)(c) δ
(s)(0)2minus1 and (0) For m = 0 1 nsim
The sampler proceeds iteratively by block sampling sequentially for each primary
sampling period as follows first the presence process then the latent variables from
the data-augmentation step for the presence component followed by the parameters for
the presence process then the latent variables for the detection component and finally
the parameters for the detection component Letting [|] denote the full conditional
probability density function of the component conditional on all other unknown
parameters and the observed data for m = 1 nsim the sampling procedure can be
summarized as
[z(m)1 | middot
]rarr[u(m)| middot
]rarr[α(m)
∣∣∣ middot ]rarr [W
(m)1 | middot
]rarr[λ(m)1
∣∣∣ middot ]rarr[z(m)2 | middot
]rarr[V(m)2minus1| middot
]rarr[β(c)(m)2minus1 δ(s)(m)
2minus1
∣∣∣ middot ]rarr [W
(m)2 | middot
]rarr[λ(m)2
∣∣∣ middot ]rarr middot middot middot
middot middot middot rarr[z(m)T | middot
]rarr[V(m)Tminus1| middot
]rarr[β(c)(m)Tminus1 δ(s)(m)
Tminus1
∣∣∣ middot ]rarr [W
(m)T | middot
]rarr[λ(m)T
∣∣∣ middot ]The full conditional probability densities for this Gibbs sampling algorithm are
presented in detail within Appendix A
41
Logit link Polya-Gamma DYMOSS model
Using the same notation as before the logit link model resorts to the hierarchy given
byState model
ui1|α sim PG(xT(o)iα 1
)zi1|ui sim Bernoulli
(Iuigt0
)for t gt 1
vi(tminus1)| sim PG(1∣∣zi(tminus1)δ(s)(tminus1) + xprimei(tminus1)β
(c)(tminus1)
∣∣)sim Bernoulli
(Ivi(tminus1)gt0
)(2ndash34)
Observed modelwijt |λt sim PG
(qTijtλt 1
)yijt |zit wijt sim Bernoulli
(zitIwijtgt0
)(2ndash35)
The logit link version of the joint posterior is given by
π(ZuV WtTt=1αB(s)B(c)
)prop
Nprodi=1
(e
xprime(o)i
α)zi1
1 + exprime(o)i
αPG
(ui 1 |xprime(o)iα|
)[λ1][α]times
Ji1prodj=1
(zi1
eqprimeij1λ1
1 + eqprimeij1λ1
)yij1(1minus zi1
eqprimeij1λ1
1 + eqprimeij1λ1
)1minusyij1
PG(wij1 1 |zi1qprimeij1λ1|
)times
Tprodt=2
[δ(s)tminus1][β
(c)tminus1][λt ]
Nprodi=1
(exp
[micro(v)tminus1
])zit1 + exp
[micro(v)i(tminus1)
]PG (vit 1 ∣∣∣micro(v)i(tminus1)
∣∣∣)timesJitprodj=1
(zit
eqprimeijtλt
1 + eqprimeijtλt
)yijt(1minus zit
eqprimeijtλt
1 + eqlowastTij
λt
)1minusyijt
PG(wijt 1 |zitqprimeijtλt |
)
(2ndash36)
with micro(v)tminus1 = zi(tminus1)δ
(s)tminus1 + xprimei(tminus1)β
(c)tminus1
42
The sampling procedure is entirely analogous to that described for the probit
version The full conditional densities derived from expression 2ndash36 are described in
detail in Appendix A232 Incorporating Spatial Dependence
In this section we describe how the additional layer of complexity space can also
be accounted for by continuing to use the same data-augmentation framework The
method we employ to incorporate spatial dependence is a slightly modified version of
the traditional approach for spatial generalized linear mixed models (GLMMrsquos) and
extends the model proposed by Johnson et al (2013) for the single season closed
population occupancy model
The traditional approach consists of using spatial random effects to induce a
correlation structure among adjacent sites This formulation introduced by Besag et al
(1991) assumes that the spatial random effect corresponds to a Gaussian Markov
Random Field (GMRF) The model known as the Spatial GLMM (SGLMM) is used to
analyze areal data It has been applied extensively given the flexibility of its hierarchical
formulation and the availability of software for its implementation (Hughes amp Haran
2013)
Succinctly the spatial dependence is accounted for in the model by adding a
random vector η assumed to have a conditionally-autoregressive (CAR) prior (also
known as the Gaussian Markov random field prior) To define the prior let the pair
G = (V E) represent the undirected graph for the entire spatial region studied where
V = (1 2 N) denotes the vertices of the graph (sites) and E the set of edges
between sites E is constituted by elements of the form (i j) indicating that sites i
and j are spatially adjacent for some i j isin V The prior for the spatial effects is then
characterized by
[η|τ ] prop τ rank()2exp[minusτ2ηprimeη
] (2ndash37)
43
where = (diag(A1)minus A) is the precision matrix with A denoting the adjacency matrix
The entries of the adjacency matrix A are such that diag(A) = 0 and Aij = I(i j)isinE
The matrix is singular Hence the probability density defined in equation 2ndash37
is improper ie it doesnrsquot integrate to 1 Regardless of the impropriety of the prior this
model can be fitted using a Bayesian approach since even if the prior is improper the
posterior for the model parameters is proper If a constraint such assum
k ηk = 0 is
imposed or if the precision matrix is replaced by a positive definite matrix the model
can also be fitted using a maximum likelihood approach
Assuming that all but the detection process are subject to spatial correlations and
using the notation we have developed up to this point the spatially explicit version of the
DYMOSS model is characterized by the hierarchy represented by equations 2ndash38 and
2ndash39
Hence adding spatial structure into the DYMOSS framework described in the
previous section only involves adding the steps to sample η(o) and ηtT
t=2 conditional
on all other parameters Furthermore the corresponding parameters and spatial
random effects of a given component (ie occupancy survival and colonization)
can be effortlessly pooled together into a single parameter vector to perform block
sampling For each of the latent variables the only modification required is to sum the
corresponding spatial effect to the linear predictor so that these retain their conditional
independence given the linear combination of fixed effects and the spatial effects
State modelzi1|α sim Bernoulli (ψi1) where ψi1 = F
(xT(o)iα+ η
(o)i
)[η(o)|τ
]prop τ rank()2exp
[minusτ2η(o)primeη(o)
]zit |zi(tminus1)αβtminus1λtminus1 sim Bernoulli
(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)where θi(tminus1) = F
(δ(s)(tminus1) + xTi(tminus1)β
(c)tminus1 + ηit
) and
γi(tminus1) = F(xTi(tminus1)β
(c)tminus1 + ηit
)[ηt |τ ] prop τ rank()2exp
[minusτ2ηprimetηt
](2ndash38)
44
Observed modelyijt |zit ηt sim Bernoulli (zitpijt)
where pijt = F (qTijtλt) (2ndash39)
In spite of the popularity of this approach to incorporating spatial dependence three
shortcomings have been reported in the literature (Hughes amp Haran 2013 Reich et al
2006) (1) model parameters have no clear interpretation due to spatial confounding
of the predictors with the spatial effect (2) there is variance inflation due to spatial
confounding and (3) the high dimensionality of the latent spatial variables leads to
high computational costs To avoid such difficulties we follow the approach used by
Hughes amp Haran (2013) which builds upon the earlier work by Reich et al (2006) This
methodology is summarized in what follows
Let a vector of spatial effects η have the CAR model given by 2ndash37 above Now
consider a random vector ζ sim MVN(0 τKprimeK
) with defined as above and where
τKprimeK corresponds to the precision of the distribution and not the covariance matrix
with matrix K satisfying KprimeK = I
This last condition implies that the linear predictor Xβ + η = Xβ + Kζ With
respect to how the matrix K is chosen Hughes amp Haran (2013) recommend basing its
construction on the spectral decomposition of operator matrices based on Moranrsquos I
The Moran operator matrix is defined as PperpAPperp with Pperp = IminusX (XprimeX )minus1X
prime and where A
is the adjacency matrix previously described The choice of the Moran operator is based
on the fact that it accounts for the underlying graph while incorporating the spatial
structure residual to the design matrix X These elements are incorporated into its
spectral decomposition of the Moran operator That is its eigenvalues correspond to the
values of Moranrsquos I statistic (a measure of spatial autocorrelation) for a spatial process
orthogonal to X while its eigenvectors provide the patterns of spatial dependence
residual to X Thus the matrix K is chosen to be the matrix whose columns are the
eigenvectors of the Moran operator for a particular adjacency matrix
45
Using this strategy the new hierarchical formulation of our model is simply modified
by letting η(o) = K(o)ζ(o) and ηt = Ktζt with
1 ζ(o) sim MVN(0 τ (o)K(o)primeK(o)
) where K(o) is the eigenvector matrix for
P(o)perpAP(o)perp and
2 ζt sim MVN(0 τtK
primetKt
) where Kt is the Pperp
t APperpt for t = 2 3 T
The algorithms for the probit and logit link from section 231 can be readily
adapted to incorporate the spatial structure simply by obtaining the joint posteriors
for (α ζ(o)) and (β(c)tminus1 δ
(s)tminus1 ζt) making the obvious modification of the corresponding
linear predictors to incorporate the spatial components24 Summary
With a few exceptions (Dorazio amp Taylor-Rodrıguez 2012 Johnson et al 2013
Royle amp Kery 2007) recent Bayesian approaches to site-occupancy modeling with
covariates have relied on model configurations (eg as multivariate normal priors of
parameters in logit scale) that lead to unfamiliar conditional posterior distributions thus
precluding the use of a direct sampling approach Therefore the sampling strategies
available are based on algorithms (eg Metropolis Hastings) that require tuning and the
knowledge to do so correctly
In Dorazio amp Taylor-Rodrıguez (2012) we proposed a Bayesian specification for
which a Gibbs sampler of the basic occupancy model is available and allowed detection
and occupancy probabilities to depend on linear combinations of predictors This
method described in section 221 is based on the data augmentation algorithm of
Albert amp Chib (1993) There the full conditional posteriors of the parameters of the probit
regression model are cast as latent mixtures of normal random variables The probit and
the logit link yield similar results with large sample sizes however their results may be
different when small to moderate sample sizes are considered because the logit link
function places more mass in the tails of the distribution than the probit link does In
46
section 222 we adapt the method for the single season model to work with the logit link
function
The basic occupancy framework is useful but it assumes a single closed population
with fixed probabilities through time Hence its assumptions may not be appropriate to
address problems where the interest lies in the temporal dynamics of the population
Hence we developed a dynamic model that incorporates the notion that occupancy
at a site previously occupied takes place through persistence which depends both on
survival and habitat suitability By this we mean that a site occupied at time t may again
be occupied at time t + 1 if (1) the current settlers survive (2) the existing settlers
perish but new settlers simultaneously colonize or (3) current settlers survive and new
ones colonize during the same season In our current formulation of the DYMOSS both
colonization and persistence depend on habitat suitability characterized by xprimei(tminus1)β(c)tminus1
They only differ in that persistence is also influenced by whether the site being occupied
during season t minus 1 enhances the suitability of the site or harms it through density
dependence
Additionally the study of the dynamics that govern distribution and abundance of
biological populations requires an understanding of the physical and biotic processes
that act upon them and these vary in time and space Consequently as a final step in
this Chapter we described a straightforward strategy to add spatial dependence among
neighboring sites in the dynamic metapopulation model This extension is based on the
popular Bayesian spatial modeling technique of Besag et al (1991) updated using the
methods described in (Hughes amp Haran 2013)
Future steps along these lines are (1) develop the software necessary to
implement the tools described throughout the Chapter and (2) build a suite of additional
extensions using this framework for occupancy models will be explored The first of
them will be used to incorporate information from different sources such as tracks
scats surveys and direct observations into a single model This can be accomplished
47
by adding a layer to the hierarchy where the source and spatial scale of the data is
accounted for The second extension is a single season spatially explicit multiple
species co-occupancy model This model will allow studying complex interactions
and testing hypotheses about species interactions at a given point in time Lastly this
co-occupancy model will be adapted to incorporate temporal dynamics in the spirit of
the DYMOSS model
48
CHAPTER 3INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
Eliminate all other factors and the one which remains must be the truthndashSherlock Holmes
The Sign of Four
31 Introduction
Occupancy models are often used to understand the mechanisms that dictate
the distribution of a species Therefore variable selection plays a fundamental role in
achieving this goal To the best of our knowledge ldquoobjectiverdquo Bayesian alternatives for
variable selection have not been put forth for this problem and with a few exceptions
(Hooten amp Hobbs 2014 Link amp Barker 2009) AIC is the method used to choose from
competing site-occupancy models In addition the procedures currently implemented
and accessible to ecologists require enumerating and estimating all the candidate
models (Fiske amp Chandler 2011 Mazerolle amp Mazerolle 2013) In practice this
can be achieved if the model space considered is small enough which is possible
if the choice of the model space is guided by substantial prior knowledge about the
underlying ecological processes Nevertheless many site-occupancy surveys collect
large amounts of covariate information about the sampled sites Given that the total
number of candidate models grows exponentially fast with the number of predictors
considered choosing a reduced set of models guided by ecological intuition becomes
increasingly difficult This is even more so the case in the occupancy model context
where the model space is the cartesian product of models for presence and models for
detection Given the issues mentioned above we propose the first objective Bayesian
variable selection method for the single-season occupancy model framework This
approach explores in a principled manner the entire model space It is completely
49
automatic precluding the need for both tuning parameters in the sampling algorithm and
subjective elicitation of parameter prior distributions
As mentioned above in ecological modeling if model selection or less frequently
model averaging is considered the Akaike Information Criterion (AIC) (Akaike 1983)
or a version of it is the measure of choice for comparing candidate models (Fiske amp
Chandler 2011 Mazerolle amp Mazerolle 2013) The AIC is designed to find the model
that has on average the density closest in Kullback-Leibler distance to the density
of the true data generating mechanism The model with the smallest AIC is selected
However if nested models are considered one of them being the true one generally the
AIC will not select it (Wasserman 2000) Commonly the model selected by AIC will be
more complex than the true one The reason for this is that the AIC has a weak signal to
noise ratio and as such it tends to overfit (Rao amp Wu 2001) Other versions of the AIC
provide a bias correction that enhances the signal to noise ratio leading to a stronger
penalization for model complexity Some examples are the AICc (Hurvich amp Tsai 1989)
and AICu (McQuarrie et al 1997) however these are also not consistent for selection
albeit asymptotically efficient (Rao amp Wu 2001)
If we are interested in prediction as opposed to testing the AIC is certainly
appropriate However when conducting inference the use of Bayesian model averaging
and selection methods is more fitting If the true data generating mechanism is among
those considered asymptotically Bayesian methods choose the true model with
probability one Conversely if the true model is not among the alternatives and a
suitable parameter prior is used the posterior probability of the most parsimonious
model closest to the true one tends asymptotically to one
In spite of this in general for Bayesian testing direct elicitation of prior probabilistic
statements is often impeded because the problems studied may not be sufficiently
well understood to make an informed decision about the priors Conversely there may
be a prohibitively large number of parameters making specifying priors for each of
50
these parameters an arduous task In addition to this seemingly innocuous subjective
choices for the priors on the parameter space may drastically affect test outcomes
This has been a recurring argument in favor of objective Bayesian procedures
which appeal to the use of formal rules to build parameter priors that incorporate the
structural information inside the likelihood while utilizing some objective criterion (Kass amp
Wasserman 1996)
One popular choice of ldquoobjectiverdquo prior is the reference prior (Berger amp Bernardo
1992) which is the prior that maximizes the amount of signal extracted from the
data These priors have proven to be effective as they are fully automatic and can
be frequentist matching in the sense that the posterior credible interval agrees with the
frequentist confidence interval from repeated sampling with equal coverage-probability
(Kass amp Wasserman 1996) Reference priors however are improper and while
they yield reasonable posterior parameter probabilities the derived model posterior
probabilities may be ill defined To avoid this shortcoming Berger amp Pericchi (1996)
introduced the intrinsic Bayes factor (IBF) for model comparison Moreno et al (1998)
building on the IBF of Berger amp Pericchi (1996) developed a limiting procedure to
generate a system of priors that yield well-defined posteriors even though these
priors may sometimes be improper The IBF is built using a data-dependent prior to
automatically generate Bayes factors however the extension introduced by Moreno
et al (1998) generates the intrinsic prior by taking a theoretical average over the space
of training samples freeing the prior from data dependence
In our view in the face of a large number of predictors the best alternative is to run
a stochastic search algorithm using good ldquoobjectiverdquo testing parameter priors and to
incorporate suitable model priors This being said the discussion about model priors is
deferred until Chapter 4 this Chapter focuses on the priors on the parameter space
The Chapter is structured as follows First issues surrounding multimodel inference
are described and insight about objective Bayesian inferential procedures is provided
51
Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate priors
on the parameter space the intrinsic priors for the parameters of the occupancy model
are derived These are used in the construction of an algorithm for ldquoobjectiverdquo model
selection tailored to the occupancy model framework To assess the performance of our
methods we provide results from a simulation study in which distinct scenarios both
favorable and unfavorable are used to determine the robustness of these tools and
analyze the Blue Hawker data set which has been examined previously in the ecological
literature (Dorazio amp Taylor-Rodrıguez 2012 Kery et al 2010)32 Objective Bayesian Inference
As mentioned before in practice noninformative priors arising from structural
rules are an alternative to subjective elicitation of priors Some of the rules used in
defining noninformative priors include the principle of insufficient reason parametrization
invariance maximum entropy geometric arguments coverage matching and decision
theoretic approaches (see Kass amp Wasserman (1996) for a discussion)
These rules reflect one of two attitudes (1) noninformative priors either aim to
convey unique representations of ignorance or (2) they attempt to produce probability
statements that may be accepted by convention This latter attitude is in the same
spirit as how weights and distances are defined (Kass amp Wasserman 1996) and
characterizes the way in which Bayesian reference methods are interpreted today ie
noninformative priors are seen to be chosen by convention according to the situation
A word of caution must be given when using noninformative priors Difficulties arise
in their implementation that should not be taken lightly In particular these difficulties
may occur because noninformative priors are generally improper (meaning that they do
not integrate or sum to a finite number) and as such are said to depend on arbitrary
constants
Bayes factors strongly depend upon the prior distributions for the parameters
included in each of the models being compared This can be an important limitation
52
considering that when using noninformative priors their introduction will result in the
Bayes factors being a function of the ratio of arbitrary constants given that these priors
are typically improper (see Jeffreys 1961 Pericchi 2005 and references therein)
Many different approaches have been developed to deal with the arbitrary constants
when using improper priors since then These include the use of partial Bayes factors
(Berger amp Pericchi 1996 Good 1950 Lempers 1971) setting the ratio of arbitrary
constants to a predefined value (Spiegelhalter amp Smith 1982) and approximating to the
Bayes factor (see Haughton 1988 as cited in Berger amp Pericchi 1996 Kass amp Raftery
1995 Tierney amp Kadane 1986)321 The Intrinsic Methodology
Berger amp Pericchi (1996) cleverly dealt with the arbitrary constants that arise when
using improper priors by introducing the intrinsic Bayes factor (IBF) procedure This
solution based on partial Bayes factors provides the means to replace the improper
priors by proper ldquoposteriorrdquo priors The IBF is obtained from combining the model
structure with information contained in the observed data Furthermore they showed
that as the sample size tends to infinity the Intrinsic Bayes factor corresponds to the
proper Bayes factor arising from the intrinsic priors
Intrinsic priors however are not unique The asymptotic correspondence between
the IBF and the Bayes factor arising from the intrinsic prior yields two functional
equations that are solved by a whole class of intrinsic priors Because all the priors
in the class produce Bayes factors that are asymptotically equivalent to the IBF for
finite sample sizes the resulting Bayes factor is not unique To address this issue
Moreno et al (1998) formalized the methodology through the ldquolimiting procedurerdquo
This procedure allows one to obtain a unique Bayes factor consolidating the method
as a valid objective Bayesian model selection procedure which we will refer to as the
Bayes factor for intrinsic priors (BFIP) This result is particularly valid for nested models
although the methodology may be extended with some caution to nonnested models
53
As mentioned before the Bayesian hypothesis testing procedure is highly sensitive
to parameter-prior specification and not all priors that are useful for estimation are
recommended for hypothesis testing or model selection Evidence of this is provided
by the Jeffreys-Lindley paradox which states that a point null hypothesis will always
be accepted when the variance of a conjugate prior goes to infinity (Robert 1993)
Additionally when comparing nested models the null model should correspond to
a substantial reduction in complexity from that of larger alternative models Hence
priors for the larger alternative models that place probability mass away from the null
model are wasteful If the true model is ldquofarrdquo from the null it will be easily detected by
any statistical procedure Therefore the prior on the alternative models should ldquowork
harderrdquo at selecting competitive models that are ldquocloserdquo to the null This principle known
as the Savage continuity condition (Gunel amp Dickey 1974) is widely recognized by
statisticians
Interestingly the intrinsic prior in correspondence with the BFIP automatically
satisfies the Savage continuity condition That is when comparing nested models the
intrinsic prior for the more complex model is centered around the null model and in spite
of being a limiting procedure it is not subject to the Jeffreys-Lindley paradox
Moreover beyond the usual pairwise consistency of the Bayes factor for nested
models Casella et al (2009) show that the corresponding Bayesian procedure with
intrinsic priors for variable selection in normal regression is consistent in the entire
class of normal linear models adding an important feature to the list of virtues of the
procedure Consistency of the BFIP for the case where the dimension of the alternative
model grows with the sample size is discussed in Moreno et al (2010)322 Mixtures of g-Priors
As previously mentioned in the Bayesian paradigm a model M in M is defined
by a sampling density and a prior distribution The sampling density associated with
model M is denoted by f (y|βM σ2M M) where (βM σ
2M) is a vector of model-specific
54
unknown parameters The prior for model M and its corresponding set of parameters is
π(βM σ2M M|M) = π(βM σ
2M |MM) middot π(M|M)
Objective local priors for the model parameters (βM σ2M) are achieved through
modifications and extensions of Zellnerrsquos g-prior (Liang et al 2008 Womack et al
2014) In particular below we focus on the intrinsic prior and provide some details for
other scaled mixtures of g-priors We defer the discussion on priors over the model
space until Chapter 5 where we describe them in detail and develop a few alternatives
of our own3221 Intrinsic priors
An automatic choice of an objective prior is the intrinsic prior (Berger amp Pericchi
1996 Moreno et al 1998) Because MB sube M for all M isin M the intrinsic prior for
(βM σ2M) is defined as an expected posterior prior
πI (βM σ2M |M) =
intpR(βM σ
2M |~yM)mR(~y|MB)d~y (3ndash1)
where ~y is a minimal training sample for model M I denotes the intrinsic distributions
and R denotes distributions derived from the reference prior πR(βM σ2M |M) = cM
dβMdσ2M
σ2M
In (3ndash1) mR(~y|M) =intint
f (~y|βM σ2M M)πR(βM σ
2M |M)dβMdσ2M is the reference marginal
of ~y under model M and pR(βM σ2M |~yM) =
f (~y|βM σ2MM)πR(βM σ2
M|M)
mR(~y|M)is the reference
posterior density
In the regression framework the reference marginal mR is improper and produces
improper intrinsic priors However the intrinsic Bayes factor of model M to the base
model MB is well-defined and given by
BF IMMB
(y) = (1minus R2M)
minus nminus|MB |2 times
int 1
0
n + sin2(π2θ) middot (|M|+ 1)
n +sin2(π
2θ)middot(|M|+1)1minusR2
M
nminus|M|
2sin2(π
2θ) middot (|M|+ 1)
n +sin2(π
2θ)middot(|M|+1)1minusR2
M
|M|minus|MB |
2
dθ (3ndash2)
55
where R2M is the coefficient of determination of model M versus model MB The Bayes
factor between two models M and M prime is defined as BF IMMprime(y) = BF I
MMB(y)BF I
MprimeMB(y)
The ldquogoodnessrdquo of the model M based on the intrinsic priors is given by its posterior
probability
pI (M|yM) =BF I
MMB(y)π(M|M)sum
MprimeisinM BF IMprimeMB
(y)π(M prime|M) (3ndash3)
It has been shown that the system of intrinsic priors produces consistent model selection
(Casella et al 2009 Giron et al 2010) In the context of well-formulated models the
true model MT is the smallest well-formulated model M isin M such that α isin M if βα = 0
If MT is the true model then the posterior probability of model MT based on equation
(3ndash3) converges to 13222 Other mixtures of g-priors
Scaled mixtures of g-priors place a reference prior on (βMBσ2) and a multivariate
normal distribution on β in M MB that is normal with mean 0 and precision matrix
qMw
nσ2ZprimeM (IminusH0)ZM
where H0 is the hat matrix associated with ZMB The prior is completed by a prior on w
and choice of scaling qM that is set at |M| + 1 to account for the minimal sample size of
M Under these assumptions the Bayesrsquo factor for M to MB is given by
BFMMB(y) =
(1minus R2
M
) nminus|MB |2
int n + w(|M|+ 1)
n + w(|M|+1)1minusR2
M
nminus|M|
2w(|M|+ 1)
n + w(|M|+1)1minusR2
M
|M|minus|MB |
2
π(w)dw
We consider the following priors on w The intrinsic prior is π(w) = Beta(w 05 05)
which is only defined for w isin (0 1) A version of the Zellner-Siow prior is given by
w sim Gamma(05 05) which produces a multivariate Cauchy distribution on β A family
of hyper-g priors are defined by π(w) prop wminus12(β + w)(α+1)2 which have Cauchy-like
tails but produce more shrinkage than the Cauchy prior
56
33 Objective Bayes Occupancy Model Selection
As mentioned before Bayesian inferential approaches used for ecological models
are lacking In particular there exists a need for suitable objective and automatic
Bayesian testing procedures and software implementations that explore thoroughly the
model space considered With this goal in mind in this section we develop an objective
intrinsic and fully automatic Bayesian model selection methodology for single season
site-occupancy models We refer to this method as automatic and objective given that
in its implementation no hyperparameter tuning is required and that it is built using
noninformative priors with good testing properties (eg intrinsic priors)
An inferential method for the occupancy problem is possible using the intrinsic
approach given that we are able to link intrinsic-Bayesian tools for the normal linear
model through our probit formulation of the occupancy model In other words because
we can represent the single season probit occupancy model through the hierarchy
yij |zi wij sim Bernoulli(ziIwijgt0
)wij |λ sim N
(qprimeijλ 1
)zi |vi sim Bernoulli
(Ivigt0
)vi |α sim N (x primeiα 1)
it is possible to solve the selection problem on the latent scale variables wij and vi and
to use those results at the level of the occupancy and detection processes
In what follows first we provide some necessary notation Then a derivation of
the intrinsic priors for the parameters of the detection and occupancy components
is outlined Using these priors we obtain the general form of the model posterior
probabilities Finally the results are incorporated in a model selection algorithm for
site-occupancy data Although the priors on the model space are not discussed in this
Chapter the software and methods developed have different choices of model priors
built in
57
331 Preliminaries
The notation used in Chapter 2 will be considered in this section as well Namely
presence will be denoted by z detection by y their corresponding latent processes are
v and w and the model parameters are denoted by α and λ However some additional
notation is also necessary Let M0 =M0y M0z
denote the ldquobaserdquo model defined by
the smallest models considered for the detection and presence processes The base
models M0y and M0z include predictors that must be contained in every model that
belongs to the model space Some examples of base models are the intercept only
model a model with covariates related to the sampling design and a model including
some predictors important to the researcher that should be included in every model
Furthermore let the sets [Kz ] = 1 2 Kz and [Ky ] = 1 2 Ky index
the covariates considered for the variable selection procedure for the presence and
detection processes respectively That is these sets denote the covariates that can
be added from the base models in M0 or removed from the largest possible models
considered MF z and MF y which we will refer to as the ldquofullrdquo models The model space
can then be represented by the Cartesian product of subsets such that Ay sube [Ky ]
and Az sube [Kz ] The entire model space is populated by models of the form MA =MAy
MAz
isin M = My timesMz with MAy
isin My and MAzisin Mz
For the presence process z the design matrix for model MAzis given by the block
matrix XAz= (X0|Xr A) X0 corresponds to the design matrix of the base model ndash which
is such that M0z sube MAzisin Mz for all Az isin [Kz ] ndash and Xr A corresponds to the submatrix
that contains the covariates indexed by Az Analogously for the detection process y the
design matrix is given by QAy= (Q0|Qr A) Similarly the coefficients for models MAz
and
MAyare given by αA = (αprime
0αprimer A)
prime and λA = (λprime0λ
primer A)
prime
With these elements in place the model selection problem consists of finding
subsets of covariates indexed by A = Az Ay that have a high posterior probability
given the detection and occupancy processes This is equivalent to finding models with
58
high posterior odds when compared to a suitable base model These posterior odds are
given by
p(MA|y z)p(M0|y z)
=m(y z|MA)π(MA)
m(y z|M0)π(M0)= BFMAM0
(y z)π(MA)
π(M0)
Since we are able to represent the occupancy model as a truncation of latent
normal variables it is possible to work through the occupancy model selection problem
in the latent normal scale used for the presence and detection processes We formulate
two solutions to this problem one that depends on the observed and latent components
and another that solely depends on the latent level variables used to data-augment the
problem We will however focus on the latter approach as this yields a straightforward
MCMC sampling scheme For completeness the other alternative is described in
Section 34
At the root of our objective inferential procedure for occupancy models lies the
conditional argument introduced by Womack et al (work in progress) for the simple
probit regression In the occupancy setting the argument is
p(MA|y zw v) =m(y z vw|MA)π(MA)
m(y zw v)
=fyz(y z|w v)
(intfvw(vw|αλMA)παλ(αλ|MA)d(αλ)
)π(MA)
fyz(y z|w v)sum
MlowastisinM(int
fvw(vw|αλMlowast)παλ(αλ|Mlowast)d(αλ))π(Mlowast)
=m(v|MAz
)m(w|MAy)π(MA)
m(v)m(w)
prop m(v|MAz)m(w|MAy
)π(MA) (3ndash4)
where
1 fyz(y z|w v) =prodN
i=1 Izivigt0I
(1minuszi )vile0
prodJ
j=1(ziIwijgt0)yij (1minus ziIwijgt0)
1minusyij
2 fvw(vw|αλMA) =
(Nprodi=1
ϕ(vi xprimeiαMAz
1)
)︸ ︷︷ ︸
f (v|αr Aα0MAz )
(Nprodi=1
Jiprodj=1
ϕ(wij qprimeijλMAy
1)
)︸ ︷︷ ︸
f (w|λr Aλ0MAy )
and
59
3 παλ(αλ|MA) = πα(α|MAz)πλ(λ|MAy
)
This result implies that once the occupancy and detection indicators are
conditioned on the latent processes v and w respectively the model posterior
probabilities only depend on the latent variables Hence in this case the model
selection problem is driven by the posterior odds
p(MA|y zw v)p(M0|y zw v)
=m(w v|MA)
m(w v|M0)
π(MA)
π(M0) (3ndash5)
where m(w v|MA) = m(w|MAy) middotm(v|MAz
) with
m(v|MAz) =
int intf (v|αr Aα0MAz
)π(αr A|α0MAz)π(α0)dαr Adα0
(3ndash6)
m(w|MAy) =
int intf (w|λr Aλ0MAy
)π(λr A|λ0MAy)π(λ0)dλ0dλr A
(3ndash7)
332 Intrinsic Priors for the Occupancy Problem
In general the intrinsic priors as defined by Moreno et al (1998) use the functional
form of the response to inform their construction assuming some preliminary prior
distribution proper or improper on the model parameters For our purposes we assume
noninformative improper priors for the parameters denoted by πN(middot|middot) Specifically the
intrinsic priors πIP(θMlowast|Mlowast) for a vector of parameters θMlowast corresponding to model
Mlowast isin M0M sub M for a response vector s with probability density (or mass) function
f (s|θMlowast) are defined by
πIP(θM0|M0) = πN(θM0
|M0)
πIP(θM |M) = πN(θM |M)
intm(~s|M)
m(~s|M0)f (~s|θM M)d~s
where ~s is a theoretical training sample
In what follows whenever it is clear from the context in an attempt to simplify the
notation MA will be used to refer to MAzor MAy
and A will denote Az or Ay To derive
60
the parameter priors involved in equations 3ndash6 and 3ndash7 using the objective intrinsic prior
strategy we start by assuming flat priors πN(αA|MA) prop cA and πN(λA|MA) prop dA where
cA and dA are unknown constants
The intrinsic prior for the parameters associated with the occupancy process αA
conditional on model MA is
πIP(αA|MA) = πN(αA|MA)
intm(~v|MA)
m(~v|M0)f (~v|αAMA)d~v
where the marginals m(~v|Mj) with j isin A 0 are obtained by solving the analogous
equation 3ndash6 for the (theoretical) training sample ~v These marginals are given by
m(~v|Mj) = cj (2π)pjminusp0
2 |~X primej~Xj |
12 eminus
12~vprime(Iminus~Hj )~v
The training sample ~v has dimension pAz=∣∣MAz
∣∣ that is the total number of
parameters in model MAz Note that without ambiguity we use
∣∣ middot ∣∣ to denote both
the cardinality of a set and also the determinant of a matrix The design matrix ~XA
corresponds to the training sample ~v and is chosen such that ~X primeA~XA =
pAzNX primeAXA
(Leon-Novelo et al 2012) and ~Hj is the corresponding hat matrix
Replacing m(~v|MA) and m(~v|M0) in πIP(αA|MA) and solving the integral with
respect to the theoretical training sample ~v we have
πIP(αA|MA) = cA
int ((2π)minus
pAzminusp0z2
(c0
cA
)eminus
12~vprime((Iminus~HA)minus(Iminus~H0))~v |~X
primeA~XA|12
|~X prime0~X0|12
)times(
(2π)minuspAz2 eminus
12(~vminus~XAαA)
prime(~vminus~XAαA))d~v
= c0(2π)minus
pAzminusp0z2 |~X prime
Ar~XAr |
12 2minus
pAzminusp0z2 exp
[minus1
2αprimer A
(1
2~X primer A
~Xr A
)αr A
]= πN(α0)timesN
(αr A
∣∣ 0 2 middot ( ~X primer A
~Xr A)minus1)
(3ndash8)
61
Analogously the intrinsic prior for the parameters associated to the detection
process is
πIP(λA|MA) = d0(2π)minus
pAyminusp0y2 | ~Q prime
Ar~QAr |
12 2minus
pAyminusp0y2 exp
[minus1
2λprimer A
(1
2~Q primer A
~Qr A
)λr A
]= πN(λ0)timesN
(λr A
∣∣ 0 2 middot ( ~Q primeA~QA)
minus1)
(3ndash9)
In short the intrinsic priors for αA = (αprime0α
primer A)
prime and λprimeA = (λprime
0λprimer A)
prime are the product
of a reference prior on the parameters of the base model and a normal density on the
parameters indexed by Az and Ay respectively333 Model Posterior Probabilities
We now derive the expressions involved in the calculations of the model posterior
probabilities First recall that p(MA|y zw v) prop m(w v|MA)π(MA) Hence determining
this posterior probability only requires calculating m(w v|MA)
Note that since w and v are independent obtaining the model posteriors from
expression 3ndash4 reduces to finding closed form expressions for the marginals m(v |MAz)
and m(w |MAy) respectively from equations 3ndash6 and 3ndash7 Therefore
m(w v|MA) =
int intf (vw|αλMA)π
IP (α|MAz)πIP
(λ|MAy
)dαdλ
(3ndash10)
For the latent variable associated with the occupancy process plugging the
parameter intrinsic prior given by 3ndash8 into equation 3ndash6 (recalling that ~X primeA~XA =
pAzNX primeAXA)
and integrating out αA yields
m(v|MA) =
int intc0N (v|X0α0 + Xr Aαr A I)N
(αr A|0 2( ~X prime
r A~Xr A)
minus1)dαr Adα0
= c0(2π)minusn2
int (pAz
2N + pAz
) (pAzminusp0z
)
2
times
exp[minus1
2(v minus X0α0)
prime(I minus
(2N
2N + pAz
)Hr Az
)(v minus X0α0)
]dα0
62
= c0 (2π)minus(nminusp0z )2
(pAz
2N + pAz
) (pAzminusp0z
)
2
|X prime0X0|minus
12 times
exp[minus1
2vprime(I minus H0z minus
(2N
2N + pAz
)Hr Az
)v
] (3ndash11)
with Hr Az= HAz
minus H0z where HAzis the hat matrix for the entire model MAz
and H0z is
the hat matrix for the base model
Similarly the marginal distribution for w is
m(w|MA) = d0 (2π)minus(Jminusp0y )2
(pAy
2J + pAy
) (pAyminusp0y
)
2
|Q prime0Q0|minus
12 times
exp[minus1
2wprime(I minus H0y minus
(2J
2J + pAy
)Hr Ay
)w
] (3ndash12)
where J =sumN
i=1 Ji or in other words J denotes the total number of surveys conducted
Now the posteriors for the base model M0 =M0y M0z
are
m(v|M0) =
intc0N (v|X0α0 I) dα0
= c0(2π)minus(nminusp0z )2 |X prime
0X0|minus12 exp
[minus1
2(v (I minus H0z ) v)
](3ndash13)
and
m(w|M0) = d0(2π)minus(Jminusp0y )2 |Q prime
0Q0|minus12 exp
[minus1
2
(w(I minus H0y
)w)]
(3ndash14)
334 Model Selection Algorithm
Having the parameter intrinsic priors in place and knowing the form of the model
posterior probabilities it is finally possible to develop a strategy to conduct model
selection for the occupancy framework
For each of the two components of the model ndashoccupancy and detectionndash the
algorithm first draws the set of active predictors (ie Az and Ay ) together with their
corresponding parameters This is a reversible jump step which uses a Metropolis
63
Hastings correction with proposal distributions given by
q(Alowastz |zo z(t)u v(t)MAz
) =1
2
(p(MAlowast
z|zo z(t)u v(t)Mz MAlowast
zisin L(MAz
)) +1
|L(MAz)|
)q(Alowast
y |y zo z(t)u w(t)MAy) =
1
2
(p(MAlowast
w|y zo z(t)u w(t)My MAlowast
yisin L(MAy
)) +1
|L(MAy)|
)(3ndash15)
where L(MAz) and L(MAy
) denote the sets of models obtained from adding or removing
one predictor at a time from MAzand MAy
respectively
To promote mixing this step is followed by an additional draw from the full
conditionals of α and λ The densities p(α0|) p(αr A|) p(λ0|) and p(λr A|) can
be sampled from directly with Gibbs steps Using the notation a|middot to denote the random
variable a conditioned on all other parameters and on the data these densities are given
by
bull α0|middot sim N((X
prime0X0)
minus1Xprime0v (X
prime0X0)
minus1)bull αr A|middot sim N
(microαr A
αr A
) where the mean vector and the covariance matrix are
given by αr A= 2N
2N+pAz(X
prime
r AXr A)minus1 and microαr A
=(αr A
Xprime
r Av)
bull λ0|middot sim N((Q
prime0Q0)
minus1Qprime0w (Q
prime0Q0)
minus1) and
bull λr A|middot sim N(microλr A
λr A
) analogously with mean and covariance matrix given by
λr A= 2J
2J+pAy(Q
prime
r AQr A)minus1 and microλr A
=(λr A
Qprime
r Aw)
Finally Gibbs sampling steps are also available for the unobserved occupancy
indicators zu and for the corresponding latent variables v and w The full conditional
posterior densities for z(t+1)u v(t+1) and w(t+1) are those introduced in Chapter 2 for the
single season probit model
The following steps summarize the stochastic search algorithm
1 Initialize A(0)y A
(0)z z
(0)u v(0)w(0)α(0)
0 λ(0)0
2 Sample the model indices and corresponding parameters
(a) Draw simultaneously
64
bull Alowastz sim q(Az |zo z(t)u v(t)MAz
)
bull αlowast0 sim p(α0|MAlowast
z zo z
(t)u v(t)) and
bull αlowastr Alowast sim p(αr A|MAlowast
z zo z
(t)u v(t))
(b) Accept (M(t+1)Az
α(t+1)10 α(t+1)1
r A ) = (MAlowastzαlowast
0αlowastr Alowast) with probability
δz = min
(1
p(MAlowastz|zo z(t)u v(t))
p(MA(t)z|zo z(t)u v(t))
q(A(t)z |zo z(t)u v(t)MAlowast
z)
q(Alowastz |zo z
(t)u v(t)MAz
)
)
otherwise let (M(t+1)Az
α(t+1)10 α(t+1)1
r A ) = (A(t)z α(t)2
0 α(t)2r A )
(c) Sample simultaneously
bull Alowasty sim q(Ay |y zo z(t)u w(t)MAy
)
bull λlowast0 sim p(λ0|MAlowast
y y zo z
(t)u w(t)) and
bull λlowastr Alowast sim p(λr A|MAlowast
y y zo z
(t)u w(t))
(d) Accept (M(t+1)Ay
λ(t+1)10 λ(t+1)1
r A ) = (MAlowastyλlowast
0λlowastr Alowast) with probability
δy = min
(1
p(MAlowastz|y zo z(t)u w(t))
p(MA(t)z|y zo z(t)u w(t))
q(A(t)z |y zo z(t)u w(t)MAlowast
y)
q(Alowastz |y zo z
(t)u w(t)MAy
)
)
otherwise let (M(t+1)Ay
λ(t+1)10 λ(t+1)1
r A ) = (A(t)y λ(t)2
0 λ(t)2r A )
3 Sample base model parameters
(a) Draw α(t+1)20 sim p(α0|MA
(t+1)z
zo z(t)u v(t))
(b) Draw λ(t+1)20 sim p(λ0|MA(t+1)y
y zo z(t)u v(t))
4 To improve mixing resample model coefficients not present the base model butare in MA
(a) Draw α(t+1)2r A sim p(αr A|MA
(t+1)z
zo z(t)u v(t))
(b) Draw λ(t+1)2r A sim p(λr A|MA
(t+1)y
yzo z(t)u v(t))
5 Sample latent and missing (unobserved) variables
(a) Sample z(t+1)u sim p(zu|MA(t+1)z
yα(t+1)2r A α(t+1)2
0 λ(t+1)2r A λ(t+1)2
0 )
(b) Sample v(t+1) sim p(v|MA(t+1)z
zo z(t+1)u α(t+1)2
r A α(t+1)20 )
65
(c) Sample w(t+1) sim p(w|MA(t+1)y
zo z(t+1)u λ(t+1)2
r A λ(t+1)20 )
34 Alternative Formulation
Because the occupancy process is partially observed it is reasonable to consider
the posterior odds in terms of the observed responses that is the detections y and
the presences at sites where at least one detection takes place Partitioning the vector
of presences into observed and unobserved z = (zprimeo zprimeu)
prime and integrating out the
unobserved component the model posterior for MA can be obtained as
p(MA|y zo) prop Ezu [m(y z|MA)] π(MA) (3ndash16)
Data-augmenting the model in terms of latent normal variables a la Albert and Chib
the marginals for any model My Mz = M isin M of z and y inside of the expectation in
equation 3ndash16 can be expressed in terms of the latent variables
m(y z|M) =
intT (z)
intT (yz)
m(w v|M)dwdv
=
(intT (z)
m(v| Mz)dv
)(intT (yz)
m(w|My)dw
) (3ndash17)
where T (z) and T (y z) denote the corresponding truncation regions for v and w which
depend on the values taken by z and y and
m(v|Mz) =
intf (v|αMz)π(α|Mz)dα (3ndash18)
m(w|My) =
intf (w|λMy)π(λ|My)dλ (3ndash19)
The last equality in equation 3ndash17 is a consequence of the independence of the
latent processes v and w Using expressions 3ndash18 and 3ndash19 allows one to embed this
model selection problem in the classical linear normal regression setting where many
ldquoobjectiverdquo Bayesian inferential tools are available In particular these expressions
facilitate deriving the parameter intrinsic priors (Berger amp Pericchi 1996 Moreno
et al 1998) for this problem This approach is an extension of the one implemented in
Leon-Novelo et al (2012) for the simple probit regression problem
66
Using this alternative approach all that is left is to integrate m(v|MA) and m(w|MA)
over their corresponding truncation regions T (z) and T (y z) which yields m(y z|MA)
and then to obtain the expectation with respect to the unobserved zrsquos Note however
two issues arise First such integrals are not available in closed form Second
calculating the expectation over the limit of integration further complicates things To
address these difficulties it is possible to express E [m(y z|MA)] as
Ezu [m(y z|MA)] = Ezu
[(intT (z)
m(v| MAz)dv
)(intT (yz)
m(w|MAy)dw
)](3ndash20)
= Ezu
[(intT (z)
intm(v| MAz
α0)πIP(α0|MAz
)dα0dv
)times(int
T (yz)
intm(w| MAy
λ0)πIP(λ0|MAy
)dλ0dw
)]
= Ezu
int (int
T (z)
m(v| MAzα0)dv
)︸ ︷︷ ︸
g1(T (z)|MAz α0)
πIP(α0|MAz)dα0 times
int (intT (yz)
m(w|MAyλ0)dw
)︸ ︷︷ ︸
g2(T (yz)|MAy λ0)
πIP(λ0|MAy)dλ0
= Ezu
[intg1(T (z)|MAz
α0)πIP(α0|MAz
)dα0 timesintg2(T (y z)|MAy
λ0)πIP(λ0|MAy
)dλ0
]= c0 d0
int intEzu
[g1(T (z)|MAz
α0)g2(T (y z)|MAyλ0)
]dα0 dλ0
where the last equality follows from Fubinirsquos theorem since m(v|MAzα0) and
m(w|MAyλ0) are proper densities From 3ndash21 the posterior odds are
p(MA|y zo)p(M0|y zo)
=
int intEzu
[g1(T (z)|MAz
α0)g2(T (y z)|MAyλ0)
]dα0 dλ0int int
Ezu
[g1(T (z)|M0z α0)g2(T (y z)|M0y λ0)
]dα0 dλ0
π(MA)
π(M0)
(3ndash21)
67
35 Simulation Experiments
The proposed methodology was tested under 36 different scenarios where we
evaluate the behavior of the algorithm by varying the number of sites the number of
surveys the amount of signal in the predictors for the presence component and finally
the amount of signal in the predictors for the detection component
For each model component the base model is taken to be the intercept only model
and the full models considered for the presence and the detection have respectively 30
and 20 predictors Therefore the model space contains 230times220 asymp 112times1015 candidate
models
To control the amount of signal in the presence and detection components values
for the model parameter were purposefully chosen so that quantiles 10 50 and 90 of the
occupancy and detection probabilities match some pre-specified probabilities Because
presence and detection are binary variables the amount of signal in each model
component associates to the spread and center of the distribution for the occupancy and
detection probabilities respectively Low signal levels relate to occupancy or detection
probabilities close to 50 High signal levels associate with probabilities close to 0 or 1
Large spreads of the distributions for the occupancy and detection probabilities reflect
greater heterogeneity among the observations collected improving the discrimination
capability of the model and viceversa
Therefore for the presence component the parameter values of the true model
were chosen to set the median for the occupancy probabilities equal 05 The chosen
parameter values also fix quantiles 10 and 90 symmetrically about 05 at small (Qz10 =
03Qz90 = 07) intermediate (Qz
10 = 02Qz90 = 08) and large (Qz
10 = 01Qz90 = 09)
distances For the detection component the model parameters are obtained to reflect
detection probabilities concentrated about low values (Qy50 = 02) intermediate values
(Qy50 = 05) and high values (Qy
50 = 08) while keeping quantiles 10 and 90 fixed at 01
and 09 respectively
68
Table 3-1 Simulation control parameters occupancy model selectorParameter Values considered
N 50 100
J 3 5
(Qz10Q
z50Q
z90)
(03 05 07) (02 05 08) (01 05 09)
(Qy
10Qy50Q
y90)
(01 02 09) (01 05 09) (01 08 09)
There are in total 36 scenarios these result from crossing all the levels of the
simulation control parameters (Table 3-1) Under each of these scenarios 20 data sets
were generated at random True presence and detection indicators were generated
with the probit model formulation from Chapter 2 This with the assumed true models
MTz = 1 x2 x15 x16 x22 x28 for the presence and MTy = 1 q7 q10 q12 q17 for
the detection with the predictors included in the randomly generated datasets In this
context 1 represents the intercept term Throughout the Section we refer to predictors
included in the true models as true predictors and to those absent as false predictors
The selection procedure was conducted using each one of these data sets with
two different priors on the model space the uniform or equal probability prior and a
multiplicity correcting prior
The results are summarized through the marginal posterior inclusion probabilities
(MPIPs) for each predictor and also the five highest posterior probability models (HPM)
The MPIP for a given predictor under a specific scenario and for a particular data set is
defined as
p(predictor is included|y zw v) =sumMisinM
I(predictorisinM)p(M|y zw vM) (3ndash22)
In addition we compare the MPIP odds between predictors present in the true model
and predictors absent from it Specifically we consider the minimum odds of marginal
posterior inclusion probabilities for the predictors Let ~ξ and ξ denote respectively a
69
predictor in the true model MT and a predictor absent from MT We define the minimum
MPIP odds between the probabilities of true and false predictor as
minOddsMPIP =min~ξisinMT
p(I~ξ = 1|~ξ isin MT )
maxξ isinMTp(Iξ = 1|ξ isin MT )
(3ndash23)
If the variable selection procedure adequately discriminates true and false predictors
minOddsMPIP will take values larger than one The ability of the method to discriminate
between the least probable true predictor and the most probable false predictor worsens
as the indicator approaches 0351 Marginal Posterior Inclusion Probabilities for Model Predictors
For clarity in Figures 3-1 through 3-5 only predictors in the true models are labeled
and are emphasized with a dotted line passing through them The left hand side plots
in these figures contain the results for the presence component and the ones on the
right correspond to predictors in the detection component The results obtained with
the uniform model priors correspond to the black lines and those for the multiplicity
correcting prior are in red In these Figures the MPIPrsquos have been averaged over all
datasets corresponding scenarios matching the condition observed
In Figure 3-1 we contrast the mean MPIPrsquos of the predictors over all datasets from
scenarios with 50 sites to the mean MPIPrsquos obtained for the scenarios with 100 sites
Similarly Figure 3-2 compares the mean MPIPrsquos of scenarios where 3 surveys are
performed to those of scenarios having 5 surveys per site Figures 3-4 and 3-5 show the
effect of the different levels of signal considered in the occupancy probabilities and in the
detection probabilities
From these figures mainly three results can be drawn (1) the effect of the model
prior is substantial (2) the proposed methods yield MPIPrsquos that clearly separate
true predictors from false predictors and (3) the separation between MPIPrsquos of true
predictors and false predictors is noticeably larger in the detection component
70
Regardless of the simulation scenario and model component observed under the
uniform prior false predictors obtain a relatively high MPIP Conversely the multiplicity
correction prior strongly shrinks towards 0 the MPIP for false predictors In the presence
component the MPIP for the true predictors is shrunk substantially under the multiplicity
prior however there remains a clear separation between true and false predictors In
contrast in the detection component the MPIP for true predictors remains relatively high
(Figures 3-1 through 3-5)
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif N=50MC N=50
Unif N=100MC N=100
Figure 3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors
71
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif J=3MC J=3
Unif J=5MC J=5
Figure 3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif N=50 J=3Unif N=50 J=5
Unif N=100 J=3Unif N=100 J=5
MC N=50 J=3MC N=50 J=5
MC N=100 J=3MC N=100 J=5
Figure 3-3 Predictor MPIP averaged over scenarios with the interaction between thenumber of sites and the surveys per site using uniform (U) and multiplicitycorrection (MC) priors
72
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
U(03 05 07)MC(03 05 07)
U(02 05 08)MC(02 05 08)
U(01 05 09)MC(01 05 09)
Figure 3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
U(01 02 09)MC(01 02 09)
U(01 05 09)MC(01 05 09)
U(01 08 09)MC(01 08 09)
Figure 3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors
73
In scenarios where more sites were surveyed the separation between the MPIP of
true and false predictors grew in both model components (Figure 3-1) Increasing the
number of sites has an effect over both components given that every time a new site is
included covariate information is added to the design matrix of both the presence and
the detection components
On the hand increasing the number of surveys affects the MPIP of predictors in the
detection component (Figures 3-2 and 3-3) but has only a marginal effect on predictors
of the presence component This may appear to be counterintuitive however increasing
the number of surveys only increases the number of observation in the design matrix
for the detection while leaving unaltered the design matrix for the presence The small
changes observed in the MPIP for the presence predictors J increases are exclusively
a result of having additional detection indicators equal to 1 in sites where with less
surveys would only have 0 valued detections
From Figure 3-3 it is clear that for the presence component the effect of the number
of sites dominates the behavior of the MPIP especially when using the multiplicity
correction priors In the detection component the MPIP is influenced by both the number
of sites and number of surveys The influence of increasing the number of surveys is
larger when considering a smaller number of sites and viceversa
Regarding the effect of the distribution for the occupancy probabilities we observe
that mostly the detection component is affected There is stronger discrimination
between true and false predictors as the distribution has a higher variability (Figure
3-4) This is consistent with intuition since having the presence probabilities more
concentrated about 05 implies that the predictors do not vary much from one site to
the next whereas having the occupancy probabilities more spread out would have the
opposite effect
Finally concentrating the detection probabilities about high or low values For
predictors in the detection component the separation between MPIP of true and false
74
predictors is larger both in scenarios where the distribution of the detection probability
is centered about 02 or 08 when compared to those scenarios where this distribution
is centered about 05 (where the signal of the predictors is weakest) For predictors in
the presence component having the detection probabilities centered at higher values
slightly increases the inclusion probabilities of the true predictors (Figure 3-5) and
reduces that of false predictors
Table 3-2 Comparison of average minOddsMPIP under scenarios having differentnumber of sites (N=50 N=100) and under scenarios having different numberof surveys per site (J=3 J=5) for the presence and detection componentsusing uniform and multiplicity correction priors
Sites SurveysComp π(M) N=50 N=100 J=3 J=5
Presence Unif 112 131 119 124MC 320 846 420 674
Detection Unif 203 264 211 257MC 2115 3246 2139 3252
Table 3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors
(Qz10Q
z50Q
z90) (Qy
10Qy50Q
y90)
Comp π(M) (030507) (020508) (010509) (010209) (010509) (010809)
Presence Unif 105 120 134 110 123 124MC 202 455 805 238 619 640
Detection Unif 234 234 230 257 200 238MC 2537 2077 2528 2933 1852 2849
The separation between the MPIP of true and false predictors is even more
notorious in Tables 3-2 and 3-3 where the minimum MPIP odds between true and
false predictors are shown Under every scenario the value for the minOddsMPIP (as
defined in 3ndash23) was greater than 1 implying that on average even the lowest MPIP
for a true predictor is higher than the maximum MPIP for a false predictor In both
components of the model the minOddsMPIP are markedly larger under the multiplicity
correction prior and increase with the number of sites and with the number of surveys
75
For the presence component increasing the signal in the occupancy probabilities
or having the detection probabilities concentrate about higher values has a positive and
considerable effect on the magnitude of the odds For the detection component these
odds are particularly high specially under the multiplicity correction prior Also having
the distribution for the detection probabilities center about low or high values increases
the minOddsMPIP 352 Summary Statistics for the Highest Posterior Probability Model
Tables 3-4 through 3-7 show the number of true predictors that are included in
the HPM (True +) and the number of false predictors excluded from it (True minus)
The mean percentages observed in these Tables provide one clear message The
highest probability models chosen with either model prior commonly differ from the
corresponding true models The multiplicity correction priorrsquos strong shrinkage only
allows a few true predictors to be selected but at the same time it prevents from
including in the HPM any false predictors On the other hand the uniform prior includes
in the HPM a larger proportion of true predictors but at the expense of also introducing
a large number of false predictors This situation is exacerbated in the presence
component but also occurs to a minor extent in the detection component
Table 3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highestprobability models for the presence and the detection components usinguniform and multiplicity correcting priors on the model space
True + True minusComp π(M) N=50 N=100 N=50 N=100
Presence Unif 057 063 051 055MC 006 013 100 100
Detection Unif 077 085 087 093MC 049 070 100 100
Having more sites or surveys improves the inclusion of true predictors and exclusion
of false ones in the HPM for both the presence and detection components (Tables 3-4
and 3-5) On the other hand if the distribution for the occupancy probabilities is more
76
Table 3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space
True + True minusComp π(M) J=3 J=5 J=3 J=5
Presence Unif 059 061 052 054MC 008 010 100 100
Detection Unif 078 085 087 092MC 050 068 100 100
spread out the HPM includes more true predictors and less false ones in the presence
component In contrast the effect of the spread of the occupancy probabilities in the
detection HPM is negligible (Table 3-6) Finally there is a positive relationship between
the location of the median for the detection probabilities and the number of correctly
classified true and false predictors for the presence The HPM in the detection part of
the model responds positively to low and high values of the median detection probability
(increased signal levels) in terms of correctly classified true and false predictors (Table
3-7)
Table 3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space
True + True minusComp π(M) (030507) (020508) (010509) (030507) (020508) (010509)
Presence Unif 055 061 064 050 054 055MC 002 008 018 100 100 100
Detection Unif 081 082 081 090 089 089MC 057 061 059 100 100 100
36 Case Study Blue Hawker Data Analysis
During 1999 and 2000 an intensive volunteer surveying effort coordinated by the
Centre Suisse de Cartographie de la Faune (CSCF) was conducted in order to analyze
the distribution of the blue hawker Ashna cyanea (Odonata Aeshnidae) a common
dragonfly in Switzerland Given that Switzerland is a small and mountainous country
77
Table 3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space
True + True minusComp π(M) (010209) (010509) (010809) (010209) (010509) (010809)
Presence Unif 059 059 062 051 054 054MC 006 010 011 100 100 100
Detection Unif 089 077 078 091 087 091MC 070 048 059 100 100 100
there is large variation in its topography and physio-geography as such elevation is a
good candidate covariate to predict species occurrence at a large spatial scale It can
be used as a proxy for habitat type intensity of land use temperature as well as some
biotic factors (Kery et al 2010)
Repeated visits to 1-ha pixels took place to obtain the corresponding detection
history In addition to the survey outcome the x and y-coordinates thermal-level the
date of the survey and the elevation were recorded Surveys were restricted to the
known flight period of the blue hawker which takes place between May 1 and October
10 In total 2572 sites were surveyed at least once during the surveying period The
number of surveys per site ranges from 1 to 22 times within each survey year
Kery et al (2010) summarize the results of this effort using AIC-based model
comparisons first by following a backwards elimination approach for the detection
process while keeping the occupancy component fixed at the most complex model and
then for the presence component choosing among a group of three models while using
the detection model chosen In our analysis of this dataset for the detection and the
presence we consider as the full models those used in Kery et al (2010) namely
minus1(ψ) = α0 + α1year+ α2elev+ α3elev2 + α4elev
3
minus1(p) = λ0 + λ1year+ λ2elev+ λ3elev2 + λ4elev
3 + λ5date+ λ6date2
78
where year = Iyear=2000
The model spaces for this data contain 26 = 64 and 24 = 16 models respectively
for the detection and occupancy components That is in total the model space contains
24+6 = 1 024 models Although this model space can be enumerated entirely for
illustration we implemented the algorithm from section 334 generating 10000 draws
from the Gibbs sampler Each one of the models sampled were chosen from the set of
models that could be reached by changing the state of a single term in the current model
(to inclusion or exclusion accordingly) This allows a more thorough exploration of the
model space because for each of the 10000 models drawn the posterior probabilities
for many more models can be observed Below the labels for the predictors are followed
by either ldquozrdquo or ldquoyrdquo accordingly to represent the component they pertain to Finally
using the results from the model selection procedure we conducted a validation step to
determine the predictive accuracy of the HPMrsquos and of the median probability models
(MPMrsquos) The performance of these models is then contrasted with that of the model
ultimately selected by Kery et al (2010)361 Results Variable Selection Procedure
The model finally chosen for the presence component in Kery et al (2010) was not
found among the highest five probability models under either model prior 3-8 Moreover
the year indicator was never chosen under the multiplicity correcting prior hinting that
this term might correspond to a falsely identified predictor under the uniform prior
Results in Table 3-10 support this claim the marginal inclusion posterior probability for
the year predictor is 7 under the multiplicity correction prior The multiplicity correction
prior concentrates more densely the model posterior probability mass in the highest
ranked models (90 of the mass is in the top five models) than the uniform prior (which
account for 40 of the mass)
For the detection component the HPM under both priors is the intercept only model
which we represent in Table 3-9 with a blank label In both cases this model obtains very
79
Table 3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data
Uniform model priorRank Mz selected p(Mz |y)
1 yrz+elevz 0102 yrz+elevz+elevz3 0083 elevz2+elevz3 0084 yrz+elevz2 0075 yrz+elevz3 007
Multiplicity correcting model priorRank Mz selected p(Mz |y)
1 elevz+elevz3 0532 0153 elevz+elevz2 0094 elevz2 0065 elevz+elevz2+elevz3 005
high posterior probabilities The terms contained in cubic polynomial for the elevation
appear to contain some relevant information however this conflicts with the MPIPs
observed in Table 3-11 which under both model priors are relatively low (lt 20 with the
uniform and le 4 with the multiplicity correcting prior)
Table 3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data
Uniform model priorRank Mz selected p(Mz |y)
1 0452 elevy3 0063 elevy2 0054 elevy 0055 yry 004
Multiplicity correcting model priorRank Mz selected p(Mz |y)
1 0862 elevy3 0023 datey2 0024 elevy2 0025 yry 002
Finally it is possible to use the MPIPs to obtain the median probability model which
contains the terms that have a MPIP higher than 50 For the occupancy process
(Table 3-10) under the uniform prior the model with the year the elevation and the
elevation cubed are included The MPM with multiplicity correction prior coincides with
the HPM from this prior The MPM chosen for the detection component (Table 3-11)
under both priors is the intercept only model coinciding again with the HPM
Given the outcomes of the simulation studies from Section 35 especially those
pertaining to the detection component the results in Table 3-11 appear to indicate that
none of the predictors considered belong to the true model especially when considering
80
Table 3-10 MPIP presence component
Predictor p(predictor isin MTz |y z w v)
Unif MultCorryrz 053 007elevz 051 073elevz2 045 023elevz3 050 067
Table 3-11 MPIP detection component
Predictor p(predictor isin MTy |y z w v)
Unif MultCorryry 019 003elevy 018 003elevy2 018 003elevy 3 019 004datey 016 003datey2 015 004
those derived with the multiplicity correction prior On the other hand for the presence
component (Table 3-10) there is an indication that terms related to the cubic polynomial
in elevz can explain the occupancy patterns362 Validation for the Selection Procedure
Approximately half of the sites were selected at random for training (ie for model
selection and parameter estimation) and the remaining half were used as test data In
the previous section we observed that using the marginal posterior inclusion probability
of the predictors the our method effectively separates predictors in the true model from
those that are not in it However in Tables 3-10 and 3-11 this separation is only clear for
the presence component using the multiplicity correction prior
Therefore in the validation procedure we observe the misclassification rates for the
detections using the following models (1) the model ultimately recommended in Kery
et al (2010) (yrz+elevz+elevz2+elevz3 + elevy+ elevy2+ datey+ datey2) (2) the
highest probability model (HPM) with a uniform prior (yrz+elevz) (3) the HPM with a
multiplicity correcting prior (elevz + elevz3 ) (4) the median probability model (MPM)
ndashthe model including only predictors with a MPIP larger than 50ndash with the uniform
prior (yrz+elevz+elevz3) and finally (5) the MPM with a multiplicity correction prior
(elevz+elevz3 same as the HPM with multiplicity correction)
We must emphasize that the models resulting from the implement ion of our model
selection procedure used exclusively the training dataset On the other hand the model
in Kery et al (2010) was chosen to minimize the prediction error of the complete data
81
Because this model was obtained from the full dataset results derived from it can only
be considered as a lower bound for the prediction errors
The benchmark misclassification error rate for true 1rsquos is high (close to 70)
However the misclassification rate for true 0rsquos which accounts for most of the
responses is less pronounced (15) Overall the performance of the selected models
is comparable They yield considerably worse results than the benchmark for the true
1rsquos but achieve rates close to the benchmark for the true zeros Pooling together
the results for true ones and true zeros the selected models with either prior have
misclassification rates close to 30 The benchmark model performs comparably with a
joint misclassification error of 23 (Table 3-12)
Table 3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform andmultiplicity correction model priors
Model True 1 True 0 Jointbenchmark (Kery et al 2010) yrz+elevz+elevz2+elevz3 + 066 015 023
elevy+ elevy2+ datey+ datey2
HPM Unif yrz+elevz 083 017 028HPMHPM MC elevz + elevz3 082 018 028MPM Unif yrz+elevz+elevz3 082 018 029
37 Discussion
In this Chapter we proposed an objective and fully automatic Bayes methodology for
the single season site-occupancy model The methodology is said to be fully automatic
because no hyper-parameter specification is necessary in defining the parameter priors
and objective because it relies on the intrinsic priors derived from noninformative priors
The intrinsic priors have been shown to have desirable properties as testing priors We
also propose a fast stochastic search algorithm to explore large model spaces using our
model selection procedure
Our simulation experiments demonstrated the ability of the method to single out the
predictors present in the true model when considering the marginal posterior inclusion
probabilities for the predictors For predictors in the true model these probabilities
were comparatively larger than those for predictors absent from it Also the simulations
82
indicated that the method has a greater discrimination capability for predictors in the
detection component of the model especially when using multiplicity correction priors
Multiplicity correction priors were not described in this Chapter however their
influence on the selection outcome is significant This behavior was observed in the
simulation experiment and in the analysis of the Blue Hawker data Model priors play an
essential role As the number of predictors grows these are instrumental in controlling
for selection of false positive predictors Additionally model priors can be used to
account for predictor structure in the selection process which helps both to reduce the
size of the model space and to make the selection more robust These issues are the
topic of the next Chapter
Accounting for the polynomial hierarchy in the predictors within the occupancy
context is a straightforward extension of the procedures we describe in Chapter 4
Hence our next step is to develop efficient software for it An additional direction we
plan to pursue is developing methods for occupancy variable selection in a multivariate
setting This can be used to conduct hypothesis testing in scenarios with varying
conditions through time or in the case where multiple species are co-observed A
final variation we will investigate for this problem is that of occupancy model selection
incorporating random effects
83
CHAPTER 4PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
It has long been an axiom of mine that the little things are infinitely themost important
ndashSherlock HolmesA Case of Identity
41 Introduction
In regression problems if a large number of potential predictors is available the
complete model space is too large to enumerate and automatic selection algorithms are
necessary to find informative parsimonious models This multiple testing problem
is difficult and even more so when interactions or powers of the predictors are
considered In the ecological literature models with interactions andor higher order
polynomial terms are ubiquitous (Johnson et al 2013 Kery et al 2010 Zeller et al
2011) given the complexity and non-linearities found in ecological processes Several
model selection procedures even in the classical normal linear setting fail to address
two fundamental issues (1) the model selection outcome is not invariant to affine
transformations when interactions or polynomial structures are found among the
predictors and (2) additional penalization is required to control for false positives as the
model space grows (ie as more covariates are considered)
These two issues motivate the developments developed throughout this Chapter
Building on the results of Chipman (1996) we propose investigate and provide
recommendations for three different prior distributions on the model space These
priors help control for test multiplicity while accounting for polynomial structure in the
predictors They improve upon those proposed by Chipman first by avoiding the need
for specific values for the prior inclusion probabilities of the predictors and second
by formulating principled alternatives to introduce additional structure in the model
84
priors Finally we design a stochastic search algorithm that allows fast and thorough
exploration of model spaces with polynomial structure
Having structure in the predictors can determine the selection outcome As an
illustration consider the model E [y ] = β00 + β01x2 + β20x21 where the order one
term x1 is not present (this choice of subscripts for the coefficients is defined in the
following section) Transforming x1 7rarr xlowast1 = x1 + c for some c = 0 the model
becomes E [y ] = β00 + β01x2 + βlowast20x
lowast21 Note that in terms of the original predictors
xlowast21 = x21 + 2c middot x1 + c2 implying that this seemingly innocuous transformation of x1
modifies the column space of the design matrix by including x1 which was not in the
original model That is when lower order terms in the hierarchy are omitted from the
model the column space of the design matrix is not invariant to afine transformations
As the hat matrix depends on the column space the modelrsquos predictive capability is also
affected by how the covariates in the model are coded an undesirable feature for any
model selection procedure To make model selection invariant to afine transformations
the selection must be constrained to the subset of models that respect the hierarchy
(Griepentrog et al 1982 Khuri 2002 McCullagh amp Nelder 1989 Nelder 2000
Peixoto 1987 1990) These models are known as well-formulated models (WFMs)
Succinctly a model is well-formulated if for any predictor in the model every lower order
predictor associated with it is also in the model The model above is not well-formulated
as it contains x21 but not x1
WFMs exhibit strong heredity in that all lower order terms dividing higher order
terms in the model must also be included An alternative is to only require weak heredity
(Chipman 1996) which only forces some of the lower terms in the corresponding
polynomial hierarchy to be in the model However Nelder (1998) demonstrated that the
conditions under which weak heredity allows the design matrix to be invariant to afine
transformations of the predictors are too restrictive to be useful in practice
85
Although this topic appeared in the literature more than three decades ago (Nelder
1977) only recently have modern variable selection techniques been adapted to
account for the constraints imposed by heredity As described in Bien et al (2013)
the current literature on variable selection for polynomial response surface models
can be classified into three broad groups mult-istep procedures (Brusco et al 2009
Peixoto 1987) regularized regression methods (Bien et al 2013 Yuan et al 2009)
and Bayesian approaches (Chipman 1996) The methods introduced in this Chapter
take a Bayesian approach towards variable selection for well-formulated models with
particular emphasis on model priors
As mentioned in previous chapters the Bayesian variable selection problem
consists of finding models with high posterior probabilities within a pre-specified model
space M The model posterior probability for M isin M is given by
p(M|yM) prop m(y|M)π(M|M) (4ndash1)
Model posterior probabilities depend on the prior distribution on the model space
as well as on the prior distributions for the model specific parameters implicitly through
the marginals m(y|M) Priors on the model specific parameters have been extensively
discussed in the literature (Berger amp Pericchi 1996 Berger et al 2001 George 2000
Jeffreys 1961 Kass amp Wasserman 1996 Liang et al 2008 Zellner amp Siow 1980) In
contrast the effect of the prior on the model space has until recently been neglected
A few authors (eg Casella et al (2014) Scott amp Berger (2010) Wilson et al (2010))
have highlighted the relevance of the priors on the model space in the context of multiple
testing Adequately formulating priors on the model space can both account for structure
in the predictors and provide additional control on the detection of false positive terms
In addition using the popular uniform prior over the model space may lead to the
undesirable and ldquoinformativerdquo implication of favoring models of size p2 (where p is the
86
total number of covariates) since this is the most abundant model size contained in the
model space
Variable selection within the model space of well-formulated polynomial models
poses two challenges for automatic objective model selection procedures First the
notion of model complexity takes on a new dimension Complexity is not exclusively
a function of the number of predictors but also depends upon the depth and
connectedness of the associations defined by the polynomial hierarchy Second
because the model space is shaped by such relationships stochastic search algorithms
used to explore the models must also conform to these restrictions
Models without polynomial hierarchy constitute a special case of WFMs where
all predictors are of order one Hence all the methods developed throughout this
Chapter also apply to models with no predictor structure Additionally although our
proposed methods are presented for the normal linear case to simplify the exposition
these methods are general enough to be embedded in many Bayesian selection
and averaging procedures including of course the occupancy framework previously
discussed
In this Chapter first we provide the necessary definitions to characterize the
well-formulated model selection problem Then we proceed to introduce three new prior
structures on the well-formulated model space and characterize their behavior with
simple examples and simulations With the model priors in place we build a stochastic
search algorithm to explore spaces of well-formulated models that relies on intrinsic
priors for the model specific parameters mdash though this assumption can be relaxed
to use other mixtures of g-priors Finally we implement our procedures using both
simulated and real data
87
42 Setup for Well-Formulated Models
Suppose that the observations yi are modeled using the polynomial regression of
the covariates xi 1 xi p given by
yi =sum
β(α1αp)
pprodj=1
xαji j + ϵi (4ndash2)
where α = (α1 αp) belongs to Np0 the p-dimensional space of natural numbers
including 0 with ϵiiidsim N(0σ2) and only finitely many βα are allowed to be non-zero
As an illustration consider a model space that includes polynomial terms incorporating
covariates xi 1 and xi 2 only The terms x2i 2 and x2i 1xi 2 can be represented by α = (0 2)
and α = (2 1) respectively
The notation y = Z(X)β + ϵ is used to denote that observed response y =
(y1 yn) is modeled via a polynomial function Z of the original covariates contained
in X = (x1 xp) (where xj = (x1j xnj)prime) and the coefficients of the polynomial
terms are given by β A specific polynomial model M is defined by the set of coefficients
βα that are allowed to be non-zero This definition is equivalent to characterizing M
through a collection of multi-indices α isin Np0 In particular model M is specified by
M = αM1 αM|M| for αMk isin Np0 where βα = 0 for α isin M
Any particular model M uses a subset XM of the original covariates X to form the
polynomial terms in the design matrix ZM(X) Without ambiguity a polynomial model
ZM(X) on X can be identified with a polynomial model ZM(XM) on the covariates XM
The number of terms used by M to model the response y denoted by |M| corresponds
to the number of columns of ZM(XM) The coefficient vector and error variance of
the model M are denoted by βM and σ2M respectively Thus M models the data as
y = ZM(XM)βM + ϵM where ϵM sim N(0 Iσ2M
) Model M is said to be nested in model M prime
if M sub M prime M models the response of the covariates in two distinct ways choosing the
set of meaningful covariates XM as well as choosing the polynomial structure of these
covariates ZM(XM)
88
The set Np0 constitutes a partially ordered set or more succinctly a poset A poset
is a set partially ordered through a binary relation ldquo≼rdquo In this context the binary relation
on the poset Np0 is defined between pairs (ααprime) by αprime ≼ α whenever αj ge αprime
j for all
j = 1 prime with αprime ≺ α if additionally αj gt αprimej for some j The order of a term α isin Np
0
is given by the sum of its elements order(α) =sumαj When order(α) = order(αprime) + 1
and αprime ≺ α then αprime is said to immediately precede α which is denoted by αprime rarr α
The parent set of α is defined by P(α) = αprime isin Np0 αprime rarr α and is given by the
set of nodes that immediately precede the given node A polynomial model M is said to
be well-formulated if α isin M implies that P(α) sub M For example any well-formulated
model using x2i 1xi 2 to model yi must also include the parent terms xi 1xi 2 and x2i 1 their
corresponding parent terms xi 1 and xi 2 and the intercept term 1
The poset Np0 can be represented by a Directed Acyclic Graph (DAG) denoted
by (Np0) Without ambiguity we can identify nodes in the graph α isin Np
0 with terms in
the set of covariates The graph has directed edges to a node from its parents Any
well-formulated model M is represented by a subgraph (M) of (Np0) with the property
that if node α isin (M) then the nodes corresponding to P(α) are also in (M) Figure
4-1 shows examples of well-formulated polynomial models where α isin Np0 is identified
withprodp
j=1 xαjj
The motivation for considering only well-formulated polynomial models is
compelling Let ZM be the design matrix associated with a polynomial model The
subspace of y modeled by ZM given by the hat matrix HM = ZM(ZprimeMZM)
minus1ZprimeM is
invariant to affine transformations of the matrix XM if and only if M corresponds to a
well-formulated polynomial model (Peixoto 1990)
89
A B
Figure 4-1 Graphs of well-formulated polynomial models for p = 2
For example if p = 2 and yi = β(00) + β(10)xi 1 + β(01)xi 2 + β(11)xi 1xi 2 + ϵi then
the hat matrix is invariant to any covariate transformation of the form A(xi 1xi 2
)+ b for any
real-valued positive definite 2 times 2 matrix A and any real-valued vector of dimension two
b In contrast if yi = β(00) + β(20)x2i 1 + ϵi then the hat matrix formed after applying the
transformation xi 1 7rarr xi 1 + c for real c = 0 is not the same as the hat matrix formed by
the original xi 1421 Well-Formulated Model Spaces
The spaces of WFMs M considered in this paper can be characterized in terms
of two WFMs MB the base model and MF the full model The base model contains at
least the intercept term and is nested in the full model The model space M is populated
by all well formulated models M that nest MB and are nested in MF
M = M MB sube M sube MF and M is well-formulated
For M to be well-formulated the entire ancestry of each node in M must also be
included in M Because of this M isin M can be uniquely identified by two different sets
of nodes in MF the set of extreme nodes and the set of children nodes For M isin M
90
the sets of extreme and children nodes respectively denoted by E(M) and C(M) are
defined by
E(M) = α isin M MB α isin P(αprime) forall αprime isin M
C(M) = α isin MF M α cupM is well-formulated
The extreme nodes are those nodes that when removed from M give rise to a WFM in
M The children nodes are those nodes that when added to M give rise to a WFM in
M Because MB sube M for all M isin M the set of nodes E(M)cupMB determine M by
beginning with this set and iteratively adding parent nodes Similarly the nodes in C(M)
determine the set αprime isin P(α) α isin C(M)cupαprime isin E(MF ) α ≼ αprime for all α isin C(M)
which contains E(M)cupMB and thus uniquely identifies M
1
x1
x2
x21
x1x2
x22
A Extreme node set
1
x1
x2
x21
x1x2
x22
B Children node set
Figure 4-2
In Figure 4-2 the extreme and children sets for model M = 1 x1 x21 are shown for
the model space characterized by MF = 1 x1 x2 x21 x1x2 x22 In Figure 4-2A the solid
nodes represent nodes α isin M E(M) the dashed node corresponds to α isin E(M) and
the dotted nodes are not in M Solid nodes in Figure 4-2B correspond to those in M
The dashed node is the single node in C(M) and the dotted nodes are not in M cup C(M)43 Priors on the Model Space
As discussed in Scott amp Berger (2010) the Ockhamrsquos-razor effect found
automatically in Bayesian variable selection through the Bayes factor does not correct
91
for multiple testing This penalization acts against more complex models but does not
account for the collection of models in the model space which describes the multiplicity
of the testing problem This is where the role of the prior on the model space becomes
important As Scott amp Berger explain the multiplicity penalty is ldquohidden awayrdquo in the
model prior probabilities π(M|M)
In what follows we propose three different prior structures on the model space
for WFMs discuss their advantages and disadvantages and describe reasonable
choices for their hyper-parameters In addition we investigate how the choice of
prior structure and hyper-parameter combinations affect the posterior probabilities for
predictor inclusion providing some recommendations for different situations431 Model Prior Definition
The graphical structure for the model spaces suggests a method for prior
construction on M guided by the notion of inheritance A node α is said to inherit from
a node αprime if there is a directed path from αprime to α in the graph (MF ) The inheritance
is said to be immediate if order(α) = order(αprime) + 1 (equivalently if αprime isin P(α) or if αprime
immediately precedes α)
For convenience define (M) = M MB to be the set of nodes in M that are not
in the base model MB For α isin (MF ) let γα(M) be the indicator function describing
whether α is included in M ie γα(M) = I(αisinM) Denote by γν(M) the set of indicators
of inclusion in M for all order ν nodes in (MF ) Finally let γltν(M) =cupνminus1
j=0 γ j(M)
the set of indicators for inclusion in M for all nodes in (MF ) of order less than ν With
these definitions the prior probability of any model M isin M can be factored as
π(M|M) =
JmaxMprod
j=JminM
π(γ j(M)|γltj(M)M) (4ndash3)
where JminM and Jmax
M are respectively the minimum and maximum order of nodes in
(MF ) and π(γJminM (M)|γltJmin
M (M)M) = π(γJminM (M)|M)
92
Prior distributions on M can be simplified by making two assumptions First if
order(α) = order(αprime) = j then γα and γαprime are assumed to be conditionally independent
when conditioned on γltj denoted by γα perpperp γαprime|γltj Second immediate inheritance is
invoked and it is assumed that if order(α) = j then γα(M)|γltj(M) = γα(M)|γP(α)(M)
where γP(α)(M) is the inclusion indicator for the set of parent nodes of α This indicator
is one if the complete parent set of α is contained in M and zero otherwise
In Figure 4-3 these two assumptions are depicted with MF being an order two
surface in two main effects The conditional independence assumption (Figure 4-3A)
implies that the inclusion indicators for x21 x22 and x1x2 is independent when conditioned
on all the lower order terms In this same space immediate inheritance implies that
the inclusion of x21 conditioned on the inclusion of all lower order nodes is equivalent to
conditioning it on its parent set (x1 in this case)
x21 perpperp x1x2 perpperp x22
∣∣∣∣∣
1
x1
x2
A Conditional independence
x21∣∣∣∣∣
1
x1
x2
=
x21
∣∣∣∣∣ x1
B Immediate inheritance
Figure 4-3
Denote the conditional inclusion probability of node α in model M by πα =
π(γα(M) = 1|γP(α)(M)M) Under the assumptions of conditional independence
93
and immediate inheritance the prior probability of M is
π(M|πMM) =prod
αisin(MF )
πγα(M)α (1minus πα)
1minusγα(M) (4ndash4)
with πM = πα α isin (MF ) Because M must be well-formulated πα = γα =
0 if γP(α)(M) = 0 Thus the product in 4ndash4 can be restricted to the set of nodes
α isin (M)cup
C(M) Additional structure can be built into the prior on M by making
assumptions about the inclusion probabilities πα such as equality assumptions or
assumptions of a hyper-prior for these parameters Three such prior classes are
developed next first by assigning hyperpriors on πM assuming some structure among
its elements and then marginalizing out the πM
Hierarchical Uniform Prior (HUP) The HUP assumes that the non-zero πα
are all equal Specifically for a model M isin M it is assumed that πα = π for all
α isin (M)cupC(M) A complete Bayesian specification of the HUP is completed by
assuming a prior distribution for π The choice of π sim Beta(a b) produces
πHUP(M|M a b) =B(|(M)|+ a |C(M)|+ b)
B(a b) (4ndash5)
where B is the beta function Setting a = b = 1 gives the particular value of
πHUP(M|M a = 1 b = 1) =1
|(M)|+ |C(M)|+ 1
(|(M)|+ |C(M)|
|(M)|
)minus1
(4ndash6)
The HUP assigns equal probabilities to all models for which the sets of nodes (M)
and C(M) have the same cardinality This prior provides a combinatorial penalization
but essentially fails to account for the hierarchical structure of the model space An
additional penalization for model complexity can be incorporated into the HUP by
changing the values of a and b Because πα = π for all α this penalization can only
depend on some aspect of the entire graph of MF such as the total number of nodes
not in the null model |(MF )|
94
Hierarchical Independence Prior (HIP) The HIP assumes that there are no
equality constraints among the non-zero πα Each non-zero πα is given its own prior
which is assumed to be a Beta distribution with parameters aα and bα Thus the prior
probability of M under the HIP is
πHIP(M|M ab) =
prodαisin(M)
aα
aα + bα
prodαisinC(M)
bα
aα + bα
(4ndash7)
where the product over empty is taken to be 1 Because the πα are totally independent any
choice of aα and bα is equivalent to choosing a probability of success πα for a given α
Setting aα = bα = 1 for all α isin (M)cup
C(M) gives the particular value of
πHIP(M|M a = 1b = 1) =
(1
2
)|(M)|+|C(M)|
(4ndash8)
Although the prior with this choice of hyper-parameters accounts for the hierarchical
structure of the model space it essentially provides no penalization for combinatorial
complexity at different levels of the hierarchy This can be observed by considering a
model space with main effects only the exponent in 4ndash8 is the same for every model in
the space because each node is either in the model or in the children set
Additional penalizations for model complexity can be incorporated into the HIP
Because each γ j is conditioned on γltj in the prior construction the aα and bα for α of
order j can be conditioned on γltj One such additional penalization utilizes the number
of nodes of order j that could be added to produce a WFM conditioned on the inclusion
vector γltj which is denoted as chj(γltj) Choosing aα = 1 and bα(M) = chj(γ
ltj) is
equivalent to choosing a probability of success πα = 1chj(γltj) This penalization can
drive down the false positive rate when chj(γltj) is large but may produce more false
negatives
Hierarchical Order Prior (HOP) A compromise between complete equality and
complete independence of the πα is to assume equality between the πα of a given
order and independence across the different orders Define j(M) = α isin (M)
95
order(α) = j and Cj(M) = α isin C(M) order(α) = j The HOP assumes that πα = πj
for all α isin j(M)cupCj(M) Assuming that πj sim Beta(aj bj) provides a prior probability of
πHOP(M|M ab) =
JmaxMprod
j=JminM
B(|j(M)|+ aj |Cj(M)|+ bj)
B(aj bj)(4ndash9)
The specific choice of aj = bj = 1 for all j gives a value of
πHOP(M|M a = 1b = 1) =prodj
[1
|j(M)|+ |Cj(M)|+ 1
(|j(M)|+ |Cj(M)|
|j(M)|
)minus1]
(4ndash10)
and produces a hierarchical version of the Scott and Berger multiplicity correction
The HOP arises from a conditional exchangeability assumption on the indicator
variables Conditioned on γltj(M) the indicators γα α isin j(M)cup
Cj(M) are
assumed to be exchangeable Bernoulli random variables By de Finettirsquos theorem these
arise from independent Bernoulli random variables with common probability of success
πj with a prior distribution Our construction of the HOP assumes that this prior is a
beta distribution Additional complexity penalizations can be incorporated into the HOP
in a similar fashion to the HIP The number of possible nodes that could be added of
order j while maintaining a WFM is given by chj(M) = chj(γltj(M)) = |j(M)
cupCj(M)|
Using aj = 1 and bj(M) = chj(M) produces a prior with two desirable properties
First if M prime sub M then π(M) le π(M prime) Second for each order j the conditional
probability of including k nodes is greater than or equal to that of including k + 1 nodes
for k = 0 1 chj(M)minus 1432 Choice of Prior Structure and Hyper-Parameters
Each of the priors introduced in Section 31 defines a whole family of model priors
characterized by the probability distribution assumed for the inclusion probabilities πM
For the sake of simplicity this paper focuses on those arising from Beta distributions
and concentrates on particular choices of hyper-parameters which can be specified
automatically First we describe some general features about how each of the three
prior structures (HUP HIP HOP) allocates mass to the models in the model space
96
Second as there is an infinite number of ways in which the hyper-parameters can be
specified focused is placed on the default choice a = b = 1 as well as the complexity
penalizations described in Section 31 The second alternative is referred to as a =
1b = ch where b = ch has a slightly different interpretation depending on the prior
structure Accordingly b = ch is given by bj(M) = bα(M) = chj(M) = |j(M)cup
Cj(M)|
for the HOP and HIP where j = order(α) while b = ch denotes that b = |(MF )| for
the HUP The prior behavior for two model spaces In both cases the base model MB is
taken to be the intercept only model and MF is the DAG shown (Figures 4-4 and 4-5)
The priors considered treat model complexity differently and some general properties
can be seen in these examples
ModelHIP HOP HUP
(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)
1 1 14 49 13 12 13 572 1 x1 18 19 112 112 112 5563 1 x2 18 19 112 112 112 5564 1 x1 x
21 18 19 112 112 112 5168
5 1 x2 x22 18 19 112 112 112 5168
6 1 x1 x2 132 364 112 112 160 1727 1 x1 x2 x
21 132 164 136 160 160 1168
8 1 x1 x2 x1x2 132 164 136 160 160 11689 1 x1 x2 x
22 132 164 136 160 160 1168
10 1 x1 x2 x21 x1x2 132 1192 136 1120 130 1252
11 1 x1 x2 x21 x
22 132 1192 136 1120 130 1252
12 1 x1 x2 x1x2 x22 132 1192 136 1120 130 1252
13 1 x1 x2 x21 x1x2 x
22 132 1576 112 1120 16 1252
Figure 4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch)
First contrast the choice of HIP HUP and HOP for the choice of (ab) = (1 1) The
HIP induces a complexity penalization that only accounts for the order of the terms in
the model This is best exhibited by the model space in Figure 4-4 Models including x1
and x2 models 6 through 13 are given the same prior probability and no penalization is
incurred for the inclusion of any or all of the quadratic terms In contrast to the HIP the
97
ModelHIP HOP HUP
(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)
1 1 18 2764 14 12 14 472 1 x1 18 964 112 110 112 2213 1 x2 18 964 112 110 112 2214 1 x3 18 964 112 110 112 2215 1 x1 x3 18 364 112 120 112 41056 1 x2 x3 18 364 112 120 112 41057 1 x1 x2 116 3128 124 140 130 1428 1 x1 x2 x1x2 116 3128 124 140 120 1709 1 x1 x2 x3 116 1128 18 140 120 17010 1 x1 x2 x3 x1x2 116 1128 18 140 15 170
Figure 4-5 Prior probabilities for the space of well-formulated models associated tothree main effects and one interaction term where MB is taken to be theintercept only model and (ab) isin (1 1) (1 ch)
HUP induces a penalization for model complexity but it does not adequately penalize
models for including additional terms Using the HIP models including all of the terms
are given at least as much probability as any model containing any non-empty set of
terms (Figures 4-4 and 4-5) This lack of penalization of the full model is originates from
its combinatorial simplicity (ie this is the only model that contains every term) and
as an unfortunate consequence this model space distribution favors the base and full
models Similar behavior is observed with the HOP with (ab) = (1 1) As models
become more complex they are appropriately penalized for their size However after a
sufficient number of nodes are added the number of possible models of that particular
size is considerably reduced Thus combinatorial complexity is negligible on the largest
models This is best exhibited in Figure 4-5 where the HOP places more mass on
the full model than on any model containing a single order one node highlighting an
undesirable behavior of the priors with this choice of hyper-parameters
In contrast if (ab) = (1 ch) all three priors produce strong penalization as
models become more complex both in terms of the number and order of the nodes
contained in the model For all of the priors adding a node α to a model M to form M prime
produces p(M) ge p(M prime) However differences between the priors are apparent The
98
HIP penalizes the full model the most with the HOP penalizing it the least and the HUP
lying between them At face value the HOP creates the most compelling penalization
of model complexity In Figure 4-5 the penalization of the HOP is the least dramatic
producing prior odds of 20 for MB versus MF as opposed to the HUP and HIP which
produce prior odds of 40 and 54 respectively Similarly the prior odds in Figure 4-4 are
60 180 and 256 for the HOP HUP and HIP respectively433 Posterior Sensitivity to the Choice of Prior
To determine how the proposed priors are adjusting the posterior probabilities to
account for multiplicity a simple simulation was performed The goal of this exercise
was to understand how the priors respond to increasing complexity First the priors are
compared as the number of main effects p grows Second they are compared as the
depth of the hierarchy increases or in other words as the orderJMmax increases
The quality of a node is characterized by its marginal posterior inclusion
probabilities defined as pα =sum
MisinM I(αisinM)p(M|yM) for α isin MF These posteriors
were obtained for the proposed priors as well as the Equal Probability Prior (EPP)
on M For all prior structures both the default hyper-parameters a = b = 1 and
the penalizing choice of a = 1 and b = ch are considered The results for the
different combinations of MF and MT incorporated in the analysis were obtained
from 100 random replications (ie generating at random 100 matrices of main effects
and responses) The simulation proceeds as follows
1 Randomly generate main effects matrices X = (x1 x18) for xiiidsim Nn(0 In) and
error vectors ϵ sim Nn(0 In) for n = 60
2 Setting all coefficient values equal to one calculate y = ZMTβ + ϵ for the true
models given byMT 1 = x1 x2 x3 x
21 x1x2 x
22 x2x3 with |MT 1| = 7
MT 2 = x1 x2 x16 with |MT 2| = 16MT 3 = x1 x2 x3 x4 with |MT 3| = 4MT 4 = x1 x2 x8 x
21 x3x4 with |MT 4| = 10
MT 5 = x1 x2 x3 x4 x21 x3x4 with |MT 5| = 6
99
Table 4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulationsgrowing p fixed JM
max fixed p growing JMmax
MF
∣∣MF
∣∣ ∣∣M∣∣ MT used MF
∣∣MF
∣∣ ∣∣M∣∣ MT used(x1 + x2 + x3)
2 9 95 MT 1 (x1 + x2 + x3)2 9 95 MT 1
(x1 + + x4)2 14 1337 MT 1 (x1 + x2 + x3)
3 19 2497 MT 1
(x1 + + x5)2 20 38619 MT 1 (x1 + x2 + x3)
4 34 161421 MT 1
Other model spacesMF
∣∣MF
∣∣ ∣∣M∣∣ MT usedx1 + x2 + middot middot middot+ x18 18 262144 MT 2MT 3
(x1 + x2 + x4)2 + x5+ 20 85568 MT 4MT 5x6 + + x10
3 In all simulations the base model MB is the intercept only model The notation(x1 + + xp)
d is used to represent the full order-d polynomial response surface inp main effects The model spaces characterized by their corresponding full modelMF are presented in Table 4-1 as well as the true models used in each case
4 Enumerate the model spaces and calculate p(M|yM) for all M isin Musing the EPP HUP HIP and HOP the latter two each with the two sets ofhyper-parameters
5 Count the number of true positives and false positives in each M for the differentpriors
The true positives (TP) are defined as those nodes α isin MT such that pα gt 05
With the false positives (FP) three different cutoffs are considered for pα elucidating
the adjustment for multiplicity induced by the model priors These cutoffs are
010 020 and 050 for α isin MT The results from this exercise provide insight
about the influence of the prior on the marginal posterior inclusion probabilities In Table
4-1 the model spaces considered are described in terms of the number of models they
contain and in terms of the number of nodes of MF the full model that defines the DAG
for M
Growing number of main effects fixed polynomial degree This simulation
investigates the posterior behavior as the number of covariates grows for a polynomial
100
surface of degree two The true model is assumed to be MT 1 and has 7 polynomial
terms The false positive and true positive rates are displayed in Table 4-2
First focus on the posterior when (ab) = (1 1) As p increases and the cutoff
is low the number of false positives increases for the EPP as well as the hierarchical
priors although less dramatically for the latter All of the priors identify all of the true
positives The false positive rate for the 50 cutoff is less than one for all four prior
structures with the HIP exhibiting the smallest false positive rate
With the second choice of hyper-parameters (1 ch) the improvement of the
hierarchical priors over the EPP is dramatic and the difference in performance is more
pronounced as p increases These also considerably outperform the priors using the
default hyper-parameters a = b = 1 in terms of the false positives Regarding the
number of true positives all priors discovered the 7 true predictors in MT 1 for most of
the 100 random samples of data with only minor differences observed between any of
the priors considered That being said the means for the priors with a = 1b = ch are
slightly lower for the true positives With a 50 cutoff the hierarchical priors keep a tight
control on the number of false positives but in doing so discard true positives with slightly
higher frequency
Growing polynomial degree fixed main effects For these examples the true
model is once again MT 1 When the complexity is increased by making the order of MF
larger (Table 4-3) the inability of the EPP to adjust the inclusion posteriors for multiplicity
becomes more pronounced the EPP becomes less and less efficient at removing false
positives when the FP cutoff is low Among the priors with a = b = 1 as the order
increases the HIP is the best at filtering out the false positives Using the 05 false
positive cutoff some false positives are included both for the EEP and for all the priors
with a = b = 1 indicating that the default hyper-parameters might not be the best option
to control FP The 7 covariates in the true model all obtain a high inclusion posterior
probability both with the EEP and the a = b = 1 priors
101
Table 4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
7 (x1 + x2 + x3)2
178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3 + x4)2
362 194 233 245 010 063 107FP(gt020) 160 047 217 215 001 017 024FP(gt050) 025 006 035 036 000 002 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3 + x4 + x5)2
600 216 260 255 012 043 115FP(gt020) 291 055 213 218 002 019 027FP(gt050) 066 011 025 037 000 003 001TP(gt050) (MT 1) 700 700 700 700 697 699 699
In contrast any of the a = 1 and b = ch priors dramatically improve upon their
a = b = 1 counterparts consistently assigning low inclusion probabilities for the majority
of the false positive terms even for low cutoffs As the order of the polynomial surface
increases the difference in performance between these priors and either the EEP or
their default versions becomes even more clear At the 50 cutoff the hierarchical priors
with complexity penalization exhibit very low false positive rates The true positive rate
decreases slightly for the priors but not to an alarming degree
Other model spaces This part of the analysis considers model spaces that do not
correspond to full polynomial degree response surfaces (Table 4-4) The first example
is a model space with main effects only The second example includes a full quadratic
surface of order 2 but in addition includes six terms for which only main effects are to be
modeled Two true models are used in combination with each model space to observe
how the posterior probabilities vary under the influence of the different priors for ldquolargerdquo
and ldquosmallrdquo true models
102
Table 4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
7 (x1 + x2 + x3)2
178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3)3
737 521 606 291 055 105 139FP(gt020) 291 155 361 208 017 034 031FP(gt050) 040 021 050 026 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 698 700FP(gt010)
7 (x1 + x2 + x3)4
822 400 469 261 052 055 132FP(gt020) 421 113 176 203 012 015 031FP(gt050) 056 017 022 027 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 697 699
By construction in model spaces with main effects only HIP(11) and EPP are
equivalent as are HOP(ab) and HUP(ab) This accounts for the similarities observed
among the results for the first two cases presented in Table 4-4 where the model space
corresponds to a full model with 18 main effects and the true models are a model with
16 and 4 main effects respectively When the number of true coefficients is large the
HUP(11) and HOP(11) do poorly at controlling false positives even at the 50 cutoff
In contrast the HIP (and thus the EPP) with the 50 cutoff identifies the true positives
and no false positives This result however does not imply that the EPP controls false
positives well The true model contains 16 out of the 18 nodes in MF so there is little
potential for false positives The a = 1 and b = ch priors show dramatically different
behavior The HIP controls false positive well but fails to identify the true coefficients at
the 50 cutoff In contrast the HOP identifies all of the true positives and has a small
false positive rate for the 50 cutoff
103
If the number of true positives is small most terms in the full model are truly zero
The EPP includes at least one false positive in approximately 50 of the randomly
sampled datasets On the other hand the HUP(11) provides some control for
multiplicity obtaining on average a lower number of false positives than the EPP
Furthermore the proposed hierarchical priors with a = 1b = ch are substantially better
than the EPP (and the choice of a = b = 1) at controlling false positives and capturing
all true positives using the marginal posterior inclusion probabilities The two examples
suggest that the HOP(1 ch) is the best default choice for model selection when the
number of terms available at a given degree is large
The third and fourth examples in Table 4-4 consider the same irregular model
space with data generated from MT 4 with ten terms and MT 5 with six terms HIP(11)
and EPP again behave quite similarly incorporating a large number of false positives
for the 01 cutoff At the 05 cutoff some false positives are still included The HUP(11)
and HOP(11) behave similarly with a slightly higher false positive rate at the 50 cutoff
In terms of the true positives the EPP and a = b = 1 priors always include all of the
predictors in MT 4 and MT 5 On the other hand the ability of the a = 1b = ch priors
to control for false positives is markedly better than that of the EPP and the hierarchical
priors with choice of a = 1 = b = 1 At the 50 cutoff these priors identify all of the true
positives and true negatives Once again these examples point to the hierarchical priors
with additional penalization for complexity as being good default priors on the model
space44 Random Walks on the Model Space
When the model space M is too large to enumerate a stochastic procedure can
be used to find models with high posterior probability In particular an MCMC algorithm
can be utilized to generate a dependent sample of models from the model posterior The
structure of the model space M both presents difficulties and provides clues on how to
build algorithms to explore it Different MCMC strategies can be adopted two of which
104
Table 4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
16 x1 + x2 + + x18
193 193 200 200 003 180 180FP(gt020) 052 052 200 200 001 046 046FP(gt050) 007 007 200 200 001 004 004TP(gt050) (MT 2) 1599 1599 1600 1600 699 1599 1599FP(gt010)
4 x1 + x2 + + x18
1395 1395 915 915 026 131 131FP(gt020) 545 545 303 303 005 045 045FP(gt050) 084 084 045 045 002 006 006TP(gt050) (MT 3) 400 400 400 400 400 400 400FP(gt010)
10
973 971 1000 560 034 233 220FP(gt020) (x1 + + x4)
2+ 265 265 873 305 012 074 069FP(gt050) +x5 + + x10 035 035 136 168 002 011 012TP(gt050) (MT 4) 1000 1000 1000 999 994 998 999FP(gt010)
6
1352 1352 1106 994 044 163 196FP(gt020) (x1 + + x4)
2+ 422 421 360 501 015 048 068FP(gt050) +x5 + + x10 053 053 057 075 001 008 011TP(gt050) (MT 5) 600 600 600 600 599 599 599
are outlined in this section Combining the different strategies allows the model selection
algorithm to explore the model space thoroughly and relatively fast441 Simple Pruning and Growing
This first strategy relies on small localized jumps around the model space turning
on or off a single node at each step The idea behind this algorithm is to grow the model
by activating one node in the children set or to prune the model by removing one node
in the extreme set At a given step in the algorithm assume that the current state of the
chain is model M Let pG be the probability that algorithm chooses the growth step The
proposed model M prime can either be M+ = M cup α for some α isin C(M) or Mminus = M α
or some α isin E(M)
An example transition kernel is defined by the mixture
g(M prime|M) = pG middot qGrow(M prime|M) + (1minus pG) middot qPrune(M prime|M)
105
=IM =MF
1 + IM =MBmiddotIαisinC(M)
|C(M)|+
IM =MB
1 + IM =MF middotIαisinE(M)
|E(M)|(4ndash11)
where pG has explicitly been defined as 05 when both C(M) and E(M) are non-empty
and as 0 (or 1) when C(M) = empty (or E(M) = empty) After choosing pruning or growing a
single node is proposed for addition to or deletion from M uniformly at random
For this simple algorithm pruning has the reverse kernel of growing and vice-versa
From this construction more elaborate algorithms can be specified First instead of
choosing the node uniformly at random from the corresponding set nodes can be
selected using the relative posterior probability of adding or removing the node Second
more than one node can be selected at any step for instance by also sampling at
random the number of nodes to add or remove given the size of the set Third the
strategy could combine pruning and growing in a single step by sampling one node
α isin C(M) cup E(M) and adding or removing it accordingly Fourth the sets of nodes from
C(M) cup E(M) that yield well-formulated models can be added or removed This simple
algorithm produces small moves around the model space by focusing node addition or
removal only on the set C(M) cup E(M)442 Degree Based Pruning and Growing
In exploring the model space it is possible to take advantage of the hierarchical
structure defined between nodes of different order One can update the vector of
inclusion indicators by blocks denoted j(M) Two flavors of this algorithm are
proposed one that separates the pruning and growing steps and one where both
are done simultaneously
Assume that at a given step say t the algorithm is at M If growing the strategy
proceeds successively by order class going from j = Jmin up to j = Jmax with Jmin
and Jmax being the lowest and highest orders of nodes in MF MB respectively Define
Mt(Jminminus1) = M and set j = Jmin The growth kernel comprises the following steps
proceeding from j = Jmin to j = Jmax
106
1) Propose a model M prime by selecting a set of nodes from Cj(Mt(jminus1)) through thekernel qGrow j(middot|Mt(jminus1))
2) Compute the Metropolis-Hastings correction for M prime versus Mt(jminus1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(jminus1)
3) If j lt Jmax then set j = j + 1 and return to 1) otherwise proceed to 4)
4) Set Mt = Mt(Jmax )
The pruning step is defined In a similar fashion however it starts at order j = Jmax
and proceeds down to j = Jmin Let Ej(M prime) = E(M prime) cap j(MF ) be the set of nodes of
order j that can removed from the model M to produce a WFM Define Mt(Jmax+1) = M
and set j = Jmax The pruning kernel comprises the following steps
1) Propose a model M prime by selecting a set of nodes from Ej(Mt(j+1)) through thekernel qPrunej(middot|Mt(j+1))
2) Compute the Metropolis-Hastings correction for M prime versus Mt(j+1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(j+1)
3) If j gt Jmin then set j = j minus 1 and return to Step 1) otherwise proceed to Step 4)
4) Set Mt = Mt(Jmin )
It is clear that the growing and pruning steps are reverse kernels of each other
Pruning and growing can be combined for each j The forward kernel proceeds from
j = Jmin to j = Jmax and proposes adding sets of nodes from Cj(M) cup Ej(M) The reverse
kernel simply reverses the direction of j proceeding from j = Jmax to j = Jmin 45 Simulation Study
To study the operating characteristics of the proposed priors a simulation
experiment was designed with three goals First the priors are characterized by how
the posterior distributions are affected by the sample size and the signal-to-noise ratio
(SNR) Second given the SNR level the influence of the allocation of the signal across
the terms in the model is investigated Third performance is assessed when the true
model has special points in the scale (McCullagh amp Nelder 1989) ie when the true
107
model has coefficients equal to zero for some lower-order terms in the polynomial
hierarchy
With these goals in mind sets of predictors and responses are generated under
various experimental conditions The model space is defined with MB being the
intercept-only model and MF being the complete order-four polynomial surface in five
main effects that has 126 nodes The entries of the matrix of main effects are generated
as independent standard normal The response vectors are drawn from the n-variate
normal distribution as y sim Nn
(ZMT
(X)βγ In) where MT is the true model and In is the
n times n identity matrix
The sample sizes considered are n isin 130 260 1040 which ensures that
ZMF(X) is of full rank The cardinality of this model space is |M| gt 12 times 1022 which
makes enumeration of all models unfeasible Because the value of the 2k-th moment
of the standard normal distribution increases with k = 1 2 higher-order terms by
construction have a larger variance than their ancestors As such assuming equal
values for all coefficients higher-order terms necessarily contain more ldquosignalrdquo than
the lower order terms from which they inherit (eg x21 has more signal than x1) Once a
higher-order term is selected its entire ancestry is also included Therefore to prevent
the simulation results from being overly optimistic (because of the larger signals from the
higher-order terms) sphering is used to calculate meaningful values of the coefficients
ensuring that the signal is of the magnitude intended in any given direction Given
the results of the simulations from Section 433 only the HOP with a = 1b = ch is
considered with the EPP included for comparison
The total number of combinations of SNR sample size regression coefficient
values and nodes in MT amounts to 108 different scenarios Each scenario was run
with 100 independently generated datasets and the mean behavior of the samples was
observed The results presented in this section correspond to the median probability
model (MPM) from each of the 108 simulation scenarios considered Figure 4-7 shows
108
the comparison between the two priors for the mean number of true positive (TP) and
false positive (FP) terms Although some of the scenarios consider true models that are
not well-formulated the smallest well-formulated model that stems from MT is always
the one shown in Figure 4-6
Figure 4-6 MT DAG of the largest true model used in simulations
The results are summarized in Figure 4-7 Each point on the horizontal axis
corresponds to the average for a given set of simulation conditions Only labels for the
SNR and sample size are included for clarity but the results are also shown for the
different values of the regression coefficients and the different true models considered
Additional details about the procedure and other results are included in the appendices451 SNR and Sample Size Effect
As expected small sample sizes conditioned upon a small SNR impair the ability
of the algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this
effect being greater when using the latter prior However considering the mean number
of TPs jointly with the number of FPs it is clear that although the number of TPs is
specially low with HOP(1 ch) most of the few predictors that are discovered in fact
belong to the true model In comparison to the results with EPP in terms of FPs the
HOP(1 ch) does better and even more so when both the sample size and the SNR are
109
Figure 4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch)
smallest Finally when either the SNR or the sample size is large the performance in
terms of TPs is similar between both priors but the number of FPs are somewhat lower
with the HOP452 Coefficient Magnitude
Three ways to allocate the amount of signal across predictors are considered For
the first choice all coefficients contain the same amount of signal regardless of their
order In the second each order-one coefficient contains twice as much signal as any
order-two coefficient and four times as much as any order-three coefficient Finally
each order-one coefficient contains a half as much signal as any order-two coefficient
and a quarter of what any order-three coefficient has These choices are denoted by
β(1) = c(1o1 1o2 1o3) β(2) = c(1o1 05o2 025o3) and β(3) = c(025o1 05o2 1o3)
respectively In Figure 4-7 the first 4 scenarios correspond to simulations with β(1) the
next four use β(2) the next four correspond to β(3) and then the values are cycled in
110
the same way The results show that scenarios using either β(1) or β(3) behave similarly
contrasting with the negative impact of having the highest signal in the order-one terms
through β(2) In Figure 4-7 the effect of using β(2) is evident as it corresponds to the
lowest values for the TPs regardless of the sample size the SNR or the prior used This
is an intuitive result since giving more signal to higher-order terms makes it easier to
detect higher-order terms and consequently by strong heredity the algorithm will also
select the corresponding lower-order terms included in the true model453 Special Points on the Scale
Four true models were considered (1) the model from Figure 4-6 (MT 1) (2)
the model without the order-one terms (MT 2) (3) the model without order-two terms
(MT 3) and (4) the model without x21 and x2x5 (MT 4) The last three are clearly not
well-formulated In Figure 4-7 the leftmost point on the horizontal axis corresponds to
scenarios with MT 1 the next point is for scenarios with MT 2 followed by those with MT 3
then with MT 4 then MT 1 etc In comparison to the EPP the HOP(1 ch) tightly controls
the inclusion of FPs by choosing smaller models at the expense of also reducing the TP
count especially when there is more uncertainty about the true model (ie SNR=025)
For both prior structures the results indicate that at low SNR levels the presence of
special points has no apparent impact as the selection behavior is similar between the
four models in terms of both the TP and FP An interesting observation is that the effect
of having special points on the scale is vastly magnified whenever the coefficients that
assign more weight to order-one terms (β(2)) are used46 Case Study Ozone Data Analysis
This section uses the ozone data from Breiman amp Friedman (1985) and followsthe analysis performed by Liang et al (2008) who investigated hyper g-priors Afterremoving observations with missing values 330 observations remain includingdaily measurements of maximum ozone concentration near Los Angeles and eightmeteorological variables Table D From the 330 observations 165 were sampled atrandom without replacement and used to run the variable selection procedure theremaining 165 were used for validation The eight meteorological variables interactionsand their squared terms are used as predictors resulting in a full model with 44predictors The model space assumes that the base model MB is the intercept onlymodel and that MF is the quadratic surface in the eight meteorological variables The
111
model space contains approximately 71 billion models and computation of all modelposterior probabilities is not feasible
Table 4-5 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA
vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX
The HOP HUP and HIP with a = 1 and b = ch as well as the EPP are considered forcomparison purposes To obtain the Bayes factors in equation 3ndash3 four different mixtures ofg-priors are utilized intrinsic priors (IP) (which yields the expression in equation 3ndash2) hyper-g(HG) priors (Liang et al 2008) with hyper-parameters α = 2β = 1 and α = β = 1 and Zellner-Siow (ZS) priors (Zellner amp Siow 1980) The results were extracted for the median posteriorprobability (MPM) models Additionally the model is estimated using the R package hierNet(Bien et al 2013) to compare model selection results to those obtained using the hierarchicallasso (Bien et al 2013) restricted to well formulated models by imposing the strong heredityconstraint The procedures were assessed on the basis of their predictive accuracy on thevalidation dataset
Among all models the one that yields the smallest RMSE is the median probability modelobtained using the HOP and EPP with the ZS prior and also using the HOP with both HGpriors (Table 4-6) The HOP model with the intrinsic prior has all the terms contained in thelowest RMSE model with the exception of dpg2 which has a relatively high marginal inclusionprobability of 46 This disparity between the IP and other mixtures of g-priors is explainedby the fact that the IP induces less posterior shrinkage than the ZS and HG priors The MPMobtained through the HUP and HIP are nested in the best model suggesting that these modelspace priors penalize complexity too much and result in false negatives Consideration ofthese MPMs suggest that the HOP is best at producing true positives while controlling for falsepositives
Finally the model obtained from the hierarchical lasso (HierNet) is the largest model andproduces the second to largest RMSE All of the terms contained in any of the other modelsexcept for vh are nested within the hierarchical lasso model and most of the terms that areexclusive to this model receive extremely low marginal inclusion probabilities under any of themodel priors and parameter priors considered under Bayesian model selection
112
Table 4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso
BF Prior Model R2 RMSEIP EPP hum dpg ibt hum2 hum lowast dpg 08054 42739
hum lowast ibt dpg2 ibt2IP HIP hum ibt hum2 hum lowast ibt ibt2 07740 43396IP HOP hum dpg ibt hum2 hum lowast ibt ibt2 07848 43175IP HUP hum dpg ibt hum lowast ibt ibt2 07767 43508ZS EPP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HIP hum ibt hum lowast ibt ibt2 07525 43505ZS HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG11 EPP vh hum dpg ibt hum2 hum lowast ibt dpg2 07701 43049HG11 HIP hum ibt hum lowast ibt ibt2 07525 43505HG11 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG11 HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 EPP hum dpg ibt hum2 hum lowast ibt dpg2 07701 43037HG21 HIP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG21 HUP hum dpg ibt hum lowast ibt 07526 44036
HierNet hum temp ibh dpg ibt vis hum2 hum lowast ibt 07651 43680temp2 temp lowast ibt dpg2
47 DiscussionScott amp Berger (2010) noted that Ockhamrsquos-razor effect found automatically in Bayesian
variable selection through the Bayes factor does not correct for multiple testing The Bayesfactor penalizes complexity of the alternative model according to the number of parametersin excess of those of the null model Therefore the Bayes factor only controls complexity in apairwise fashion If the model selection procedure uses equal prior probabilities for all M isin Mthen these comparisons ignore the effect of the multiplicity of the testing problem This is wherethe role of the prior on the model space becomes important The multiplicity penalty is ldquohiddenawayrdquo in the model prior probabilities π(M|M)
In addition to the multiplicity of the testing problem disregarding the hierarchical polynomialstructure in the predictors in model selection procedures has the potential to lead to differentresults according to how the predictors are setup (eg in what units these predictors areexpressed)
In this Chapter we investigated a solution to these two issues We define prior structuresfor well-formulated models and develop random walk algorithms to traverse this type of modelspace The key to understanding prior distributions on the space of WFMs is the hierarchicalnature of the model space itself The prior distributions described take advantage of thathierarchy in two ways First conditional independence and immediate inheritance are used todevelop the HOP HIP and HUP structures discussed in Section 43 Second the conditionalnature of the priors allows for the direct incorporation of complexity penalizations Of the priorsproposed the HOP using the hyperparameter choice (1 ch) provides the best control of falsepositives while maintaining a reasonable true positive rate Thus this prior is recommended asthe default prior on the space of WFMs
113
In the near future the software developed to carry out a Metropolis-Hastings random walkon the space of WFMs will be integrated to the R package varSelectIP These new functionsimplement various local priors for the regression coefficients including the intrinsic prior Zellner-Siow prior and hyper g-priors In addition the software supports the computation of crediblesets for each regression coefficient conditioned on the selected model as well as under modelaveraging
114
CHAPTER 5CONCLUSIONS
Ecologists are now embracing the use of Bayesian methods to investigate the
interactions that dictate the distribution and abundance of organisms These tools are
both powerful and flexible They allow integrating under a single methodology empirical
observations and theoretical process models and can seamlessly account for several
sources of uncertainty and dependence The estimation and testing methods proposed
throughout the document will contribute to the understanding of Bayesian methods used
in ecology and hopefully these will shed light about the differences between estimation
and testing Bayesian tools
All of our contributions exploit the potential of the latent variable formulation This
approach greatly simplifies the analysis of complex models it redirects the bulk of
the inferential burden away from the original response variables and places it on the
easy-to-work-with latent scale for which several time-tested approaches are available
Our methods are distinctly classified into estimation and testing tools
For estimation we proposed a Bayesian specification of the single-season
occupancy model for which a Gibbs sampler is available using both logit and probit
link functions This setup allows detection and occupancy probabilities to depend
on linear combinations of predictors Then we developed a dynamic version of this
approach incorporating the notion that occupancy at a previously occupied site depends
both on survival of current settlers and habitat suitability Additionally because these
dynamics also vary in space we suggest a strategy to add spatial dependence among
neighboring sites
Ecological inquiry usually requires of competing explanations and uncertainty
surrounds the decision of choosing any one of them Hence a model or a set of
probable models should be selected from all the viable alternatives To address this
testing problem we proposed an objective and fully automatic Bayesian methodology
115
for the single season site-occupancy model Our approach relies on the intrinsic prior
which prevents from introducing (commonly unavailable) subjectively information
into the model In simulation experiments we observed that the methods single out
accurately the predictors present in the true model using the marginal posterior inclusion
probabilities of the predictors For predictors in the true model these probabilities were
comparatively larger than those for predictors not present in the true model Also the
simulations indicated that the method provides better discrimination for predictors in the
detection component of the model
In our simulations and in the analysis of the Blue Hawker data we observed that the
effect from using the multiplicity correction prior was substantial This occurs because
the Bayes factor only penalizes complexity of the alternative model according to its
number of parameters in excess to those of the null model As the number of predictors
grows the number of models in the models space also grows increasing the chances
of making false positive decisions on the inclusion of predictors This is where the role
of the prior on the model space becomes important The multiplicity penalty is ldquohidden
awayrdquo in the model prior probabilities π(M|M) In addition to the multiplicity of the
testing problem disregarding the hierarchical polynomial structure in the predictors in
model selection procedures has the potential to lead to different results according to
how the predictors are coded (eg in what units these predictors are expressed)
To confront this situation we propose three prior structures for well-formulated
models take advantage of the hierarchical structure of the predictors Of the priors
proposed we recommend the HOP using the hyperparameter choice (1 ch) which
provides the best control of false positives while maintaining a reasonable true positive
rate
Overall considering the flexibility of the latent approach several other extensions of
these methods follow Currently we envision three future developments (1) occupancy
models incorporate various sources of information (2) multi-species models that make
116
use of spatial and interspecific dependence and (3) investigate methods to conduct
model selection for the dynamic and spatially explicit version of the model
117
APPENDIX AFULL CONDITIONAL DENSITIES DYMOSS
In this section we introduce the full conditional probability density functions for all
the parameters involved in the DYMOSS model using probit as well as logic links
Sampler Z
The full conditionals corresponding to the presence indicators have the same form
regardless of the link used These are derived separately for the cases t = 1 1 lt t lt T
and t = T since their corresponding probabilities take on slightly different forms
Let ϕ(ν|microσ2) represent the density for a normal random variable ν with mean micro and
variance σ2 and recall that ψi1 = F (xprime(o)iα) and pijt = F (qprimeijtλt) where F () is the
inverse link function The full conditional for zit is given by
1 For t = 1
π(zi1|vi1αλ1βc1 δ
s1) = ψlowast
i1zi1 (1minus ψlowast
i1)1minuszi1
= Bernoulli(ψlowasti1) (Andash1)
where
ψlowasti1 =
ψi1ϕ(vi1|xprimei1βc1 + δs1 1)
prodJi1j=1(1minus pij1)
ψi1ϕ(vi1|xprimei1βc1 + δs1 1)
prodJi1j=1(1minus pij1) + (1minus ψi1)ϕ(vi1|xprimei1β
c1 1)
prodJj=1 Iyij1=0
2 For 1 lt t lt T
π(zit |zi(tminus1) zi(t+1)λt βctminus1 δ
stminus1) = ψlowast
itzit (1minus ψlowast
it)1minuszit
= Bernoulli(ψlowastit) (Andash2)
where
ψlowastit =
κitprodJit
j=1(1minus pijt)
κitprodJit
j=1(1minus pijt) +nablait
prodJj=1 Iyijt=0
with
(a) κit = F (xprimei(tminus1)β
ctminus1 + zi(tminus1)δ
stminus1)ϕ(vit |xprimeitβ
ct + δst 1) and
(b) nablait =(1minus F (xprime
i(tminus1)βctminus1 + zi(tminus1)δ
stminus1)
)ϕ(vit |xprimeitβ
ct 1)
3 For t = T
π(ziT |zi(Tminus1)λT βcTminus1 δ
sTminus1) = ψ⋆iT
ziT (1minus ψ⋆iT )1minusziT
118
=
Nprodi=1
Bernoulli(ψ⋆iT ) (Andash3)
where
ψ⋆iT =κ⋆iT
prodJiTj=1(1minus pijT )
κ⋆iTprodJiT
j=1(1minus pijT ) +nabla⋆iT
prodJj=1 IyijT=0
with
(a) κ⋆iT = F (xprimei(Tminus1)β
cTminus1 + zi(Tminus1)δ
sTminus1) and
(b) nabla⋆iT =
(1minus F (xprime
i(Tminus1)βcTminus1 + zi(Tminus1)δ
sTminus1)
)Sampler ui
1
π(ui |zi1α) = tr N(xprime(o)iα 1 trunc(zi1))
where trunc(zi1) =
(minusinfin 0] zi1 = 0
(0infin) zi1 = 1(Andash4)
and tr N(microσ2A) denotes the pdf of a truncated normal random variable with mean microvariance σ2 and truncation region A
Sampler α
1
π(α|u) prop [α]
Nprodi=1
ϕ(ui xprime(o)iα 1) (Andash5)
If [α] prop 1 then
α|u sim N(m(α)α)
with m(α) = αXprime(o)u and α = (X prime
(o)X(o))minus1
Sampler vit
1 (For t gt 1)
π(vi (tminus1)|zi (tminus1) zit βctminus1 δ
stminus1) = tr N
(micro(v)i(tminus1) 1 trunc(zit)
)(Andash6)
where micro(v)i(tminus1) = xprime
i(tminus1)βctminus1 + zi(tminus1)δ
ci(tminus1) and trunc(zit) defines the corresponding
truncation region given by zit
119
Sampler(β(c)tminus1 δ
(c)tminus1
)
1 (For t gt 1)
π(β(s)tminus1 δ
(c)tminus1|vtminus1 ztminus1) prop [β
(s)tminus1 δ
(c)tminus1]
Nprodi=1
ϕ(vit xprimei(tminus1)β
(c)tminus1 + zi(tminus1)δ
(s)tminus1 1) (Andash7)
If[β(c)tminus1 δ
(s)tminus1
]prop 1 then
β(c)tminus1 δ
(s)tminus1|vtminus1 ztminus1 sim N(m(β
(c)tminus1 δ
(s)tminus1)tminus1)
with m(β(c)tminus1 δ
(s)tminus1) = tminus1 ~X
primetminus1vtminus1 and tminus1 = (~X prime
tminus1 ~Xtminus1)minus1 where ~Xtminus1 =(
Xtminus1 ztminus1)
Sampler wijt
1 (For t gt 1 and zit = 1)
π(wijt | i zit = 1 yijt λ) = tr N(qprimeijtλt 1 tr(yijt)
)(Andash8)
Sampler λt
1 (For t = 1 2 T )
π(λt |zt wt) prop [λt ]prod
i zit=1
Jitprodj=1
ϕ(wijt qprimeijtλt 1) (Andash9)
If [λt ] prop 1 then
λt |wt zt sim N(m(λt)λt)
with m(λt) = λtQ primetwt and λt
= (Q primetQt)
minus1 where Qt and wt respectively are the designmatrix and the vector of latent variables for surveys of sites such that zit = 1
120
APPENDIX BRANDOM WALK ALGORITHMS
Global Jump From the current state M the global jump is performed by drawing
a model M prime at random from the model space This is achieved by beginning at the base
model and increasing the order from JminM to the Jmax
M the minimum and maximum orders
of nodes in (MF ) = MF MB at each order a set of nodes is selected at random from
the prior conditioned on the nodes already in the model The MH correction is
α =
1m(y|M primeM)
m(y|MM)
Local Jump From the current state M the local jump is performed by drawing a
model from the set of models L(M) = Mα α isin E(M) cup C(M) where Mα is M α
for α isin E(M) and M cup α for α isin C(M) The proposal probabilities for the model are
computed as a mixture of p(M prime|yMM prime isin L(M)) and the discrete uniform distribution
The proposal kernel is
q(M prime|yMM prime isin L(M)) =1
2
(p(M prime|yMM prime isin L(M)) +
1
|L(M)|
)This choice promotes moving to better models while maintaining a non-negligible
probability of moving to any of the possible models The MH correction is
α =
1m(y|M primeM)
m(y|MM)
q(M|yMM isin L(M prime))
q(M prime|yMM prime isin L(M))
Intermediate Jump The intermediate jump is performed by increasing or
decreasing the order of the nodes under consideration performing local proposals based
on order For a model M prime define Lj(Mprime) = M prime cup M prime
α α isin (E(M prime) cup C(M prime)) capj(MF )
From a state M the kernel chooses at random whether to increase or decrease the
order If M = MF then decreasing the order is chosen with probability 1 and if M = MB
then increasing the order is chosen with probability 1 in all other cases the probability of
increasing and decreasing order is 12 The proposal kernels are given by
121
Increasing order proposal kernel
1 Set j = JminM minus 1 and M prime
j = M
2 Draw M primej+1 from qincj+1(M
prime|yMM prime isin Lj+1(Mprimej )) where
qincj+1(Mprime|yMM prime isin Lj+1(M
primej )) =
12
(p(M prime|yMM prime isin Lj+1(M
primej )) +
1|Lj+1(M
primej)|
)
3 Set j = j + 1
4 If j lt JmaxM then return to 2 O therwise proceed to 5
5 Set M prime = M primeJmaxM
and compute the proposal probability
qinc(Mprime|yMM) =
JmaxM minus1prod
j=JminM minus1
qincj+1(Mprimej |yMM prime isin Lj+1(M
primej )) (Bndash1)
Decreasing order proposal kernel
1 Set j = JmaxM + 1 and M prime
j = M
2 Draw M primejminus1 from qdecjminus1(M
prime|yMM prime isin Ljminus1(Mprimej )) where
qdecjminus1(Mprime|yMM prime isin Ljminus1(M
primej )) =
12
(p(M prime|yMM prime isin Ljminus1(M
primej )) +
1|Ljminus1(M
primej)|
)
3 Set j = j minus 1
4 If j gt JminM then return to 2 Otherwise proceed to 5
5 Set M prime = M primeJminM
and compute the proposal probability
qdec(Mprime|yMM) =
JminM +1prod
j=JmaxM +1
qdecjminus1(Mprimej |yMM prime isin Ljminus1(M
primej )) (Bndash2)
If increasing order is chosen then the MH correction is given by
α = min
1
(1 + I (M prime = MF )
1 + I (M = MB)
)qdec(M|yMM prime)
qinc(M prime|yMM)
p(M prime|yM)
p(M|yM)
(Bndash3)
and similarly if decreasing order is chosen
Other Local and Intermediate Kernels The local and intermediate kernels
described here perform a kind of stochastic forwards-backwards selection Each kernel
122
q can be relaxed to allow more than one node to be turned on or off at each step which
could provide larger jumps for each of these kernels The tradeoff is that number of
proposed models for such jumps could be very large precluding the use of posterior
information in the construction of the proposal kernel
123
APPENDIX CWFM SIMULATION DETAILS
Briefly the idea is to let ZMT(X )βMT
= (QR)βMT= QηMT
(ie βMT= Rminus1ηMT
)
using the QR decomposition As such setting all values in ηMTproportional to one
corresponds to distributing the signal in the model uniformly across all predictors
regardless of their order
The (unconditional) variance of a single observation yi is var(yi) = var (E [yi |zi ]) +
E [var(yi |zi)] where zi is the i -th row of the design matrix ZMT Hence we take the
signal to noise ratio for each observation to be
SNR(η) = ηTMT
RminusTzRminus1ηMT
σ2
where z = var(zi) We determine how the signal is distributed across predictors up to a
proportionality constant to be able to control simultaneously the signal to noise ratio
Additionally to investigate the ability of the model to capture correctly the
hierarchical structure we specify four different 0-1 vectors that determine the predictors
in MT which generates the data in the different scenarios
Table C-1 Experimental conditions WFM simulationsParameter Values considered
SNR(ηMT) = k 025 1 4
ηMTprop (1 13 14 12) (1 13 1214
1412) (1 1413
1214 12)
γMT(1 13 14 12) (1 13 14 02) (1 13 04 12) (1 03 0 1 1 0 12)
n 130 260 1040
The results presented below are somewhat different from those found in the main
body of the article in Section 5 These are extracted averaging the number of FPrsquos
TPrsquos and model sizes respectively over the 100 independent runs and across the
corresponding scenarios for the 20 highest probability models
124
SNR and Sample Size Effect
In terms of the SNR and the sample size (Figure C-1) we observe that as
expected small sample sizes conditioned upon a small SNR impair the ability of the
algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this effect
more notorious when using the latter prior However considering the mean number
of true positives (TP) jointly with the mean model size it is clear that although the
sensitivity is low most of the few predictors that are discovered belong to the true
model The results observed with SNR of 025 and a relatively small sample size are
far from being impressive however real problems where the SNR is as low as 025
will yield many spurious associations under the EPP The fact that the HOP(1 ch) has
a strong protection against false positive is commendable in itself A SNR of 1 also
represents a feeble relationship between the predictors and the response nonetheless
the method captures approximately half of the true coefficients while including very few
false positives Following intuition as either the sample size or the SNR increase the
algorithms performance is greatly enhanced Either having a large sample size or a
large SNR yields models that contain mostly true predictors Additionally HOP(1 ch)
provides a strong control over the number of false positives therefore for high SNR
or larger sample sizes the number of predictors in the top 20 models is close to the
size of the true model In general the EPP allows the detection of more TPrsquos while
the HOP(1 ch) provides a stronger control on the amount of FPrsquos included when
considering small sample sizes combined with small SNRs As either sample size or
SNR grows the differences between the two priors become indistinct
125
Figure C-1 SNR vs n Average model size average true positives and average false
positives for all simulated scenarios by model ranking according to model
posterior probabilities
Coefficient Magnitude
This part of the experiment explores the effect of how the signal is distributed across
predictors As mentioned before sphering is used to assign the coefficients values
in a manner that controls the amount of signal that goes into each coefficient Three
possible ways to allocate the signal are considered First each order-one coefficient
contains twice as much signal as any order-two coefficient and four times as much
any as order-three coefficient second all coefficients contain the same amount of
signal regardless of their order and third each order-one coefficient contains a half
as much signal as any order-two coefficient and a quarter of what any order-three
126
coefficient has In Figure C-2 these values are denoted by β = c(1o1 05o2 025o3)
β = c(1o1 1o2 1o3) and β = c(025o1 05o2 1o3) respectively
Observe that the number of FPrsquos is invulnerable to how the SNR is distributed
across predictors using the HOP(1 ch) conversely when using the EPP the number
of FPrsquos decreases as the SNR grows always being slightly higher than those obtained
with the HOP With either prior structure the algorithm performs better whenever all
coefficients are equally weighted or when those for the order-three terms have higher
weights In these two cases (ie with β = c(1o1 05o2 025o3) or β = c(1o1 1o2 1o3))
the effect of the SNR appears to be similar In contrast when more weight is given to
order one terms the algorithm yields slightly worse models at any SNR level This is an
intuitive result since giving more signal to higher order terms makes it easier to detect
higher order terms and consequently by strong heredity the algorithm will also select
the corresponding lower order terms included in the true model
Special Points on the Scale
In Nelder (1998) the author argues that the conditions under which the
weak-heredity principle can be used for model selection are so restrictive that the
principle is commonly not valid in practice in this context In addition the author states
that considering well-formulated models only does not take into account the possible
presence of special points on the scales of the predictors that is situations where
omitting lower order terms is justified due to the nature of the data However it is our
contention that every model has an underlying well-formulated structure whether or not
some predictor has special points on its scale will be determined through the estimation
of the coefficients once a valid well-formulated structure has been chosen
To understand how the algorithm behaves whenever the true data generating
mechanism has zero-valued coefficients for some lower order terms in the hierarchy
four different true models are considered Three of them are not well-formulated while
the remaining one is the WFM shown in Figure 4-6 The three models that have special
127
Figure C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model rankingaccording to model posterior probabilities
points correspond to the same model MT from Figure 4-6 but have respectively
zero-valued coefficients for all the order-one terms all the order-two terms and for x21
and x2x5
As seen before in comparison to the EPP the HOP(1 ch) tightly controls the
inclusion FPs by choosing smaller models at the expense of also reducing the TP
count especially when there is more uncertainty about the true model (ie SNR=025)
For both prior structures the results in Figure C-3 indicate that at low SNR levels the
presence of special points has no apparent impact as the selection behavior is similar
between the four models in terms of both the TP and FP As the SNR increases the
TPs and the model size are affected for true models with zero-valued lower order
128
Figure C-3 SNR vs different true models MT Average model size average truepositives and average false positives for all simulated scenarios by modelranking according to model posterior probabilities
terms These differences however are not very large Relatively smaller models are
selected whenever some terms in the hierarchy are missing but with high SNR which
is where the differences are most pronounced the predictors included are mostly true
coefficients The impact is almost imperceptible for the true model that lacks order one
terms and the model with zero coefficients for x21 and x2x5 and is more visible for models
without order two terms This last result is expected due to strong-heredity whenever
the order-one coefficients are missing the inclusion of order-two and order-three
terms will force their selection which is also the case when only a few order two terms
have zero-valued coefficients Conversely when all order two predictors are removed
129
some order three predictors are not selected as their signal is attributed the order two
predictors missing from the true model This is especially the case for the order three
interaction term x1x2x5 which depends on the inclusion of three order two terms terms
(x1x2 x1x5 x2x5) in order for it to be included as well This makes the inclusion of this
term somewhat more challenging the three order two interactions capture most of
the variation of the polynomial terms that is present when the order three term is also
included However special points on the scale commonly occur on a single or at most
on a few covariates A true data generating mechanism that removes all terms of a given
order in the context of polynomial models is clearly not justified here this was only done
for comparison purposes
130
APPENDIX DSUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
The covariates considered for the ozone data analysis match those used in Liang
et al (2008) these are displayed in Table D below
Table D-1 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA
vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX
The marginal posterior inclusion probability corresponds to the probability of including a
given term in the full model MF after summing over all models in the model space For each
node α isin MF this probability is given by pα =sum
MisinM I(αisinM)p(M|yM) Given that in problems
with a large model space such as the one considered for the ozone concentration problem
enumeration of the entire space is not feasible Thus these probabilities are estimated summing
over every model drawn by the random walk over the model space M
Given that there are in total 44 potential predictors for convenience in Tables D-2 to D-5
below we only display the marginal posterior probabilities for the terms included under at least
one of the model priors considered (EPP HIP HUP and HOP) for each of the parameter priors
utilized (intrinsic priors Zellner-Siow priors Hyper-g(11) and Hyper-g(21))
131
Table D-2 Marginal inclusion probabilities
intrinsic prior
EPP HIP HUP HOP
hum 099 069 085 076
dpg 085 048 052 053
ibt 099 100 100 100
hum2 076 051 043 062
humdpg 055 002 003 017
humibt 098 069 084 075
dpg2 072 036 025 046
ibt2 059 078 057 081
Table D-3 Marginal inclusion probabilities
Zellner-Siow prior
EPP HIP HUP HOP
hum 076 067 080 069
dpg 089 050 055 058
ibt 099 100 100 100
hum2 057 049 040 057
humibt 072 066 078 068
dpg2 081 038 031 051
ibt2 054 076 055 077
Table D-4 Marginal inclusion probabilities
Hyper-g11
EPP HIP HUP HOP
vh 054 005 010 011
hum 081 067 080 069
dpg 090 050 055 058
ibt 099 100 099 099
hum2 061 049 040 057
humibt 078 066 078 068
dpg2 083 038 030 051
ibt2 049 076 054 077
Table D-5 Marginal inclusion probabilities
Hyper-g21
EPP HIP HUP HOP
hum 079 064 073 067
dpg 090 052 060 059
ibt 099 100 099 100
hum2 060 047 037 055
humibt 076 064 071 067
dpg2 082 041 036 052
ibt2 047 073 049 075
132
REFERENCES
Akaike H (1983) Information measures and model selection Bull Int Statist Inst 50277ndash290
Albert J H amp Chib S (1993) Bayesian-analysis of binary and polychotomousresponse data Journal of the American Statistical Association 88(422) 669ndash679
Berger J amp Bernardo J (1992) On the development of reference priors BayesianStatistics 4 (pp 35ndash60)
URL httpisbastatdukeedueventsvalencia1992Valencia4Refpdf
Berger J amp Pericchi L (1996) The intrinsic Bayes factor for model selection andprediction Journal of the American Statistical Association 91(433) 109ndash122
URL httpamstattandfonlinecomdoiabs10108001621459199610476668
Berger J Pericchi L amp Ghosh J (2001) Objective Bayesian methods for modelselection introduction and comparison In Model selection vol 38 of IMS LectureNotes Monogr Ser (pp 135ndash207) Inst Math Statist
URL httpwwwjstororgstable1023074356165
Besag J York J amp Mollie A (1991) Bayesian Image-Restoration with 2 Applicationsin Spatial Statistics Annals of the Institute of Statistical Mathematics 43 1ndash20
Bien J Taylor J amp Tibshirani R (2013) A lasso for hierarchical interactions TheAnnals of Statistics 41(3) 1111ndash1141
URL httpprojecteuclidorgeuclidaos1371150895
Breiman L amp Friedman J (1985) Estimating optimal transformations for multipleregression and correlation Journal of the American Statistical Association 80580ndash598
Brusco M J Steinley D amp Cradit J D (2009) An exact algorithm for hierarchicallywell-formulated subsets in second-order polynomial regression Technometrics 51(3)306ndash315
Casella G Giron F J Martınez M L amp Moreno E (2009) Consistency of Bayesianprocedures for variable selection The Annals of Statistics 37 (3) 1207ndash1228
URL httpprojecteuclidorgeuclidaos1239369020
Casella G Moreno E amp Giron F (2014) Cluster Analysis Model Selection and PriorDistributions on Models Bayesian Analysis TBA(TBA) 1ndash46
URL httpwwwstatufledu~casellaPapersClusterModel-July11-Apdf
133
Chipman H (1996) Bayesian variable selection with related predictors CanadianJournal of Statistics 24(1) 17ndash36
URL httponlinelibrarywileycomdoi1023073315687abstract
Clyde M amp George E I (2004) Model Uncertainty Statistical Science 19(1) 81ndash94
URL httpprojecteuclidorgDienstgetRecordid=euclidss1089808274
Dewey J (1958) Experience and nature New York Dover Publications
Dorazio R M amp Taylor-Rodrıguez D (2012) A Gibbs sampler for Bayesian analysis ofsite-occupancy data Methods in Ecology and Evolution 3 1093ndash1098
Ellison A M (2004) Bayesian inference in ecology Ecology Letters 7 509ndash520
Fiske I amp Chandler R (2011) unmarked An R package for fitting hierarchical modelsof wildlife occurrence and abundance Journal of Statistical Software 43(10)
URL httpcorekmiopenacukdownloadpdf5701760pdf
George E (2000) The variable selection problem Journal of the American StatisticalAssociation 95(452) 1304ndash1308
URL httpwwwtandfonlinecomdoiabs10108001621459200010474336
Giron F J Moreno E Casella G amp Martınez M L (2010) Consistency of objectiveBayes factors for nonnested linear models and increasing model dimension Revistade la Real Academia de Ciencias Exactas Fisicas y Naturales Serie A Matematicas104(1) 57ndash67
URL httpwwwspringerlinkcomindex105052RACSAM201006
Good I J (1950) Probability and the Weighing of Evidence New York Haffner
Griepentrog G L Ryan J M amp Smith L D (1982) Linear transformations ofpolynomial regression-models American Statistician 36(3) 171ndash174
Gunel E amp Dickey J (1974) Bayes factors for independence in contingency tablesBiometrika 61 545ndash557
Hanski I (1994) A Practical Model of Metapopulation Dynamics Journal of AnimalEcology 63 151ndash162
Hooten M (2006) Hierarchical spatio-temporal models for ecological processesDoctoral dissertation University of Missouri-Columbia
URL httpsmospacelibraryumsystemeduxmluihandle103554500
Hooten M B amp Hobbs N T (2014) A Guide to Bayesian Model Selection forEcologists Ecological Monographs (In Press)
134
Hughes J amp Haran M (2013) Dimension reduction and alleviation of confoundingfor spatial generalized linear mixed models Journal of the Royal Statistical SocietySeries B Statistical Methodology 75 139ndash159
Hurvich C M amp Tsai C-L (1989) Regression and time series model selection insmall samples Biometrika 76 297ndash307
URL httpbiometoxfordjournalsorgcontent762297abstract
Jeffreys H (1935) Some tests of significance treated by the theory of probabilityProcedings of the Cambridge Philosophy Society 31 203ndash222
Jeffreys H (1961) Theory of Probability London Oxford University Press 3rd ed
Johnson D Conn P Hooten M Ray J amp Pond B (2013) Spatial occupancymodels for large data sets Ecology 94(4) 801ndash808
URL httpwwwesajournalsorgdoiabs10189012-05641mi=3eywlhampaf=R
ampsearchText=human+population
Kass R amp Wasserman L (1995) A reference Bayesian test for nested hypothesesand its relationship to the Schwarz criterion Journal of the American StatisticalAssociation 90(431)
URL httpamstattandfonlinecomdoiabs10108001621459199510476592
Kass R E amp Raftery A E (1995) Bayes Factors Journal of the American StatisticalAssociation 90 773ndash795
URL httpwwwtandfonlinecomdoiabs10108001621459199510476572$
delimiter026E30F$nhttpwwwtandfonlinecomdoiabs10108001621459
199510476572UvBybrTIgcs
Kass R E amp Wasserman L (1996) The Selection of Prior Distributions by FormalRules Journal of the American Statistical Association 91(435) 1343
URL httpwwwjstororgstable2291752origin=crossref
Kery M (2010) Introduction to WinBUGS for Ecologists Bayesian Approach toRegression ANOVA Mixed Models and Related Analyses Academic Press 1st ed
Kery M Gardner B amp Monnerat C (2010) Predicting species distributions fromchecklist data using site-occupancy models Journal of Biogeography 37 (10)1851ndash1862 Kery Marc Gardner Beth Monnerat Christian
Khuri A (2002) Nonsingular linear transformations of the control variables in responsesurface models Technical Report
Krebs C J (1972) Ecology the experimental analysis of distribution and abundance
135
Lempers F B (1971) Posterior probabilities of alternative linear models University ofRotterdam Press Rotterdam
Leon-Novelo L Moreno E amp Casella G (2012) Objective Bayes model selection inprobit models Statistics in medicine 31(4) 353ndash65
URL httpwwwncbinlmnihgovpubmed22162041
Liang F Paulo R Molina G Clyde M a amp Berger J O (2008) Mixtures of g Priorsfor Bayesian Variable Selection Journal of the American Statistical Association103(481) 410ndash423
URL httpwwwtandfonlinecomdoiabs101198016214507000001337
Link W amp Barker R (2009) Bayesian inference with ecological applications Elsevier
URL httpbooksgooglecombookshl=enamplr=ampid=hecon2l2QPcCampoi=fnd
amppg=PP2ampdq=Bayesian+Inference+with+ecological+applicationsampots=S82_
0pxrNmampsig=L3xbsSQcKD8FV6rxCMp2pmP2JKk
MacKenzie D amp Nichols J (2004) Occupancy as a surrogate for abundanceestimation Animal biodiversity and conservation 1 461ndash467
URL httpcrsitbacidmediajurnalrefslandscapemackenzie2004zhpdf
MacKenzie D Nichols J amp Hines J (2003) Estimating site occupancy colonizationand local extinction when a species is detected imperfectly Ecology 84(8)2200ndash2207
URL httpwwwesajournalsorgdoiabs10189002-3090
MacKenzie D I Bailey L L amp Nichols J D (2004) Investigating speciesco-occurrence patterns when species Journal of Animal Ecology 73 546ndash555
MacKenzie D I Nichols J D Lachman G B Droege S Royle J A amp LangtimmC A (2002) Estimating site occupancy rates when detection probabilities are lessthan one Ecology 83(8) 2248ndash2255
Mazerolle M amp Mazerolle M (2013) Package rsquoAICcmodavgrsquo (c)
URL ftpheanetarchivegnewsenseorgdisk1CRANwebpackages
AICcmodavgAICcmodavgpdf
McCullagh P amp Nelder J A (1989) Generalized linear models (2nd ed) LondonEngland Chapman amp Hall
McQuarrie A Shumway R amp Tsai C-L (1997) The model selection criterion AICu
136
Moreno E Bertolino F amp Racugno W (1998) An intrinsic limiting procedure for modelselection and hypotheses testing Journal of the American Statistical Association93(444) 1451ndash1460
Moreno E Giron F J amp Casella G (2010) Consistency of objective Bayes factors asthe model dimension grows The Annals of Statistics 38(4) 1937ndash1952
URL httpprojecteuclidorgeuclidaos1278861238
Nelder J A (1977) Reformulation of linear-models Journal of the Royal StatisticalSociety Series A - Statistics in Society 140 48ndash77
Nelder J A (1998) The selection of terms in response-surface models - how strong isthe weak-heredity principle American Statistician 52(4) 315ndash318
Nelder J A (2000) Functional marginality and response-surface fitting Journal ofApplied Statistics 27 (1) 109ndash112
Nichols J Hines J amp Mackenzie D (2007) Occupancy estimation and modeling withmultiple states and state uncertainty Ecology 88(6) 1395ndash1400
URL httpwwwesajournalsorgdoipdf10189006-1474
Ovaskainen O Hottola J amp Siitonen J (2010) Modeling species co-occurrenceby multivariate logistic regression generates new hypotheses on fungal interactionsEcology 91(9) 2514ndash21
URL httpwwwncbinlmnihgovpubmed20957941
Peixoto J L (1987) Hierarchical variable selection in polynomial regression-modelsAmerican Statistician 41(4) 311ndash313
Peixoto J L (1990) A property of well-formulated polynomial regression-modelsAmerican Statistician 44(1) 26ndash30
Pericchi L R (2005) Model selection and hypothesis testing based on objectiveprobabilities and bayes factors In Handbook of Statistics Elsevier
Polson N G Scott J G amp Windle J (2013) Bayesian Inference for Logistic ModelsUsing Polya-Gamma Latent Variables Journal of the American Statistical Association108 1339ndash1349
URL httpdxdoiorg101080016214592013829001
Rao C R amp Wu Y (2001) On model selection vol Volume 38 of Lecture NotesndashMonograph Series (pp 1ndash57) Beachwood OH Institute of Mathematical Statistics
URL httpdxdoiorg101214lnms1215540960
137
Reich B J Hodges J S amp Zadnik V (2006) Effects of residual smoothing on theposterior of the fixed effects in disease-mapping models Biometrics 62 1197ndash1206
Reiners W amp Lockwood J (2009) Philosophical Foundations for the Practices ofEcology Cambridge University Press
URL httpbooksgooglecombooksid=dr9cPgAACAAJ
Rigler F amp Peters R (1995) Excellence in Ecology Science and Limnology EcologyInstitute Germany
URL httportoncatieaccrcgi-binwxisexeIsisScript=CIENLxis
ampmethod=postampformato=2ampcantidad=1ampexpresion=mfn=008268
Robert C Chopin N amp Rousseau J (2009) Harold Jeffreysrsquo Theory of Probabilityrevisited Statistical Science Volume 24(2) 141ndash179
URL httpswwwnewtonacukpreprintsNI08021pdf
Robert C P (1993) A note on jeffreys-lindley paradox Statistica Sinica 3 601ndash608
Royle J A amp Kery M (2007) A Bayesian state-space formulation of dynamicoccupancy models Ecology 88(7) 1813ndash23
URL httpwwwncbinlmnihgovpubmed17645027
Scott J amp Berger J (2010) Bayes and Empirical-Bayes Multiplicity Adjustment in thevariable selection problem The Annals of Statistics
URL httpprojecteuclidorgeuclidaos1278861454
Spiegelhalter D J amp Smith A F M (1982) Bayes factor for linear and log-linearmodels with vague prior information J R Statist Soc B 44 377ndash387
Tierney L amp Kadane J B (1986) Accurate approximations for posterior moments andmarginal densities Journal of the American Statistical Association 81 82ndash86
Tyre A J Tenhumberg B Field S a Niejalke D Parris K amp Possingham H P(2003) Improving Precision and Reducing Bias in Biological Surveys EstimatingFalse-Negative Error Rates Ecological Applications 13(6) 1790ndash1801
URL httpwwwesajournalsorgdoiabs10189002-5078
Waddle J H Dorazio R M Walls S C Rice K G Beauchamp J Schuman M Jamp Mazzotti F J (2010) A new parameterization for estimating co-occurrence ofinteracting species Ecological applications a publication of the Ecological Society ofAmerica 20 1467ndash1475
Wasserman L (2000) Bayesian Model Selection and Model Averaging Journal ofmathematical psychology 44(1) 92ndash107
138
URL httpwwwncbinlmnihgovpubmed10733859
Wilson M Iversen E Clyde M A Schmidler S C amp Schildkraut J M (2010)Bayesian model search and multilevel inference for SNP association studies TheAnnals of Applied Statistics 4(3) 1342ndash1364
URL httpwwwncbinlmnihgovpmcarticlesPMC3004292
Womack A J Leon-Novelo L amp Casella G (2014) Inference from Intrinsic BayesProcedures Under Model Selection and Uncertainty Journal of the AmericanStatistical Association (June) 140114063448000
URL httpwwwtandfonlinecomdoiabs101080016214592014880348
Yuan M Joseph V R amp Zou H (2009) Structured variable selection and estimationThe Annals of Applied Statistics 3(4) 1738ndash1757
URL httpprojecteuclidorgeuclidaoas1267453962
Zeller K A Nijhawan S Salom-Perez R Potosme S H amp Hines J E (2011)Integrating occupancy modeling and interview data for corridor identification A casestudy for jaguars in nicaragua Biological Conservation 144(2) 892ndash901
Zellner A amp Siow A (1980) Posterior odds ratios for selected regression hypothesesIn Trabajos de estadıstica y de investigacion operativa (pp 585ndash603)
URL httpwwwspringerlinkcomindex5300770UP12246M9pdf
139
BIOGRAPHICAL SKETCH
Daniel Taylor-Rodrıguez was born in Bogota Colombia He earned a BS
degree in economics from the Universidad de Los Andes (2004) and a Specialist
degree in statistics from the Universidad Nacional de Colombia In 2009 he traveled
to Gainesville Florida to pursue a masterrsquos in statistics under the supervision of
George Casella Upon completion he started a PhD in interdisciplinary ecology with
concentration in statistics again under George Casellarsquos supervision After Georgersquos
passing Linda Young and Nikolay Bliznyuk continued to oversee Danielrsquos mentorship
He has currently accepted a joint postdoctoral fellowship at the Statistical and Applied
Mathematical Sciences Institute and the Department of Statistical Science at Duke
University
140
- ACKNOWLEDGMENTS
- TABLE OF CONTENTS
- LIST OF TABLES
- LIST OF FIGURES
- ABSTRACT
- 1 GENERAL INTRODUCTION
-
- 11 Occupancy Modeling
- 12 A Primer on Objective Bayesian Testing
- 13 Overview of the Chapters
-
- 2 MODEL ESTIMATION METHODS
-
- 21 Introduction
-
- 211 The Occupancy Model
- 212 Data Augmentation Algorithms for Binary Models
-
- 22 Single Season Occupancy
-
- 221 Probit Link Model
- 222 Logit Link Model
-
- 23 Temporal Dynamics and Spatial Structure
-
- 231 Dynamic Mixture Occupancy State-Space Model
- 232 Incorporating Spatial Dependence
-
- 24 Summary
-
- 3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
-
- 31 Introduction
- 32 Objective Bayesian Inference
-
- 321 The Intrinsic Methodology
- 322 Mixtures of g-Priors
-
- 3221 Intrinsic priors
- 3222 Other mixtures of g-priors
-
- 33 Objective Bayes Occupancy Model Selection
-
- 331 Preliminaries
- 332 Intrinsic Priors for the Occupancy Problem
- 333 Model Posterior Probabilities
- 334 Model Selection Algorithm
-
- 34 Alternative Formulation
- 35 Simulation Experiments
-
- 351 Marginal Posterior Inclusion Probabilities for Model Predictors
- 352 Summary Statistics for the Highest Posterior Probability Model
-
- 36 Case Study Blue Hawker Data Analysis
-
- 361 Results Variable Selection Procedure
- 362 Validation for the Selection Procedure
-
- 37 Discussion
-
- 4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
-
- 41 Introduction
- 42 Setup for Well-Formulated Models
-
- 421 Well-Formulated Model Spaces
-
- 43 Priors on the Model Space
-
- 431 Model Prior Definition
- 432 Choice of Prior Structure and Hyper-Parameters
- 433 Posterior Sensitivity to the Choice of Prior
-
- 44 Random Walks on the Model Space
-
- 441 Simple Pruning and Growing
- 442 Degree Based Pruning and Growing
-
- 45 Simulation Study
-
- 451 SNR and Sample Size Effect
- 452 Coefficient Magnitude
- 453 Special Points on the Scale
-
- 46 Case Study Ozone Data Analysis
- 47 Discussion
-
- 5 CONCLUSIONS
- A FULL CONDITIONAL DENSITIES DYMOSS
- B RANDOM WALK ALGORITHMS
- C WFM SIMULATION DETAILS
- D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
- REFERENCES
- BIOGRAPHICAL SKETCH
-
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS 4
LIST OF TABLES 8
LIST OF FIGURES 10
ABSTRACT 12
CHAPTER
1 GENERAL INTRODUCTION 14
11 Occupancy Modeling 1512 A Primer on Objective Bayesian Testing 1713 Overview of the Chapters 21
2 MODEL ESTIMATION METHODS 23
21 Introduction 23211 The Occupancy Model 24212 Data Augmentation Algorithms for Binary Models 26
22 Single Season Occupancy 29221 Probit Link Model 30222 Logit Link Model 32
23 Temporal Dynamics and Spatial Structure 34231 Dynamic Mixture Occupancy State-Space Model 37232 Incorporating Spatial Dependence 43
24 Summary 46
3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS 49
31 Introduction 4932 Objective Bayesian Inference 52
321 The Intrinsic Methodology 53322 Mixtures of g-Priors 54
3221 Intrinsic priors 553222 Other mixtures of g-priors 56
33 Objective Bayes Occupancy Model Selection 57331 Preliminaries 58332 Intrinsic Priors for the Occupancy Problem 60333 Model Posterior Probabilities 62334 Model Selection Algorithm 63
34 Alternative Formulation 6635 Simulation Experiments 68
351 Marginal Posterior Inclusion Probabilities for Model Predictors 70
6
352 Summary Statistics for the Highest Posterior Probability Model 7636 Case Study Blue Hawker Data Analysis 77
361 Results Variable Selection Procedure 79362 Validation for the Selection Procedure 81
37 Discussion 82
4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS 84
41 Introduction 8442 Setup for Well-Formulated Models 88
421 Well-Formulated Model Spaces 9043 Priors on the Model Space 91
431 Model Prior Definition 92432 Choice of Prior Structure and Hyper-Parameters 96433 Posterior Sensitivity to the Choice of Prior 99
44 Random Walks on the Model Space 104441 Simple Pruning and Growing 105442 Degree Based Pruning and Growing 106
45 Simulation Study 107451 SNR and Sample Size Effect 109452 Coefficient Magnitude 110453 Special Points on the Scale 111
46 Case Study Ozone Data Analysis 11147 Discussion 113
5 CONCLUSIONS 115
APPENDIX
A FULL CONDITIONAL DENSITIES DYMOSS 118
B RANDOM WALK ALGORITHMS 121
C WFM SIMULATION DETAILS 124
D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS 131
REFERENCES 133
BIOGRAPHICAL SKETCH 140
7
LIST OF TABLES
Table page
1-1 Interpretation of BFji when contrasting Mj and Mi 20
3-1 Simulation control parameters occupancy model selector 69
3-2 Comparison of average minOddsMPIP under scenarios having different numberof sites (N=50 N=100) and under scenarios having different number of surveysper site (J=3 J=5) for the presence and detection components using uniformand multiplicity correction priors 75
3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors 75
3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highest probabilitymodels for the presence and the detection components using uniform andmultiplicity correcting priors on the model space 76
3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space 77
3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 77
3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 78
3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data 80
3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data 80
3-10 MPIP presence component 81
3-11 MPIP detection component 81
3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform and multiplicitycorrection model priors 82
8
4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulations 100
4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchical independenceprior (HIP) the hierarchical order prior (HOP) and the hierarchical uniformprior (HUP) 102
4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and the hierarchicaluniform prior (HUP) 103
4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP) 105
4-5 Variables used in the analyses of the ozone contamination dataset 112
4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso 113
C-1 Experimental conditions WFM simulations 124
D-1 Variables used in the analyses of the ozone contamination dataset 131
D-2 Marginal inclusion probabilities intrinsic prior 132
D-3 Marginal inclusion probabilities Zellner-Siow prior 132
D-4 Marginal inclusion probabilities Hyper-g11 132
D-5 Marginal inclusion probabilities Hyper-g21 132
9
LIST OF FIGURES
Figure page
2-1 Graphical representation occupancy model 25
2-2 Graphical representation occupancy model after data-augmentation 31
2-3 Graphical representation multiseason model for a single site 39
2-4 Graphical representation data-augmented multiseason model 39
3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors 71
3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors 72
3-3 Predictor MPIP averaged over scenarios with the interaction between the numberof sites and the surveys per site using uniform (U) and multiplicity correction(MC) priors 72
3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors 73
3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors 73
4-1 Graphs of well-formulated polynomial models for p = 2 90
4-2 E(M) and C(M) in M defined by a quadratic surface in two main effects formodel M = 1 x1 x21 91
4-3 Graphical representation of assumptions on M defined by the quadratic surfacein two main effects 93
4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the intercept onlymodel and (ab) isin (1 1) (1 ch) 97
4-5 Prior probabilities for the space of well-formulated models associated to threemain effects and one interaction term where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch) 98
4-6 MT DAG of the largest true model used in simulations 109
4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch) 110
C-1 SNR vs n Average model size average true positives and average false positivesfor all simulated scenarios by model ranking according to model posterior probabilities126
10
C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 128
C-3 SNR vs different true models MT Average model size average true positivesand average false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 129
11
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
OBJECTIVE BAYESIAN METHODS FOR OCCUPANCY MODEL ESTIMATION ANDSELECTION
By
Daniel Taylor-Rodrıguez
August 2014
Chair Linda J YoungCochair Nikolay BliznyukMajor Interdisciplinary Ecology
The ecological literature contains numerous methods for conducting inference about
the dynamics that govern biological populations Among these methods occupancy
models have played a leading role during the past decade in the analysis of large
biological population surveys The flexibility of the occupancy framework has brought
about useful extensions for determining key population parameters which provide
insights about the distribution structure and dynamics of a population However the
methods used to fit the models and to conduct inference have gradually grown in
complexity leaving practitioners unable to fully understand their implicit assumptions
increasing the potential for misuse This motivated our first contribution We develop
a flexible and straightforward estimation method for occupancy models that provides
the means to directly incorporate temporal and spatial heterogeneity using covariate
information that characterizes habitat quality and the detectability of a species
Adding to the issue mentioned above studies of complex ecological systems now
collect large amounts of information To identify the drivers of these systems robust
techniques that account for test multiplicity and for the structure in the predictors are
necessary but unavailable for ecological models We develop tools to address this
methodological gap First working in an ldquoobjectiverdquo Bayesian framework we develop
the first fully automatic and objective method for occupancy model selection based
12
on intrinsic parameter priors Moreover for the general variable selection problem we
propose three sets of prior structures on the model space that correct for multiple testing
and a stochastic search algorithm that relies on the priors on the models space to
account for the polynomial structure in the predictors
13
CHAPTER 1GENERAL INTRODUCTION
As with any other branch of science ecology strives to grasp truths about the
world that surrounds us and in particular about nature The objective truth sought
by ecology may well be beyond our grasp however it is reasonable to think that at
least partially ldquoNature is capable of being understoodrdquo (Dewey 1958) We can observe
and interpret nature to formulate hypotheses which can then be tested against reality
Hypotheses that encounter no or little opposition when confronted with reality may
become contextual versions of the truth and may be generalized by scaling them
spatially andor temporally accordingly to delimit the bounds within which they are valid
To formulate hypotheses accurately and in a fashion amenable to scientific inquiry
not only the point of view and assumptions considered must be made explicit but
also the object of interest the properties worthy of consideration of that object and
the methods used in studying such properties (Reiners amp Lockwood 2009 Rigler amp
Peters 1995) Ecology as defined by Krebs (1972) is ldquothe study of interactions that
determine the distribution and abundance of organismsrdquo This characterizes organisms
and their interactions as the objects of interest to ecology and prescribes distribution
and abundance as a relevant property of these organisms
With regards to the methods used to acquire ecological scientific knowledge
traditionally theoretical mathematical models (such as deterministic PDEs) have been
used However naturally varying systems are imprecisely observed and as such are
subject to multiple sources of uncertainty that must be explicitly accounted for Because
of this the ecological scientific community is developing a growing interest in flexible
and powerful statistical methods and among these Bayesian hierarchical models
predominate These methods rely on empirical observations and can accommodate
fairly complex relationships between empirical observations and theoretical process
models while accounting for diverse sources of uncertainty (Hooten 2006)
14
Bayesian approaches are now used extensively in ecological modeling however
there are two issues of concern one from the standpoint of ecological practitioners
and another from the perspective of scientific ecological endeavors First Bayesian
modeling tools require a considerable understanding of probability and statistical theory
leading practitioners to view them as black box approaches (Kery 2010) Second
although Bayesian applications proliferate in the literature in general there is a lack of
awareness of the distinction between approaches specifically devised for testing and
those for estimation (Ellison 2004) Furthermore there is a dangerous unfamiliarity with
the proven risks of using tools designed for estimation in testing procedures (Berger amp
Pericchi 1996 Berger et al 2001 Kass amp Raftery 1995 Moreno et al 1998 Robert
et al 2009 Robert 1993) (eg use of flat priors in hypothesis testing)
Occupancy models have played a leading role during the past decade in large
biological population surveys The flexibility of the occupancy framework has allowed
the development of useful extensions to determine several key population parameters
which provide robust notions of the distribution structure and dynamics of a population
In order to address some of the concerns stated in previous paragraph we concentrate
in the occupancy framework to develop estimation and testing tools that will allow
ecologists first to gain insight about the estimation procedure and second to conduct
statistically sound model selection for site-occupancy data
11 Occupancy Modeling
Since MacKenzie et al (2002) and Tyre et al (2003) introduced the site-occupancy
framework countless applications and extensions of the method have been developed
in the ecological literature as evidenced by the 438000 hits on Google Scholar for
a search of rdquooccupancy modelrdquo This class of models acknowledges that techniques
used to conduct biological population surveys are prone to detection errors ndashif an
individual is detected it must be present while if it is not detected it might or might
not be Occupancy models improve upon traditional binary regression by accounting
15
for observed detection and partially observed presence as two separate but related
components In the site occupancy setting the chosen locations are surveyed
repeatedly in order to reduce the ambiguity caused by the observed zeros This
approach therefore allows probabilities of both presence (occurrence) and detection
to be estimated
The uses of site-occupancy models are many For example metapopulation
and island biogeography models are often parameterized in terms of site (or patch)
occupancy (Hansky 19921994 1997 as cited in MacKenzie et al (2003)) and
occupancy may be used as a surrogate for abundance to answer questions regarding
geographic distribution range size and metapopulation dynamics (MacKenzie et al
2004 Royle amp Kery 2007)
The basic occupancy framework which assumes a single closed population with
fixed probabilities through time has proven to be quite useful however it might be of
limited utility when addressing some problems In particular assumptions for the basic
model may become too restrictive or unrealistic whenever the study period extends
throughout multiple years or seasons especially given the increasingly changing
environmental conditions that most ecosystems are currently experiencing
Among the extensions found in the literature one that we consider particularly
relevant incorporates heterogenous occupancy probabilities through time Models
that incorporate temporally varying probabilities stem from important meta-population
notions provided by Hanski (1994) such as occupancy probabilities depending on local
colonization and local extinction processes In spite of the conceptual usefulness of
Hanskirsquos model several strong and untenable assumptions (eg all patches being
homogenous in quality) are required for it to provide practically meaningful results
A more viable alternative which builds on Hanski (1994) is an extension of
the single season occupancy model of MacKenzie et al (2003) In this model the
heterogeneity of occupancy probabilities across seasons arises from local colonization
16
and extinction processes This model is flexible enough to let detection occurrence
extinction and colonization probabilities to each depend upon its own set of covariates
Model parameters are obtained through likelihood-based estimation
Using a maximum likelihood approach presents two drawbacks First the
uncertainty assessment for maximum likelihood parameter estimates relies on
asymptotic results which are obtained from implementation of the delta method
making it sensitive to sample size Second to obtain parameter estimates the latent
process (occupancy) is marginalized out of the likelihood leading to the usual zero
inflated Bernoulli model Although this is a convenient strategy for solving the estimation
problem after integrating the latent state variables (occupancy indicators) they are
no longer available Therefore finite sample estimates cannot be calculated directly
Instead a supplementary parametric bootstrapping step is necessary Further
additional structure such as temporal or spatial variation cannot be introduced by
means of random effects (Royle amp Kery 2007)
12 A Primer on Objective Bayesian Testing
With the advent of high dimensional data such as that found in modern problems
in ecology genetics physics etc coupled with evolving computing capability objective
Bayesian inferential methods have gained increasing popularity This however is by no
means a new approach in the way Bayesian inference is conducted In fact starting with
Bayes and Laplace and continuing for almost 200 years Bayesian analysis was primarily
based on ldquononinformativerdquo priors (Berger amp Bernardo 1992)
Now subjective elicitation of prior probabilities in Bayesian analysis is widely
recognized as the ideal (Berger et al 2001) however it is often the case that the
available information is insufficient to specify appropriate prior probabilistic statements
Commonly as in model selection problems where large model spaces have to be
explored the number of model parameters is prohibitively large preventing one from
eliciting prior information for the entire parameter space As a consequence in practice
17
the determination of priors through the definition of structural rules has become the
alternative to subjective elicitation for a variety of problems in Bayesian testing Priors
arising from these rules are known in the literature as noninformative objective default
or reference Many of these connotations generate controversy and are accused
perhaps rightly of providing a false pretension of objectivity Nevertheless we will avoid
that discussion and refer to them herein exchangeably as noninformative or objective
priors to convey the sense that no attempt to introduce an informed opinion is made in
defining prior probabilities
A plethora of ldquononinformativerdquo methods has been developed in the past few
decades (see Berger amp Bernardo (1992) Berger amp Pericchi (1996) Berger et al (2001)
Clyde amp George (2004) Kass amp Wasserman (1995 1996) Liang et al (2008) Moreno
et al (1998) Spiegelhalter amp Smith (1982) Wasserman (2000) and the references
therein) We find particularly interesting those derived from the model structure in which
no tuning parameters are required especially since these can be regarded as automatic
methods Among them methods based on the Bayes factor for Intrinsic Priors have
proven their worth in a variety of inferential problems given their excellent performance
flexibility and ease of use This class of priors is discussed in detail in chapter 3 For
now some basic notation and notions of Bayesian inferential procedures are introduced
Hypothesis testing and the Bayes factor
Bayesian model selection techniques that aim to find the true model as opposed
to searching for the model that best predicts the data are fundamentally extensions to
Bayesian hypothesis testing strategies In general this Bayesian approach to hypothesis
testing and model selection relies on determining the amount of evidence found in favor
of one hypothesis (or model) over the other given an observed set of data Approached
from a Bayesian standpoint this type of problem can be formulated in great generality
using a natural well defined probabilistic framework that incorporates both model and
parameter uncertainty
18
Jeffreys (1935) first developed the Bayesian strategy to hypothesis testing and
consequently to the model selection problem Bayesian model selection within
a model space M = (M1M2 MJ) where each model is associated with a
parameter θj which may be a vector of parameters itself incorporates three types
of probability distributions (1) a prior probability distribution for each model π(Mj)
(2) a prior probability distribution for the parameters in each model π(θj |Mj) and (3)
the distribution of the data conditional on both the model and the modelrsquos parameters
f (x|θj Mj) These three probability densities induce the joint distribution p(x θj Mj) =
f (x|θj Mj) middot π(θj |Mj) middot π(Mj) which is instrumental in producing model posterior
probabilities The model posterior probability is the probability that a model is true given
the data It is obtained by marginalizing over the parameter space and using Bayes rule
p(Mj |x) =m(x|Mj)π(Mj)sumJ
i=1m(x|Mi)π(Mi) (1ndash1)
where m(x|Mj) =intf (x|θj Mj)π(θj |Mj)dθj is the marginal likelihood of Mj
Given that interest lies in comparing different models evidence in favor of one or
another model is assessed with pairwise comparisons using posterior odds
p(Mj |x)p(Mk |x)
=m(x|Mj)
m(x|Mk)middot π(Mj)
π(Mk) (1ndash2)
The first term on the right hand side of (1ndash2) m(x|Mj )
m(x|Mk) is known as the Bayes factor
comparing model Mj to model Mk and it is denoted by BFjk(x) The Bayes factor
provides a measure of the evidence in favor of either model given the data and updates
the model prior odds given by π(Mj )
π(Mk) to produce the posterior odds
Note that the model posterior probability in (1ndash1) can be expressed as a function of
Bayes factors To illustrate let model Mlowast isin M be a reference model All other models
compare in M are compared to the reference model Then dividing both the numerator
19
and denominator in (1ndash1) by m(x|Mlowast)π(Mlowast) yields
p(Mj |x) =BFjlowast(x)
π(Mj )
π(Mlowast)
1 +sum
MiisinMMi =Mlowast
BFilowast(x)π(Mi )π(Mlowast)
(1ndash3)
Therefore as the Bayes factor increases the posterior probability of model Mj given the
data increases If all models have equal prior probabilities a straightforward criterion
to select the best among all candidate models is to choose the model with the largest
Bayes factor As such the Bayes factor is not only useful for identifying models favored
by the data but it also provides a means to rank models in terms of their posterior
probabilities
Assuming equal model prior probabilities in (1ndash3) the prior odds are set equal to
one and the model posterior odds in (1ndash2) become p(Mj |x)p(Mk |x) = BFjk(x) Based
on the Bayes factors the evidence in favor of one or another model can be interpreted
using Table 1-1 adapted from Kass amp Raftery (1995)
Table 1-1 Interpretation of BFji when contrasting Mj and Mi
lnBFjk BFjk Evidence in favor of Mj P(Mj |x)0 to 2 1 to 3 Weak evidence 05-0752 to 6 3 to 20 Positive evidence 075-095
6 to 10 20 to 150 Strong evidence 095-099gt10 gt150 Very strong evidence gt 099
Bayesian hypothesis testing and model selection procedures through Bayes factors
and posterior probabilities have several desirable features First these methods have a
straight forward interpretation since the Bayes factor is an increasing function of model
(or hypothesis) posterior probabilities Second these methods can yield frequentist
matching confidence bounds when implemented with good testing priors (Kass amp
Wasserman 1996) such as the reference priors of Berger amp Bernardo (1992) Third
since the Bayes factor contains the ratio of marginal densities it automatically penalizes
complexity according to the number of parameters in each model this property is
known as Ockhamrsquos razor (Kass amp Raftery 1995) Four the use of Bayes factors does
20
not require having nested hypotheses (ie having the null hypothesis nested in the
alternative) standard distributions or regular asymptotics (eg convergence to normal
or chi squared distributions) (Berger et al 2001) In contrast this is not always the case
with frequentist and likelihood ratio tests which depend on known distributions (at least
asymptotically) for the test statistic to perform the test Finally Bayesian hypothesis
testing procedures using the Bayes factor can naturally incorporate model uncertainty by
using the Bayesian machinery for model averaged predictions and confidence bounds
(Kass amp Raftery 1995) It is not clear how to account for this uncertainty rigorously in a
fully frequentist approach
13 Overview of the Chapters
In the chapters that follow we develop a flexible and straightforward hierarchical
Bayesian framework for occupancy models allowing us to obtain estimates and conduct
robust testing from an ldquoobjectiverdquo Bayesian perspective Latent mixtures of random
variables supply a foundation for our methodology This approach provides a means to
directly incorporate spatial dependency and temporal heterogeneity through predictors
that characterize either habitat quality of a given site or detectability features of a
particular survey conducted in a specific site On the other hand the Bayesian testing
methods we propose are (1) a fully automatic and objective method for occupancy
model selection and (2) an objective Bayesian testing tool that accounts for multiple
testing and for polynomial hierarchical structure in the space of predictors
Chapter 2 introduces the methods proposed for estimation of occupancy model
parameters A simple estimation procedure for the single season occupancy model
with covariates is formulated using both probit and logit links Based on the simple
version an extension is provided to cope with metapopulation dynamics by introducing
persistence and colonization processes Finally given the fundamental role that spatial
dependence plays in defining temporal dynamics a strategy to seamlessly account for
this feature in our framework is introduced
21
Chapter 3 develops a new fully automatic and objective method for occupancy
model selection that is asymptotically consistent for variable selection and averts the
use of tuning parameters In this Chapter first some issues surrounding multimodel
inference are described and insight about objective Bayesian inferential procedures is
provided Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate
priors on the parameter space the intrinsic priors for the parameters of the occupancy
model are obtained These are used in the construction of a variable selection algorithm
for ldquoobjectiverdquo variable selection tailored to the occupancy model framework
Chapter 4 touches on two important and interconnected issues when conducting
model testing that have yet to receive the attention they deserve (1) controlling for false
discovery in hypothesis testing given the size of the model space ie given the number
of tests performed and (2) non-invariance to location transformations of the variable
selection procedures in the face of polynomial predictor structure These elements both
depend on the definition of prior probabilities on the model space In this chapter a set
of priors on the model space and a stochastic search algorithm are proposed Together
these control for model multiplicity and account for the polynomial structure among the
predictors
22
CHAPTER 2MODEL ESTIMATION METHODS
ldquoData Data Datardquo he cried impatiently ldquoI canrsquot make bricks without clayrdquo
ndashSherlock HolmesThe Adventure of the Copper Beeches
21 Introduction
Prior to the introduction of site-occupancy models (MacKenzie et al 2002 Tyre
et al 2003) presence-absence data from ecological monitoring programs were used
without any adjustment to assess the impact of management actions to observe trends
in species distribution through space and time or to model the habitat of a species (Tyre
et al 2003) These efforts however were suspect due to false-negative errors not
being accounted for False-negative errors occur whenever a species is present at a site
but goes undetected during the survey
Site-occupancy models developed independently by MacKenzie et al (2002)
and Tyre et al (2003) extend simple binary-regression models to account for the
aforementioned errors in detection of individuals common in surveys of animal or plant
populations Since their introduction the site-occupancy framework has been used in
countless applications and numerous extensions for it have been proposed Occupancy
models improve upon traditional binary regression by analyzing observed detection
and partially observed presence as two separate but related components In the site
occupancy setting the chosen locations are surveyed repeatedly in order to reduce the
ambiguity caused by the observed zeros This approach therefore allows simultaneous
estimation of the probabilities of presence (occurrence) and detection
Several extensions to the basic single-season closed population model are
now available The occupancy approach has been used to determine species range
dynamics (MacKenzie et al 2003 Royle amp Kery 2007) and to understand agestage
23
structure within populations (Nichols et al 2007) model species co-occurrence
(MacKenzie et al 2004 Ovaskainen et al 2010 Waddle et al 2010) It has even been
suggested as a surrogate for abundance (MacKenzie amp Nichols 2004) MacKenzie et al
suggested using occupancy models to conduct large-scale monitoring programs since
this approach avoids the high costs associated with surveys designed for abundance
estimation Also to investigate metapopulation dynamics occupancy models improve
upon incidence function models (Hanski 1994) which are often parameterized in terms
of site (or patch) occupancy and assume homogenous patches and a metapopulation
that is at a colonization-extinction equilibrium
Nevertheless the implementation of Bayesian occupancy models commonly resorts
to sampling strategies dependent on hyper-parameters subjective prior elicitation
and relatively elaborate algorithms From the standpoint of practitioners these are
often treated as black-box methods (Kery 2010) As such the potential of using the
methodology incorrectly is high Commonly these procedures are fitted with packages
such as BUGS or JAGS Although the packagersquos ease of use has led to a wide-spread
adoption of the methods the user may be oblivious as to the assumptions underpinning
the analysis
We believe providing straightforward and robust alternatives to implement these
methods will help practitioners gain insight about how occupancy modeling and more
generally Bayesian modeling is performed In this Chapter using a simple Gibbs
sampling approach first we develop a versatile method to estimate the single season
closed population site-occupancy model then extend it to analyze metapopulation
dynamics through time and finally provide a further adaptation to incorporate spatial
dependence among neighboring sites211 The Occupancy Model
In this section of the document we first introduce our results published in Dorazio
amp Taylor-Rodrıguez (2012) and build upon them to propose relevant extensions For
24
the standard sampling protocol for collecting site-occupancy data J gt 1 independent
surveys are conducted at each of N representative sample locations (sites) noting
whether a species is detected or not detected during each survey Let yij denote a binary
random variable that indicates detection (y = 1) or non-detection (y = 0) during the
j th survey of site i Without loss of generality J may be assumed constant among all N
sites to simplify description of the model In practice however site-specific variation in
J poses no real difficulties and is easily implemented This sampling protocol therefore
yields a N times J matrix Y of detectionnon-detection data
Note that the observed process yij is an imperfect representation of the underlying
occupancy or presence process Hence letting zi denote the presence indicator at site i
this model specification can therefore be represented through the hierarchy
yij |zi λ sim Bernoulli (zipij)
zi |α sim Bernoulli (ψi) (2ndash1)
where pij is the probability of correctly classifying as occupied the i th site during the j th
survey ψi is the presence probability at the i th site The graphical representation of this
process is
ψi
zi
yi
pi
Figure 2-1 Graphical representation occupancy model
Probabilities of detection and occupancy can both be made functions of covariates
and their corresponding parameter estimates can be obtained using either a maximum
25
likelihood or a Bayesian approach Existing methodologies from the likelihood
perspective marginalize over the latent occupancy process (zi ) making the estimation
procedure depend only on the detections Most Bayesian strategies rely on MCMC
algorithms that require parameter prior specification and tuning However Albert amp Chib
(1993) proposed a longstanding strategy in the Bayesian statistical literature that models
binary outcomes using a simple Gibbs sampler This procedure which is described in
the following section can be extrapolated to the occupancy setting eliminating the need
for tuning parameters and subjective prior elicitation212 Data Augmentation Algorithms for Binary Models
Probit model Data-augmentation with latent normal variables
At the root of Albert amp Chibrsquos algorithm lies the idea that if the observed outcome is
0 the latent variable can be simulated from a truncated normal distribution with support
(minusinfin 0] And if the outcome is 1 the latent variable can be simulated from a truncated
normal distribution in (0infin) To understand the reasoning behind this strategy let
Y sim Bern((xTβ)
) and V = xTβ + ε with ε sim N (0 1) In such a case note that
Pr(y = 1 | xTβ) = (xTβ) = Pr(ε lt xTβ)
= Pr(ε gt minusxTβ)
= Pr(v gt 0 | xTβ)
Thus whenever y = 1 then v gt 0 and v le 0 otherwise In other words we
may think of y as a truncated version of v Thus we can sample iteratively alternating
between the latent variables conditioned on model parameters and vice versa to draw
from the desired posterior densities By augmenting the data with the latent variables
we are able to obtain full conditional posterior distributions for model parameters that are
easy to draw from (equation 2ndash3 below) Further we may sample the latent variables
we may also sample the parameters
Given some initial values for all model parameters values for the latent variables
can be simulated By conditioning on the latter it is then possible to draw samples
26
from the parameterrsquos posterior distributions These samples can be used to generate
new values for the latent variables etc The process is iterated using a Gibbs sampling
approach Generally after a large number iterations it yields draws from the joint
posterior distribution of the latent variables and the model parameters conditional on the
observed outcome values We formalize the procedure below
Assume that each outcome Y1Y2 Yn is such that Yi |xi β sim Bernoulli(qi)
where qi = (xTi β) is the standard normal CDF evaluated at xTi β where xi and β
are the p-dimensional vectors of observed covariates for the i -th observation and their
corresponding parameters respectively
Now let y = y1 y2 yn be the vector of observed outcomes and [ β ] represents
the prior distribution of the model parameters Therefore the posterior distribution of β is
given by
[ β|y ] prop [ β ]nprodi=1
(xTi β)yi(1minus(xTi β)
)1minusyi (2ndash2)
which is intractable Nevertheless introducing latent random variables V = (V1 Vn)
such that Vi sim N (xTi β 1) resolves this difficulty by specifying that whenever Yi = 1
then Vi gt 0 and if Yi = 0 then Vi le 0 This yields
[ β v|y ] prop [ β ]
nprodi=1
ϕ(vi | xTi β 1)Ivile0Iyi=0 + Ivigt0Iyi=1
(2ndash3)
where ϕ(x |micro τ 2) is the probability density function of normal random variable x
with mean micro and variance τ2 The data augmentation artifact works since [ β|y ] =int[ β v|y ]dv hence if we sample from joint posterior 2ndash3 and extract only the sampled
values for β they will correspond to samples from [ β|y ]
From the expression above it is possible to obtain the full conditional distributions
for V and β Thus a Gibbs sampler can be proposed For example if we use a flat prior
27
for β (ie [ β ] prop 1) the full conditionals are given by
β|V y sim MVNk
((XTX )minus1(XTV ) (XTX )minus1
)(2ndash4)
V|β y simnprodi=1
tr N (xTi β 1Qi) (2ndash5)
where MVNq(micro ) represents a multivariate normal distribution with mean vector micro
and variance-covariance matrix and tr N (ξσ2Q) stands for the truncated normal
distribution with mean ξ variance σ2 and truncation region Q For each i = 1 2 n
the support of the truncated variables is given by Q = (minusinfin 0 ] if yi = 0 and Q = (0infin)
otherwise Note that conjugate normal priors could be used alternatively
At iteration m + 1 the Gibbs sampler draws V(m+1) conditional on β(m) from (2ndash5)
and then samples β(m+1) conditional on V(m+1) from (2ndash4) This process is repeated for
s = 0 1 nsim where nsim is the number of iterations in the Gibbs sampler
Logit model Data-augmentation with latent Polya-gamma variables
Recently Polson et al (2013) developed a novel and efficient approach for Bayesian
inference for logistic models using Polya-gamma latent variables which is analogous
to the Albert amp Chib algorithm The result arises from what the authors refer to as the
Polya-gamma distribution To construct a random variable from this family consider the
infinite mixture of the iid sequence of Exp(1) random variables Ekinfink=1 given by
ω =2
π2
infinsumk=1
Ek
(2k minus 1)2
with probability density function
g(ω) =infinsumk=1
(minus1)k 2k + 1radic2πω3
eminus(2k+1)2
8ω Iωisin(0infin) (2ndash6)
and Laplace density transform E[eminustω] = coshminus1(radic
t2)
28
The Polya-gamma family of densities is obtained through an exponential tilting of
the density g from 2ndash6 These densities indexed by c ge 0 are characterized by
f (ω|c) = cosh c2 eminusc2ω2 g(ω)
The likelihood for the binomial logistic model can be expressed in terms of latent
Polya-gamma variables as follows Assume yi sim Bernoulli(δi) with predictors xprimei =
(xi1 xip) and success probability δi = exprimeiβ(1 + ex
primeiβ) Hence the posterior for the
model parameters can be represented as
[β|y] =[β]prodn
i δyii (1minus δi)
1minusyi
c(y)
where c(y) is the normalizing constant
To facilitate the sampling procedure a data augmentation step can be performed
by introducing a Polya-gamma random variable ω sim PG(xprimeβ 1) This yields the
data-augmented posterior
[βω|y] =
(prodn
i=1 Pr(yi = 1|β))f (ω|xprime
β) [β] dω
c(y) (2ndash7)
such that [β|y] =int
R+[βω|y] dω
Thus from the augmented model the full conditional density for β is given by
[β|ω y] prop
(nprodi=1
Pr(yi = 1|β)
)f (ω|xprime
β) [β] dω
=
nprodi=1
(exprimeiβ)yi
1 + exprimeiβ
nprodi=1
cosh
(∣∣xprime
iβ∣∣
2
)exp
[minus(x
prime
iβ)2ωi
2
]g(ωi)
(2ndash8)
This expression yields a normal posterior distribution if β is assigned flat or normal
priors Hence a two-step sampling strategy analogous to that of Albert amp Chib (1993)
can be used to estimate β in the occupancy framework22 Single Season Occupancy
Let pij = F (qTij λ) be the probability of correctly classifying as occupied the i th
site during the j th survey conditional on the site being occupied and let ψi = F (xTi α)
29
correspond to the presence probability at the i th site Further let Fminus1(middot) denote a link
function (ie probit or logit) connecting the response to the predictors and denote by λ
and α respectively the r -variate and p-variate coefficient vectors for the detection and
for the presence probabilities Then the following is the joint posterior probability for the
presence indicators and the model parameters
πlowast(z vαwλ) prop πα(α)πλ(λ)Nprodi=1
F (xprimeiα)zi (1minus F (xprimeiα))
(1minuszi ) times
Jprodj=1
(ziF (qprimeijλ))
yij (1minus ziF (qprimeijλ))
1minusyij (2ndash9)
As in the simple probit regression problem this posterior is intractable Consequently
sampling from it directly is not possible But the procedures of Albert amp Chib for the
probit model and of Polson et al for the logit model can be extended to generate an
MCMC sampling strategy for the occupancy problem In what follows we make use of
this framework to develop samplers with which occupancy parameter estimates can be
obtained for both probit and logit link functions These algorithms have the added benefit
that they do not require tuning parameters nor eliciting parameter priors subjectively221 Probit Link Model
To extend Albert amp Chibrsquos algorithm to the occupancy framework with a probit link
first we introduce two sets of latent variables denoted by wij and vi corresponding to
the normal latent variables used to augment the data The corresponding hierarchy is
yij |zi sij sim Bernoulli(ziIwijgt0
)wij |λ sim N
(qprimeijλ 1
)λ sim [λ]
zi |vi sim Ivigt0
vi |α sim N (xprimeiα 1)
α sim [α] (2ndash10)
30
represented by the directed graph found in Figure 2-2
α
vi
zi
yi
wi
λ
Figure 2-2 Graphical representation occupancy model after data-augmentation
Under this hierarchical model the joint density is given by
πlowast(z vαwλ) prop Cyπα(α)πλ(λ)Nprodi=1
ϕ(vi xprimeiα 1)I
zivigt0I
(1minuszi )vile0 times
Jprodj=1
(ziIwijgt0)yij (1minus ziIwijgt0)
1minusyijϕ(wij qprimeijλ 1) (2ndash11)
The full conditional densities derived from the posterior in equation 2ndash11 are
detailed below
1 These are obtained from the full conditional of z after integrating out v and w
f (z|αλ) =
Nprodi=1
f (zi |αλ) =Nprodi=1
ψlowastizi (1minus ψlowast
i )1minuszi
where ψlowasti =
ψiprodJ
j=1 pyijij (1minus pij)
1minusyij
ψiprodJ
j=1 pyijij (1minus pij)1minusyij + (1minus ψi)
prodJ
j=1 Iyij=0(2ndash12)
2
f (v|zα) =
Nprodi=1
f (vi |zi α) =Nprodi=1
tr N (x primeiα 1Ai)
where Ai =
(minusinfin 0] zi = 0(0infin) zi = 1
(2ndash13)
31
and tr N(microσ2A) denotes the pdf of a truncated normal random variable withmean micro variance σ2 and truncation region A
3
f (α|v) = ϕp (α αXprimev α) (2ndash14)
where α = (X primeX )minus1and ϕk(x micro ) represents the k-variate normal density withmean vector micro and variance matrix
4
f (w|y zλ) =
Nprodi=1
Jprodj=1
f (wij |yij zi λ) =Nprodi=1
Jprodj=1
tr N (qprimeijλ 1Bij)
where Bij =
(minusinfininfin) zi = 0(minusinfin 0] zi = 1 and yij = 0(0infin) zi = 1 and yij = 1
(2ndash15)
5
f (λ|w) = ϕr (λ λQprimew λ) (2ndash16)
where λ = (Q primeQ)minus1
The Gibbs sampling algorithm for the model can then be summarized as
1 Initialize z α v λ and w
2 Sample zi sim Bern(ψilowast)
3 Sample vi from a truncated normal with micro = x primeiα σ = 1 and truncation regiondepending on zi
4 Sample α sim N (αXprimev α) with α = (X primeX )minus1
5 Sample wij from a truncated normal with micro = qprimeijλ σ = 1 and truncation region
depending on yij and zi
6 Sample λ sim N (λQprimew λ) with λ = (Q primeQ)minus1
222 Logit Link Model
Now turning to the logit link version of the occupancy model again let yij be the
indicator variable used to mark detection of the target species on the j th survey at the
i th site and let zi be the indicator variable that denotes presence (zi = 1) or absence
32
(zi = 0) of the target species at the i th site The model is now defined by
yij |zi λ sim Bernoulli (zipij) where pij =eq
primeijλ
1 + eqprimeijλ
λ sim [λ]
zi |α sim Bernoulli (ψi) where ψi =ex
primeiα
1 + exprimeiα
α sim [α]
In this hierarchy the contribution of a single site to the likelihood is
Li(αλ) =(ex
primeiα)zi
1 + exprimeiα
Jprodj=1
(zi
eqprimeijλ
1 + eqprimeijλ
)yij(1minus zi
eqprimeijλ
1 + eqprimeijλ
)1minusyij
(2ndash17)
As in the probit case we data-augment the likelihood with two separate sets
of covariates however in this case each of them having Polya-gamma distribution
Augmenting the model and using the posterior in (2ndash7) the joint is
[ zαλ|y ] prop [α] [λ]
Nprodi=1
(ex
primeiα)zi
1 + exprimeiαcosh
(∣∣xprime
iα∣∣
2
)exp
[minus(x
prime
iα)2vi
2
]g(vi)times
Jprodj=1
(zi
eqprimeijλ
1 + eqprimeijλ
)yij(1minus zi
eqprimeijλ
1 + eqprimeijλ
)1minusyij
times
cosh
(∣∣ziqprimeijλ∣∣2
)exp
[minus(ziq
primeijλ)2wij
2
]g(wij)
(2ndash18)
The full conditionals for z α v λ and w obtained from (2ndash18) are provided below
1 The full conditional for z is obtained after marginalizing the latent variables andyields
f (z|αλ) =
Nprodi=1
f (zi |αλ) =Nprodi=1
ψlowastizi (1minus ψlowast
i )1minuszi
where ψlowasti =
ψiprodJ
j=1 pyijij (1minus pij)
1minusyij
ψiprodJ
j=1 pyijij (1minus pij)1minusyij + (1minus ψi)
prodJ
j=1 Iyij=0(2ndash19)
33
2 Using the result derived in Polson et al (2013) we have that
f (v|zα) =
Nprodi=1
f (vi |zi α) =Nprodi=1
PG(1 xprimeiα) (2ndash20)
3
f (α|v) prop [α ]
Nprodi=1
exp[zix
prime
iαminus xprime
iα
2minus (x
prime
iα)2vi
2
] (2ndash21)
4 By the same result as that used for v the full conditional for w is
f (w|y zλ) =
Nprodi=1
Jprodj=1
f (wij |yij zi λ)
=
(prodiisinS1
Jprodj=1
PG(1 |qprimeijλ| )
)(prodi isinS1
Jprodj=1
PG(1 0)
) (2ndash22)
with S1 = i isin 1 2 N zi = 1
5
f (λ|z yw) prop [λ ]prodiisinS1
exp
[yijq
prime
ijλminusq
prime
ijλ
2minus
(qprime
ijλ)2wij
2
] (2ndash23)
with S1 as defined above
The Gibbs sampling algorithm is analogous to the one with a probit link but with the
obvious modifications to incorporate Polya-gamma instead of normal latent variables23 Temporal Dynamics and Spatial Structure
The uses of the single-season model are limited to very specific problems In
particular assumptions for the basic model may become too restrictive or unrealistic
whenever the study period extends throughout multiple years or seasons especially
given the increasingly changing environmental conditions that most ecosystems are
currently experiencing
Among the many extensions found in the literature one that we consider particularly
relevant incorporates heterogenous occupancy probabilities through time Extensions of
34
site-occupancy models that incorporate temporally varying probabilities can be traced
back to Hanski (1994) The heterogeneity of occupancy probabilities through time arises
from local colonization and extinction processes MacKenzie et al (2003) proposed an
alternative to Hanskirsquos approach in order to incorporate imperfect detection The method
is flexible enough to let detection occurrence survival and colonization probabilities
each depend upon its own set of covariates using likelihood-based estimation for the
model parameters
However the approach of MacKenzie et al presents two drawbacks First
the uncertainty assessment for maximum likelihood parameter estimates relies on
asymptotic results (obtained from implementation of the delta method) making it
sensitive to sample size And second to obtain parameter estimates the latent process
(occupancy) is marginalized out of the likelihood leading to the usual zero-inflated
Bernoulli model Although this is a convenient strategy to solve the estimation problem
the latent state variables (occupancy indicators) are no longer available and as such
finite sample estimates cannot be calculated unless an additional (and computationally
expensive) parametric bootstrap step is performed (Royle amp Kery 2007) Additionally as
the occupancy process is integrated out the likelihood approach precludes incorporation
of additional structural dependence using random effects Thus the model cannot
account for spatial dependence which plays a fundamental role in this setting
To work around some of the shortcomings encountered when fitting dynamic
occupancy models via likelihood based methods Royle amp Kery developed what they
refer to as a dynamic occupancy state space model (DOSS) alluding to the conceptual
similarity found between this model and the class of state space models found in the
time series literature In particular this model allows one to retain the latent process
(occupancy indicators) in order to obtain small sample estimates and to eventually
generate extensions that incorporate structure in time andor space through random
effects
35
The data used in the DOSS model comes from standard repeated presenceabsence
surveys with N sampling locations (patches or sites) indexed by i = 1 2 N Within
a given season (eg year month week depending on the biology of the species) each
sampling location is visited (surveyed) j = 1 2 J times This process is repeated for
t = 1 2 T seasons Here an important assumption is that the site occupancy status
is closed within but not across seasons
As is usual in the occupancy modeling framework two different processes are
considered The first one is the detection process per site-visit-season combination
denoted by yijt The yijt are indicator functions that take the value 1 if the species is
present at site i survey j and season t and 0 otherwise These detection indicators
are assumed to be independent within each site and season The second response
considered is the partially observed presence (occupancy) indicators zit These are
indicator variables which are equal to 1 whenever yijt = 1 for one or more of the visits
made to site i during season t otherwise the values of the zit rsquos are unknown Royle amp
Kery refer to these two processes as the observation (yijt) and the state (zit) models
In this setting the parameters of greatest interest are the occurrence or site
occupancy probabilities denoted by ψit as well as those representing the population
dynamics which are accounted for by introducing changes in occupancy status over
time through local colonization and survival That is if a site was not occupied at season
t minus 1 at season t it can either be colonized or remain unoccupied On the other hand
if the site was in fact occupied at season t minus 1 it can remain that way (survival) or
become abandoned (local extinction) at season t The probabilities of survival and
colonization from season t minus 1 to season t at the i th site are denoted by θi(tminus1) and
γi(tminus1) respectivelyDuring the initial period (or season) the model for the state process is expressed in
terms of the occupancy probability (equation 2ndash24) For subsequent periods the stateprocess is specified in terms of survival and colonization probabilities (equation 2ndash25) inparticular
zi1 sim Bernoulli (ψi1) (2ndash24)
36
zit |zi(tminus1) sim Bernoulli(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)(2ndash25)
The observation model conditional on the latent process zit is defined by
yijt |zit sim Bernoulli(zitpijt
)(2ndash26)
Royle amp Kery induce the heterogeneity by site site-season and site-survey-seasonrespectively in the occupancy survival and colonization and in the detection probabilitiesthroughthe following specification
logit(ψi1) = x1 + ri ri sim N(0σ2ψ) logitminus1(x1) sim Unif(0 1)
logit(θit) = at + ui ui sim N(0σ2θ ) logitminus1(ai) sim Unif(0 1)logit(γit) = bt + vi vi sim N(0σ2γ) logitminus1(bi) sim Unif(0 1)
logit(pijt) = ct + wij wi sim N(0σ2p) logitminus1(ci) sim Unif(0 1) (2ndash27)
where x1 at bt ct are the season fixed effects for the corresponding probabilities
and where (ri ui vi) and wij are the site and site-survey random effects respectively
Additionally all variance components assume the usual inverse gamma priors
As the authors state this formulation can be regarded as ldquobeing suitably vaguerdquo
however it is also restrictive in the sense that it is not clear what strategy to follow to
incorporate additional covariates while preserving the straightforward sampling strategy231 Dynamic Mixture Occupancy State-Space Model
We assume that the probabilities for occupancy survival colonization and detection
are all functions of linear combinations of covariates However our setup varies
slightly from that considered by Royle amp Kery (2007) In essence we modify the way in
which the estimates for survival and colonization probabilities are attained Our model
incorporates the notion that occupancy at a site occupied during the previous season
takes place through persistence where we define persistence as a function of both
survival and colonization That is a site occupied at time t may again be occupied
at time t + 1 if the current settlers survive if they perish and new settlers colonize
simultaneously or if both current settlers survive and new ones colonize
Our functional forms of choice are again the probit and logit link functions This
means that each probability of interest which we will refer to for illustration as δ is
37
linked to a linear combination of covariates xprime ξ through the relationship defined by
δ = F (xT ξ) where F (middot) represents the inverse link function This particular assumption
facilitates relating the data augmentation algorithms of Albert amp Chib and Polson et al to
Royle amp Keryrsquos DOSS model We refer to this extension of Royle amp Keryrsquos model as the
Dynamic Mixture Occupancy State Space model (DYMOSS)
As before let yijt be the indicator variable used to mark detection of the target
species on the j th survey at the i th site during the tth season and let zit be the indicator
variable that denotes presence (zit = 1) or absence (zit = 0) of the target species at the
i th site tth season with i isin 1 2 N j isin 1 2 J and t isin 1 2 T
Additionally assume that probabilities for occupancy at time t = 1 persistence
colonization and detection are all functions of covariates with corresponding parameter
vectors α (s) =δ(s)tminus1
Tt=2
B(c) =β(c)tminus1
Tt=2
and = λtTt=1 and covariate matrices
X(o) X = Xtminus1Tt=2 and Q(s) = QtTt=1 respectively Using the notation above our
proposed dynamic occupancy model is defined by the following hierarchyState model
zi1|α sim Bernoulli (ψi1) where ψi1 = F(xprime(o)iα
)zit |zi(tminus1) δ
(c)tminus1β
(s)tminus1 sim Bernoulli
(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)where θi(tminus1) = F
(δ(s)tminus1 + xprimei(tminus1)β
(c)tminus1
) and
γi(tminus1) = F(xprimei(tminus1)β
(c)tminus1
)(2ndash28)
Observed modelyijt |zit ηt sim Bernoulli (zitpijt)
where pijt = F (qTijtλt) (2ndash29)
In the hierarchical setup given by Equations 2ndash28 and 2ndash29 θi(tminus1) corresponds to
the probability of persistence from time t minus 1 to time t at site i and γi(tminus1) denotes the
colonization probability Note that θi(tminus1) minus γi(tminus1) yields the survival probability from t minus 1
to t The effect of survival is introduced by changing the intercept of the linear predictor
by a quantity δ(s)tminus1 Although in this version of the model this effect is accomplished by
just modifying the intercept it can be extended to have covariates determining δ(s)tminus1 as
well The graphical representation of the model for a single site is
38
α
zi1
yi1
λ1
zi2
yi2
λ1
δ(s)1
β(c)1
middot middot middot
zit
yit
λt
δ(s)tminus1
β(c)tminus1
middot middot middot
ziT
yiT
λT
δ(s)Tminus1
β(c)Tminus1
Figure 2-3 Graphical representation multiseason model for a single site
The joint posterior for the model defined by this hierarchical setting is
[ zηαβλ|y ] = Cy
Nprodi=1
ψi1 Jprodj=1
pyij1ij1 (1minus pij1)
(1minusyij1)
zi1(1minus ψi1)
Jprodj=1
Iyij1=0
1minuszi1 [η1][α]times
Tprodt=2
Nprodi=1
[(θziti(tminus1)(1minus θi(tminus1))
1minuszit)zi(tminus1)
+(γziti(tminus1)(1minus γi(tminus1))
1minuszit)1minuszi(tminus1)
] Jprod
j=1
pyijtijt (1minus pijt)
1minusyijt
zit
times
Jprodj=1
Iyijt=0
1minuszit [ηt ][βtminus1][λtminus1]
(2ndash30)
which as in the single season case is intractable Once again a Gibbs sampler cannot
be constructed directly to sample from this joint posterior The graphical representation
of the model for one site incorporating the latent variables is provided in Figure 2-4
α
ui1
zi1
yi1
wi1
λ1
zi2
yi2
wi2
λ1
vi1
δ(s)1
β(c)1
middot middot middot
middot middot middot
zit
vi tminus1
yit
wit
λt
δ(s)tminus1
β(c)tminus1
middot middot middot
middot middot middot
ziT
vi Tminus1
yiT
wiT
λT
δ(s)Tminus1
β(s)Tminus1
Figure 2-4 Graphical representation data-augmented multiseason model
Probit link normal-mixture DYMOSS model
39
We deal with the intractability of the joint posterior distribution as before that is
by introducing latent random variables Each of the latent variables incorporates the
relevant linear combinations of covariates for the probabilities considered in the model
This artifact enables us to sample from the joint posterior distributions of the model
parameters For the probit link the sets of latent random variables respectively for first
season occupancy persistence and colonization and detection are
bull ui sim N (bTi α 1)
bull vi(tminus1) sim zi(tminus1)N(δ(s)(tminus1) + xTi(tminus1)β
(c)(tminus1) 1
)+ (1minus zi(tminus1))N
(xTi(tminus1)β
(c)(tminus1) 1
) and
bull wijt sim N (qTijtηt 1)
Introducing these latent variables into the hierarchical formulation yieldsState model
ui1|α sim N(xprime(o)iα 1
)zi1|ui sim Bernoulli
(Iuigt0
)for t gt 1
vi(tminus1)|zi(tminus1)βtminus1 sim zi(tminus1)N(δ(s)(tminus1) + xprimei(tminus1)β
(c)(tminus1) 1
)+
(1minus zi(tminus1))N(xprimei(tminus1)β
(c)(tminus1) 1
)zit |vi(tminus1) sim Bernoulli
(Ivi(tminus1)gt0
)(2ndash31)
Observed modelwijt |ηt sim N
(qTijtλt 1
)yijt |zit wijt sim Bernoulli
(zitIrijtgt0
)(2ndash32)
Note that the result presented in Section 22 corresponds to the particular case for
T = 1 of the model specified by Equations 2ndash31 and 2ndash32
As mentioned previously model parameters are obtained using a Gibbs sampling
approach Let ϕ(x |microσ2) denote the pdf of a normally distributed random variable x
with mean micro and standard deviation σ Also let
1 Wt = (w1t w2t wNt) with wit = (wi1t wi2t wiJitt) (for i = 1 2 N andt = 1 2 T )
40
2 u = (u1 u2 uN)
3 V = (v1 vTminus1) with vt = (v1t v2t vNt)
For the probit link model the joint posterior distribution is
π(ZuV WtTt=1αB(c) δ(s)
)prop [α]
prodNi=1 ϕ
(ui∣∣ xprime(o)iα 1
)Izi1uigt0I
1minuszi1uile0
times
Tprodt=2
[β(c)tminus1 δ
(s)tminus1
] Nprodi=1
ϕ(vi(tminus1)
∣∣micro(v)i(tminus1) 1
)Izitvi(tminus1)gt0
I1minuszitvi(tminus1)le0
times
Tprodt=1
[λt ]
Nprodi=1
Jitprodj=1
ϕ(wijt
∣∣qprimeijtλt 1)(zitIwijtgt0)yij1(1minus zitIwijtgt0)
(1minusyijt)
where micro(v)i(tminus1) = zi(tminus1)δ
(s)tminus1 + xprimei(tminus1)β
(c)tminus1 (2ndash33)
Initialize the Gibbs sampler at α(0)B(0)(c) δ
(s)(0)2minus1 and (0) For m = 0 1 nsim
The sampler proceeds iteratively by block sampling sequentially for each primary
sampling period as follows first the presence process then the latent variables from
the data-augmentation step for the presence component followed by the parameters for
the presence process then the latent variables for the detection component and finally
the parameters for the detection component Letting [|] denote the full conditional
probability density function of the component conditional on all other unknown
parameters and the observed data for m = 1 nsim the sampling procedure can be
summarized as
[z(m)1 | middot
]rarr[u(m)| middot
]rarr[α(m)
∣∣∣ middot ]rarr [W
(m)1 | middot
]rarr[λ(m)1
∣∣∣ middot ]rarr[z(m)2 | middot
]rarr[V(m)2minus1| middot
]rarr[β(c)(m)2minus1 δ(s)(m)
2minus1
∣∣∣ middot ]rarr [W
(m)2 | middot
]rarr[λ(m)2
∣∣∣ middot ]rarr middot middot middot
middot middot middot rarr[z(m)T | middot
]rarr[V(m)Tminus1| middot
]rarr[β(c)(m)Tminus1 δ(s)(m)
Tminus1
∣∣∣ middot ]rarr [W
(m)T | middot
]rarr[λ(m)T
∣∣∣ middot ]The full conditional probability densities for this Gibbs sampling algorithm are
presented in detail within Appendix A
41
Logit link Polya-Gamma DYMOSS model
Using the same notation as before the logit link model resorts to the hierarchy given
byState model
ui1|α sim PG(xT(o)iα 1
)zi1|ui sim Bernoulli
(Iuigt0
)for t gt 1
vi(tminus1)| sim PG(1∣∣zi(tminus1)δ(s)(tminus1) + xprimei(tminus1)β
(c)(tminus1)
∣∣)sim Bernoulli
(Ivi(tminus1)gt0
)(2ndash34)
Observed modelwijt |λt sim PG
(qTijtλt 1
)yijt |zit wijt sim Bernoulli
(zitIwijtgt0
)(2ndash35)
The logit link version of the joint posterior is given by
π(ZuV WtTt=1αB(s)B(c)
)prop
Nprodi=1
(e
xprime(o)i
α)zi1
1 + exprime(o)i
αPG
(ui 1 |xprime(o)iα|
)[λ1][α]times
Ji1prodj=1
(zi1
eqprimeij1λ1
1 + eqprimeij1λ1
)yij1(1minus zi1
eqprimeij1λ1
1 + eqprimeij1λ1
)1minusyij1
PG(wij1 1 |zi1qprimeij1λ1|
)times
Tprodt=2
[δ(s)tminus1][β
(c)tminus1][λt ]
Nprodi=1
(exp
[micro(v)tminus1
])zit1 + exp
[micro(v)i(tminus1)
]PG (vit 1 ∣∣∣micro(v)i(tminus1)
∣∣∣)timesJitprodj=1
(zit
eqprimeijtλt
1 + eqprimeijtλt
)yijt(1minus zit
eqprimeijtλt
1 + eqlowastTij
λt
)1minusyijt
PG(wijt 1 |zitqprimeijtλt |
)
(2ndash36)
with micro(v)tminus1 = zi(tminus1)δ
(s)tminus1 + xprimei(tminus1)β
(c)tminus1
42
The sampling procedure is entirely analogous to that described for the probit
version The full conditional densities derived from expression 2ndash36 are described in
detail in Appendix A232 Incorporating Spatial Dependence
In this section we describe how the additional layer of complexity space can also
be accounted for by continuing to use the same data-augmentation framework The
method we employ to incorporate spatial dependence is a slightly modified version of
the traditional approach for spatial generalized linear mixed models (GLMMrsquos) and
extends the model proposed by Johnson et al (2013) for the single season closed
population occupancy model
The traditional approach consists of using spatial random effects to induce a
correlation structure among adjacent sites This formulation introduced by Besag et al
(1991) assumes that the spatial random effect corresponds to a Gaussian Markov
Random Field (GMRF) The model known as the Spatial GLMM (SGLMM) is used to
analyze areal data It has been applied extensively given the flexibility of its hierarchical
formulation and the availability of software for its implementation (Hughes amp Haran
2013)
Succinctly the spatial dependence is accounted for in the model by adding a
random vector η assumed to have a conditionally-autoregressive (CAR) prior (also
known as the Gaussian Markov random field prior) To define the prior let the pair
G = (V E) represent the undirected graph for the entire spatial region studied where
V = (1 2 N) denotes the vertices of the graph (sites) and E the set of edges
between sites E is constituted by elements of the form (i j) indicating that sites i
and j are spatially adjacent for some i j isin V The prior for the spatial effects is then
characterized by
[η|τ ] prop τ rank()2exp[minusτ2ηprimeη
] (2ndash37)
43
where = (diag(A1)minus A) is the precision matrix with A denoting the adjacency matrix
The entries of the adjacency matrix A are such that diag(A) = 0 and Aij = I(i j)isinE
The matrix is singular Hence the probability density defined in equation 2ndash37
is improper ie it doesnrsquot integrate to 1 Regardless of the impropriety of the prior this
model can be fitted using a Bayesian approach since even if the prior is improper the
posterior for the model parameters is proper If a constraint such assum
k ηk = 0 is
imposed or if the precision matrix is replaced by a positive definite matrix the model
can also be fitted using a maximum likelihood approach
Assuming that all but the detection process are subject to spatial correlations and
using the notation we have developed up to this point the spatially explicit version of the
DYMOSS model is characterized by the hierarchy represented by equations 2ndash38 and
2ndash39
Hence adding spatial structure into the DYMOSS framework described in the
previous section only involves adding the steps to sample η(o) and ηtT
t=2 conditional
on all other parameters Furthermore the corresponding parameters and spatial
random effects of a given component (ie occupancy survival and colonization)
can be effortlessly pooled together into a single parameter vector to perform block
sampling For each of the latent variables the only modification required is to sum the
corresponding spatial effect to the linear predictor so that these retain their conditional
independence given the linear combination of fixed effects and the spatial effects
State modelzi1|α sim Bernoulli (ψi1) where ψi1 = F
(xT(o)iα+ η
(o)i
)[η(o)|τ
]prop τ rank()2exp
[minusτ2η(o)primeη(o)
]zit |zi(tminus1)αβtminus1λtminus1 sim Bernoulli
(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)where θi(tminus1) = F
(δ(s)(tminus1) + xTi(tminus1)β
(c)tminus1 + ηit
) and
γi(tminus1) = F(xTi(tminus1)β
(c)tminus1 + ηit
)[ηt |τ ] prop τ rank()2exp
[minusτ2ηprimetηt
](2ndash38)
44
Observed modelyijt |zit ηt sim Bernoulli (zitpijt)
where pijt = F (qTijtλt) (2ndash39)
In spite of the popularity of this approach to incorporating spatial dependence three
shortcomings have been reported in the literature (Hughes amp Haran 2013 Reich et al
2006) (1) model parameters have no clear interpretation due to spatial confounding
of the predictors with the spatial effect (2) there is variance inflation due to spatial
confounding and (3) the high dimensionality of the latent spatial variables leads to
high computational costs To avoid such difficulties we follow the approach used by
Hughes amp Haran (2013) which builds upon the earlier work by Reich et al (2006) This
methodology is summarized in what follows
Let a vector of spatial effects η have the CAR model given by 2ndash37 above Now
consider a random vector ζ sim MVN(0 τKprimeK
) with defined as above and where
τKprimeK corresponds to the precision of the distribution and not the covariance matrix
with matrix K satisfying KprimeK = I
This last condition implies that the linear predictor Xβ + η = Xβ + Kζ With
respect to how the matrix K is chosen Hughes amp Haran (2013) recommend basing its
construction on the spectral decomposition of operator matrices based on Moranrsquos I
The Moran operator matrix is defined as PperpAPperp with Pperp = IminusX (XprimeX )minus1X
prime and where A
is the adjacency matrix previously described The choice of the Moran operator is based
on the fact that it accounts for the underlying graph while incorporating the spatial
structure residual to the design matrix X These elements are incorporated into its
spectral decomposition of the Moran operator That is its eigenvalues correspond to the
values of Moranrsquos I statistic (a measure of spatial autocorrelation) for a spatial process
orthogonal to X while its eigenvectors provide the patterns of spatial dependence
residual to X Thus the matrix K is chosen to be the matrix whose columns are the
eigenvectors of the Moran operator for a particular adjacency matrix
45
Using this strategy the new hierarchical formulation of our model is simply modified
by letting η(o) = K(o)ζ(o) and ηt = Ktζt with
1 ζ(o) sim MVN(0 τ (o)K(o)primeK(o)
) where K(o) is the eigenvector matrix for
P(o)perpAP(o)perp and
2 ζt sim MVN(0 τtK
primetKt
) where Kt is the Pperp
t APperpt for t = 2 3 T
The algorithms for the probit and logit link from section 231 can be readily
adapted to incorporate the spatial structure simply by obtaining the joint posteriors
for (α ζ(o)) and (β(c)tminus1 δ
(s)tminus1 ζt) making the obvious modification of the corresponding
linear predictors to incorporate the spatial components24 Summary
With a few exceptions (Dorazio amp Taylor-Rodrıguez 2012 Johnson et al 2013
Royle amp Kery 2007) recent Bayesian approaches to site-occupancy modeling with
covariates have relied on model configurations (eg as multivariate normal priors of
parameters in logit scale) that lead to unfamiliar conditional posterior distributions thus
precluding the use of a direct sampling approach Therefore the sampling strategies
available are based on algorithms (eg Metropolis Hastings) that require tuning and the
knowledge to do so correctly
In Dorazio amp Taylor-Rodrıguez (2012) we proposed a Bayesian specification for
which a Gibbs sampler of the basic occupancy model is available and allowed detection
and occupancy probabilities to depend on linear combinations of predictors This
method described in section 221 is based on the data augmentation algorithm of
Albert amp Chib (1993) There the full conditional posteriors of the parameters of the probit
regression model are cast as latent mixtures of normal random variables The probit and
the logit link yield similar results with large sample sizes however their results may be
different when small to moderate sample sizes are considered because the logit link
function places more mass in the tails of the distribution than the probit link does In
46
section 222 we adapt the method for the single season model to work with the logit link
function
The basic occupancy framework is useful but it assumes a single closed population
with fixed probabilities through time Hence its assumptions may not be appropriate to
address problems where the interest lies in the temporal dynamics of the population
Hence we developed a dynamic model that incorporates the notion that occupancy
at a site previously occupied takes place through persistence which depends both on
survival and habitat suitability By this we mean that a site occupied at time t may again
be occupied at time t + 1 if (1) the current settlers survive (2) the existing settlers
perish but new settlers simultaneously colonize or (3) current settlers survive and new
ones colonize during the same season In our current formulation of the DYMOSS both
colonization and persistence depend on habitat suitability characterized by xprimei(tminus1)β(c)tminus1
They only differ in that persistence is also influenced by whether the site being occupied
during season t minus 1 enhances the suitability of the site or harms it through density
dependence
Additionally the study of the dynamics that govern distribution and abundance of
biological populations requires an understanding of the physical and biotic processes
that act upon them and these vary in time and space Consequently as a final step in
this Chapter we described a straightforward strategy to add spatial dependence among
neighboring sites in the dynamic metapopulation model This extension is based on the
popular Bayesian spatial modeling technique of Besag et al (1991) updated using the
methods described in (Hughes amp Haran 2013)
Future steps along these lines are (1) develop the software necessary to
implement the tools described throughout the Chapter and (2) build a suite of additional
extensions using this framework for occupancy models will be explored The first of
them will be used to incorporate information from different sources such as tracks
scats surveys and direct observations into a single model This can be accomplished
47
by adding a layer to the hierarchy where the source and spatial scale of the data is
accounted for The second extension is a single season spatially explicit multiple
species co-occupancy model This model will allow studying complex interactions
and testing hypotheses about species interactions at a given point in time Lastly this
co-occupancy model will be adapted to incorporate temporal dynamics in the spirit of
the DYMOSS model
48
CHAPTER 3INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
Eliminate all other factors and the one which remains must be the truthndashSherlock Holmes
The Sign of Four
31 Introduction
Occupancy models are often used to understand the mechanisms that dictate
the distribution of a species Therefore variable selection plays a fundamental role in
achieving this goal To the best of our knowledge ldquoobjectiverdquo Bayesian alternatives for
variable selection have not been put forth for this problem and with a few exceptions
(Hooten amp Hobbs 2014 Link amp Barker 2009) AIC is the method used to choose from
competing site-occupancy models In addition the procedures currently implemented
and accessible to ecologists require enumerating and estimating all the candidate
models (Fiske amp Chandler 2011 Mazerolle amp Mazerolle 2013) In practice this
can be achieved if the model space considered is small enough which is possible
if the choice of the model space is guided by substantial prior knowledge about the
underlying ecological processes Nevertheless many site-occupancy surveys collect
large amounts of covariate information about the sampled sites Given that the total
number of candidate models grows exponentially fast with the number of predictors
considered choosing a reduced set of models guided by ecological intuition becomes
increasingly difficult This is even more so the case in the occupancy model context
where the model space is the cartesian product of models for presence and models for
detection Given the issues mentioned above we propose the first objective Bayesian
variable selection method for the single-season occupancy model framework This
approach explores in a principled manner the entire model space It is completely
49
automatic precluding the need for both tuning parameters in the sampling algorithm and
subjective elicitation of parameter prior distributions
As mentioned above in ecological modeling if model selection or less frequently
model averaging is considered the Akaike Information Criterion (AIC) (Akaike 1983)
or a version of it is the measure of choice for comparing candidate models (Fiske amp
Chandler 2011 Mazerolle amp Mazerolle 2013) The AIC is designed to find the model
that has on average the density closest in Kullback-Leibler distance to the density
of the true data generating mechanism The model with the smallest AIC is selected
However if nested models are considered one of them being the true one generally the
AIC will not select it (Wasserman 2000) Commonly the model selected by AIC will be
more complex than the true one The reason for this is that the AIC has a weak signal to
noise ratio and as such it tends to overfit (Rao amp Wu 2001) Other versions of the AIC
provide a bias correction that enhances the signal to noise ratio leading to a stronger
penalization for model complexity Some examples are the AICc (Hurvich amp Tsai 1989)
and AICu (McQuarrie et al 1997) however these are also not consistent for selection
albeit asymptotically efficient (Rao amp Wu 2001)
If we are interested in prediction as opposed to testing the AIC is certainly
appropriate However when conducting inference the use of Bayesian model averaging
and selection methods is more fitting If the true data generating mechanism is among
those considered asymptotically Bayesian methods choose the true model with
probability one Conversely if the true model is not among the alternatives and a
suitable parameter prior is used the posterior probability of the most parsimonious
model closest to the true one tends asymptotically to one
In spite of this in general for Bayesian testing direct elicitation of prior probabilistic
statements is often impeded because the problems studied may not be sufficiently
well understood to make an informed decision about the priors Conversely there may
be a prohibitively large number of parameters making specifying priors for each of
50
these parameters an arduous task In addition to this seemingly innocuous subjective
choices for the priors on the parameter space may drastically affect test outcomes
This has been a recurring argument in favor of objective Bayesian procedures
which appeal to the use of formal rules to build parameter priors that incorporate the
structural information inside the likelihood while utilizing some objective criterion (Kass amp
Wasserman 1996)
One popular choice of ldquoobjectiverdquo prior is the reference prior (Berger amp Bernardo
1992) which is the prior that maximizes the amount of signal extracted from the
data These priors have proven to be effective as they are fully automatic and can
be frequentist matching in the sense that the posterior credible interval agrees with the
frequentist confidence interval from repeated sampling with equal coverage-probability
(Kass amp Wasserman 1996) Reference priors however are improper and while
they yield reasonable posterior parameter probabilities the derived model posterior
probabilities may be ill defined To avoid this shortcoming Berger amp Pericchi (1996)
introduced the intrinsic Bayes factor (IBF) for model comparison Moreno et al (1998)
building on the IBF of Berger amp Pericchi (1996) developed a limiting procedure to
generate a system of priors that yield well-defined posteriors even though these
priors may sometimes be improper The IBF is built using a data-dependent prior to
automatically generate Bayes factors however the extension introduced by Moreno
et al (1998) generates the intrinsic prior by taking a theoretical average over the space
of training samples freeing the prior from data dependence
In our view in the face of a large number of predictors the best alternative is to run
a stochastic search algorithm using good ldquoobjectiverdquo testing parameter priors and to
incorporate suitable model priors This being said the discussion about model priors is
deferred until Chapter 4 this Chapter focuses on the priors on the parameter space
The Chapter is structured as follows First issues surrounding multimodel inference
are described and insight about objective Bayesian inferential procedures is provided
51
Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate priors
on the parameter space the intrinsic priors for the parameters of the occupancy model
are derived These are used in the construction of an algorithm for ldquoobjectiverdquo model
selection tailored to the occupancy model framework To assess the performance of our
methods we provide results from a simulation study in which distinct scenarios both
favorable and unfavorable are used to determine the robustness of these tools and
analyze the Blue Hawker data set which has been examined previously in the ecological
literature (Dorazio amp Taylor-Rodrıguez 2012 Kery et al 2010)32 Objective Bayesian Inference
As mentioned before in practice noninformative priors arising from structural
rules are an alternative to subjective elicitation of priors Some of the rules used in
defining noninformative priors include the principle of insufficient reason parametrization
invariance maximum entropy geometric arguments coverage matching and decision
theoretic approaches (see Kass amp Wasserman (1996) for a discussion)
These rules reflect one of two attitudes (1) noninformative priors either aim to
convey unique representations of ignorance or (2) they attempt to produce probability
statements that may be accepted by convention This latter attitude is in the same
spirit as how weights and distances are defined (Kass amp Wasserman 1996) and
characterizes the way in which Bayesian reference methods are interpreted today ie
noninformative priors are seen to be chosen by convention according to the situation
A word of caution must be given when using noninformative priors Difficulties arise
in their implementation that should not be taken lightly In particular these difficulties
may occur because noninformative priors are generally improper (meaning that they do
not integrate or sum to a finite number) and as such are said to depend on arbitrary
constants
Bayes factors strongly depend upon the prior distributions for the parameters
included in each of the models being compared This can be an important limitation
52
considering that when using noninformative priors their introduction will result in the
Bayes factors being a function of the ratio of arbitrary constants given that these priors
are typically improper (see Jeffreys 1961 Pericchi 2005 and references therein)
Many different approaches have been developed to deal with the arbitrary constants
when using improper priors since then These include the use of partial Bayes factors
(Berger amp Pericchi 1996 Good 1950 Lempers 1971) setting the ratio of arbitrary
constants to a predefined value (Spiegelhalter amp Smith 1982) and approximating to the
Bayes factor (see Haughton 1988 as cited in Berger amp Pericchi 1996 Kass amp Raftery
1995 Tierney amp Kadane 1986)321 The Intrinsic Methodology
Berger amp Pericchi (1996) cleverly dealt with the arbitrary constants that arise when
using improper priors by introducing the intrinsic Bayes factor (IBF) procedure This
solution based on partial Bayes factors provides the means to replace the improper
priors by proper ldquoposteriorrdquo priors The IBF is obtained from combining the model
structure with information contained in the observed data Furthermore they showed
that as the sample size tends to infinity the Intrinsic Bayes factor corresponds to the
proper Bayes factor arising from the intrinsic priors
Intrinsic priors however are not unique The asymptotic correspondence between
the IBF and the Bayes factor arising from the intrinsic prior yields two functional
equations that are solved by a whole class of intrinsic priors Because all the priors
in the class produce Bayes factors that are asymptotically equivalent to the IBF for
finite sample sizes the resulting Bayes factor is not unique To address this issue
Moreno et al (1998) formalized the methodology through the ldquolimiting procedurerdquo
This procedure allows one to obtain a unique Bayes factor consolidating the method
as a valid objective Bayesian model selection procedure which we will refer to as the
Bayes factor for intrinsic priors (BFIP) This result is particularly valid for nested models
although the methodology may be extended with some caution to nonnested models
53
As mentioned before the Bayesian hypothesis testing procedure is highly sensitive
to parameter-prior specification and not all priors that are useful for estimation are
recommended for hypothesis testing or model selection Evidence of this is provided
by the Jeffreys-Lindley paradox which states that a point null hypothesis will always
be accepted when the variance of a conjugate prior goes to infinity (Robert 1993)
Additionally when comparing nested models the null model should correspond to
a substantial reduction in complexity from that of larger alternative models Hence
priors for the larger alternative models that place probability mass away from the null
model are wasteful If the true model is ldquofarrdquo from the null it will be easily detected by
any statistical procedure Therefore the prior on the alternative models should ldquowork
harderrdquo at selecting competitive models that are ldquocloserdquo to the null This principle known
as the Savage continuity condition (Gunel amp Dickey 1974) is widely recognized by
statisticians
Interestingly the intrinsic prior in correspondence with the BFIP automatically
satisfies the Savage continuity condition That is when comparing nested models the
intrinsic prior for the more complex model is centered around the null model and in spite
of being a limiting procedure it is not subject to the Jeffreys-Lindley paradox
Moreover beyond the usual pairwise consistency of the Bayes factor for nested
models Casella et al (2009) show that the corresponding Bayesian procedure with
intrinsic priors for variable selection in normal regression is consistent in the entire
class of normal linear models adding an important feature to the list of virtues of the
procedure Consistency of the BFIP for the case where the dimension of the alternative
model grows with the sample size is discussed in Moreno et al (2010)322 Mixtures of g-Priors
As previously mentioned in the Bayesian paradigm a model M in M is defined
by a sampling density and a prior distribution The sampling density associated with
model M is denoted by f (y|βM σ2M M) where (βM σ
2M) is a vector of model-specific
54
unknown parameters The prior for model M and its corresponding set of parameters is
π(βM σ2M M|M) = π(βM σ
2M |MM) middot π(M|M)
Objective local priors for the model parameters (βM σ2M) are achieved through
modifications and extensions of Zellnerrsquos g-prior (Liang et al 2008 Womack et al
2014) In particular below we focus on the intrinsic prior and provide some details for
other scaled mixtures of g-priors We defer the discussion on priors over the model
space until Chapter 5 where we describe them in detail and develop a few alternatives
of our own3221 Intrinsic priors
An automatic choice of an objective prior is the intrinsic prior (Berger amp Pericchi
1996 Moreno et al 1998) Because MB sube M for all M isin M the intrinsic prior for
(βM σ2M) is defined as an expected posterior prior
πI (βM σ2M |M) =
intpR(βM σ
2M |~yM)mR(~y|MB)d~y (3ndash1)
where ~y is a minimal training sample for model M I denotes the intrinsic distributions
and R denotes distributions derived from the reference prior πR(βM σ2M |M) = cM
dβMdσ2M
σ2M
In (3ndash1) mR(~y|M) =intint
f (~y|βM σ2M M)πR(βM σ
2M |M)dβMdσ2M is the reference marginal
of ~y under model M and pR(βM σ2M |~yM) =
f (~y|βM σ2MM)πR(βM σ2
M|M)
mR(~y|M)is the reference
posterior density
In the regression framework the reference marginal mR is improper and produces
improper intrinsic priors However the intrinsic Bayes factor of model M to the base
model MB is well-defined and given by
BF IMMB
(y) = (1minus R2M)
minus nminus|MB |2 times
int 1
0
n + sin2(π2θ) middot (|M|+ 1)
n +sin2(π
2θ)middot(|M|+1)1minusR2
M
nminus|M|
2sin2(π
2θ) middot (|M|+ 1)
n +sin2(π
2θ)middot(|M|+1)1minusR2
M
|M|minus|MB |
2
dθ (3ndash2)
55
where R2M is the coefficient of determination of model M versus model MB The Bayes
factor between two models M and M prime is defined as BF IMMprime(y) = BF I
MMB(y)BF I
MprimeMB(y)
The ldquogoodnessrdquo of the model M based on the intrinsic priors is given by its posterior
probability
pI (M|yM) =BF I
MMB(y)π(M|M)sum
MprimeisinM BF IMprimeMB
(y)π(M prime|M) (3ndash3)
It has been shown that the system of intrinsic priors produces consistent model selection
(Casella et al 2009 Giron et al 2010) In the context of well-formulated models the
true model MT is the smallest well-formulated model M isin M such that α isin M if βα = 0
If MT is the true model then the posterior probability of model MT based on equation
(3ndash3) converges to 13222 Other mixtures of g-priors
Scaled mixtures of g-priors place a reference prior on (βMBσ2) and a multivariate
normal distribution on β in M MB that is normal with mean 0 and precision matrix
qMw
nσ2ZprimeM (IminusH0)ZM
where H0 is the hat matrix associated with ZMB The prior is completed by a prior on w
and choice of scaling qM that is set at |M| + 1 to account for the minimal sample size of
M Under these assumptions the Bayesrsquo factor for M to MB is given by
BFMMB(y) =
(1minus R2
M
) nminus|MB |2
int n + w(|M|+ 1)
n + w(|M|+1)1minusR2
M
nminus|M|
2w(|M|+ 1)
n + w(|M|+1)1minusR2
M
|M|minus|MB |
2
π(w)dw
We consider the following priors on w The intrinsic prior is π(w) = Beta(w 05 05)
which is only defined for w isin (0 1) A version of the Zellner-Siow prior is given by
w sim Gamma(05 05) which produces a multivariate Cauchy distribution on β A family
of hyper-g priors are defined by π(w) prop wminus12(β + w)(α+1)2 which have Cauchy-like
tails but produce more shrinkage than the Cauchy prior
56
33 Objective Bayes Occupancy Model Selection
As mentioned before Bayesian inferential approaches used for ecological models
are lacking In particular there exists a need for suitable objective and automatic
Bayesian testing procedures and software implementations that explore thoroughly the
model space considered With this goal in mind in this section we develop an objective
intrinsic and fully automatic Bayesian model selection methodology for single season
site-occupancy models We refer to this method as automatic and objective given that
in its implementation no hyperparameter tuning is required and that it is built using
noninformative priors with good testing properties (eg intrinsic priors)
An inferential method for the occupancy problem is possible using the intrinsic
approach given that we are able to link intrinsic-Bayesian tools for the normal linear
model through our probit formulation of the occupancy model In other words because
we can represent the single season probit occupancy model through the hierarchy
yij |zi wij sim Bernoulli(ziIwijgt0
)wij |λ sim N
(qprimeijλ 1
)zi |vi sim Bernoulli
(Ivigt0
)vi |α sim N (x primeiα 1)
it is possible to solve the selection problem on the latent scale variables wij and vi and
to use those results at the level of the occupancy and detection processes
In what follows first we provide some necessary notation Then a derivation of
the intrinsic priors for the parameters of the detection and occupancy components
is outlined Using these priors we obtain the general form of the model posterior
probabilities Finally the results are incorporated in a model selection algorithm for
site-occupancy data Although the priors on the model space are not discussed in this
Chapter the software and methods developed have different choices of model priors
built in
57
331 Preliminaries
The notation used in Chapter 2 will be considered in this section as well Namely
presence will be denoted by z detection by y their corresponding latent processes are
v and w and the model parameters are denoted by α and λ However some additional
notation is also necessary Let M0 =M0y M0z
denote the ldquobaserdquo model defined by
the smallest models considered for the detection and presence processes The base
models M0y and M0z include predictors that must be contained in every model that
belongs to the model space Some examples of base models are the intercept only
model a model with covariates related to the sampling design and a model including
some predictors important to the researcher that should be included in every model
Furthermore let the sets [Kz ] = 1 2 Kz and [Ky ] = 1 2 Ky index
the covariates considered for the variable selection procedure for the presence and
detection processes respectively That is these sets denote the covariates that can
be added from the base models in M0 or removed from the largest possible models
considered MF z and MF y which we will refer to as the ldquofullrdquo models The model space
can then be represented by the Cartesian product of subsets such that Ay sube [Ky ]
and Az sube [Kz ] The entire model space is populated by models of the form MA =MAy
MAz
isin M = My timesMz with MAy
isin My and MAzisin Mz
For the presence process z the design matrix for model MAzis given by the block
matrix XAz= (X0|Xr A) X0 corresponds to the design matrix of the base model ndash which
is such that M0z sube MAzisin Mz for all Az isin [Kz ] ndash and Xr A corresponds to the submatrix
that contains the covariates indexed by Az Analogously for the detection process y the
design matrix is given by QAy= (Q0|Qr A) Similarly the coefficients for models MAz
and
MAyare given by αA = (αprime
0αprimer A)
prime and λA = (λprime0λ
primer A)
prime
With these elements in place the model selection problem consists of finding
subsets of covariates indexed by A = Az Ay that have a high posterior probability
given the detection and occupancy processes This is equivalent to finding models with
58
high posterior odds when compared to a suitable base model These posterior odds are
given by
p(MA|y z)p(M0|y z)
=m(y z|MA)π(MA)
m(y z|M0)π(M0)= BFMAM0
(y z)π(MA)
π(M0)
Since we are able to represent the occupancy model as a truncation of latent
normal variables it is possible to work through the occupancy model selection problem
in the latent normal scale used for the presence and detection processes We formulate
two solutions to this problem one that depends on the observed and latent components
and another that solely depends on the latent level variables used to data-augment the
problem We will however focus on the latter approach as this yields a straightforward
MCMC sampling scheme For completeness the other alternative is described in
Section 34
At the root of our objective inferential procedure for occupancy models lies the
conditional argument introduced by Womack et al (work in progress) for the simple
probit regression In the occupancy setting the argument is
p(MA|y zw v) =m(y z vw|MA)π(MA)
m(y zw v)
=fyz(y z|w v)
(intfvw(vw|αλMA)παλ(αλ|MA)d(αλ)
)π(MA)
fyz(y z|w v)sum
MlowastisinM(int
fvw(vw|αλMlowast)παλ(αλ|Mlowast)d(αλ))π(Mlowast)
=m(v|MAz
)m(w|MAy)π(MA)
m(v)m(w)
prop m(v|MAz)m(w|MAy
)π(MA) (3ndash4)
where
1 fyz(y z|w v) =prodN
i=1 Izivigt0I
(1minuszi )vile0
prodJ
j=1(ziIwijgt0)yij (1minus ziIwijgt0)
1minusyij
2 fvw(vw|αλMA) =
(Nprodi=1
ϕ(vi xprimeiαMAz
1)
)︸ ︷︷ ︸
f (v|αr Aα0MAz )
(Nprodi=1
Jiprodj=1
ϕ(wij qprimeijλMAy
1)
)︸ ︷︷ ︸
f (w|λr Aλ0MAy )
and
59
3 παλ(αλ|MA) = πα(α|MAz)πλ(λ|MAy
)
This result implies that once the occupancy and detection indicators are
conditioned on the latent processes v and w respectively the model posterior
probabilities only depend on the latent variables Hence in this case the model
selection problem is driven by the posterior odds
p(MA|y zw v)p(M0|y zw v)
=m(w v|MA)
m(w v|M0)
π(MA)
π(M0) (3ndash5)
where m(w v|MA) = m(w|MAy) middotm(v|MAz
) with
m(v|MAz) =
int intf (v|αr Aα0MAz
)π(αr A|α0MAz)π(α0)dαr Adα0
(3ndash6)
m(w|MAy) =
int intf (w|λr Aλ0MAy
)π(λr A|λ0MAy)π(λ0)dλ0dλr A
(3ndash7)
332 Intrinsic Priors for the Occupancy Problem
In general the intrinsic priors as defined by Moreno et al (1998) use the functional
form of the response to inform their construction assuming some preliminary prior
distribution proper or improper on the model parameters For our purposes we assume
noninformative improper priors for the parameters denoted by πN(middot|middot) Specifically the
intrinsic priors πIP(θMlowast|Mlowast) for a vector of parameters θMlowast corresponding to model
Mlowast isin M0M sub M for a response vector s with probability density (or mass) function
f (s|θMlowast) are defined by
πIP(θM0|M0) = πN(θM0
|M0)
πIP(θM |M) = πN(θM |M)
intm(~s|M)
m(~s|M0)f (~s|θM M)d~s
where ~s is a theoretical training sample
In what follows whenever it is clear from the context in an attempt to simplify the
notation MA will be used to refer to MAzor MAy
and A will denote Az or Ay To derive
60
the parameter priors involved in equations 3ndash6 and 3ndash7 using the objective intrinsic prior
strategy we start by assuming flat priors πN(αA|MA) prop cA and πN(λA|MA) prop dA where
cA and dA are unknown constants
The intrinsic prior for the parameters associated with the occupancy process αA
conditional on model MA is
πIP(αA|MA) = πN(αA|MA)
intm(~v|MA)
m(~v|M0)f (~v|αAMA)d~v
where the marginals m(~v|Mj) with j isin A 0 are obtained by solving the analogous
equation 3ndash6 for the (theoretical) training sample ~v These marginals are given by
m(~v|Mj) = cj (2π)pjminusp0
2 |~X primej~Xj |
12 eminus
12~vprime(Iminus~Hj )~v
The training sample ~v has dimension pAz=∣∣MAz
∣∣ that is the total number of
parameters in model MAz Note that without ambiguity we use
∣∣ middot ∣∣ to denote both
the cardinality of a set and also the determinant of a matrix The design matrix ~XA
corresponds to the training sample ~v and is chosen such that ~X primeA~XA =
pAzNX primeAXA
(Leon-Novelo et al 2012) and ~Hj is the corresponding hat matrix
Replacing m(~v|MA) and m(~v|M0) in πIP(αA|MA) and solving the integral with
respect to the theoretical training sample ~v we have
πIP(αA|MA) = cA
int ((2π)minus
pAzminusp0z2
(c0
cA
)eminus
12~vprime((Iminus~HA)minus(Iminus~H0))~v |~X
primeA~XA|12
|~X prime0~X0|12
)times(
(2π)minuspAz2 eminus
12(~vminus~XAαA)
prime(~vminus~XAαA))d~v
= c0(2π)minus
pAzminusp0z2 |~X prime
Ar~XAr |
12 2minus
pAzminusp0z2 exp
[minus1
2αprimer A
(1
2~X primer A
~Xr A
)αr A
]= πN(α0)timesN
(αr A
∣∣ 0 2 middot ( ~X primer A
~Xr A)minus1)
(3ndash8)
61
Analogously the intrinsic prior for the parameters associated to the detection
process is
πIP(λA|MA) = d0(2π)minus
pAyminusp0y2 | ~Q prime
Ar~QAr |
12 2minus
pAyminusp0y2 exp
[minus1
2λprimer A
(1
2~Q primer A
~Qr A
)λr A
]= πN(λ0)timesN
(λr A
∣∣ 0 2 middot ( ~Q primeA~QA)
minus1)
(3ndash9)
In short the intrinsic priors for αA = (αprime0α
primer A)
prime and λprimeA = (λprime
0λprimer A)
prime are the product
of a reference prior on the parameters of the base model and a normal density on the
parameters indexed by Az and Ay respectively333 Model Posterior Probabilities
We now derive the expressions involved in the calculations of the model posterior
probabilities First recall that p(MA|y zw v) prop m(w v|MA)π(MA) Hence determining
this posterior probability only requires calculating m(w v|MA)
Note that since w and v are independent obtaining the model posteriors from
expression 3ndash4 reduces to finding closed form expressions for the marginals m(v |MAz)
and m(w |MAy) respectively from equations 3ndash6 and 3ndash7 Therefore
m(w v|MA) =
int intf (vw|αλMA)π
IP (α|MAz)πIP
(λ|MAy
)dαdλ
(3ndash10)
For the latent variable associated with the occupancy process plugging the
parameter intrinsic prior given by 3ndash8 into equation 3ndash6 (recalling that ~X primeA~XA =
pAzNX primeAXA)
and integrating out αA yields
m(v|MA) =
int intc0N (v|X0α0 + Xr Aαr A I)N
(αr A|0 2( ~X prime
r A~Xr A)
minus1)dαr Adα0
= c0(2π)minusn2
int (pAz
2N + pAz
) (pAzminusp0z
)
2
times
exp[minus1
2(v minus X0α0)
prime(I minus
(2N
2N + pAz
)Hr Az
)(v minus X0α0)
]dα0
62
= c0 (2π)minus(nminusp0z )2
(pAz
2N + pAz
) (pAzminusp0z
)
2
|X prime0X0|minus
12 times
exp[minus1
2vprime(I minus H0z minus
(2N
2N + pAz
)Hr Az
)v
] (3ndash11)
with Hr Az= HAz
minus H0z where HAzis the hat matrix for the entire model MAz
and H0z is
the hat matrix for the base model
Similarly the marginal distribution for w is
m(w|MA) = d0 (2π)minus(Jminusp0y )2
(pAy
2J + pAy
) (pAyminusp0y
)
2
|Q prime0Q0|minus
12 times
exp[minus1
2wprime(I minus H0y minus
(2J
2J + pAy
)Hr Ay
)w
] (3ndash12)
where J =sumN
i=1 Ji or in other words J denotes the total number of surveys conducted
Now the posteriors for the base model M0 =M0y M0z
are
m(v|M0) =
intc0N (v|X0α0 I) dα0
= c0(2π)minus(nminusp0z )2 |X prime
0X0|minus12 exp
[minus1
2(v (I minus H0z ) v)
](3ndash13)
and
m(w|M0) = d0(2π)minus(Jminusp0y )2 |Q prime
0Q0|minus12 exp
[minus1
2
(w(I minus H0y
)w)]
(3ndash14)
334 Model Selection Algorithm
Having the parameter intrinsic priors in place and knowing the form of the model
posterior probabilities it is finally possible to develop a strategy to conduct model
selection for the occupancy framework
For each of the two components of the model ndashoccupancy and detectionndash the
algorithm first draws the set of active predictors (ie Az and Ay ) together with their
corresponding parameters This is a reversible jump step which uses a Metropolis
63
Hastings correction with proposal distributions given by
q(Alowastz |zo z(t)u v(t)MAz
) =1
2
(p(MAlowast
z|zo z(t)u v(t)Mz MAlowast
zisin L(MAz
)) +1
|L(MAz)|
)q(Alowast
y |y zo z(t)u w(t)MAy) =
1
2
(p(MAlowast
w|y zo z(t)u w(t)My MAlowast
yisin L(MAy
)) +1
|L(MAy)|
)(3ndash15)
where L(MAz) and L(MAy
) denote the sets of models obtained from adding or removing
one predictor at a time from MAzand MAy
respectively
To promote mixing this step is followed by an additional draw from the full
conditionals of α and λ The densities p(α0|) p(αr A|) p(λ0|) and p(λr A|) can
be sampled from directly with Gibbs steps Using the notation a|middot to denote the random
variable a conditioned on all other parameters and on the data these densities are given
by
bull α0|middot sim N((X
prime0X0)
minus1Xprime0v (X
prime0X0)
minus1)bull αr A|middot sim N
(microαr A
αr A
) where the mean vector and the covariance matrix are
given by αr A= 2N
2N+pAz(X
prime
r AXr A)minus1 and microαr A
=(αr A
Xprime
r Av)
bull λ0|middot sim N((Q
prime0Q0)
minus1Qprime0w (Q
prime0Q0)
minus1) and
bull λr A|middot sim N(microλr A
λr A
) analogously with mean and covariance matrix given by
λr A= 2J
2J+pAy(Q
prime
r AQr A)minus1 and microλr A
=(λr A
Qprime
r Aw)
Finally Gibbs sampling steps are also available for the unobserved occupancy
indicators zu and for the corresponding latent variables v and w The full conditional
posterior densities for z(t+1)u v(t+1) and w(t+1) are those introduced in Chapter 2 for the
single season probit model
The following steps summarize the stochastic search algorithm
1 Initialize A(0)y A
(0)z z
(0)u v(0)w(0)α(0)
0 λ(0)0
2 Sample the model indices and corresponding parameters
(a) Draw simultaneously
64
bull Alowastz sim q(Az |zo z(t)u v(t)MAz
)
bull αlowast0 sim p(α0|MAlowast
z zo z
(t)u v(t)) and
bull αlowastr Alowast sim p(αr A|MAlowast
z zo z
(t)u v(t))
(b) Accept (M(t+1)Az
α(t+1)10 α(t+1)1
r A ) = (MAlowastzαlowast
0αlowastr Alowast) with probability
δz = min
(1
p(MAlowastz|zo z(t)u v(t))
p(MA(t)z|zo z(t)u v(t))
q(A(t)z |zo z(t)u v(t)MAlowast
z)
q(Alowastz |zo z
(t)u v(t)MAz
)
)
otherwise let (M(t+1)Az
α(t+1)10 α(t+1)1
r A ) = (A(t)z α(t)2
0 α(t)2r A )
(c) Sample simultaneously
bull Alowasty sim q(Ay |y zo z(t)u w(t)MAy
)
bull λlowast0 sim p(λ0|MAlowast
y y zo z
(t)u w(t)) and
bull λlowastr Alowast sim p(λr A|MAlowast
y y zo z
(t)u w(t))
(d) Accept (M(t+1)Ay
λ(t+1)10 λ(t+1)1
r A ) = (MAlowastyλlowast
0λlowastr Alowast) with probability
δy = min
(1
p(MAlowastz|y zo z(t)u w(t))
p(MA(t)z|y zo z(t)u w(t))
q(A(t)z |y zo z(t)u w(t)MAlowast
y)
q(Alowastz |y zo z
(t)u w(t)MAy
)
)
otherwise let (M(t+1)Ay
λ(t+1)10 λ(t+1)1
r A ) = (A(t)y λ(t)2
0 λ(t)2r A )
3 Sample base model parameters
(a) Draw α(t+1)20 sim p(α0|MA
(t+1)z
zo z(t)u v(t))
(b) Draw λ(t+1)20 sim p(λ0|MA(t+1)y
y zo z(t)u v(t))
4 To improve mixing resample model coefficients not present the base model butare in MA
(a) Draw α(t+1)2r A sim p(αr A|MA
(t+1)z
zo z(t)u v(t))
(b) Draw λ(t+1)2r A sim p(λr A|MA
(t+1)y
yzo z(t)u v(t))
5 Sample latent and missing (unobserved) variables
(a) Sample z(t+1)u sim p(zu|MA(t+1)z
yα(t+1)2r A α(t+1)2
0 λ(t+1)2r A λ(t+1)2
0 )
(b) Sample v(t+1) sim p(v|MA(t+1)z
zo z(t+1)u α(t+1)2
r A α(t+1)20 )
65
(c) Sample w(t+1) sim p(w|MA(t+1)y
zo z(t+1)u λ(t+1)2
r A λ(t+1)20 )
34 Alternative Formulation
Because the occupancy process is partially observed it is reasonable to consider
the posterior odds in terms of the observed responses that is the detections y and
the presences at sites where at least one detection takes place Partitioning the vector
of presences into observed and unobserved z = (zprimeo zprimeu)
prime and integrating out the
unobserved component the model posterior for MA can be obtained as
p(MA|y zo) prop Ezu [m(y z|MA)] π(MA) (3ndash16)
Data-augmenting the model in terms of latent normal variables a la Albert and Chib
the marginals for any model My Mz = M isin M of z and y inside of the expectation in
equation 3ndash16 can be expressed in terms of the latent variables
m(y z|M) =
intT (z)
intT (yz)
m(w v|M)dwdv
=
(intT (z)
m(v| Mz)dv
)(intT (yz)
m(w|My)dw
) (3ndash17)
where T (z) and T (y z) denote the corresponding truncation regions for v and w which
depend on the values taken by z and y and
m(v|Mz) =
intf (v|αMz)π(α|Mz)dα (3ndash18)
m(w|My) =
intf (w|λMy)π(λ|My)dλ (3ndash19)
The last equality in equation 3ndash17 is a consequence of the independence of the
latent processes v and w Using expressions 3ndash18 and 3ndash19 allows one to embed this
model selection problem in the classical linear normal regression setting where many
ldquoobjectiverdquo Bayesian inferential tools are available In particular these expressions
facilitate deriving the parameter intrinsic priors (Berger amp Pericchi 1996 Moreno
et al 1998) for this problem This approach is an extension of the one implemented in
Leon-Novelo et al (2012) for the simple probit regression problem
66
Using this alternative approach all that is left is to integrate m(v|MA) and m(w|MA)
over their corresponding truncation regions T (z) and T (y z) which yields m(y z|MA)
and then to obtain the expectation with respect to the unobserved zrsquos Note however
two issues arise First such integrals are not available in closed form Second
calculating the expectation over the limit of integration further complicates things To
address these difficulties it is possible to express E [m(y z|MA)] as
Ezu [m(y z|MA)] = Ezu
[(intT (z)
m(v| MAz)dv
)(intT (yz)
m(w|MAy)dw
)](3ndash20)
= Ezu
[(intT (z)
intm(v| MAz
α0)πIP(α0|MAz
)dα0dv
)times(int
T (yz)
intm(w| MAy
λ0)πIP(λ0|MAy
)dλ0dw
)]
= Ezu
int (int
T (z)
m(v| MAzα0)dv
)︸ ︷︷ ︸
g1(T (z)|MAz α0)
πIP(α0|MAz)dα0 times
int (intT (yz)
m(w|MAyλ0)dw
)︸ ︷︷ ︸
g2(T (yz)|MAy λ0)
πIP(λ0|MAy)dλ0
= Ezu
[intg1(T (z)|MAz
α0)πIP(α0|MAz
)dα0 timesintg2(T (y z)|MAy
λ0)πIP(λ0|MAy
)dλ0
]= c0 d0
int intEzu
[g1(T (z)|MAz
α0)g2(T (y z)|MAyλ0)
]dα0 dλ0
where the last equality follows from Fubinirsquos theorem since m(v|MAzα0) and
m(w|MAyλ0) are proper densities From 3ndash21 the posterior odds are
p(MA|y zo)p(M0|y zo)
=
int intEzu
[g1(T (z)|MAz
α0)g2(T (y z)|MAyλ0)
]dα0 dλ0int int
Ezu
[g1(T (z)|M0z α0)g2(T (y z)|M0y λ0)
]dα0 dλ0
π(MA)
π(M0)
(3ndash21)
67
35 Simulation Experiments
The proposed methodology was tested under 36 different scenarios where we
evaluate the behavior of the algorithm by varying the number of sites the number of
surveys the amount of signal in the predictors for the presence component and finally
the amount of signal in the predictors for the detection component
For each model component the base model is taken to be the intercept only model
and the full models considered for the presence and the detection have respectively 30
and 20 predictors Therefore the model space contains 230times220 asymp 112times1015 candidate
models
To control the amount of signal in the presence and detection components values
for the model parameter were purposefully chosen so that quantiles 10 50 and 90 of the
occupancy and detection probabilities match some pre-specified probabilities Because
presence and detection are binary variables the amount of signal in each model
component associates to the spread and center of the distribution for the occupancy and
detection probabilities respectively Low signal levels relate to occupancy or detection
probabilities close to 50 High signal levels associate with probabilities close to 0 or 1
Large spreads of the distributions for the occupancy and detection probabilities reflect
greater heterogeneity among the observations collected improving the discrimination
capability of the model and viceversa
Therefore for the presence component the parameter values of the true model
were chosen to set the median for the occupancy probabilities equal 05 The chosen
parameter values also fix quantiles 10 and 90 symmetrically about 05 at small (Qz10 =
03Qz90 = 07) intermediate (Qz
10 = 02Qz90 = 08) and large (Qz
10 = 01Qz90 = 09)
distances For the detection component the model parameters are obtained to reflect
detection probabilities concentrated about low values (Qy50 = 02) intermediate values
(Qy50 = 05) and high values (Qy
50 = 08) while keeping quantiles 10 and 90 fixed at 01
and 09 respectively
68
Table 3-1 Simulation control parameters occupancy model selectorParameter Values considered
N 50 100
J 3 5
(Qz10Q
z50Q
z90)
(03 05 07) (02 05 08) (01 05 09)
(Qy
10Qy50Q
y90)
(01 02 09) (01 05 09) (01 08 09)
There are in total 36 scenarios these result from crossing all the levels of the
simulation control parameters (Table 3-1) Under each of these scenarios 20 data sets
were generated at random True presence and detection indicators were generated
with the probit model formulation from Chapter 2 This with the assumed true models
MTz = 1 x2 x15 x16 x22 x28 for the presence and MTy = 1 q7 q10 q12 q17 for
the detection with the predictors included in the randomly generated datasets In this
context 1 represents the intercept term Throughout the Section we refer to predictors
included in the true models as true predictors and to those absent as false predictors
The selection procedure was conducted using each one of these data sets with
two different priors on the model space the uniform or equal probability prior and a
multiplicity correcting prior
The results are summarized through the marginal posterior inclusion probabilities
(MPIPs) for each predictor and also the five highest posterior probability models (HPM)
The MPIP for a given predictor under a specific scenario and for a particular data set is
defined as
p(predictor is included|y zw v) =sumMisinM
I(predictorisinM)p(M|y zw vM) (3ndash22)
In addition we compare the MPIP odds between predictors present in the true model
and predictors absent from it Specifically we consider the minimum odds of marginal
posterior inclusion probabilities for the predictors Let ~ξ and ξ denote respectively a
69
predictor in the true model MT and a predictor absent from MT We define the minimum
MPIP odds between the probabilities of true and false predictor as
minOddsMPIP =min~ξisinMT
p(I~ξ = 1|~ξ isin MT )
maxξ isinMTp(Iξ = 1|ξ isin MT )
(3ndash23)
If the variable selection procedure adequately discriminates true and false predictors
minOddsMPIP will take values larger than one The ability of the method to discriminate
between the least probable true predictor and the most probable false predictor worsens
as the indicator approaches 0351 Marginal Posterior Inclusion Probabilities for Model Predictors
For clarity in Figures 3-1 through 3-5 only predictors in the true models are labeled
and are emphasized with a dotted line passing through them The left hand side plots
in these figures contain the results for the presence component and the ones on the
right correspond to predictors in the detection component The results obtained with
the uniform model priors correspond to the black lines and those for the multiplicity
correcting prior are in red In these Figures the MPIPrsquos have been averaged over all
datasets corresponding scenarios matching the condition observed
In Figure 3-1 we contrast the mean MPIPrsquos of the predictors over all datasets from
scenarios with 50 sites to the mean MPIPrsquos obtained for the scenarios with 100 sites
Similarly Figure 3-2 compares the mean MPIPrsquos of scenarios where 3 surveys are
performed to those of scenarios having 5 surveys per site Figures 3-4 and 3-5 show the
effect of the different levels of signal considered in the occupancy probabilities and in the
detection probabilities
From these figures mainly three results can be drawn (1) the effect of the model
prior is substantial (2) the proposed methods yield MPIPrsquos that clearly separate
true predictors from false predictors and (3) the separation between MPIPrsquos of true
predictors and false predictors is noticeably larger in the detection component
70
Regardless of the simulation scenario and model component observed under the
uniform prior false predictors obtain a relatively high MPIP Conversely the multiplicity
correction prior strongly shrinks towards 0 the MPIP for false predictors In the presence
component the MPIP for the true predictors is shrunk substantially under the multiplicity
prior however there remains a clear separation between true and false predictors In
contrast in the detection component the MPIP for true predictors remains relatively high
(Figures 3-1 through 3-5)
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif N=50MC N=50
Unif N=100MC N=100
Figure 3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors
71
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif J=3MC J=3
Unif J=5MC J=5
Figure 3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif N=50 J=3Unif N=50 J=5
Unif N=100 J=3Unif N=100 J=5
MC N=50 J=3MC N=50 J=5
MC N=100 J=3MC N=100 J=5
Figure 3-3 Predictor MPIP averaged over scenarios with the interaction between thenumber of sites and the surveys per site using uniform (U) and multiplicitycorrection (MC) priors
72
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
U(03 05 07)MC(03 05 07)
U(02 05 08)MC(02 05 08)
U(01 05 09)MC(01 05 09)
Figure 3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
U(01 02 09)MC(01 02 09)
U(01 05 09)MC(01 05 09)
U(01 08 09)MC(01 08 09)
Figure 3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors
73
In scenarios where more sites were surveyed the separation between the MPIP of
true and false predictors grew in both model components (Figure 3-1) Increasing the
number of sites has an effect over both components given that every time a new site is
included covariate information is added to the design matrix of both the presence and
the detection components
On the hand increasing the number of surveys affects the MPIP of predictors in the
detection component (Figures 3-2 and 3-3) but has only a marginal effect on predictors
of the presence component This may appear to be counterintuitive however increasing
the number of surveys only increases the number of observation in the design matrix
for the detection while leaving unaltered the design matrix for the presence The small
changes observed in the MPIP for the presence predictors J increases are exclusively
a result of having additional detection indicators equal to 1 in sites where with less
surveys would only have 0 valued detections
From Figure 3-3 it is clear that for the presence component the effect of the number
of sites dominates the behavior of the MPIP especially when using the multiplicity
correction priors In the detection component the MPIP is influenced by both the number
of sites and number of surveys The influence of increasing the number of surveys is
larger when considering a smaller number of sites and viceversa
Regarding the effect of the distribution for the occupancy probabilities we observe
that mostly the detection component is affected There is stronger discrimination
between true and false predictors as the distribution has a higher variability (Figure
3-4) This is consistent with intuition since having the presence probabilities more
concentrated about 05 implies that the predictors do not vary much from one site to
the next whereas having the occupancy probabilities more spread out would have the
opposite effect
Finally concentrating the detection probabilities about high or low values For
predictors in the detection component the separation between MPIP of true and false
74
predictors is larger both in scenarios where the distribution of the detection probability
is centered about 02 or 08 when compared to those scenarios where this distribution
is centered about 05 (where the signal of the predictors is weakest) For predictors in
the presence component having the detection probabilities centered at higher values
slightly increases the inclusion probabilities of the true predictors (Figure 3-5) and
reduces that of false predictors
Table 3-2 Comparison of average minOddsMPIP under scenarios having differentnumber of sites (N=50 N=100) and under scenarios having different numberof surveys per site (J=3 J=5) for the presence and detection componentsusing uniform and multiplicity correction priors
Sites SurveysComp π(M) N=50 N=100 J=3 J=5
Presence Unif 112 131 119 124MC 320 846 420 674
Detection Unif 203 264 211 257MC 2115 3246 2139 3252
Table 3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors
(Qz10Q
z50Q
z90) (Qy
10Qy50Q
y90)
Comp π(M) (030507) (020508) (010509) (010209) (010509) (010809)
Presence Unif 105 120 134 110 123 124MC 202 455 805 238 619 640
Detection Unif 234 234 230 257 200 238MC 2537 2077 2528 2933 1852 2849
The separation between the MPIP of true and false predictors is even more
notorious in Tables 3-2 and 3-3 where the minimum MPIP odds between true and
false predictors are shown Under every scenario the value for the minOddsMPIP (as
defined in 3ndash23) was greater than 1 implying that on average even the lowest MPIP
for a true predictor is higher than the maximum MPIP for a false predictor In both
components of the model the minOddsMPIP are markedly larger under the multiplicity
correction prior and increase with the number of sites and with the number of surveys
75
For the presence component increasing the signal in the occupancy probabilities
or having the detection probabilities concentrate about higher values has a positive and
considerable effect on the magnitude of the odds For the detection component these
odds are particularly high specially under the multiplicity correction prior Also having
the distribution for the detection probabilities center about low or high values increases
the minOddsMPIP 352 Summary Statistics for the Highest Posterior Probability Model
Tables 3-4 through 3-7 show the number of true predictors that are included in
the HPM (True +) and the number of false predictors excluded from it (True minus)
The mean percentages observed in these Tables provide one clear message The
highest probability models chosen with either model prior commonly differ from the
corresponding true models The multiplicity correction priorrsquos strong shrinkage only
allows a few true predictors to be selected but at the same time it prevents from
including in the HPM any false predictors On the other hand the uniform prior includes
in the HPM a larger proportion of true predictors but at the expense of also introducing
a large number of false predictors This situation is exacerbated in the presence
component but also occurs to a minor extent in the detection component
Table 3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highestprobability models for the presence and the detection components usinguniform and multiplicity correcting priors on the model space
True + True minusComp π(M) N=50 N=100 N=50 N=100
Presence Unif 057 063 051 055MC 006 013 100 100
Detection Unif 077 085 087 093MC 049 070 100 100
Having more sites or surveys improves the inclusion of true predictors and exclusion
of false ones in the HPM for both the presence and detection components (Tables 3-4
and 3-5) On the other hand if the distribution for the occupancy probabilities is more
76
Table 3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space
True + True minusComp π(M) J=3 J=5 J=3 J=5
Presence Unif 059 061 052 054MC 008 010 100 100
Detection Unif 078 085 087 092MC 050 068 100 100
spread out the HPM includes more true predictors and less false ones in the presence
component In contrast the effect of the spread of the occupancy probabilities in the
detection HPM is negligible (Table 3-6) Finally there is a positive relationship between
the location of the median for the detection probabilities and the number of correctly
classified true and false predictors for the presence The HPM in the detection part of
the model responds positively to low and high values of the median detection probability
(increased signal levels) in terms of correctly classified true and false predictors (Table
3-7)
Table 3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space
True + True minusComp π(M) (030507) (020508) (010509) (030507) (020508) (010509)
Presence Unif 055 061 064 050 054 055MC 002 008 018 100 100 100
Detection Unif 081 082 081 090 089 089MC 057 061 059 100 100 100
36 Case Study Blue Hawker Data Analysis
During 1999 and 2000 an intensive volunteer surveying effort coordinated by the
Centre Suisse de Cartographie de la Faune (CSCF) was conducted in order to analyze
the distribution of the blue hawker Ashna cyanea (Odonata Aeshnidae) a common
dragonfly in Switzerland Given that Switzerland is a small and mountainous country
77
Table 3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space
True + True minusComp π(M) (010209) (010509) (010809) (010209) (010509) (010809)
Presence Unif 059 059 062 051 054 054MC 006 010 011 100 100 100
Detection Unif 089 077 078 091 087 091MC 070 048 059 100 100 100
there is large variation in its topography and physio-geography as such elevation is a
good candidate covariate to predict species occurrence at a large spatial scale It can
be used as a proxy for habitat type intensity of land use temperature as well as some
biotic factors (Kery et al 2010)
Repeated visits to 1-ha pixels took place to obtain the corresponding detection
history In addition to the survey outcome the x and y-coordinates thermal-level the
date of the survey and the elevation were recorded Surveys were restricted to the
known flight period of the blue hawker which takes place between May 1 and October
10 In total 2572 sites were surveyed at least once during the surveying period The
number of surveys per site ranges from 1 to 22 times within each survey year
Kery et al (2010) summarize the results of this effort using AIC-based model
comparisons first by following a backwards elimination approach for the detection
process while keeping the occupancy component fixed at the most complex model and
then for the presence component choosing among a group of three models while using
the detection model chosen In our analysis of this dataset for the detection and the
presence we consider as the full models those used in Kery et al (2010) namely
minus1(ψ) = α0 + α1year+ α2elev+ α3elev2 + α4elev
3
minus1(p) = λ0 + λ1year+ λ2elev+ λ3elev2 + λ4elev
3 + λ5date+ λ6date2
78
where year = Iyear=2000
The model spaces for this data contain 26 = 64 and 24 = 16 models respectively
for the detection and occupancy components That is in total the model space contains
24+6 = 1 024 models Although this model space can be enumerated entirely for
illustration we implemented the algorithm from section 334 generating 10000 draws
from the Gibbs sampler Each one of the models sampled were chosen from the set of
models that could be reached by changing the state of a single term in the current model
(to inclusion or exclusion accordingly) This allows a more thorough exploration of the
model space because for each of the 10000 models drawn the posterior probabilities
for many more models can be observed Below the labels for the predictors are followed
by either ldquozrdquo or ldquoyrdquo accordingly to represent the component they pertain to Finally
using the results from the model selection procedure we conducted a validation step to
determine the predictive accuracy of the HPMrsquos and of the median probability models
(MPMrsquos) The performance of these models is then contrasted with that of the model
ultimately selected by Kery et al (2010)361 Results Variable Selection Procedure
The model finally chosen for the presence component in Kery et al (2010) was not
found among the highest five probability models under either model prior 3-8 Moreover
the year indicator was never chosen under the multiplicity correcting prior hinting that
this term might correspond to a falsely identified predictor under the uniform prior
Results in Table 3-10 support this claim the marginal inclusion posterior probability for
the year predictor is 7 under the multiplicity correction prior The multiplicity correction
prior concentrates more densely the model posterior probability mass in the highest
ranked models (90 of the mass is in the top five models) than the uniform prior (which
account for 40 of the mass)
For the detection component the HPM under both priors is the intercept only model
which we represent in Table 3-9 with a blank label In both cases this model obtains very
79
Table 3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data
Uniform model priorRank Mz selected p(Mz |y)
1 yrz+elevz 0102 yrz+elevz+elevz3 0083 elevz2+elevz3 0084 yrz+elevz2 0075 yrz+elevz3 007
Multiplicity correcting model priorRank Mz selected p(Mz |y)
1 elevz+elevz3 0532 0153 elevz+elevz2 0094 elevz2 0065 elevz+elevz2+elevz3 005
high posterior probabilities The terms contained in cubic polynomial for the elevation
appear to contain some relevant information however this conflicts with the MPIPs
observed in Table 3-11 which under both model priors are relatively low (lt 20 with the
uniform and le 4 with the multiplicity correcting prior)
Table 3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data
Uniform model priorRank Mz selected p(Mz |y)
1 0452 elevy3 0063 elevy2 0054 elevy 0055 yry 004
Multiplicity correcting model priorRank Mz selected p(Mz |y)
1 0862 elevy3 0023 datey2 0024 elevy2 0025 yry 002
Finally it is possible to use the MPIPs to obtain the median probability model which
contains the terms that have a MPIP higher than 50 For the occupancy process
(Table 3-10) under the uniform prior the model with the year the elevation and the
elevation cubed are included The MPM with multiplicity correction prior coincides with
the HPM from this prior The MPM chosen for the detection component (Table 3-11)
under both priors is the intercept only model coinciding again with the HPM
Given the outcomes of the simulation studies from Section 35 especially those
pertaining to the detection component the results in Table 3-11 appear to indicate that
none of the predictors considered belong to the true model especially when considering
80
Table 3-10 MPIP presence component
Predictor p(predictor isin MTz |y z w v)
Unif MultCorryrz 053 007elevz 051 073elevz2 045 023elevz3 050 067
Table 3-11 MPIP detection component
Predictor p(predictor isin MTy |y z w v)
Unif MultCorryry 019 003elevy 018 003elevy2 018 003elevy 3 019 004datey 016 003datey2 015 004
those derived with the multiplicity correction prior On the other hand for the presence
component (Table 3-10) there is an indication that terms related to the cubic polynomial
in elevz can explain the occupancy patterns362 Validation for the Selection Procedure
Approximately half of the sites were selected at random for training (ie for model
selection and parameter estimation) and the remaining half were used as test data In
the previous section we observed that using the marginal posterior inclusion probability
of the predictors the our method effectively separates predictors in the true model from
those that are not in it However in Tables 3-10 and 3-11 this separation is only clear for
the presence component using the multiplicity correction prior
Therefore in the validation procedure we observe the misclassification rates for the
detections using the following models (1) the model ultimately recommended in Kery
et al (2010) (yrz+elevz+elevz2+elevz3 + elevy+ elevy2+ datey+ datey2) (2) the
highest probability model (HPM) with a uniform prior (yrz+elevz) (3) the HPM with a
multiplicity correcting prior (elevz + elevz3 ) (4) the median probability model (MPM)
ndashthe model including only predictors with a MPIP larger than 50ndash with the uniform
prior (yrz+elevz+elevz3) and finally (5) the MPM with a multiplicity correction prior
(elevz+elevz3 same as the HPM with multiplicity correction)
We must emphasize that the models resulting from the implement ion of our model
selection procedure used exclusively the training dataset On the other hand the model
in Kery et al (2010) was chosen to minimize the prediction error of the complete data
81
Because this model was obtained from the full dataset results derived from it can only
be considered as a lower bound for the prediction errors
The benchmark misclassification error rate for true 1rsquos is high (close to 70)
However the misclassification rate for true 0rsquos which accounts for most of the
responses is less pronounced (15) Overall the performance of the selected models
is comparable They yield considerably worse results than the benchmark for the true
1rsquos but achieve rates close to the benchmark for the true zeros Pooling together
the results for true ones and true zeros the selected models with either prior have
misclassification rates close to 30 The benchmark model performs comparably with a
joint misclassification error of 23 (Table 3-12)
Table 3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform andmultiplicity correction model priors
Model True 1 True 0 Jointbenchmark (Kery et al 2010) yrz+elevz+elevz2+elevz3 + 066 015 023
elevy+ elevy2+ datey+ datey2
HPM Unif yrz+elevz 083 017 028HPMHPM MC elevz + elevz3 082 018 028MPM Unif yrz+elevz+elevz3 082 018 029
37 Discussion
In this Chapter we proposed an objective and fully automatic Bayes methodology for
the single season site-occupancy model The methodology is said to be fully automatic
because no hyper-parameter specification is necessary in defining the parameter priors
and objective because it relies on the intrinsic priors derived from noninformative priors
The intrinsic priors have been shown to have desirable properties as testing priors We
also propose a fast stochastic search algorithm to explore large model spaces using our
model selection procedure
Our simulation experiments demonstrated the ability of the method to single out the
predictors present in the true model when considering the marginal posterior inclusion
probabilities for the predictors For predictors in the true model these probabilities
were comparatively larger than those for predictors absent from it Also the simulations
82
indicated that the method has a greater discrimination capability for predictors in the
detection component of the model especially when using multiplicity correction priors
Multiplicity correction priors were not described in this Chapter however their
influence on the selection outcome is significant This behavior was observed in the
simulation experiment and in the analysis of the Blue Hawker data Model priors play an
essential role As the number of predictors grows these are instrumental in controlling
for selection of false positive predictors Additionally model priors can be used to
account for predictor structure in the selection process which helps both to reduce the
size of the model space and to make the selection more robust These issues are the
topic of the next Chapter
Accounting for the polynomial hierarchy in the predictors within the occupancy
context is a straightforward extension of the procedures we describe in Chapter 4
Hence our next step is to develop efficient software for it An additional direction we
plan to pursue is developing methods for occupancy variable selection in a multivariate
setting This can be used to conduct hypothesis testing in scenarios with varying
conditions through time or in the case where multiple species are co-observed A
final variation we will investigate for this problem is that of occupancy model selection
incorporating random effects
83
CHAPTER 4PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
It has long been an axiom of mine that the little things are infinitely themost important
ndashSherlock HolmesA Case of Identity
41 Introduction
In regression problems if a large number of potential predictors is available the
complete model space is too large to enumerate and automatic selection algorithms are
necessary to find informative parsimonious models This multiple testing problem
is difficult and even more so when interactions or powers of the predictors are
considered In the ecological literature models with interactions andor higher order
polynomial terms are ubiquitous (Johnson et al 2013 Kery et al 2010 Zeller et al
2011) given the complexity and non-linearities found in ecological processes Several
model selection procedures even in the classical normal linear setting fail to address
two fundamental issues (1) the model selection outcome is not invariant to affine
transformations when interactions or polynomial structures are found among the
predictors and (2) additional penalization is required to control for false positives as the
model space grows (ie as more covariates are considered)
These two issues motivate the developments developed throughout this Chapter
Building on the results of Chipman (1996) we propose investigate and provide
recommendations for three different prior distributions on the model space These
priors help control for test multiplicity while accounting for polynomial structure in the
predictors They improve upon those proposed by Chipman first by avoiding the need
for specific values for the prior inclusion probabilities of the predictors and second
by formulating principled alternatives to introduce additional structure in the model
84
priors Finally we design a stochastic search algorithm that allows fast and thorough
exploration of model spaces with polynomial structure
Having structure in the predictors can determine the selection outcome As an
illustration consider the model E [y ] = β00 + β01x2 + β20x21 where the order one
term x1 is not present (this choice of subscripts for the coefficients is defined in the
following section) Transforming x1 7rarr xlowast1 = x1 + c for some c = 0 the model
becomes E [y ] = β00 + β01x2 + βlowast20x
lowast21 Note that in terms of the original predictors
xlowast21 = x21 + 2c middot x1 + c2 implying that this seemingly innocuous transformation of x1
modifies the column space of the design matrix by including x1 which was not in the
original model That is when lower order terms in the hierarchy are omitted from the
model the column space of the design matrix is not invariant to afine transformations
As the hat matrix depends on the column space the modelrsquos predictive capability is also
affected by how the covariates in the model are coded an undesirable feature for any
model selection procedure To make model selection invariant to afine transformations
the selection must be constrained to the subset of models that respect the hierarchy
(Griepentrog et al 1982 Khuri 2002 McCullagh amp Nelder 1989 Nelder 2000
Peixoto 1987 1990) These models are known as well-formulated models (WFMs)
Succinctly a model is well-formulated if for any predictor in the model every lower order
predictor associated with it is also in the model The model above is not well-formulated
as it contains x21 but not x1
WFMs exhibit strong heredity in that all lower order terms dividing higher order
terms in the model must also be included An alternative is to only require weak heredity
(Chipman 1996) which only forces some of the lower terms in the corresponding
polynomial hierarchy to be in the model However Nelder (1998) demonstrated that the
conditions under which weak heredity allows the design matrix to be invariant to afine
transformations of the predictors are too restrictive to be useful in practice
85
Although this topic appeared in the literature more than three decades ago (Nelder
1977) only recently have modern variable selection techniques been adapted to
account for the constraints imposed by heredity As described in Bien et al (2013)
the current literature on variable selection for polynomial response surface models
can be classified into three broad groups mult-istep procedures (Brusco et al 2009
Peixoto 1987) regularized regression methods (Bien et al 2013 Yuan et al 2009)
and Bayesian approaches (Chipman 1996) The methods introduced in this Chapter
take a Bayesian approach towards variable selection for well-formulated models with
particular emphasis on model priors
As mentioned in previous chapters the Bayesian variable selection problem
consists of finding models with high posterior probabilities within a pre-specified model
space M The model posterior probability for M isin M is given by
p(M|yM) prop m(y|M)π(M|M) (4ndash1)
Model posterior probabilities depend on the prior distribution on the model space
as well as on the prior distributions for the model specific parameters implicitly through
the marginals m(y|M) Priors on the model specific parameters have been extensively
discussed in the literature (Berger amp Pericchi 1996 Berger et al 2001 George 2000
Jeffreys 1961 Kass amp Wasserman 1996 Liang et al 2008 Zellner amp Siow 1980) In
contrast the effect of the prior on the model space has until recently been neglected
A few authors (eg Casella et al (2014) Scott amp Berger (2010) Wilson et al (2010))
have highlighted the relevance of the priors on the model space in the context of multiple
testing Adequately formulating priors on the model space can both account for structure
in the predictors and provide additional control on the detection of false positive terms
In addition using the popular uniform prior over the model space may lead to the
undesirable and ldquoinformativerdquo implication of favoring models of size p2 (where p is the
86
total number of covariates) since this is the most abundant model size contained in the
model space
Variable selection within the model space of well-formulated polynomial models
poses two challenges for automatic objective model selection procedures First the
notion of model complexity takes on a new dimension Complexity is not exclusively
a function of the number of predictors but also depends upon the depth and
connectedness of the associations defined by the polynomial hierarchy Second
because the model space is shaped by such relationships stochastic search algorithms
used to explore the models must also conform to these restrictions
Models without polynomial hierarchy constitute a special case of WFMs where
all predictors are of order one Hence all the methods developed throughout this
Chapter also apply to models with no predictor structure Additionally although our
proposed methods are presented for the normal linear case to simplify the exposition
these methods are general enough to be embedded in many Bayesian selection
and averaging procedures including of course the occupancy framework previously
discussed
In this Chapter first we provide the necessary definitions to characterize the
well-formulated model selection problem Then we proceed to introduce three new prior
structures on the well-formulated model space and characterize their behavior with
simple examples and simulations With the model priors in place we build a stochastic
search algorithm to explore spaces of well-formulated models that relies on intrinsic
priors for the model specific parameters mdash though this assumption can be relaxed
to use other mixtures of g-priors Finally we implement our procedures using both
simulated and real data
87
42 Setup for Well-Formulated Models
Suppose that the observations yi are modeled using the polynomial regression of
the covariates xi 1 xi p given by
yi =sum
β(α1αp)
pprodj=1
xαji j + ϵi (4ndash2)
where α = (α1 αp) belongs to Np0 the p-dimensional space of natural numbers
including 0 with ϵiiidsim N(0σ2) and only finitely many βα are allowed to be non-zero
As an illustration consider a model space that includes polynomial terms incorporating
covariates xi 1 and xi 2 only The terms x2i 2 and x2i 1xi 2 can be represented by α = (0 2)
and α = (2 1) respectively
The notation y = Z(X)β + ϵ is used to denote that observed response y =
(y1 yn) is modeled via a polynomial function Z of the original covariates contained
in X = (x1 xp) (where xj = (x1j xnj)prime) and the coefficients of the polynomial
terms are given by β A specific polynomial model M is defined by the set of coefficients
βα that are allowed to be non-zero This definition is equivalent to characterizing M
through a collection of multi-indices α isin Np0 In particular model M is specified by
M = αM1 αM|M| for αMk isin Np0 where βα = 0 for α isin M
Any particular model M uses a subset XM of the original covariates X to form the
polynomial terms in the design matrix ZM(X) Without ambiguity a polynomial model
ZM(X) on X can be identified with a polynomial model ZM(XM) on the covariates XM
The number of terms used by M to model the response y denoted by |M| corresponds
to the number of columns of ZM(XM) The coefficient vector and error variance of
the model M are denoted by βM and σ2M respectively Thus M models the data as
y = ZM(XM)βM + ϵM where ϵM sim N(0 Iσ2M
) Model M is said to be nested in model M prime
if M sub M prime M models the response of the covariates in two distinct ways choosing the
set of meaningful covariates XM as well as choosing the polynomial structure of these
covariates ZM(XM)
88
The set Np0 constitutes a partially ordered set or more succinctly a poset A poset
is a set partially ordered through a binary relation ldquo≼rdquo In this context the binary relation
on the poset Np0 is defined between pairs (ααprime) by αprime ≼ α whenever αj ge αprime
j for all
j = 1 prime with αprime ≺ α if additionally αj gt αprimej for some j The order of a term α isin Np
0
is given by the sum of its elements order(α) =sumαj When order(α) = order(αprime) + 1
and αprime ≺ α then αprime is said to immediately precede α which is denoted by αprime rarr α
The parent set of α is defined by P(α) = αprime isin Np0 αprime rarr α and is given by the
set of nodes that immediately precede the given node A polynomial model M is said to
be well-formulated if α isin M implies that P(α) sub M For example any well-formulated
model using x2i 1xi 2 to model yi must also include the parent terms xi 1xi 2 and x2i 1 their
corresponding parent terms xi 1 and xi 2 and the intercept term 1
The poset Np0 can be represented by a Directed Acyclic Graph (DAG) denoted
by (Np0) Without ambiguity we can identify nodes in the graph α isin Np
0 with terms in
the set of covariates The graph has directed edges to a node from its parents Any
well-formulated model M is represented by a subgraph (M) of (Np0) with the property
that if node α isin (M) then the nodes corresponding to P(α) are also in (M) Figure
4-1 shows examples of well-formulated polynomial models where α isin Np0 is identified
withprodp
j=1 xαjj
The motivation for considering only well-formulated polynomial models is
compelling Let ZM be the design matrix associated with a polynomial model The
subspace of y modeled by ZM given by the hat matrix HM = ZM(ZprimeMZM)
minus1ZprimeM is
invariant to affine transformations of the matrix XM if and only if M corresponds to a
well-formulated polynomial model (Peixoto 1990)
89
A B
Figure 4-1 Graphs of well-formulated polynomial models for p = 2
For example if p = 2 and yi = β(00) + β(10)xi 1 + β(01)xi 2 + β(11)xi 1xi 2 + ϵi then
the hat matrix is invariant to any covariate transformation of the form A(xi 1xi 2
)+ b for any
real-valued positive definite 2 times 2 matrix A and any real-valued vector of dimension two
b In contrast if yi = β(00) + β(20)x2i 1 + ϵi then the hat matrix formed after applying the
transformation xi 1 7rarr xi 1 + c for real c = 0 is not the same as the hat matrix formed by
the original xi 1421 Well-Formulated Model Spaces
The spaces of WFMs M considered in this paper can be characterized in terms
of two WFMs MB the base model and MF the full model The base model contains at
least the intercept term and is nested in the full model The model space M is populated
by all well formulated models M that nest MB and are nested in MF
M = M MB sube M sube MF and M is well-formulated
For M to be well-formulated the entire ancestry of each node in M must also be
included in M Because of this M isin M can be uniquely identified by two different sets
of nodes in MF the set of extreme nodes and the set of children nodes For M isin M
90
the sets of extreme and children nodes respectively denoted by E(M) and C(M) are
defined by
E(M) = α isin M MB α isin P(αprime) forall αprime isin M
C(M) = α isin MF M α cupM is well-formulated
The extreme nodes are those nodes that when removed from M give rise to a WFM in
M The children nodes are those nodes that when added to M give rise to a WFM in
M Because MB sube M for all M isin M the set of nodes E(M)cupMB determine M by
beginning with this set and iteratively adding parent nodes Similarly the nodes in C(M)
determine the set αprime isin P(α) α isin C(M)cupαprime isin E(MF ) α ≼ αprime for all α isin C(M)
which contains E(M)cupMB and thus uniquely identifies M
1
x1
x2
x21
x1x2
x22
A Extreme node set
1
x1
x2
x21
x1x2
x22
B Children node set
Figure 4-2
In Figure 4-2 the extreme and children sets for model M = 1 x1 x21 are shown for
the model space characterized by MF = 1 x1 x2 x21 x1x2 x22 In Figure 4-2A the solid
nodes represent nodes α isin M E(M) the dashed node corresponds to α isin E(M) and
the dotted nodes are not in M Solid nodes in Figure 4-2B correspond to those in M
The dashed node is the single node in C(M) and the dotted nodes are not in M cup C(M)43 Priors on the Model Space
As discussed in Scott amp Berger (2010) the Ockhamrsquos-razor effect found
automatically in Bayesian variable selection through the Bayes factor does not correct
91
for multiple testing This penalization acts against more complex models but does not
account for the collection of models in the model space which describes the multiplicity
of the testing problem This is where the role of the prior on the model space becomes
important As Scott amp Berger explain the multiplicity penalty is ldquohidden awayrdquo in the
model prior probabilities π(M|M)
In what follows we propose three different prior structures on the model space
for WFMs discuss their advantages and disadvantages and describe reasonable
choices for their hyper-parameters In addition we investigate how the choice of
prior structure and hyper-parameter combinations affect the posterior probabilities for
predictor inclusion providing some recommendations for different situations431 Model Prior Definition
The graphical structure for the model spaces suggests a method for prior
construction on M guided by the notion of inheritance A node α is said to inherit from
a node αprime if there is a directed path from αprime to α in the graph (MF ) The inheritance
is said to be immediate if order(α) = order(αprime) + 1 (equivalently if αprime isin P(α) or if αprime
immediately precedes α)
For convenience define (M) = M MB to be the set of nodes in M that are not
in the base model MB For α isin (MF ) let γα(M) be the indicator function describing
whether α is included in M ie γα(M) = I(αisinM) Denote by γν(M) the set of indicators
of inclusion in M for all order ν nodes in (MF ) Finally let γltν(M) =cupνminus1
j=0 γ j(M)
the set of indicators for inclusion in M for all nodes in (MF ) of order less than ν With
these definitions the prior probability of any model M isin M can be factored as
π(M|M) =
JmaxMprod
j=JminM
π(γ j(M)|γltj(M)M) (4ndash3)
where JminM and Jmax
M are respectively the minimum and maximum order of nodes in
(MF ) and π(γJminM (M)|γltJmin
M (M)M) = π(γJminM (M)|M)
92
Prior distributions on M can be simplified by making two assumptions First if
order(α) = order(αprime) = j then γα and γαprime are assumed to be conditionally independent
when conditioned on γltj denoted by γα perpperp γαprime|γltj Second immediate inheritance is
invoked and it is assumed that if order(α) = j then γα(M)|γltj(M) = γα(M)|γP(α)(M)
where γP(α)(M) is the inclusion indicator for the set of parent nodes of α This indicator
is one if the complete parent set of α is contained in M and zero otherwise
In Figure 4-3 these two assumptions are depicted with MF being an order two
surface in two main effects The conditional independence assumption (Figure 4-3A)
implies that the inclusion indicators for x21 x22 and x1x2 is independent when conditioned
on all the lower order terms In this same space immediate inheritance implies that
the inclusion of x21 conditioned on the inclusion of all lower order nodes is equivalent to
conditioning it on its parent set (x1 in this case)
x21 perpperp x1x2 perpperp x22
∣∣∣∣∣
1
x1
x2
A Conditional independence
x21∣∣∣∣∣
1
x1
x2
=
x21
∣∣∣∣∣ x1
B Immediate inheritance
Figure 4-3
Denote the conditional inclusion probability of node α in model M by πα =
π(γα(M) = 1|γP(α)(M)M) Under the assumptions of conditional independence
93
and immediate inheritance the prior probability of M is
π(M|πMM) =prod
αisin(MF )
πγα(M)α (1minus πα)
1minusγα(M) (4ndash4)
with πM = πα α isin (MF ) Because M must be well-formulated πα = γα =
0 if γP(α)(M) = 0 Thus the product in 4ndash4 can be restricted to the set of nodes
α isin (M)cup
C(M) Additional structure can be built into the prior on M by making
assumptions about the inclusion probabilities πα such as equality assumptions or
assumptions of a hyper-prior for these parameters Three such prior classes are
developed next first by assigning hyperpriors on πM assuming some structure among
its elements and then marginalizing out the πM
Hierarchical Uniform Prior (HUP) The HUP assumes that the non-zero πα
are all equal Specifically for a model M isin M it is assumed that πα = π for all
α isin (M)cupC(M) A complete Bayesian specification of the HUP is completed by
assuming a prior distribution for π The choice of π sim Beta(a b) produces
πHUP(M|M a b) =B(|(M)|+ a |C(M)|+ b)
B(a b) (4ndash5)
where B is the beta function Setting a = b = 1 gives the particular value of
πHUP(M|M a = 1 b = 1) =1
|(M)|+ |C(M)|+ 1
(|(M)|+ |C(M)|
|(M)|
)minus1
(4ndash6)
The HUP assigns equal probabilities to all models for which the sets of nodes (M)
and C(M) have the same cardinality This prior provides a combinatorial penalization
but essentially fails to account for the hierarchical structure of the model space An
additional penalization for model complexity can be incorporated into the HUP by
changing the values of a and b Because πα = π for all α this penalization can only
depend on some aspect of the entire graph of MF such as the total number of nodes
not in the null model |(MF )|
94
Hierarchical Independence Prior (HIP) The HIP assumes that there are no
equality constraints among the non-zero πα Each non-zero πα is given its own prior
which is assumed to be a Beta distribution with parameters aα and bα Thus the prior
probability of M under the HIP is
πHIP(M|M ab) =
prodαisin(M)
aα
aα + bα
prodαisinC(M)
bα
aα + bα
(4ndash7)
where the product over empty is taken to be 1 Because the πα are totally independent any
choice of aα and bα is equivalent to choosing a probability of success πα for a given α
Setting aα = bα = 1 for all α isin (M)cup
C(M) gives the particular value of
πHIP(M|M a = 1b = 1) =
(1
2
)|(M)|+|C(M)|
(4ndash8)
Although the prior with this choice of hyper-parameters accounts for the hierarchical
structure of the model space it essentially provides no penalization for combinatorial
complexity at different levels of the hierarchy This can be observed by considering a
model space with main effects only the exponent in 4ndash8 is the same for every model in
the space because each node is either in the model or in the children set
Additional penalizations for model complexity can be incorporated into the HIP
Because each γ j is conditioned on γltj in the prior construction the aα and bα for α of
order j can be conditioned on γltj One such additional penalization utilizes the number
of nodes of order j that could be added to produce a WFM conditioned on the inclusion
vector γltj which is denoted as chj(γltj) Choosing aα = 1 and bα(M) = chj(γ
ltj) is
equivalent to choosing a probability of success πα = 1chj(γltj) This penalization can
drive down the false positive rate when chj(γltj) is large but may produce more false
negatives
Hierarchical Order Prior (HOP) A compromise between complete equality and
complete independence of the πα is to assume equality between the πα of a given
order and independence across the different orders Define j(M) = α isin (M)
95
order(α) = j and Cj(M) = α isin C(M) order(α) = j The HOP assumes that πα = πj
for all α isin j(M)cupCj(M) Assuming that πj sim Beta(aj bj) provides a prior probability of
πHOP(M|M ab) =
JmaxMprod
j=JminM
B(|j(M)|+ aj |Cj(M)|+ bj)
B(aj bj)(4ndash9)
The specific choice of aj = bj = 1 for all j gives a value of
πHOP(M|M a = 1b = 1) =prodj
[1
|j(M)|+ |Cj(M)|+ 1
(|j(M)|+ |Cj(M)|
|j(M)|
)minus1]
(4ndash10)
and produces a hierarchical version of the Scott and Berger multiplicity correction
The HOP arises from a conditional exchangeability assumption on the indicator
variables Conditioned on γltj(M) the indicators γα α isin j(M)cup
Cj(M) are
assumed to be exchangeable Bernoulli random variables By de Finettirsquos theorem these
arise from independent Bernoulli random variables with common probability of success
πj with a prior distribution Our construction of the HOP assumes that this prior is a
beta distribution Additional complexity penalizations can be incorporated into the HOP
in a similar fashion to the HIP The number of possible nodes that could be added of
order j while maintaining a WFM is given by chj(M) = chj(γltj(M)) = |j(M)
cupCj(M)|
Using aj = 1 and bj(M) = chj(M) produces a prior with two desirable properties
First if M prime sub M then π(M) le π(M prime) Second for each order j the conditional
probability of including k nodes is greater than or equal to that of including k + 1 nodes
for k = 0 1 chj(M)minus 1432 Choice of Prior Structure and Hyper-Parameters
Each of the priors introduced in Section 31 defines a whole family of model priors
characterized by the probability distribution assumed for the inclusion probabilities πM
For the sake of simplicity this paper focuses on those arising from Beta distributions
and concentrates on particular choices of hyper-parameters which can be specified
automatically First we describe some general features about how each of the three
prior structures (HUP HIP HOP) allocates mass to the models in the model space
96
Second as there is an infinite number of ways in which the hyper-parameters can be
specified focused is placed on the default choice a = b = 1 as well as the complexity
penalizations described in Section 31 The second alternative is referred to as a =
1b = ch where b = ch has a slightly different interpretation depending on the prior
structure Accordingly b = ch is given by bj(M) = bα(M) = chj(M) = |j(M)cup
Cj(M)|
for the HOP and HIP where j = order(α) while b = ch denotes that b = |(MF )| for
the HUP The prior behavior for two model spaces In both cases the base model MB is
taken to be the intercept only model and MF is the DAG shown (Figures 4-4 and 4-5)
The priors considered treat model complexity differently and some general properties
can be seen in these examples
ModelHIP HOP HUP
(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)
1 1 14 49 13 12 13 572 1 x1 18 19 112 112 112 5563 1 x2 18 19 112 112 112 5564 1 x1 x
21 18 19 112 112 112 5168
5 1 x2 x22 18 19 112 112 112 5168
6 1 x1 x2 132 364 112 112 160 1727 1 x1 x2 x
21 132 164 136 160 160 1168
8 1 x1 x2 x1x2 132 164 136 160 160 11689 1 x1 x2 x
22 132 164 136 160 160 1168
10 1 x1 x2 x21 x1x2 132 1192 136 1120 130 1252
11 1 x1 x2 x21 x
22 132 1192 136 1120 130 1252
12 1 x1 x2 x1x2 x22 132 1192 136 1120 130 1252
13 1 x1 x2 x21 x1x2 x
22 132 1576 112 1120 16 1252
Figure 4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch)
First contrast the choice of HIP HUP and HOP for the choice of (ab) = (1 1) The
HIP induces a complexity penalization that only accounts for the order of the terms in
the model This is best exhibited by the model space in Figure 4-4 Models including x1
and x2 models 6 through 13 are given the same prior probability and no penalization is
incurred for the inclusion of any or all of the quadratic terms In contrast to the HIP the
97
ModelHIP HOP HUP
(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)
1 1 18 2764 14 12 14 472 1 x1 18 964 112 110 112 2213 1 x2 18 964 112 110 112 2214 1 x3 18 964 112 110 112 2215 1 x1 x3 18 364 112 120 112 41056 1 x2 x3 18 364 112 120 112 41057 1 x1 x2 116 3128 124 140 130 1428 1 x1 x2 x1x2 116 3128 124 140 120 1709 1 x1 x2 x3 116 1128 18 140 120 17010 1 x1 x2 x3 x1x2 116 1128 18 140 15 170
Figure 4-5 Prior probabilities for the space of well-formulated models associated tothree main effects and one interaction term where MB is taken to be theintercept only model and (ab) isin (1 1) (1 ch)
HUP induces a penalization for model complexity but it does not adequately penalize
models for including additional terms Using the HIP models including all of the terms
are given at least as much probability as any model containing any non-empty set of
terms (Figures 4-4 and 4-5) This lack of penalization of the full model is originates from
its combinatorial simplicity (ie this is the only model that contains every term) and
as an unfortunate consequence this model space distribution favors the base and full
models Similar behavior is observed with the HOP with (ab) = (1 1) As models
become more complex they are appropriately penalized for their size However after a
sufficient number of nodes are added the number of possible models of that particular
size is considerably reduced Thus combinatorial complexity is negligible on the largest
models This is best exhibited in Figure 4-5 where the HOP places more mass on
the full model than on any model containing a single order one node highlighting an
undesirable behavior of the priors with this choice of hyper-parameters
In contrast if (ab) = (1 ch) all three priors produce strong penalization as
models become more complex both in terms of the number and order of the nodes
contained in the model For all of the priors adding a node α to a model M to form M prime
produces p(M) ge p(M prime) However differences between the priors are apparent The
98
HIP penalizes the full model the most with the HOP penalizing it the least and the HUP
lying between them At face value the HOP creates the most compelling penalization
of model complexity In Figure 4-5 the penalization of the HOP is the least dramatic
producing prior odds of 20 for MB versus MF as opposed to the HUP and HIP which
produce prior odds of 40 and 54 respectively Similarly the prior odds in Figure 4-4 are
60 180 and 256 for the HOP HUP and HIP respectively433 Posterior Sensitivity to the Choice of Prior
To determine how the proposed priors are adjusting the posterior probabilities to
account for multiplicity a simple simulation was performed The goal of this exercise
was to understand how the priors respond to increasing complexity First the priors are
compared as the number of main effects p grows Second they are compared as the
depth of the hierarchy increases or in other words as the orderJMmax increases
The quality of a node is characterized by its marginal posterior inclusion
probabilities defined as pα =sum
MisinM I(αisinM)p(M|yM) for α isin MF These posteriors
were obtained for the proposed priors as well as the Equal Probability Prior (EPP)
on M For all prior structures both the default hyper-parameters a = b = 1 and
the penalizing choice of a = 1 and b = ch are considered The results for the
different combinations of MF and MT incorporated in the analysis were obtained
from 100 random replications (ie generating at random 100 matrices of main effects
and responses) The simulation proceeds as follows
1 Randomly generate main effects matrices X = (x1 x18) for xiiidsim Nn(0 In) and
error vectors ϵ sim Nn(0 In) for n = 60
2 Setting all coefficient values equal to one calculate y = ZMTβ + ϵ for the true
models given byMT 1 = x1 x2 x3 x
21 x1x2 x
22 x2x3 with |MT 1| = 7
MT 2 = x1 x2 x16 with |MT 2| = 16MT 3 = x1 x2 x3 x4 with |MT 3| = 4MT 4 = x1 x2 x8 x
21 x3x4 with |MT 4| = 10
MT 5 = x1 x2 x3 x4 x21 x3x4 with |MT 5| = 6
99
Table 4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulationsgrowing p fixed JM
max fixed p growing JMmax
MF
∣∣MF
∣∣ ∣∣M∣∣ MT used MF
∣∣MF
∣∣ ∣∣M∣∣ MT used(x1 + x2 + x3)
2 9 95 MT 1 (x1 + x2 + x3)2 9 95 MT 1
(x1 + + x4)2 14 1337 MT 1 (x1 + x2 + x3)
3 19 2497 MT 1
(x1 + + x5)2 20 38619 MT 1 (x1 + x2 + x3)
4 34 161421 MT 1
Other model spacesMF
∣∣MF
∣∣ ∣∣M∣∣ MT usedx1 + x2 + middot middot middot+ x18 18 262144 MT 2MT 3
(x1 + x2 + x4)2 + x5+ 20 85568 MT 4MT 5x6 + + x10
3 In all simulations the base model MB is the intercept only model The notation(x1 + + xp)
d is used to represent the full order-d polynomial response surface inp main effects The model spaces characterized by their corresponding full modelMF are presented in Table 4-1 as well as the true models used in each case
4 Enumerate the model spaces and calculate p(M|yM) for all M isin Musing the EPP HUP HIP and HOP the latter two each with the two sets ofhyper-parameters
5 Count the number of true positives and false positives in each M for the differentpriors
The true positives (TP) are defined as those nodes α isin MT such that pα gt 05
With the false positives (FP) three different cutoffs are considered for pα elucidating
the adjustment for multiplicity induced by the model priors These cutoffs are
010 020 and 050 for α isin MT The results from this exercise provide insight
about the influence of the prior on the marginal posterior inclusion probabilities In Table
4-1 the model spaces considered are described in terms of the number of models they
contain and in terms of the number of nodes of MF the full model that defines the DAG
for M
Growing number of main effects fixed polynomial degree This simulation
investigates the posterior behavior as the number of covariates grows for a polynomial
100
surface of degree two The true model is assumed to be MT 1 and has 7 polynomial
terms The false positive and true positive rates are displayed in Table 4-2
First focus on the posterior when (ab) = (1 1) As p increases and the cutoff
is low the number of false positives increases for the EPP as well as the hierarchical
priors although less dramatically for the latter All of the priors identify all of the true
positives The false positive rate for the 50 cutoff is less than one for all four prior
structures with the HIP exhibiting the smallest false positive rate
With the second choice of hyper-parameters (1 ch) the improvement of the
hierarchical priors over the EPP is dramatic and the difference in performance is more
pronounced as p increases These also considerably outperform the priors using the
default hyper-parameters a = b = 1 in terms of the false positives Regarding the
number of true positives all priors discovered the 7 true predictors in MT 1 for most of
the 100 random samples of data with only minor differences observed between any of
the priors considered That being said the means for the priors with a = 1b = ch are
slightly lower for the true positives With a 50 cutoff the hierarchical priors keep a tight
control on the number of false positives but in doing so discard true positives with slightly
higher frequency
Growing polynomial degree fixed main effects For these examples the true
model is once again MT 1 When the complexity is increased by making the order of MF
larger (Table 4-3) the inability of the EPP to adjust the inclusion posteriors for multiplicity
becomes more pronounced the EPP becomes less and less efficient at removing false
positives when the FP cutoff is low Among the priors with a = b = 1 as the order
increases the HIP is the best at filtering out the false positives Using the 05 false
positive cutoff some false positives are included both for the EEP and for all the priors
with a = b = 1 indicating that the default hyper-parameters might not be the best option
to control FP The 7 covariates in the true model all obtain a high inclusion posterior
probability both with the EEP and the a = b = 1 priors
101
Table 4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
7 (x1 + x2 + x3)2
178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3 + x4)2
362 194 233 245 010 063 107FP(gt020) 160 047 217 215 001 017 024FP(gt050) 025 006 035 036 000 002 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3 + x4 + x5)2
600 216 260 255 012 043 115FP(gt020) 291 055 213 218 002 019 027FP(gt050) 066 011 025 037 000 003 001TP(gt050) (MT 1) 700 700 700 700 697 699 699
In contrast any of the a = 1 and b = ch priors dramatically improve upon their
a = b = 1 counterparts consistently assigning low inclusion probabilities for the majority
of the false positive terms even for low cutoffs As the order of the polynomial surface
increases the difference in performance between these priors and either the EEP or
their default versions becomes even more clear At the 50 cutoff the hierarchical priors
with complexity penalization exhibit very low false positive rates The true positive rate
decreases slightly for the priors but not to an alarming degree
Other model spaces This part of the analysis considers model spaces that do not
correspond to full polynomial degree response surfaces (Table 4-4) The first example
is a model space with main effects only The second example includes a full quadratic
surface of order 2 but in addition includes six terms for which only main effects are to be
modeled Two true models are used in combination with each model space to observe
how the posterior probabilities vary under the influence of the different priors for ldquolargerdquo
and ldquosmallrdquo true models
102
Table 4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
7 (x1 + x2 + x3)2
178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3)3
737 521 606 291 055 105 139FP(gt020) 291 155 361 208 017 034 031FP(gt050) 040 021 050 026 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 698 700FP(gt010)
7 (x1 + x2 + x3)4
822 400 469 261 052 055 132FP(gt020) 421 113 176 203 012 015 031FP(gt050) 056 017 022 027 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 697 699
By construction in model spaces with main effects only HIP(11) and EPP are
equivalent as are HOP(ab) and HUP(ab) This accounts for the similarities observed
among the results for the first two cases presented in Table 4-4 where the model space
corresponds to a full model with 18 main effects and the true models are a model with
16 and 4 main effects respectively When the number of true coefficients is large the
HUP(11) and HOP(11) do poorly at controlling false positives even at the 50 cutoff
In contrast the HIP (and thus the EPP) with the 50 cutoff identifies the true positives
and no false positives This result however does not imply that the EPP controls false
positives well The true model contains 16 out of the 18 nodes in MF so there is little
potential for false positives The a = 1 and b = ch priors show dramatically different
behavior The HIP controls false positive well but fails to identify the true coefficients at
the 50 cutoff In contrast the HOP identifies all of the true positives and has a small
false positive rate for the 50 cutoff
103
If the number of true positives is small most terms in the full model are truly zero
The EPP includes at least one false positive in approximately 50 of the randomly
sampled datasets On the other hand the HUP(11) provides some control for
multiplicity obtaining on average a lower number of false positives than the EPP
Furthermore the proposed hierarchical priors with a = 1b = ch are substantially better
than the EPP (and the choice of a = b = 1) at controlling false positives and capturing
all true positives using the marginal posterior inclusion probabilities The two examples
suggest that the HOP(1 ch) is the best default choice for model selection when the
number of terms available at a given degree is large
The third and fourth examples in Table 4-4 consider the same irregular model
space with data generated from MT 4 with ten terms and MT 5 with six terms HIP(11)
and EPP again behave quite similarly incorporating a large number of false positives
for the 01 cutoff At the 05 cutoff some false positives are still included The HUP(11)
and HOP(11) behave similarly with a slightly higher false positive rate at the 50 cutoff
In terms of the true positives the EPP and a = b = 1 priors always include all of the
predictors in MT 4 and MT 5 On the other hand the ability of the a = 1b = ch priors
to control for false positives is markedly better than that of the EPP and the hierarchical
priors with choice of a = 1 = b = 1 At the 50 cutoff these priors identify all of the true
positives and true negatives Once again these examples point to the hierarchical priors
with additional penalization for complexity as being good default priors on the model
space44 Random Walks on the Model Space
When the model space M is too large to enumerate a stochastic procedure can
be used to find models with high posterior probability In particular an MCMC algorithm
can be utilized to generate a dependent sample of models from the model posterior The
structure of the model space M both presents difficulties and provides clues on how to
build algorithms to explore it Different MCMC strategies can be adopted two of which
104
Table 4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
16 x1 + x2 + + x18
193 193 200 200 003 180 180FP(gt020) 052 052 200 200 001 046 046FP(gt050) 007 007 200 200 001 004 004TP(gt050) (MT 2) 1599 1599 1600 1600 699 1599 1599FP(gt010)
4 x1 + x2 + + x18
1395 1395 915 915 026 131 131FP(gt020) 545 545 303 303 005 045 045FP(gt050) 084 084 045 045 002 006 006TP(gt050) (MT 3) 400 400 400 400 400 400 400FP(gt010)
10
973 971 1000 560 034 233 220FP(gt020) (x1 + + x4)
2+ 265 265 873 305 012 074 069FP(gt050) +x5 + + x10 035 035 136 168 002 011 012TP(gt050) (MT 4) 1000 1000 1000 999 994 998 999FP(gt010)
6
1352 1352 1106 994 044 163 196FP(gt020) (x1 + + x4)
2+ 422 421 360 501 015 048 068FP(gt050) +x5 + + x10 053 053 057 075 001 008 011TP(gt050) (MT 5) 600 600 600 600 599 599 599
are outlined in this section Combining the different strategies allows the model selection
algorithm to explore the model space thoroughly and relatively fast441 Simple Pruning and Growing
This first strategy relies on small localized jumps around the model space turning
on or off a single node at each step The idea behind this algorithm is to grow the model
by activating one node in the children set or to prune the model by removing one node
in the extreme set At a given step in the algorithm assume that the current state of the
chain is model M Let pG be the probability that algorithm chooses the growth step The
proposed model M prime can either be M+ = M cup α for some α isin C(M) or Mminus = M α
or some α isin E(M)
An example transition kernel is defined by the mixture
g(M prime|M) = pG middot qGrow(M prime|M) + (1minus pG) middot qPrune(M prime|M)
105
=IM =MF
1 + IM =MBmiddotIαisinC(M)
|C(M)|+
IM =MB
1 + IM =MF middotIαisinE(M)
|E(M)|(4ndash11)
where pG has explicitly been defined as 05 when both C(M) and E(M) are non-empty
and as 0 (or 1) when C(M) = empty (or E(M) = empty) After choosing pruning or growing a
single node is proposed for addition to or deletion from M uniformly at random
For this simple algorithm pruning has the reverse kernel of growing and vice-versa
From this construction more elaborate algorithms can be specified First instead of
choosing the node uniformly at random from the corresponding set nodes can be
selected using the relative posterior probability of adding or removing the node Second
more than one node can be selected at any step for instance by also sampling at
random the number of nodes to add or remove given the size of the set Third the
strategy could combine pruning and growing in a single step by sampling one node
α isin C(M) cup E(M) and adding or removing it accordingly Fourth the sets of nodes from
C(M) cup E(M) that yield well-formulated models can be added or removed This simple
algorithm produces small moves around the model space by focusing node addition or
removal only on the set C(M) cup E(M)442 Degree Based Pruning and Growing
In exploring the model space it is possible to take advantage of the hierarchical
structure defined between nodes of different order One can update the vector of
inclusion indicators by blocks denoted j(M) Two flavors of this algorithm are
proposed one that separates the pruning and growing steps and one where both
are done simultaneously
Assume that at a given step say t the algorithm is at M If growing the strategy
proceeds successively by order class going from j = Jmin up to j = Jmax with Jmin
and Jmax being the lowest and highest orders of nodes in MF MB respectively Define
Mt(Jminminus1) = M and set j = Jmin The growth kernel comprises the following steps
proceeding from j = Jmin to j = Jmax
106
1) Propose a model M prime by selecting a set of nodes from Cj(Mt(jminus1)) through thekernel qGrow j(middot|Mt(jminus1))
2) Compute the Metropolis-Hastings correction for M prime versus Mt(jminus1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(jminus1)
3) If j lt Jmax then set j = j + 1 and return to 1) otherwise proceed to 4)
4) Set Mt = Mt(Jmax )
The pruning step is defined In a similar fashion however it starts at order j = Jmax
and proceeds down to j = Jmin Let Ej(M prime) = E(M prime) cap j(MF ) be the set of nodes of
order j that can removed from the model M to produce a WFM Define Mt(Jmax+1) = M
and set j = Jmax The pruning kernel comprises the following steps
1) Propose a model M prime by selecting a set of nodes from Ej(Mt(j+1)) through thekernel qPrunej(middot|Mt(j+1))
2) Compute the Metropolis-Hastings correction for M prime versus Mt(j+1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(j+1)
3) If j gt Jmin then set j = j minus 1 and return to Step 1) otherwise proceed to Step 4)
4) Set Mt = Mt(Jmin )
It is clear that the growing and pruning steps are reverse kernels of each other
Pruning and growing can be combined for each j The forward kernel proceeds from
j = Jmin to j = Jmax and proposes adding sets of nodes from Cj(M) cup Ej(M) The reverse
kernel simply reverses the direction of j proceeding from j = Jmax to j = Jmin 45 Simulation Study
To study the operating characteristics of the proposed priors a simulation
experiment was designed with three goals First the priors are characterized by how
the posterior distributions are affected by the sample size and the signal-to-noise ratio
(SNR) Second given the SNR level the influence of the allocation of the signal across
the terms in the model is investigated Third performance is assessed when the true
model has special points in the scale (McCullagh amp Nelder 1989) ie when the true
107
model has coefficients equal to zero for some lower-order terms in the polynomial
hierarchy
With these goals in mind sets of predictors and responses are generated under
various experimental conditions The model space is defined with MB being the
intercept-only model and MF being the complete order-four polynomial surface in five
main effects that has 126 nodes The entries of the matrix of main effects are generated
as independent standard normal The response vectors are drawn from the n-variate
normal distribution as y sim Nn
(ZMT
(X)βγ In) where MT is the true model and In is the
n times n identity matrix
The sample sizes considered are n isin 130 260 1040 which ensures that
ZMF(X) is of full rank The cardinality of this model space is |M| gt 12 times 1022 which
makes enumeration of all models unfeasible Because the value of the 2k-th moment
of the standard normal distribution increases with k = 1 2 higher-order terms by
construction have a larger variance than their ancestors As such assuming equal
values for all coefficients higher-order terms necessarily contain more ldquosignalrdquo than
the lower order terms from which they inherit (eg x21 has more signal than x1) Once a
higher-order term is selected its entire ancestry is also included Therefore to prevent
the simulation results from being overly optimistic (because of the larger signals from the
higher-order terms) sphering is used to calculate meaningful values of the coefficients
ensuring that the signal is of the magnitude intended in any given direction Given
the results of the simulations from Section 433 only the HOP with a = 1b = ch is
considered with the EPP included for comparison
The total number of combinations of SNR sample size regression coefficient
values and nodes in MT amounts to 108 different scenarios Each scenario was run
with 100 independently generated datasets and the mean behavior of the samples was
observed The results presented in this section correspond to the median probability
model (MPM) from each of the 108 simulation scenarios considered Figure 4-7 shows
108
the comparison between the two priors for the mean number of true positive (TP) and
false positive (FP) terms Although some of the scenarios consider true models that are
not well-formulated the smallest well-formulated model that stems from MT is always
the one shown in Figure 4-6
Figure 4-6 MT DAG of the largest true model used in simulations
The results are summarized in Figure 4-7 Each point on the horizontal axis
corresponds to the average for a given set of simulation conditions Only labels for the
SNR and sample size are included for clarity but the results are also shown for the
different values of the regression coefficients and the different true models considered
Additional details about the procedure and other results are included in the appendices451 SNR and Sample Size Effect
As expected small sample sizes conditioned upon a small SNR impair the ability
of the algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this
effect being greater when using the latter prior However considering the mean number
of TPs jointly with the number of FPs it is clear that although the number of TPs is
specially low with HOP(1 ch) most of the few predictors that are discovered in fact
belong to the true model In comparison to the results with EPP in terms of FPs the
HOP(1 ch) does better and even more so when both the sample size and the SNR are
109
Figure 4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch)
smallest Finally when either the SNR or the sample size is large the performance in
terms of TPs is similar between both priors but the number of FPs are somewhat lower
with the HOP452 Coefficient Magnitude
Three ways to allocate the amount of signal across predictors are considered For
the first choice all coefficients contain the same amount of signal regardless of their
order In the second each order-one coefficient contains twice as much signal as any
order-two coefficient and four times as much as any order-three coefficient Finally
each order-one coefficient contains a half as much signal as any order-two coefficient
and a quarter of what any order-three coefficient has These choices are denoted by
β(1) = c(1o1 1o2 1o3) β(2) = c(1o1 05o2 025o3) and β(3) = c(025o1 05o2 1o3)
respectively In Figure 4-7 the first 4 scenarios correspond to simulations with β(1) the
next four use β(2) the next four correspond to β(3) and then the values are cycled in
110
the same way The results show that scenarios using either β(1) or β(3) behave similarly
contrasting with the negative impact of having the highest signal in the order-one terms
through β(2) In Figure 4-7 the effect of using β(2) is evident as it corresponds to the
lowest values for the TPs regardless of the sample size the SNR or the prior used This
is an intuitive result since giving more signal to higher-order terms makes it easier to
detect higher-order terms and consequently by strong heredity the algorithm will also
select the corresponding lower-order terms included in the true model453 Special Points on the Scale
Four true models were considered (1) the model from Figure 4-6 (MT 1) (2)
the model without the order-one terms (MT 2) (3) the model without order-two terms
(MT 3) and (4) the model without x21 and x2x5 (MT 4) The last three are clearly not
well-formulated In Figure 4-7 the leftmost point on the horizontal axis corresponds to
scenarios with MT 1 the next point is for scenarios with MT 2 followed by those with MT 3
then with MT 4 then MT 1 etc In comparison to the EPP the HOP(1 ch) tightly controls
the inclusion of FPs by choosing smaller models at the expense of also reducing the TP
count especially when there is more uncertainty about the true model (ie SNR=025)
For both prior structures the results indicate that at low SNR levels the presence of
special points has no apparent impact as the selection behavior is similar between the
four models in terms of both the TP and FP An interesting observation is that the effect
of having special points on the scale is vastly magnified whenever the coefficients that
assign more weight to order-one terms (β(2)) are used46 Case Study Ozone Data Analysis
This section uses the ozone data from Breiman amp Friedman (1985) and followsthe analysis performed by Liang et al (2008) who investigated hyper g-priors Afterremoving observations with missing values 330 observations remain includingdaily measurements of maximum ozone concentration near Los Angeles and eightmeteorological variables Table D From the 330 observations 165 were sampled atrandom without replacement and used to run the variable selection procedure theremaining 165 were used for validation The eight meteorological variables interactionsand their squared terms are used as predictors resulting in a full model with 44predictors The model space assumes that the base model MB is the intercept onlymodel and that MF is the quadratic surface in the eight meteorological variables The
111
model space contains approximately 71 billion models and computation of all modelposterior probabilities is not feasible
Table 4-5 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA
vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX
The HOP HUP and HIP with a = 1 and b = ch as well as the EPP are considered forcomparison purposes To obtain the Bayes factors in equation 3ndash3 four different mixtures ofg-priors are utilized intrinsic priors (IP) (which yields the expression in equation 3ndash2) hyper-g(HG) priors (Liang et al 2008) with hyper-parameters α = 2β = 1 and α = β = 1 and Zellner-Siow (ZS) priors (Zellner amp Siow 1980) The results were extracted for the median posteriorprobability (MPM) models Additionally the model is estimated using the R package hierNet(Bien et al 2013) to compare model selection results to those obtained using the hierarchicallasso (Bien et al 2013) restricted to well formulated models by imposing the strong heredityconstraint The procedures were assessed on the basis of their predictive accuracy on thevalidation dataset
Among all models the one that yields the smallest RMSE is the median probability modelobtained using the HOP and EPP with the ZS prior and also using the HOP with both HGpriors (Table 4-6) The HOP model with the intrinsic prior has all the terms contained in thelowest RMSE model with the exception of dpg2 which has a relatively high marginal inclusionprobability of 46 This disparity between the IP and other mixtures of g-priors is explainedby the fact that the IP induces less posterior shrinkage than the ZS and HG priors The MPMobtained through the HUP and HIP are nested in the best model suggesting that these modelspace priors penalize complexity too much and result in false negatives Consideration ofthese MPMs suggest that the HOP is best at producing true positives while controlling for falsepositives
Finally the model obtained from the hierarchical lasso (HierNet) is the largest model andproduces the second to largest RMSE All of the terms contained in any of the other modelsexcept for vh are nested within the hierarchical lasso model and most of the terms that areexclusive to this model receive extremely low marginal inclusion probabilities under any of themodel priors and parameter priors considered under Bayesian model selection
112
Table 4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso
BF Prior Model R2 RMSEIP EPP hum dpg ibt hum2 hum lowast dpg 08054 42739
hum lowast ibt dpg2 ibt2IP HIP hum ibt hum2 hum lowast ibt ibt2 07740 43396IP HOP hum dpg ibt hum2 hum lowast ibt ibt2 07848 43175IP HUP hum dpg ibt hum lowast ibt ibt2 07767 43508ZS EPP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HIP hum ibt hum lowast ibt ibt2 07525 43505ZS HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG11 EPP vh hum dpg ibt hum2 hum lowast ibt dpg2 07701 43049HG11 HIP hum ibt hum lowast ibt ibt2 07525 43505HG11 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG11 HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 EPP hum dpg ibt hum2 hum lowast ibt dpg2 07701 43037HG21 HIP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG21 HUP hum dpg ibt hum lowast ibt 07526 44036
HierNet hum temp ibh dpg ibt vis hum2 hum lowast ibt 07651 43680temp2 temp lowast ibt dpg2
47 DiscussionScott amp Berger (2010) noted that Ockhamrsquos-razor effect found automatically in Bayesian
variable selection through the Bayes factor does not correct for multiple testing The Bayesfactor penalizes complexity of the alternative model according to the number of parametersin excess of those of the null model Therefore the Bayes factor only controls complexity in apairwise fashion If the model selection procedure uses equal prior probabilities for all M isin Mthen these comparisons ignore the effect of the multiplicity of the testing problem This is wherethe role of the prior on the model space becomes important The multiplicity penalty is ldquohiddenawayrdquo in the model prior probabilities π(M|M)
In addition to the multiplicity of the testing problem disregarding the hierarchical polynomialstructure in the predictors in model selection procedures has the potential to lead to differentresults according to how the predictors are setup (eg in what units these predictors areexpressed)
In this Chapter we investigated a solution to these two issues We define prior structuresfor well-formulated models and develop random walk algorithms to traverse this type of modelspace The key to understanding prior distributions on the space of WFMs is the hierarchicalnature of the model space itself The prior distributions described take advantage of thathierarchy in two ways First conditional independence and immediate inheritance are used todevelop the HOP HIP and HUP structures discussed in Section 43 Second the conditionalnature of the priors allows for the direct incorporation of complexity penalizations Of the priorsproposed the HOP using the hyperparameter choice (1 ch) provides the best control of falsepositives while maintaining a reasonable true positive rate Thus this prior is recommended asthe default prior on the space of WFMs
113
In the near future the software developed to carry out a Metropolis-Hastings random walkon the space of WFMs will be integrated to the R package varSelectIP These new functionsimplement various local priors for the regression coefficients including the intrinsic prior Zellner-Siow prior and hyper g-priors In addition the software supports the computation of crediblesets for each regression coefficient conditioned on the selected model as well as under modelaveraging
114
CHAPTER 5CONCLUSIONS
Ecologists are now embracing the use of Bayesian methods to investigate the
interactions that dictate the distribution and abundance of organisms These tools are
both powerful and flexible They allow integrating under a single methodology empirical
observations and theoretical process models and can seamlessly account for several
sources of uncertainty and dependence The estimation and testing methods proposed
throughout the document will contribute to the understanding of Bayesian methods used
in ecology and hopefully these will shed light about the differences between estimation
and testing Bayesian tools
All of our contributions exploit the potential of the latent variable formulation This
approach greatly simplifies the analysis of complex models it redirects the bulk of
the inferential burden away from the original response variables and places it on the
easy-to-work-with latent scale for which several time-tested approaches are available
Our methods are distinctly classified into estimation and testing tools
For estimation we proposed a Bayesian specification of the single-season
occupancy model for which a Gibbs sampler is available using both logit and probit
link functions This setup allows detection and occupancy probabilities to depend
on linear combinations of predictors Then we developed a dynamic version of this
approach incorporating the notion that occupancy at a previously occupied site depends
both on survival of current settlers and habitat suitability Additionally because these
dynamics also vary in space we suggest a strategy to add spatial dependence among
neighboring sites
Ecological inquiry usually requires of competing explanations and uncertainty
surrounds the decision of choosing any one of them Hence a model or a set of
probable models should be selected from all the viable alternatives To address this
testing problem we proposed an objective and fully automatic Bayesian methodology
115
for the single season site-occupancy model Our approach relies on the intrinsic prior
which prevents from introducing (commonly unavailable) subjectively information
into the model In simulation experiments we observed that the methods single out
accurately the predictors present in the true model using the marginal posterior inclusion
probabilities of the predictors For predictors in the true model these probabilities were
comparatively larger than those for predictors not present in the true model Also the
simulations indicated that the method provides better discrimination for predictors in the
detection component of the model
In our simulations and in the analysis of the Blue Hawker data we observed that the
effect from using the multiplicity correction prior was substantial This occurs because
the Bayes factor only penalizes complexity of the alternative model according to its
number of parameters in excess to those of the null model As the number of predictors
grows the number of models in the models space also grows increasing the chances
of making false positive decisions on the inclusion of predictors This is where the role
of the prior on the model space becomes important The multiplicity penalty is ldquohidden
awayrdquo in the model prior probabilities π(M|M) In addition to the multiplicity of the
testing problem disregarding the hierarchical polynomial structure in the predictors in
model selection procedures has the potential to lead to different results according to
how the predictors are coded (eg in what units these predictors are expressed)
To confront this situation we propose three prior structures for well-formulated
models take advantage of the hierarchical structure of the predictors Of the priors
proposed we recommend the HOP using the hyperparameter choice (1 ch) which
provides the best control of false positives while maintaining a reasonable true positive
rate
Overall considering the flexibility of the latent approach several other extensions of
these methods follow Currently we envision three future developments (1) occupancy
models incorporate various sources of information (2) multi-species models that make
116
use of spatial and interspecific dependence and (3) investigate methods to conduct
model selection for the dynamic and spatially explicit version of the model
117
APPENDIX AFULL CONDITIONAL DENSITIES DYMOSS
In this section we introduce the full conditional probability density functions for all
the parameters involved in the DYMOSS model using probit as well as logic links
Sampler Z
The full conditionals corresponding to the presence indicators have the same form
regardless of the link used These are derived separately for the cases t = 1 1 lt t lt T
and t = T since their corresponding probabilities take on slightly different forms
Let ϕ(ν|microσ2) represent the density for a normal random variable ν with mean micro and
variance σ2 and recall that ψi1 = F (xprime(o)iα) and pijt = F (qprimeijtλt) where F () is the
inverse link function The full conditional for zit is given by
1 For t = 1
π(zi1|vi1αλ1βc1 δ
s1) = ψlowast
i1zi1 (1minus ψlowast
i1)1minuszi1
= Bernoulli(ψlowasti1) (Andash1)
where
ψlowasti1 =
ψi1ϕ(vi1|xprimei1βc1 + δs1 1)
prodJi1j=1(1minus pij1)
ψi1ϕ(vi1|xprimei1βc1 + δs1 1)
prodJi1j=1(1minus pij1) + (1minus ψi1)ϕ(vi1|xprimei1β
c1 1)
prodJj=1 Iyij1=0
2 For 1 lt t lt T
π(zit |zi(tminus1) zi(t+1)λt βctminus1 δ
stminus1) = ψlowast
itzit (1minus ψlowast
it)1minuszit
= Bernoulli(ψlowastit) (Andash2)
where
ψlowastit =
κitprodJit
j=1(1minus pijt)
κitprodJit
j=1(1minus pijt) +nablait
prodJj=1 Iyijt=0
with
(a) κit = F (xprimei(tminus1)β
ctminus1 + zi(tminus1)δ
stminus1)ϕ(vit |xprimeitβ
ct + δst 1) and
(b) nablait =(1minus F (xprime
i(tminus1)βctminus1 + zi(tminus1)δ
stminus1)
)ϕ(vit |xprimeitβ
ct 1)
3 For t = T
π(ziT |zi(Tminus1)λT βcTminus1 δ
sTminus1) = ψ⋆iT
ziT (1minus ψ⋆iT )1minusziT
118
=
Nprodi=1
Bernoulli(ψ⋆iT ) (Andash3)
where
ψ⋆iT =κ⋆iT
prodJiTj=1(1minus pijT )
κ⋆iTprodJiT
j=1(1minus pijT ) +nabla⋆iT
prodJj=1 IyijT=0
with
(a) κ⋆iT = F (xprimei(Tminus1)β
cTminus1 + zi(Tminus1)δ
sTminus1) and
(b) nabla⋆iT =
(1minus F (xprime
i(Tminus1)βcTminus1 + zi(Tminus1)δ
sTminus1)
)Sampler ui
1
π(ui |zi1α) = tr N(xprime(o)iα 1 trunc(zi1))
where trunc(zi1) =
(minusinfin 0] zi1 = 0
(0infin) zi1 = 1(Andash4)
and tr N(microσ2A) denotes the pdf of a truncated normal random variable with mean microvariance σ2 and truncation region A
Sampler α
1
π(α|u) prop [α]
Nprodi=1
ϕ(ui xprime(o)iα 1) (Andash5)
If [α] prop 1 then
α|u sim N(m(α)α)
with m(α) = αXprime(o)u and α = (X prime
(o)X(o))minus1
Sampler vit
1 (For t gt 1)
π(vi (tminus1)|zi (tminus1) zit βctminus1 δ
stminus1) = tr N
(micro(v)i(tminus1) 1 trunc(zit)
)(Andash6)
where micro(v)i(tminus1) = xprime
i(tminus1)βctminus1 + zi(tminus1)δ
ci(tminus1) and trunc(zit) defines the corresponding
truncation region given by zit
119
Sampler(β(c)tminus1 δ
(c)tminus1
)
1 (For t gt 1)
π(β(s)tminus1 δ
(c)tminus1|vtminus1 ztminus1) prop [β
(s)tminus1 δ
(c)tminus1]
Nprodi=1
ϕ(vit xprimei(tminus1)β
(c)tminus1 + zi(tminus1)δ
(s)tminus1 1) (Andash7)
If[β(c)tminus1 δ
(s)tminus1
]prop 1 then
β(c)tminus1 δ
(s)tminus1|vtminus1 ztminus1 sim N(m(β
(c)tminus1 δ
(s)tminus1)tminus1)
with m(β(c)tminus1 δ
(s)tminus1) = tminus1 ~X
primetminus1vtminus1 and tminus1 = (~X prime
tminus1 ~Xtminus1)minus1 where ~Xtminus1 =(
Xtminus1 ztminus1)
Sampler wijt
1 (For t gt 1 and zit = 1)
π(wijt | i zit = 1 yijt λ) = tr N(qprimeijtλt 1 tr(yijt)
)(Andash8)
Sampler λt
1 (For t = 1 2 T )
π(λt |zt wt) prop [λt ]prod
i zit=1
Jitprodj=1
ϕ(wijt qprimeijtλt 1) (Andash9)
If [λt ] prop 1 then
λt |wt zt sim N(m(λt)λt)
with m(λt) = λtQ primetwt and λt
= (Q primetQt)
minus1 where Qt and wt respectively are the designmatrix and the vector of latent variables for surveys of sites such that zit = 1
120
APPENDIX BRANDOM WALK ALGORITHMS
Global Jump From the current state M the global jump is performed by drawing
a model M prime at random from the model space This is achieved by beginning at the base
model and increasing the order from JminM to the Jmax
M the minimum and maximum orders
of nodes in (MF ) = MF MB at each order a set of nodes is selected at random from
the prior conditioned on the nodes already in the model The MH correction is
α =
1m(y|M primeM)
m(y|MM)
Local Jump From the current state M the local jump is performed by drawing a
model from the set of models L(M) = Mα α isin E(M) cup C(M) where Mα is M α
for α isin E(M) and M cup α for α isin C(M) The proposal probabilities for the model are
computed as a mixture of p(M prime|yMM prime isin L(M)) and the discrete uniform distribution
The proposal kernel is
q(M prime|yMM prime isin L(M)) =1
2
(p(M prime|yMM prime isin L(M)) +
1
|L(M)|
)This choice promotes moving to better models while maintaining a non-negligible
probability of moving to any of the possible models The MH correction is
α =
1m(y|M primeM)
m(y|MM)
q(M|yMM isin L(M prime))
q(M prime|yMM prime isin L(M))
Intermediate Jump The intermediate jump is performed by increasing or
decreasing the order of the nodes under consideration performing local proposals based
on order For a model M prime define Lj(Mprime) = M prime cup M prime
α α isin (E(M prime) cup C(M prime)) capj(MF )
From a state M the kernel chooses at random whether to increase or decrease the
order If M = MF then decreasing the order is chosen with probability 1 and if M = MB
then increasing the order is chosen with probability 1 in all other cases the probability of
increasing and decreasing order is 12 The proposal kernels are given by
121
Increasing order proposal kernel
1 Set j = JminM minus 1 and M prime
j = M
2 Draw M primej+1 from qincj+1(M
prime|yMM prime isin Lj+1(Mprimej )) where
qincj+1(Mprime|yMM prime isin Lj+1(M
primej )) =
12
(p(M prime|yMM prime isin Lj+1(M
primej )) +
1|Lj+1(M
primej)|
)
3 Set j = j + 1
4 If j lt JmaxM then return to 2 O therwise proceed to 5
5 Set M prime = M primeJmaxM
and compute the proposal probability
qinc(Mprime|yMM) =
JmaxM minus1prod
j=JminM minus1
qincj+1(Mprimej |yMM prime isin Lj+1(M
primej )) (Bndash1)
Decreasing order proposal kernel
1 Set j = JmaxM + 1 and M prime
j = M
2 Draw M primejminus1 from qdecjminus1(M
prime|yMM prime isin Ljminus1(Mprimej )) where
qdecjminus1(Mprime|yMM prime isin Ljminus1(M
primej )) =
12
(p(M prime|yMM prime isin Ljminus1(M
primej )) +
1|Ljminus1(M
primej)|
)
3 Set j = j minus 1
4 If j gt JminM then return to 2 Otherwise proceed to 5
5 Set M prime = M primeJminM
and compute the proposal probability
qdec(Mprime|yMM) =
JminM +1prod
j=JmaxM +1
qdecjminus1(Mprimej |yMM prime isin Ljminus1(M
primej )) (Bndash2)
If increasing order is chosen then the MH correction is given by
α = min
1
(1 + I (M prime = MF )
1 + I (M = MB)
)qdec(M|yMM prime)
qinc(M prime|yMM)
p(M prime|yM)
p(M|yM)
(Bndash3)
and similarly if decreasing order is chosen
Other Local and Intermediate Kernels The local and intermediate kernels
described here perform a kind of stochastic forwards-backwards selection Each kernel
122
q can be relaxed to allow more than one node to be turned on or off at each step which
could provide larger jumps for each of these kernels The tradeoff is that number of
proposed models for such jumps could be very large precluding the use of posterior
information in the construction of the proposal kernel
123
APPENDIX CWFM SIMULATION DETAILS
Briefly the idea is to let ZMT(X )βMT
= (QR)βMT= QηMT
(ie βMT= Rminus1ηMT
)
using the QR decomposition As such setting all values in ηMTproportional to one
corresponds to distributing the signal in the model uniformly across all predictors
regardless of their order
The (unconditional) variance of a single observation yi is var(yi) = var (E [yi |zi ]) +
E [var(yi |zi)] where zi is the i -th row of the design matrix ZMT Hence we take the
signal to noise ratio for each observation to be
SNR(η) = ηTMT
RminusTzRminus1ηMT
σ2
where z = var(zi) We determine how the signal is distributed across predictors up to a
proportionality constant to be able to control simultaneously the signal to noise ratio
Additionally to investigate the ability of the model to capture correctly the
hierarchical structure we specify four different 0-1 vectors that determine the predictors
in MT which generates the data in the different scenarios
Table C-1 Experimental conditions WFM simulationsParameter Values considered
SNR(ηMT) = k 025 1 4
ηMTprop (1 13 14 12) (1 13 1214
1412) (1 1413
1214 12)
γMT(1 13 14 12) (1 13 14 02) (1 13 04 12) (1 03 0 1 1 0 12)
n 130 260 1040
The results presented below are somewhat different from those found in the main
body of the article in Section 5 These are extracted averaging the number of FPrsquos
TPrsquos and model sizes respectively over the 100 independent runs and across the
corresponding scenarios for the 20 highest probability models
124
SNR and Sample Size Effect
In terms of the SNR and the sample size (Figure C-1) we observe that as
expected small sample sizes conditioned upon a small SNR impair the ability of the
algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this effect
more notorious when using the latter prior However considering the mean number
of true positives (TP) jointly with the mean model size it is clear that although the
sensitivity is low most of the few predictors that are discovered belong to the true
model The results observed with SNR of 025 and a relatively small sample size are
far from being impressive however real problems where the SNR is as low as 025
will yield many spurious associations under the EPP The fact that the HOP(1 ch) has
a strong protection against false positive is commendable in itself A SNR of 1 also
represents a feeble relationship between the predictors and the response nonetheless
the method captures approximately half of the true coefficients while including very few
false positives Following intuition as either the sample size or the SNR increase the
algorithms performance is greatly enhanced Either having a large sample size or a
large SNR yields models that contain mostly true predictors Additionally HOP(1 ch)
provides a strong control over the number of false positives therefore for high SNR
or larger sample sizes the number of predictors in the top 20 models is close to the
size of the true model In general the EPP allows the detection of more TPrsquos while
the HOP(1 ch) provides a stronger control on the amount of FPrsquos included when
considering small sample sizes combined with small SNRs As either sample size or
SNR grows the differences between the two priors become indistinct
125
Figure C-1 SNR vs n Average model size average true positives and average false
positives for all simulated scenarios by model ranking according to model
posterior probabilities
Coefficient Magnitude
This part of the experiment explores the effect of how the signal is distributed across
predictors As mentioned before sphering is used to assign the coefficients values
in a manner that controls the amount of signal that goes into each coefficient Three
possible ways to allocate the signal are considered First each order-one coefficient
contains twice as much signal as any order-two coefficient and four times as much
any as order-three coefficient second all coefficients contain the same amount of
signal regardless of their order and third each order-one coefficient contains a half
as much signal as any order-two coefficient and a quarter of what any order-three
126
coefficient has In Figure C-2 these values are denoted by β = c(1o1 05o2 025o3)
β = c(1o1 1o2 1o3) and β = c(025o1 05o2 1o3) respectively
Observe that the number of FPrsquos is invulnerable to how the SNR is distributed
across predictors using the HOP(1 ch) conversely when using the EPP the number
of FPrsquos decreases as the SNR grows always being slightly higher than those obtained
with the HOP With either prior structure the algorithm performs better whenever all
coefficients are equally weighted or when those for the order-three terms have higher
weights In these two cases (ie with β = c(1o1 05o2 025o3) or β = c(1o1 1o2 1o3))
the effect of the SNR appears to be similar In contrast when more weight is given to
order one terms the algorithm yields slightly worse models at any SNR level This is an
intuitive result since giving more signal to higher order terms makes it easier to detect
higher order terms and consequently by strong heredity the algorithm will also select
the corresponding lower order terms included in the true model
Special Points on the Scale
In Nelder (1998) the author argues that the conditions under which the
weak-heredity principle can be used for model selection are so restrictive that the
principle is commonly not valid in practice in this context In addition the author states
that considering well-formulated models only does not take into account the possible
presence of special points on the scales of the predictors that is situations where
omitting lower order terms is justified due to the nature of the data However it is our
contention that every model has an underlying well-formulated structure whether or not
some predictor has special points on its scale will be determined through the estimation
of the coefficients once a valid well-formulated structure has been chosen
To understand how the algorithm behaves whenever the true data generating
mechanism has zero-valued coefficients for some lower order terms in the hierarchy
four different true models are considered Three of them are not well-formulated while
the remaining one is the WFM shown in Figure 4-6 The three models that have special
127
Figure C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model rankingaccording to model posterior probabilities
points correspond to the same model MT from Figure 4-6 but have respectively
zero-valued coefficients for all the order-one terms all the order-two terms and for x21
and x2x5
As seen before in comparison to the EPP the HOP(1 ch) tightly controls the
inclusion FPs by choosing smaller models at the expense of also reducing the TP
count especially when there is more uncertainty about the true model (ie SNR=025)
For both prior structures the results in Figure C-3 indicate that at low SNR levels the
presence of special points has no apparent impact as the selection behavior is similar
between the four models in terms of both the TP and FP As the SNR increases the
TPs and the model size are affected for true models with zero-valued lower order
128
Figure C-3 SNR vs different true models MT Average model size average truepositives and average false positives for all simulated scenarios by modelranking according to model posterior probabilities
terms These differences however are not very large Relatively smaller models are
selected whenever some terms in the hierarchy are missing but with high SNR which
is where the differences are most pronounced the predictors included are mostly true
coefficients The impact is almost imperceptible for the true model that lacks order one
terms and the model with zero coefficients for x21 and x2x5 and is more visible for models
without order two terms This last result is expected due to strong-heredity whenever
the order-one coefficients are missing the inclusion of order-two and order-three
terms will force their selection which is also the case when only a few order two terms
have zero-valued coefficients Conversely when all order two predictors are removed
129
some order three predictors are not selected as their signal is attributed the order two
predictors missing from the true model This is especially the case for the order three
interaction term x1x2x5 which depends on the inclusion of three order two terms terms
(x1x2 x1x5 x2x5) in order for it to be included as well This makes the inclusion of this
term somewhat more challenging the three order two interactions capture most of
the variation of the polynomial terms that is present when the order three term is also
included However special points on the scale commonly occur on a single or at most
on a few covariates A true data generating mechanism that removes all terms of a given
order in the context of polynomial models is clearly not justified here this was only done
for comparison purposes
130
APPENDIX DSUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
The covariates considered for the ozone data analysis match those used in Liang
et al (2008) these are displayed in Table D below
Table D-1 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA
vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX
The marginal posterior inclusion probability corresponds to the probability of including a
given term in the full model MF after summing over all models in the model space For each
node α isin MF this probability is given by pα =sum
MisinM I(αisinM)p(M|yM) Given that in problems
with a large model space such as the one considered for the ozone concentration problem
enumeration of the entire space is not feasible Thus these probabilities are estimated summing
over every model drawn by the random walk over the model space M
Given that there are in total 44 potential predictors for convenience in Tables D-2 to D-5
below we only display the marginal posterior probabilities for the terms included under at least
one of the model priors considered (EPP HIP HUP and HOP) for each of the parameter priors
utilized (intrinsic priors Zellner-Siow priors Hyper-g(11) and Hyper-g(21))
131
Table D-2 Marginal inclusion probabilities
intrinsic prior
EPP HIP HUP HOP
hum 099 069 085 076
dpg 085 048 052 053
ibt 099 100 100 100
hum2 076 051 043 062
humdpg 055 002 003 017
humibt 098 069 084 075
dpg2 072 036 025 046
ibt2 059 078 057 081
Table D-3 Marginal inclusion probabilities
Zellner-Siow prior
EPP HIP HUP HOP
hum 076 067 080 069
dpg 089 050 055 058
ibt 099 100 100 100
hum2 057 049 040 057
humibt 072 066 078 068
dpg2 081 038 031 051
ibt2 054 076 055 077
Table D-4 Marginal inclusion probabilities
Hyper-g11
EPP HIP HUP HOP
vh 054 005 010 011
hum 081 067 080 069
dpg 090 050 055 058
ibt 099 100 099 099
hum2 061 049 040 057
humibt 078 066 078 068
dpg2 083 038 030 051
ibt2 049 076 054 077
Table D-5 Marginal inclusion probabilities
Hyper-g21
EPP HIP HUP HOP
hum 079 064 073 067
dpg 090 052 060 059
ibt 099 100 099 100
hum2 060 047 037 055
humibt 076 064 071 067
dpg2 082 041 036 052
ibt2 047 073 049 075
132
REFERENCES
Akaike H (1983) Information measures and model selection Bull Int Statist Inst 50277ndash290
Albert J H amp Chib S (1993) Bayesian-analysis of binary and polychotomousresponse data Journal of the American Statistical Association 88(422) 669ndash679
Berger J amp Bernardo J (1992) On the development of reference priors BayesianStatistics 4 (pp 35ndash60)
URL httpisbastatdukeedueventsvalencia1992Valencia4Refpdf
Berger J amp Pericchi L (1996) The intrinsic Bayes factor for model selection andprediction Journal of the American Statistical Association 91(433) 109ndash122
URL httpamstattandfonlinecomdoiabs10108001621459199610476668
Berger J Pericchi L amp Ghosh J (2001) Objective Bayesian methods for modelselection introduction and comparison In Model selection vol 38 of IMS LectureNotes Monogr Ser (pp 135ndash207) Inst Math Statist
URL httpwwwjstororgstable1023074356165
Besag J York J amp Mollie A (1991) Bayesian Image-Restoration with 2 Applicationsin Spatial Statistics Annals of the Institute of Statistical Mathematics 43 1ndash20
Bien J Taylor J amp Tibshirani R (2013) A lasso for hierarchical interactions TheAnnals of Statistics 41(3) 1111ndash1141
URL httpprojecteuclidorgeuclidaos1371150895
Breiman L amp Friedman J (1985) Estimating optimal transformations for multipleregression and correlation Journal of the American Statistical Association 80580ndash598
Brusco M J Steinley D amp Cradit J D (2009) An exact algorithm for hierarchicallywell-formulated subsets in second-order polynomial regression Technometrics 51(3)306ndash315
Casella G Giron F J Martınez M L amp Moreno E (2009) Consistency of Bayesianprocedures for variable selection The Annals of Statistics 37 (3) 1207ndash1228
URL httpprojecteuclidorgeuclidaos1239369020
Casella G Moreno E amp Giron F (2014) Cluster Analysis Model Selection and PriorDistributions on Models Bayesian Analysis TBA(TBA) 1ndash46
URL httpwwwstatufledu~casellaPapersClusterModel-July11-Apdf
133
Chipman H (1996) Bayesian variable selection with related predictors CanadianJournal of Statistics 24(1) 17ndash36
URL httponlinelibrarywileycomdoi1023073315687abstract
Clyde M amp George E I (2004) Model Uncertainty Statistical Science 19(1) 81ndash94
URL httpprojecteuclidorgDienstgetRecordid=euclidss1089808274
Dewey J (1958) Experience and nature New York Dover Publications
Dorazio R M amp Taylor-Rodrıguez D (2012) A Gibbs sampler for Bayesian analysis ofsite-occupancy data Methods in Ecology and Evolution 3 1093ndash1098
Ellison A M (2004) Bayesian inference in ecology Ecology Letters 7 509ndash520
Fiske I amp Chandler R (2011) unmarked An R package for fitting hierarchical modelsof wildlife occurrence and abundance Journal of Statistical Software 43(10)
URL httpcorekmiopenacukdownloadpdf5701760pdf
George E (2000) The variable selection problem Journal of the American StatisticalAssociation 95(452) 1304ndash1308
URL httpwwwtandfonlinecomdoiabs10108001621459200010474336
Giron F J Moreno E Casella G amp Martınez M L (2010) Consistency of objectiveBayes factors for nonnested linear models and increasing model dimension Revistade la Real Academia de Ciencias Exactas Fisicas y Naturales Serie A Matematicas104(1) 57ndash67
URL httpwwwspringerlinkcomindex105052RACSAM201006
Good I J (1950) Probability and the Weighing of Evidence New York Haffner
Griepentrog G L Ryan J M amp Smith L D (1982) Linear transformations ofpolynomial regression-models American Statistician 36(3) 171ndash174
Gunel E amp Dickey J (1974) Bayes factors for independence in contingency tablesBiometrika 61 545ndash557
Hanski I (1994) A Practical Model of Metapopulation Dynamics Journal of AnimalEcology 63 151ndash162
Hooten M (2006) Hierarchical spatio-temporal models for ecological processesDoctoral dissertation University of Missouri-Columbia
URL httpsmospacelibraryumsystemeduxmluihandle103554500
Hooten M B amp Hobbs N T (2014) A Guide to Bayesian Model Selection forEcologists Ecological Monographs (In Press)
134
Hughes J amp Haran M (2013) Dimension reduction and alleviation of confoundingfor spatial generalized linear mixed models Journal of the Royal Statistical SocietySeries B Statistical Methodology 75 139ndash159
Hurvich C M amp Tsai C-L (1989) Regression and time series model selection insmall samples Biometrika 76 297ndash307
URL httpbiometoxfordjournalsorgcontent762297abstract
Jeffreys H (1935) Some tests of significance treated by the theory of probabilityProcedings of the Cambridge Philosophy Society 31 203ndash222
Jeffreys H (1961) Theory of Probability London Oxford University Press 3rd ed
Johnson D Conn P Hooten M Ray J amp Pond B (2013) Spatial occupancymodels for large data sets Ecology 94(4) 801ndash808
URL httpwwwesajournalsorgdoiabs10189012-05641mi=3eywlhampaf=R
ampsearchText=human+population
Kass R amp Wasserman L (1995) A reference Bayesian test for nested hypothesesand its relationship to the Schwarz criterion Journal of the American StatisticalAssociation 90(431)
URL httpamstattandfonlinecomdoiabs10108001621459199510476592
Kass R E amp Raftery A E (1995) Bayes Factors Journal of the American StatisticalAssociation 90 773ndash795
URL httpwwwtandfonlinecomdoiabs10108001621459199510476572$
delimiter026E30F$nhttpwwwtandfonlinecomdoiabs10108001621459
199510476572UvBybrTIgcs
Kass R E amp Wasserman L (1996) The Selection of Prior Distributions by FormalRules Journal of the American Statistical Association 91(435) 1343
URL httpwwwjstororgstable2291752origin=crossref
Kery M (2010) Introduction to WinBUGS for Ecologists Bayesian Approach toRegression ANOVA Mixed Models and Related Analyses Academic Press 1st ed
Kery M Gardner B amp Monnerat C (2010) Predicting species distributions fromchecklist data using site-occupancy models Journal of Biogeography 37 (10)1851ndash1862 Kery Marc Gardner Beth Monnerat Christian
Khuri A (2002) Nonsingular linear transformations of the control variables in responsesurface models Technical Report
Krebs C J (1972) Ecology the experimental analysis of distribution and abundance
135
Lempers F B (1971) Posterior probabilities of alternative linear models University ofRotterdam Press Rotterdam
Leon-Novelo L Moreno E amp Casella G (2012) Objective Bayes model selection inprobit models Statistics in medicine 31(4) 353ndash65
URL httpwwwncbinlmnihgovpubmed22162041
Liang F Paulo R Molina G Clyde M a amp Berger J O (2008) Mixtures of g Priorsfor Bayesian Variable Selection Journal of the American Statistical Association103(481) 410ndash423
URL httpwwwtandfonlinecomdoiabs101198016214507000001337
Link W amp Barker R (2009) Bayesian inference with ecological applications Elsevier
URL httpbooksgooglecombookshl=enamplr=ampid=hecon2l2QPcCampoi=fnd
amppg=PP2ampdq=Bayesian+Inference+with+ecological+applicationsampots=S82_
0pxrNmampsig=L3xbsSQcKD8FV6rxCMp2pmP2JKk
MacKenzie D amp Nichols J (2004) Occupancy as a surrogate for abundanceestimation Animal biodiversity and conservation 1 461ndash467
URL httpcrsitbacidmediajurnalrefslandscapemackenzie2004zhpdf
MacKenzie D Nichols J amp Hines J (2003) Estimating site occupancy colonizationand local extinction when a species is detected imperfectly Ecology 84(8)2200ndash2207
URL httpwwwesajournalsorgdoiabs10189002-3090
MacKenzie D I Bailey L L amp Nichols J D (2004) Investigating speciesco-occurrence patterns when species Journal of Animal Ecology 73 546ndash555
MacKenzie D I Nichols J D Lachman G B Droege S Royle J A amp LangtimmC A (2002) Estimating site occupancy rates when detection probabilities are lessthan one Ecology 83(8) 2248ndash2255
Mazerolle M amp Mazerolle M (2013) Package rsquoAICcmodavgrsquo (c)
URL ftpheanetarchivegnewsenseorgdisk1CRANwebpackages
AICcmodavgAICcmodavgpdf
McCullagh P amp Nelder J A (1989) Generalized linear models (2nd ed) LondonEngland Chapman amp Hall
McQuarrie A Shumway R amp Tsai C-L (1997) The model selection criterion AICu
136
Moreno E Bertolino F amp Racugno W (1998) An intrinsic limiting procedure for modelselection and hypotheses testing Journal of the American Statistical Association93(444) 1451ndash1460
Moreno E Giron F J amp Casella G (2010) Consistency of objective Bayes factors asthe model dimension grows The Annals of Statistics 38(4) 1937ndash1952
URL httpprojecteuclidorgeuclidaos1278861238
Nelder J A (1977) Reformulation of linear-models Journal of the Royal StatisticalSociety Series A - Statistics in Society 140 48ndash77
Nelder J A (1998) The selection of terms in response-surface models - how strong isthe weak-heredity principle American Statistician 52(4) 315ndash318
Nelder J A (2000) Functional marginality and response-surface fitting Journal ofApplied Statistics 27 (1) 109ndash112
Nichols J Hines J amp Mackenzie D (2007) Occupancy estimation and modeling withmultiple states and state uncertainty Ecology 88(6) 1395ndash1400
URL httpwwwesajournalsorgdoipdf10189006-1474
Ovaskainen O Hottola J amp Siitonen J (2010) Modeling species co-occurrenceby multivariate logistic regression generates new hypotheses on fungal interactionsEcology 91(9) 2514ndash21
URL httpwwwncbinlmnihgovpubmed20957941
Peixoto J L (1987) Hierarchical variable selection in polynomial regression-modelsAmerican Statistician 41(4) 311ndash313
Peixoto J L (1990) A property of well-formulated polynomial regression-modelsAmerican Statistician 44(1) 26ndash30
Pericchi L R (2005) Model selection and hypothesis testing based on objectiveprobabilities and bayes factors In Handbook of Statistics Elsevier
Polson N G Scott J G amp Windle J (2013) Bayesian Inference for Logistic ModelsUsing Polya-Gamma Latent Variables Journal of the American Statistical Association108 1339ndash1349
URL httpdxdoiorg101080016214592013829001
Rao C R amp Wu Y (2001) On model selection vol Volume 38 of Lecture NotesndashMonograph Series (pp 1ndash57) Beachwood OH Institute of Mathematical Statistics
URL httpdxdoiorg101214lnms1215540960
137
Reich B J Hodges J S amp Zadnik V (2006) Effects of residual smoothing on theposterior of the fixed effects in disease-mapping models Biometrics 62 1197ndash1206
Reiners W amp Lockwood J (2009) Philosophical Foundations for the Practices ofEcology Cambridge University Press
URL httpbooksgooglecombooksid=dr9cPgAACAAJ
Rigler F amp Peters R (1995) Excellence in Ecology Science and Limnology EcologyInstitute Germany
URL httportoncatieaccrcgi-binwxisexeIsisScript=CIENLxis
ampmethod=postampformato=2ampcantidad=1ampexpresion=mfn=008268
Robert C Chopin N amp Rousseau J (2009) Harold Jeffreysrsquo Theory of Probabilityrevisited Statistical Science Volume 24(2) 141ndash179
URL httpswwwnewtonacukpreprintsNI08021pdf
Robert C P (1993) A note on jeffreys-lindley paradox Statistica Sinica 3 601ndash608
Royle J A amp Kery M (2007) A Bayesian state-space formulation of dynamicoccupancy models Ecology 88(7) 1813ndash23
URL httpwwwncbinlmnihgovpubmed17645027
Scott J amp Berger J (2010) Bayes and Empirical-Bayes Multiplicity Adjustment in thevariable selection problem The Annals of Statistics
URL httpprojecteuclidorgeuclidaos1278861454
Spiegelhalter D J amp Smith A F M (1982) Bayes factor for linear and log-linearmodels with vague prior information J R Statist Soc B 44 377ndash387
Tierney L amp Kadane J B (1986) Accurate approximations for posterior moments andmarginal densities Journal of the American Statistical Association 81 82ndash86
Tyre A J Tenhumberg B Field S a Niejalke D Parris K amp Possingham H P(2003) Improving Precision and Reducing Bias in Biological Surveys EstimatingFalse-Negative Error Rates Ecological Applications 13(6) 1790ndash1801
URL httpwwwesajournalsorgdoiabs10189002-5078
Waddle J H Dorazio R M Walls S C Rice K G Beauchamp J Schuman M Jamp Mazzotti F J (2010) A new parameterization for estimating co-occurrence ofinteracting species Ecological applications a publication of the Ecological Society ofAmerica 20 1467ndash1475
Wasserman L (2000) Bayesian Model Selection and Model Averaging Journal ofmathematical psychology 44(1) 92ndash107
138
URL httpwwwncbinlmnihgovpubmed10733859
Wilson M Iversen E Clyde M A Schmidler S C amp Schildkraut J M (2010)Bayesian model search and multilevel inference for SNP association studies TheAnnals of Applied Statistics 4(3) 1342ndash1364
URL httpwwwncbinlmnihgovpmcarticlesPMC3004292
Womack A J Leon-Novelo L amp Casella G (2014) Inference from Intrinsic BayesProcedures Under Model Selection and Uncertainty Journal of the AmericanStatistical Association (June) 140114063448000
URL httpwwwtandfonlinecomdoiabs101080016214592014880348
Yuan M Joseph V R amp Zou H (2009) Structured variable selection and estimationThe Annals of Applied Statistics 3(4) 1738ndash1757
URL httpprojecteuclidorgeuclidaoas1267453962
Zeller K A Nijhawan S Salom-Perez R Potosme S H amp Hines J E (2011)Integrating occupancy modeling and interview data for corridor identification A casestudy for jaguars in nicaragua Biological Conservation 144(2) 892ndash901
Zellner A amp Siow A (1980) Posterior odds ratios for selected regression hypothesesIn Trabajos de estadıstica y de investigacion operativa (pp 585ndash603)
URL httpwwwspringerlinkcomindex5300770UP12246M9pdf
139
BIOGRAPHICAL SKETCH
Daniel Taylor-Rodrıguez was born in Bogota Colombia He earned a BS
degree in economics from the Universidad de Los Andes (2004) and a Specialist
degree in statistics from the Universidad Nacional de Colombia In 2009 he traveled
to Gainesville Florida to pursue a masterrsquos in statistics under the supervision of
George Casella Upon completion he started a PhD in interdisciplinary ecology with
concentration in statistics again under George Casellarsquos supervision After Georgersquos
passing Linda Young and Nikolay Bliznyuk continued to oversee Danielrsquos mentorship
He has currently accepted a joint postdoctoral fellowship at the Statistical and Applied
Mathematical Sciences Institute and the Department of Statistical Science at Duke
University
140
- ACKNOWLEDGMENTS
- TABLE OF CONTENTS
- LIST OF TABLES
- LIST OF FIGURES
- ABSTRACT
- 1 GENERAL INTRODUCTION
-
- 11 Occupancy Modeling
- 12 A Primer on Objective Bayesian Testing
- 13 Overview of the Chapters
-
- 2 MODEL ESTIMATION METHODS
-
- 21 Introduction
-
- 211 The Occupancy Model
- 212 Data Augmentation Algorithms for Binary Models
-
- 22 Single Season Occupancy
-
- 221 Probit Link Model
- 222 Logit Link Model
-
- 23 Temporal Dynamics and Spatial Structure
-
- 231 Dynamic Mixture Occupancy State-Space Model
- 232 Incorporating Spatial Dependence
-
- 24 Summary
-
- 3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
-
- 31 Introduction
- 32 Objective Bayesian Inference
-
- 321 The Intrinsic Methodology
- 322 Mixtures of g-Priors
-
- 3221 Intrinsic priors
- 3222 Other mixtures of g-priors
-
- 33 Objective Bayes Occupancy Model Selection
-
- 331 Preliminaries
- 332 Intrinsic Priors for the Occupancy Problem
- 333 Model Posterior Probabilities
- 334 Model Selection Algorithm
-
- 34 Alternative Formulation
- 35 Simulation Experiments
-
- 351 Marginal Posterior Inclusion Probabilities for Model Predictors
- 352 Summary Statistics for the Highest Posterior Probability Model
-
- 36 Case Study Blue Hawker Data Analysis
-
- 361 Results Variable Selection Procedure
- 362 Validation for the Selection Procedure
-
- 37 Discussion
-
- 4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
-
- 41 Introduction
- 42 Setup for Well-Formulated Models
-
- 421 Well-Formulated Model Spaces
-
- 43 Priors on the Model Space
-
- 431 Model Prior Definition
- 432 Choice of Prior Structure and Hyper-Parameters
- 433 Posterior Sensitivity to the Choice of Prior
-
- 44 Random Walks on the Model Space
-
- 441 Simple Pruning and Growing
- 442 Degree Based Pruning and Growing
-
- 45 Simulation Study
-
- 451 SNR and Sample Size Effect
- 452 Coefficient Magnitude
- 453 Special Points on the Scale
-
- 46 Case Study Ozone Data Analysis
- 47 Discussion
-
- 5 CONCLUSIONS
- A FULL CONDITIONAL DENSITIES DYMOSS
- B RANDOM WALK ALGORITHMS
- C WFM SIMULATION DETAILS
- D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
- REFERENCES
- BIOGRAPHICAL SKETCH
-
352 Summary Statistics for the Highest Posterior Probability Model 7636 Case Study Blue Hawker Data Analysis 77
361 Results Variable Selection Procedure 79362 Validation for the Selection Procedure 81
37 Discussion 82
4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS 84
41 Introduction 8442 Setup for Well-Formulated Models 88
421 Well-Formulated Model Spaces 9043 Priors on the Model Space 91
431 Model Prior Definition 92432 Choice of Prior Structure and Hyper-Parameters 96433 Posterior Sensitivity to the Choice of Prior 99
44 Random Walks on the Model Space 104441 Simple Pruning and Growing 105442 Degree Based Pruning and Growing 106
45 Simulation Study 107451 SNR and Sample Size Effect 109452 Coefficient Magnitude 110453 Special Points on the Scale 111
46 Case Study Ozone Data Analysis 11147 Discussion 113
5 CONCLUSIONS 115
APPENDIX
A FULL CONDITIONAL DENSITIES DYMOSS 118
B RANDOM WALK ALGORITHMS 121
C WFM SIMULATION DETAILS 124
D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS 131
REFERENCES 133
BIOGRAPHICAL SKETCH 140
7
LIST OF TABLES
Table page
1-1 Interpretation of BFji when contrasting Mj and Mi 20
3-1 Simulation control parameters occupancy model selector 69
3-2 Comparison of average minOddsMPIP under scenarios having different numberof sites (N=50 N=100) and under scenarios having different number of surveysper site (J=3 J=5) for the presence and detection components using uniformand multiplicity correction priors 75
3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors 75
3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highest probabilitymodels for the presence and the detection components using uniform andmultiplicity correcting priors on the model space 76
3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space 77
3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 77
3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 78
3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data 80
3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data 80
3-10 MPIP presence component 81
3-11 MPIP detection component 81
3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform and multiplicitycorrection model priors 82
8
4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulations 100
4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchical independenceprior (HIP) the hierarchical order prior (HOP) and the hierarchical uniformprior (HUP) 102
4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and the hierarchicaluniform prior (HUP) 103
4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP) 105
4-5 Variables used in the analyses of the ozone contamination dataset 112
4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso 113
C-1 Experimental conditions WFM simulations 124
D-1 Variables used in the analyses of the ozone contamination dataset 131
D-2 Marginal inclusion probabilities intrinsic prior 132
D-3 Marginal inclusion probabilities Zellner-Siow prior 132
D-4 Marginal inclusion probabilities Hyper-g11 132
D-5 Marginal inclusion probabilities Hyper-g21 132
9
LIST OF FIGURES
Figure page
2-1 Graphical representation occupancy model 25
2-2 Graphical representation occupancy model after data-augmentation 31
2-3 Graphical representation multiseason model for a single site 39
2-4 Graphical representation data-augmented multiseason model 39
3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors 71
3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors 72
3-3 Predictor MPIP averaged over scenarios with the interaction between the numberof sites and the surveys per site using uniform (U) and multiplicity correction(MC) priors 72
3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors 73
3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors 73
4-1 Graphs of well-formulated polynomial models for p = 2 90
4-2 E(M) and C(M) in M defined by a quadratic surface in two main effects formodel M = 1 x1 x21 91
4-3 Graphical representation of assumptions on M defined by the quadratic surfacein two main effects 93
4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the intercept onlymodel and (ab) isin (1 1) (1 ch) 97
4-5 Prior probabilities for the space of well-formulated models associated to threemain effects and one interaction term where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch) 98
4-6 MT DAG of the largest true model used in simulations 109
4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch) 110
C-1 SNR vs n Average model size average true positives and average false positivesfor all simulated scenarios by model ranking according to model posterior probabilities126
10
C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 128
C-3 SNR vs different true models MT Average model size average true positivesand average false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 129
11
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
OBJECTIVE BAYESIAN METHODS FOR OCCUPANCY MODEL ESTIMATION ANDSELECTION
By
Daniel Taylor-Rodrıguez
August 2014
Chair Linda J YoungCochair Nikolay BliznyukMajor Interdisciplinary Ecology
The ecological literature contains numerous methods for conducting inference about
the dynamics that govern biological populations Among these methods occupancy
models have played a leading role during the past decade in the analysis of large
biological population surveys The flexibility of the occupancy framework has brought
about useful extensions for determining key population parameters which provide
insights about the distribution structure and dynamics of a population However the
methods used to fit the models and to conduct inference have gradually grown in
complexity leaving practitioners unable to fully understand their implicit assumptions
increasing the potential for misuse This motivated our first contribution We develop
a flexible and straightforward estimation method for occupancy models that provides
the means to directly incorporate temporal and spatial heterogeneity using covariate
information that characterizes habitat quality and the detectability of a species
Adding to the issue mentioned above studies of complex ecological systems now
collect large amounts of information To identify the drivers of these systems robust
techniques that account for test multiplicity and for the structure in the predictors are
necessary but unavailable for ecological models We develop tools to address this
methodological gap First working in an ldquoobjectiverdquo Bayesian framework we develop
the first fully automatic and objective method for occupancy model selection based
12
on intrinsic parameter priors Moreover for the general variable selection problem we
propose three sets of prior structures on the model space that correct for multiple testing
and a stochastic search algorithm that relies on the priors on the models space to
account for the polynomial structure in the predictors
13
CHAPTER 1GENERAL INTRODUCTION
As with any other branch of science ecology strives to grasp truths about the
world that surrounds us and in particular about nature The objective truth sought
by ecology may well be beyond our grasp however it is reasonable to think that at
least partially ldquoNature is capable of being understoodrdquo (Dewey 1958) We can observe
and interpret nature to formulate hypotheses which can then be tested against reality
Hypotheses that encounter no or little opposition when confronted with reality may
become contextual versions of the truth and may be generalized by scaling them
spatially andor temporally accordingly to delimit the bounds within which they are valid
To formulate hypotheses accurately and in a fashion amenable to scientific inquiry
not only the point of view and assumptions considered must be made explicit but
also the object of interest the properties worthy of consideration of that object and
the methods used in studying such properties (Reiners amp Lockwood 2009 Rigler amp
Peters 1995) Ecology as defined by Krebs (1972) is ldquothe study of interactions that
determine the distribution and abundance of organismsrdquo This characterizes organisms
and their interactions as the objects of interest to ecology and prescribes distribution
and abundance as a relevant property of these organisms
With regards to the methods used to acquire ecological scientific knowledge
traditionally theoretical mathematical models (such as deterministic PDEs) have been
used However naturally varying systems are imprecisely observed and as such are
subject to multiple sources of uncertainty that must be explicitly accounted for Because
of this the ecological scientific community is developing a growing interest in flexible
and powerful statistical methods and among these Bayesian hierarchical models
predominate These methods rely on empirical observations and can accommodate
fairly complex relationships between empirical observations and theoretical process
models while accounting for diverse sources of uncertainty (Hooten 2006)
14
Bayesian approaches are now used extensively in ecological modeling however
there are two issues of concern one from the standpoint of ecological practitioners
and another from the perspective of scientific ecological endeavors First Bayesian
modeling tools require a considerable understanding of probability and statistical theory
leading practitioners to view them as black box approaches (Kery 2010) Second
although Bayesian applications proliferate in the literature in general there is a lack of
awareness of the distinction between approaches specifically devised for testing and
those for estimation (Ellison 2004) Furthermore there is a dangerous unfamiliarity with
the proven risks of using tools designed for estimation in testing procedures (Berger amp
Pericchi 1996 Berger et al 2001 Kass amp Raftery 1995 Moreno et al 1998 Robert
et al 2009 Robert 1993) (eg use of flat priors in hypothesis testing)
Occupancy models have played a leading role during the past decade in large
biological population surveys The flexibility of the occupancy framework has allowed
the development of useful extensions to determine several key population parameters
which provide robust notions of the distribution structure and dynamics of a population
In order to address some of the concerns stated in previous paragraph we concentrate
in the occupancy framework to develop estimation and testing tools that will allow
ecologists first to gain insight about the estimation procedure and second to conduct
statistically sound model selection for site-occupancy data
11 Occupancy Modeling
Since MacKenzie et al (2002) and Tyre et al (2003) introduced the site-occupancy
framework countless applications and extensions of the method have been developed
in the ecological literature as evidenced by the 438000 hits on Google Scholar for
a search of rdquooccupancy modelrdquo This class of models acknowledges that techniques
used to conduct biological population surveys are prone to detection errors ndashif an
individual is detected it must be present while if it is not detected it might or might
not be Occupancy models improve upon traditional binary regression by accounting
15
for observed detection and partially observed presence as two separate but related
components In the site occupancy setting the chosen locations are surveyed
repeatedly in order to reduce the ambiguity caused by the observed zeros This
approach therefore allows probabilities of both presence (occurrence) and detection
to be estimated
The uses of site-occupancy models are many For example metapopulation
and island biogeography models are often parameterized in terms of site (or patch)
occupancy (Hansky 19921994 1997 as cited in MacKenzie et al (2003)) and
occupancy may be used as a surrogate for abundance to answer questions regarding
geographic distribution range size and metapopulation dynamics (MacKenzie et al
2004 Royle amp Kery 2007)
The basic occupancy framework which assumes a single closed population with
fixed probabilities through time has proven to be quite useful however it might be of
limited utility when addressing some problems In particular assumptions for the basic
model may become too restrictive or unrealistic whenever the study period extends
throughout multiple years or seasons especially given the increasingly changing
environmental conditions that most ecosystems are currently experiencing
Among the extensions found in the literature one that we consider particularly
relevant incorporates heterogenous occupancy probabilities through time Models
that incorporate temporally varying probabilities stem from important meta-population
notions provided by Hanski (1994) such as occupancy probabilities depending on local
colonization and local extinction processes In spite of the conceptual usefulness of
Hanskirsquos model several strong and untenable assumptions (eg all patches being
homogenous in quality) are required for it to provide practically meaningful results
A more viable alternative which builds on Hanski (1994) is an extension of
the single season occupancy model of MacKenzie et al (2003) In this model the
heterogeneity of occupancy probabilities across seasons arises from local colonization
16
and extinction processes This model is flexible enough to let detection occurrence
extinction and colonization probabilities to each depend upon its own set of covariates
Model parameters are obtained through likelihood-based estimation
Using a maximum likelihood approach presents two drawbacks First the
uncertainty assessment for maximum likelihood parameter estimates relies on
asymptotic results which are obtained from implementation of the delta method
making it sensitive to sample size Second to obtain parameter estimates the latent
process (occupancy) is marginalized out of the likelihood leading to the usual zero
inflated Bernoulli model Although this is a convenient strategy for solving the estimation
problem after integrating the latent state variables (occupancy indicators) they are
no longer available Therefore finite sample estimates cannot be calculated directly
Instead a supplementary parametric bootstrapping step is necessary Further
additional structure such as temporal or spatial variation cannot be introduced by
means of random effects (Royle amp Kery 2007)
12 A Primer on Objective Bayesian Testing
With the advent of high dimensional data such as that found in modern problems
in ecology genetics physics etc coupled with evolving computing capability objective
Bayesian inferential methods have gained increasing popularity This however is by no
means a new approach in the way Bayesian inference is conducted In fact starting with
Bayes and Laplace and continuing for almost 200 years Bayesian analysis was primarily
based on ldquononinformativerdquo priors (Berger amp Bernardo 1992)
Now subjective elicitation of prior probabilities in Bayesian analysis is widely
recognized as the ideal (Berger et al 2001) however it is often the case that the
available information is insufficient to specify appropriate prior probabilistic statements
Commonly as in model selection problems where large model spaces have to be
explored the number of model parameters is prohibitively large preventing one from
eliciting prior information for the entire parameter space As a consequence in practice
17
the determination of priors through the definition of structural rules has become the
alternative to subjective elicitation for a variety of problems in Bayesian testing Priors
arising from these rules are known in the literature as noninformative objective default
or reference Many of these connotations generate controversy and are accused
perhaps rightly of providing a false pretension of objectivity Nevertheless we will avoid
that discussion and refer to them herein exchangeably as noninformative or objective
priors to convey the sense that no attempt to introduce an informed opinion is made in
defining prior probabilities
A plethora of ldquononinformativerdquo methods has been developed in the past few
decades (see Berger amp Bernardo (1992) Berger amp Pericchi (1996) Berger et al (2001)
Clyde amp George (2004) Kass amp Wasserman (1995 1996) Liang et al (2008) Moreno
et al (1998) Spiegelhalter amp Smith (1982) Wasserman (2000) and the references
therein) We find particularly interesting those derived from the model structure in which
no tuning parameters are required especially since these can be regarded as automatic
methods Among them methods based on the Bayes factor for Intrinsic Priors have
proven their worth in a variety of inferential problems given their excellent performance
flexibility and ease of use This class of priors is discussed in detail in chapter 3 For
now some basic notation and notions of Bayesian inferential procedures are introduced
Hypothesis testing and the Bayes factor
Bayesian model selection techniques that aim to find the true model as opposed
to searching for the model that best predicts the data are fundamentally extensions to
Bayesian hypothesis testing strategies In general this Bayesian approach to hypothesis
testing and model selection relies on determining the amount of evidence found in favor
of one hypothesis (or model) over the other given an observed set of data Approached
from a Bayesian standpoint this type of problem can be formulated in great generality
using a natural well defined probabilistic framework that incorporates both model and
parameter uncertainty
18
Jeffreys (1935) first developed the Bayesian strategy to hypothesis testing and
consequently to the model selection problem Bayesian model selection within
a model space M = (M1M2 MJ) where each model is associated with a
parameter θj which may be a vector of parameters itself incorporates three types
of probability distributions (1) a prior probability distribution for each model π(Mj)
(2) a prior probability distribution for the parameters in each model π(θj |Mj) and (3)
the distribution of the data conditional on both the model and the modelrsquos parameters
f (x|θj Mj) These three probability densities induce the joint distribution p(x θj Mj) =
f (x|θj Mj) middot π(θj |Mj) middot π(Mj) which is instrumental in producing model posterior
probabilities The model posterior probability is the probability that a model is true given
the data It is obtained by marginalizing over the parameter space and using Bayes rule
p(Mj |x) =m(x|Mj)π(Mj)sumJ
i=1m(x|Mi)π(Mi) (1ndash1)
where m(x|Mj) =intf (x|θj Mj)π(θj |Mj)dθj is the marginal likelihood of Mj
Given that interest lies in comparing different models evidence in favor of one or
another model is assessed with pairwise comparisons using posterior odds
p(Mj |x)p(Mk |x)
=m(x|Mj)
m(x|Mk)middot π(Mj)
π(Mk) (1ndash2)
The first term on the right hand side of (1ndash2) m(x|Mj )
m(x|Mk) is known as the Bayes factor
comparing model Mj to model Mk and it is denoted by BFjk(x) The Bayes factor
provides a measure of the evidence in favor of either model given the data and updates
the model prior odds given by π(Mj )
π(Mk) to produce the posterior odds
Note that the model posterior probability in (1ndash1) can be expressed as a function of
Bayes factors To illustrate let model Mlowast isin M be a reference model All other models
compare in M are compared to the reference model Then dividing both the numerator
19
and denominator in (1ndash1) by m(x|Mlowast)π(Mlowast) yields
p(Mj |x) =BFjlowast(x)
π(Mj )
π(Mlowast)
1 +sum
MiisinMMi =Mlowast
BFilowast(x)π(Mi )π(Mlowast)
(1ndash3)
Therefore as the Bayes factor increases the posterior probability of model Mj given the
data increases If all models have equal prior probabilities a straightforward criterion
to select the best among all candidate models is to choose the model with the largest
Bayes factor As such the Bayes factor is not only useful for identifying models favored
by the data but it also provides a means to rank models in terms of their posterior
probabilities
Assuming equal model prior probabilities in (1ndash3) the prior odds are set equal to
one and the model posterior odds in (1ndash2) become p(Mj |x)p(Mk |x) = BFjk(x) Based
on the Bayes factors the evidence in favor of one or another model can be interpreted
using Table 1-1 adapted from Kass amp Raftery (1995)
Table 1-1 Interpretation of BFji when contrasting Mj and Mi
lnBFjk BFjk Evidence in favor of Mj P(Mj |x)0 to 2 1 to 3 Weak evidence 05-0752 to 6 3 to 20 Positive evidence 075-095
6 to 10 20 to 150 Strong evidence 095-099gt10 gt150 Very strong evidence gt 099
Bayesian hypothesis testing and model selection procedures through Bayes factors
and posterior probabilities have several desirable features First these methods have a
straight forward interpretation since the Bayes factor is an increasing function of model
(or hypothesis) posterior probabilities Second these methods can yield frequentist
matching confidence bounds when implemented with good testing priors (Kass amp
Wasserman 1996) such as the reference priors of Berger amp Bernardo (1992) Third
since the Bayes factor contains the ratio of marginal densities it automatically penalizes
complexity according to the number of parameters in each model this property is
known as Ockhamrsquos razor (Kass amp Raftery 1995) Four the use of Bayes factors does
20
not require having nested hypotheses (ie having the null hypothesis nested in the
alternative) standard distributions or regular asymptotics (eg convergence to normal
or chi squared distributions) (Berger et al 2001) In contrast this is not always the case
with frequentist and likelihood ratio tests which depend on known distributions (at least
asymptotically) for the test statistic to perform the test Finally Bayesian hypothesis
testing procedures using the Bayes factor can naturally incorporate model uncertainty by
using the Bayesian machinery for model averaged predictions and confidence bounds
(Kass amp Raftery 1995) It is not clear how to account for this uncertainty rigorously in a
fully frequentist approach
13 Overview of the Chapters
In the chapters that follow we develop a flexible and straightforward hierarchical
Bayesian framework for occupancy models allowing us to obtain estimates and conduct
robust testing from an ldquoobjectiverdquo Bayesian perspective Latent mixtures of random
variables supply a foundation for our methodology This approach provides a means to
directly incorporate spatial dependency and temporal heterogeneity through predictors
that characterize either habitat quality of a given site or detectability features of a
particular survey conducted in a specific site On the other hand the Bayesian testing
methods we propose are (1) a fully automatic and objective method for occupancy
model selection and (2) an objective Bayesian testing tool that accounts for multiple
testing and for polynomial hierarchical structure in the space of predictors
Chapter 2 introduces the methods proposed for estimation of occupancy model
parameters A simple estimation procedure for the single season occupancy model
with covariates is formulated using both probit and logit links Based on the simple
version an extension is provided to cope with metapopulation dynamics by introducing
persistence and colonization processes Finally given the fundamental role that spatial
dependence plays in defining temporal dynamics a strategy to seamlessly account for
this feature in our framework is introduced
21
Chapter 3 develops a new fully automatic and objective method for occupancy
model selection that is asymptotically consistent for variable selection and averts the
use of tuning parameters In this Chapter first some issues surrounding multimodel
inference are described and insight about objective Bayesian inferential procedures is
provided Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate
priors on the parameter space the intrinsic priors for the parameters of the occupancy
model are obtained These are used in the construction of a variable selection algorithm
for ldquoobjectiverdquo variable selection tailored to the occupancy model framework
Chapter 4 touches on two important and interconnected issues when conducting
model testing that have yet to receive the attention they deserve (1) controlling for false
discovery in hypothesis testing given the size of the model space ie given the number
of tests performed and (2) non-invariance to location transformations of the variable
selection procedures in the face of polynomial predictor structure These elements both
depend on the definition of prior probabilities on the model space In this chapter a set
of priors on the model space and a stochastic search algorithm are proposed Together
these control for model multiplicity and account for the polynomial structure among the
predictors
22
CHAPTER 2MODEL ESTIMATION METHODS
ldquoData Data Datardquo he cried impatiently ldquoI canrsquot make bricks without clayrdquo
ndashSherlock HolmesThe Adventure of the Copper Beeches
21 Introduction
Prior to the introduction of site-occupancy models (MacKenzie et al 2002 Tyre
et al 2003) presence-absence data from ecological monitoring programs were used
without any adjustment to assess the impact of management actions to observe trends
in species distribution through space and time or to model the habitat of a species (Tyre
et al 2003) These efforts however were suspect due to false-negative errors not
being accounted for False-negative errors occur whenever a species is present at a site
but goes undetected during the survey
Site-occupancy models developed independently by MacKenzie et al (2002)
and Tyre et al (2003) extend simple binary-regression models to account for the
aforementioned errors in detection of individuals common in surveys of animal or plant
populations Since their introduction the site-occupancy framework has been used in
countless applications and numerous extensions for it have been proposed Occupancy
models improve upon traditional binary regression by analyzing observed detection
and partially observed presence as two separate but related components In the site
occupancy setting the chosen locations are surveyed repeatedly in order to reduce the
ambiguity caused by the observed zeros This approach therefore allows simultaneous
estimation of the probabilities of presence (occurrence) and detection
Several extensions to the basic single-season closed population model are
now available The occupancy approach has been used to determine species range
dynamics (MacKenzie et al 2003 Royle amp Kery 2007) and to understand agestage
23
structure within populations (Nichols et al 2007) model species co-occurrence
(MacKenzie et al 2004 Ovaskainen et al 2010 Waddle et al 2010) It has even been
suggested as a surrogate for abundance (MacKenzie amp Nichols 2004) MacKenzie et al
suggested using occupancy models to conduct large-scale monitoring programs since
this approach avoids the high costs associated with surveys designed for abundance
estimation Also to investigate metapopulation dynamics occupancy models improve
upon incidence function models (Hanski 1994) which are often parameterized in terms
of site (or patch) occupancy and assume homogenous patches and a metapopulation
that is at a colonization-extinction equilibrium
Nevertheless the implementation of Bayesian occupancy models commonly resorts
to sampling strategies dependent on hyper-parameters subjective prior elicitation
and relatively elaborate algorithms From the standpoint of practitioners these are
often treated as black-box methods (Kery 2010) As such the potential of using the
methodology incorrectly is high Commonly these procedures are fitted with packages
such as BUGS or JAGS Although the packagersquos ease of use has led to a wide-spread
adoption of the methods the user may be oblivious as to the assumptions underpinning
the analysis
We believe providing straightforward and robust alternatives to implement these
methods will help practitioners gain insight about how occupancy modeling and more
generally Bayesian modeling is performed In this Chapter using a simple Gibbs
sampling approach first we develop a versatile method to estimate the single season
closed population site-occupancy model then extend it to analyze metapopulation
dynamics through time and finally provide a further adaptation to incorporate spatial
dependence among neighboring sites211 The Occupancy Model
In this section of the document we first introduce our results published in Dorazio
amp Taylor-Rodrıguez (2012) and build upon them to propose relevant extensions For
24
the standard sampling protocol for collecting site-occupancy data J gt 1 independent
surveys are conducted at each of N representative sample locations (sites) noting
whether a species is detected or not detected during each survey Let yij denote a binary
random variable that indicates detection (y = 1) or non-detection (y = 0) during the
j th survey of site i Without loss of generality J may be assumed constant among all N
sites to simplify description of the model In practice however site-specific variation in
J poses no real difficulties and is easily implemented This sampling protocol therefore
yields a N times J matrix Y of detectionnon-detection data
Note that the observed process yij is an imperfect representation of the underlying
occupancy or presence process Hence letting zi denote the presence indicator at site i
this model specification can therefore be represented through the hierarchy
yij |zi λ sim Bernoulli (zipij)
zi |α sim Bernoulli (ψi) (2ndash1)
where pij is the probability of correctly classifying as occupied the i th site during the j th
survey ψi is the presence probability at the i th site The graphical representation of this
process is
ψi
zi
yi
pi
Figure 2-1 Graphical representation occupancy model
Probabilities of detection and occupancy can both be made functions of covariates
and their corresponding parameter estimates can be obtained using either a maximum
25
likelihood or a Bayesian approach Existing methodologies from the likelihood
perspective marginalize over the latent occupancy process (zi ) making the estimation
procedure depend only on the detections Most Bayesian strategies rely on MCMC
algorithms that require parameter prior specification and tuning However Albert amp Chib
(1993) proposed a longstanding strategy in the Bayesian statistical literature that models
binary outcomes using a simple Gibbs sampler This procedure which is described in
the following section can be extrapolated to the occupancy setting eliminating the need
for tuning parameters and subjective prior elicitation212 Data Augmentation Algorithms for Binary Models
Probit model Data-augmentation with latent normal variables
At the root of Albert amp Chibrsquos algorithm lies the idea that if the observed outcome is
0 the latent variable can be simulated from a truncated normal distribution with support
(minusinfin 0] And if the outcome is 1 the latent variable can be simulated from a truncated
normal distribution in (0infin) To understand the reasoning behind this strategy let
Y sim Bern((xTβ)
) and V = xTβ + ε with ε sim N (0 1) In such a case note that
Pr(y = 1 | xTβ) = (xTβ) = Pr(ε lt xTβ)
= Pr(ε gt minusxTβ)
= Pr(v gt 0 | xTβ)
Thus whenever y = 1 then v gt 0 and v le 0 otherwise In other words we
may think of y as a truncated version of v Thus we can sample iteratively alternating
between the latent variables conditioned on model parameters and vice versa to draw
from the desired posterior densities By augmenting the data with the latent variables
we are able to obtain full conditional posterior distributions for model parameters that are
easy to draw from (equation 2ndash3 below) Further we may sample the latent variables
we may also sample the parameters
Given some initial values for all model parameters values for the latent variables
can be simulated By conditioning on the latter it is then possible to draw samples
26
from the parameterrsquos posterior distributions These samples can be used to generate
new values for the latent variables etc The process is iterated using a Gibbs sampling
approach Generally after a large number iterations it yields draws from the joint
posterior distribution of the latent variables and the model parameters conditional on the
observed outcome values We formalize the procedure below
Assume that each outcome Y1Y2 Yn is such that Yi |xi β sim Bernoulli(qi)
where qi = (xTi β) is the standard normal CDF evaluated at xTi β where xi and β
are the p-dimensional vectors of observed covariates for the i -th observation and their
corresponding parameters respectively
Now let y = y1 y2 yn be the vector of observed outcomes and [ β ] represents
the prior distribution of the model parameters Therefore the posterior distribution of β is
given by
[ β|y ] prop [ β ]nprodi=1
(xTi β)yi(1minus(xTi β)
)1minusyi (2ndash2)
which is intractable Nevertheless introducing latent random variables V = (V1 Vn)
such that Vi sim N (xTi β 1) resolves this difficulty by specifying that whenever Yi = 1
then Vi gt 0 and if Yi = 0 then Vi le 0 This yields
[ β v|y ] prop [ β ]
nprodi=1
ϕ(vi | xTi β 1)Ivile0Iyi=0 + Ivigt0Iyi=1
(2ndash3)
where ϕ(x |micro τ 2) is the probability density function of normal random variable x
with mean micro and variance τ2 The data augmentation artifact works since [ β|y ] =int[ β v|y ]dv hence if we sample from joint posterior 2ndash3 and extract only the sampled
values for β they will correspond to samples from [ β|y ]
From the expression above it is possible to obtain the full conditional distributions
for V and β Thus a Gibbs sampler can be proposed For example if we use a flat prior
27
for β (ie [ β ] prop 1) the full conditionals are given by
β|V y sim MVNk
((XTX )minus1(XTV ) (XTX )minus1
)(2ndash4)
V|β y simnprodi=1
tr N (xTi β 1Qi) (2ndash5)
where MVNq(micro ) represents a multivariate normal distribution with mean vector micro
and variance-covariance matrix and tr N (ξσ2Q) stands for the truncated normal
distribution with mean ξ variance σ2 and truncation region Q For each i = 1 2 n
the support of the truncated variables is given by Q = (minusinfin 0 ] if yi = 0 and Q = (0infin)
otherwise Note that conjugate normal priors could be used alternatively
At iteration m + 1 the Gibbs sampler draws V(m+1) conditional on β(m) from (2ndash5)
and then samples β(m+1) conditional on V(m+1) from (2ndash4) This process is repeated for
s = 0 1 nsim where nsim is the number of iterations in the Gibbs sampler
Logit model Data-augmentation with latent Polya-gamma variables
Recently Polson et al (2013) developed a novel and efficient approach for Bayesian
inference for logistic models using Polya-gamma latent variables which is analogous
to the Albert amp Chib algorithm The result arises from what the authors refer to as the
Polya-gamma distribution To construct a random variable from this family consider the
infinite mixture of the iid sequence of Exp(1) random variables Ekinfink=1 given by
ω =2
π2
infinsumk=1
Ek
(2k minus 1)2
with probability density function
g(ω) =infinsumk=1
(minus1)k 2k + 1radic2πω3
eminus(2k+1)2
8ω Iωisin(0infin) (2ndash6)
and Laplace density transform E[eminustω] = coshminus1(radic
t2)
28
The Polya-gamma family of densities is obtained through an exponential tilting of
the density g from 2ndash6 These densities indexed by c ge 0 are characterized by
f (ω|c) = cosh c2 eminusc2ω2 g(ω)
The likelihood for the binomial logistic model can be expressed in terms of latent
Polya-gamma variables as follows Assume yi sim Bernoulli(δi) with predictors xprimei =
(xi1 xip) and success probability δi = exprimeiβ(1 + ex
primeiβ) Hence the posterior for the
model parameters can be represented as
[β|y] =[β]prodn
i δyii (1minus δi)
1minusyi
c(y)
where c(y) is the normalizing constant
To facilitate the sampling procedure a data augmentation step can be performed
by introducing a Polya-gamma random variable ω sim PG(xprimeβ 1) This yields the
data-augmented posterior
[βω|y] =
(prodn
i=1 Pr(yi = 1|β))f (ω|xprime
β) [β] dω
c(y) (2ndash7)
such that [β|y] =int
R+[βω|y] dω
Thus from the augmented model the full conditional density for β is given by
[β|ω y] prop
(nprodi=1
Pr(yi = 1|β)
)f (ω|xprime
β) [β] dω
=
nprodi=1
(exprimeiβ)yi
1 + exprimeiβ
nprodi=1
cosh
(∣∣xprime
iβ∣∣
2
)exp
[minus(x
prime
iβ)2ωi
2
]g(ωi)
(2ndash8)
This expression yields a normal posterior distribution if β is assigned flat or normal
priors Hence a two-step sampling strategy analogous to that of Albert amp Chib (1993)
can be used to estimate β in the occupancy framework22 Single Season Occupancy
Let pij = F (qTij λ) be the probability of correctly classifying as occupied the i th
site during the j th survey conditional on the site being occupied and let ψi = F (xTi α)
29
correspond to the presence probability at the i th site Further let Fminus1(middot) denote a link
function (ie probit or logit) connecting the response to the predictors and denote by λ
and α respectively the r -variate and p-variate coefficient vectors for the detection and
for the presence probabilities Then the following is the joint posterior probability for the
presence indicators and the model parameters
πlowast(z vαwλ) prop πα(α)πλ(λ)Nprodi=1
F (xprimeiα)zi (1minus F (xprimeiα))
(1minuszi ) times
Jprodj=1
(ziF (qprimeijλ))
yij (1minus ziF (qprimeijλ))
1minusyij (2ndash9)
As in the simple probit regression problem this posterior is intractable Consequently
sampling from it directly is not possible But the procedures of Albert amp Chib for the
probit model and of Polson et al for the logit model can be extended to generate an
MCMC sampling strategy for the occupancy problem In what follows we make use of
this framework to develop samplers with which occupancy parameter estimates can be
obtained for both probit and logit link functions These algorithms have the added benefit
that they do not require tuning parameters nor eliciting parameter priors subjectively221 Probit Link Model
To extend Albert amp Chibrsquos algorithm to the occupancy framework with a probit link
first we introduce two sets of latent variables denoted by wij and vi corresponding to
the normal latent variables used to augment the data The corresponding hierarchy is
yij |zi sij sim Bernoulli(ziIwijgt0
)wij |λ sim N
(qprimeijλ 1
)λ sim [λ]
zi |vi sim Ivigt0
vi |α sim N (xprimeiα 1)
α sim [α] (2ndash10)
30
represented by the directed graph found in Figure 2-2
α
vi
zi
yi
wi
λ
Figure 2-2 Graphical representation occupancy model after data-augmentation
Under this hierarchical model the joint density is given by
πlowast(z vαwλ) prop Cyπα(α)πλ(λ)Nprodi=1
ϕ(vi xprimeiα 1)I
zivigt0I
(1minuszi )vile0 times
Jprodj=1
(ziIwijgt0)yij (1minus ziIwijgt0)
1minusyijϕ(wij qprimeijλ 1) (2ndash11)
The full conditional densities derived from the posterior in equation 2ndash11 are
detailed below
1 These are obtained from the full conditional of z after integrating out v and w
f (z|αλ) =
Nprodi=1
f (zi |αλ) =Nprodi=1
ψlowastizi (1minus ψlowast
i )1minuszi
where ψlowasti =
ψiprodJ
j=1 pyijij (1minus pij)
1minusyij
ψiprodJ
j=1 pyijij (1minus pij)1minusyij + (1minus ψi)
prodJ
j=1 Iyij=0(2ndash12)
2
f (v|zα) =
Nprodi=1
f (vi |zi α) =Nprodi=1
tr N (x primeiα 1Ai)
where Ai =
(minusinfin 0] zi = 0(0infin) zi = 1
(2ndash13)
31
and tr N(microσ2A) denotes the pdf of a truncated normal random variable withmean micro variance σ2 and truncation region A
3
f (α|v) = ϕp (α αXprimev α) (2ndash14)
where α = (X primeX )minus1and ϕk(x micro ) represents the k-variate normal density withmean vector micro and variance matrix
4
f (w|y zλ) =
Nprodi=1
Jprodj=1
f (wij |yij zi λ) =Nprodi=1
Jprodj=1
tr N (qprimeijλ 1Bij)
where Bij =
(minusinfininfin) zi = 0(minusinfin 0] zi = 1 and yij = 0(0infin) zi = 1 and yij = 1
(2ndash15)
5
f (λ|w) = ϕr (λ λQprimew λ) (2ndash16)
where λ = (Q primeQ)minus1
The Gibbs sampling algorithm for the model can then be summarized as
1 Initialize z α v λ and w
2 Sample zi sim Bern(ψilowast)
3 Sample vi from a truncated normal with micro = x primeiα σ = 1 and truncation regiondepending on zi
4 Sample α sim N (αXprimev α) with α = (X primeX )minus1
5 Sample wij from a truncated normal with micro = qprimeijλ σ = 1 and truncation region
depending on yij and zi
6 Sample λ sim N (λQprimew λ) with λ = (Q primeQ)minus1
222 Logit Link Model
Now turning to the logit link version of the occupancy model again let yij be the
indicator variable used to mark detection of the target species on the j th survey at the
i th site and let zi be the indicator variable that denotes presence (zi = 1) or absence
32
(zi = 0) of the target species at the i th site The model is now defined by
yij |zi λ sim Bernoulli (zipij) where pij =eq
primeijλ
1 + eqprimeijλ
λ sim [λ]
zi |α sim Bernoulli (ψi) where ψi =ex
primeiα
1 + exprimeiα
α sim [α]
In this hierarchy the contribution of a single site to the likelihood is
Li(αλ) =(ex
primeiα)zi
1 + exprimeiα
Jprodj=1
(zi
eqprimeijλ
1 + eqprimeijλ
)yij(1minus zi
eqprimeijλ
1 + eqprimeijλ
)1minusyij
(2ndash17)
As in the probit case we data-augment the likelihood with two separate sets
of covariates however in this case each of them having Polya-gamma distribution
Augmenting the model and using the posterior in (2ndash7) the joint is
[ zαλ|y ] prop [α] [λ]
Nprodi=1
(ex
primeiα)zi
1 + exprimeiαcosh
(∣∣xprime
iα∣∣
2
)exp
[minus(x
prime
iα)2vi
2
]g(vi)times
Jprodj=1
(zi
eqprimeijλ
1 + eqprimeijλ
)yij(1minus zi
eqprimeijλ
1 + eqprimeijλ
)1minusyij
times
cosh
(∣∣ziqprimeijλ∣∣2
)exp
[minus(ziq
primeijλ)2wij
2
]g(wij)
(2ndash18)
The full conditionals for z α v λ and w obtained from (2ndash18) are provided below
1 The full conditional for z is obtained after marginalizing the latent variables andyields
f (z|αλ) =
Nprodi=1
f (zi |αλ) =Nprodi=1
ψlowastizi (1minus ψlowast
i )1minuszi
where ψlowasti =
ψiprodJ
j=1 pyijij (1minus pij)
1minusyij
ψiprodJ
j=1 pyijij (1minus pij)1minusyij + (1minus ψi)
prodJ
j=1 Iyij=0(2ndash19)
33
2 Using the result derived in Polson et al (2013) we have that
f (v|zα) =
Nprodi=1
f (vi |zi α) =Nprodi=1
PG(1 xprimeiα) (2ndash20)
3
f (α|v) prop [α ]
Nprodi=1
exp[zix
prime
iαminus xprime
iα
2minus (x
prime
iα)2vi
2
] (2ndash21)
4 By the same result as that used for v the full conditional for w is
f (w|y zλ) =
Nprodi=1
Jprodj=1
f (wij |yij zi λ)
=
(prodiisinS1
Jprodj=1
PG(1 |qprimeijλ| )
)(prodi isinS1
Jprodj=1
PG(1 0)
) (2ndash22)
with S1 = i isin 1 2 N zi = 1
5
f (λ|z yw) prop [λ ]prodiisinS1
exp
[yijq
prime
ijλminusq
prime
ijλ
2minus
(qprime
ijλ)2wij
2
] (2ndash23)
with S1 as defined above
The Gibbs sampling algorithm is analogous to the one with a probit link but with the
obvious modifications to incorporate Polya-gamma instead of normal latent variables23 Temporal Dynamics and Spatial Structure
The uses of the single-season model are limited to very specific problems In
particular assumptions for the basic model may become too restrictive or unrealistic
whenever the study period extends throughout multiple years or seasons especially
given the increasingly changing environmental conditions that most ecosystems are
currently experiencing
Among the many extensions found in the literature one that we consider particularly
relevant incorporates heterogenous occupancy probabilities through time Extensions of
34
site-occupancy models that incorporate temporally varying probabilities can be traced
back to Hanski (1994) The heterogeneity of occupancy probabilities through time arises
from local colonization and extinction processes MacKenzie et al (2003) proposed an
alternative to Hanskirsquos approach in order to incorporate imperfect detection The method
is flexible enough to let detection occurrence survival and colonization probabilities
each depend upon its own set of covariates using likelihood-based estimation for the
model parameters
However the approach of MacKenzie et al presents two drawbacks First
the uncertainty assessment for maximum likelihood parameter estimates relies on
asymptotic results (obtained from implementation of the delta method) making it
sensitive to sample size And second to obtain parameter estimates the latent process
(occupancy) is marginalized out of the likelihood leading to the usual zero-inflated
Bernoulli model Although this is a convenient strategy to solve the estimation problem
the latent state variables (occupancy indicators) are no longer available and as such
finite sample estimates cannot be calculated unless an additional (and computationally
expensive) parametric bootstrap step is performed (Royle amp Kery 2007) Additionally as
the occupancy process is integrated out the likelihood approach precludes incorporation
of additional structural dependence using random effects Thus the model cannot
account for spatial dependence which plays a fundamental role in this setting
To work around some of the shortcomings encountered when fitting dynamic
occupancy models via likelihood based methods Royle amp Kery developed what they
refer to as a dynamic occupancy state space model (DOSS) alluding to the conceptual
similarity found between this model and the class of state space models found in the
time series literature In particular this model allows one to retain the latent process
(occupancy indicators) in order to obtain small sample estimates and to eventually
generate extensions that incorporate structure in time andor space through random
effects
35
The data used in the DOSS model comes from standard repeated presenceabsence
surveys with N sampling locations (patches or sites) indexed by i = 1 2 N Within
a given season (eg year month week depending on the biology of the species) each
sampling location is visited (surveyed) j = 1 2 J times This process is repeated for
t = 1 2 T seasons Here an important assumption is that the site occupancy status
is closed within but not across seasons
As is usual in the occupancy modeling framework two different processes are
considered The first one is the detection process per site-visit-season combination
denoted by yijt The yijt are indicator functions that take the value 1 if the species is
present at site i survey j and season t and 0 otherwise These detection indicators
are assumed to be independent within each site and season The second response
considered is the partially observed presence (occupancy) indicators zit These are
indicator variables which are equal to 1 whenever yijt = 1 for one or more of the visits
made to site i during season t otherwise the values of the zit rsquos are unknown Royle amp
Kery refer to these two processes as the observation (yijt) and the state (zit) models
In this setting the parameters of greatest interest are the occurrence or site
occupancy probabilities denoted by ψit as well as those representing the population
dynamics which are accounted for by introducing changes in occupancy status over
time through local colonization and survival That is if a site was not occupied at season
t minus 1 at season t it can either be colonized or remain unoccupied On the other hand
if the site was in fact occupied at season t minus 1 it can remain that way (survival) or
become abandoned (local extinction) at season t The probabilities of survival and
colonization from season t minus 1 to season t at the i th site are denoted by θi(tminus1) and
γi(tminus1) respectivelyDuring the initial period (or season) the model for the state process is expressed in
terms of the occupancy probability (equation 2ndash24) For subsequent periods the stateprocess is specified in terms of survival and colonization probabilities (equation 2ndash25) inparticular
zi1 sim Bernoulli (ψi1) (2ndash24)
36
zit |zi(tminus1) sim Bernoulli(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)(2ndash25)
The observation model conditional on the latent process zit is defined by
yijt |zit sim Bernoulli(zitpijt
)(2ndash26)
Royle amp Kery induce the heterogeneity by site site-season and site-survey-seasonrespectively in the occupancy survival and colonization and in the detection probabilitiesthroughthe following specification
logit(ψi1) = x1 + ri ri sim N(0σ2ψ) logitminus1(x1) sim Unif(0 1)
logit(θit) = at + ui ui sim N(0σ2θ ) logitminus1(ai) sim Unif(0 1)logit(γit) = bt + vi vi sim N(0σ2γ) logitminus1(bi) sim Unif(0 1)
logit(pijt) = ct + wij wi sim N(0σ2p) logitminus1(ci) sim Unif(0 1) (2ndash27)
where x1 at bt ct are the season fixed effects for the corresponding probabilities
and where (ri ui vi) and wij are the site and site-survey random effects respectively
Additionally all variance components assume the usual inverse gamma priors
As the authors state this formulation can be regarded as ldquobeing suitably vaguerdquo
however it is also restrictive in the sense that it is not clear what strategy to follow to
incorporate additional covariates while preserving the straightforward sampling strategy231 Dynamic Mixture Occupancy State-Space Model
We assume that the probabilities for occupancy survival colonization and detection
are all functions of linear combinations of covariates However our setup varies
slightly from that considered by Royle amp Kery (2007) In essence we modify the way in
which the estimates for survival and colonization probabilities are attained Our model
incorporates the notion that occupancy at a site occupied during the previous season
takes place through persistence where we define persistence as a function of both
survival and colonization That is a site occupied at time t may again be occupied
at time t + 1 if the current settlers survive if they perish and new settlers colonize
simultaneously or if both current settlers survive and new ones colonize
Our functional forms of choice are again the probit and logit link functions This
means that each probability of interest which we will refer to for illustration as δ is
37
linked to a linear combination of covariates xprime ξ through the relationship defined by
δ = F (xT ξ) where F (middot) represents the inverse link function This particular assumption
facilitates relating the data augmentation algorithms of Albert amp Chib and Polson et al to
Royle amp Keryrsquos DOSS model We refer to this extension of Royle amp Keryrsquos model as the
Dynamic Mixture Occupancy State Space model (DYMOSS)
As before let yijt be the indicator variable used to mark detection of the target
species on the j th survey at the i th site during the tth season and let zit be the indicator
variable that denotes presence (zit = 1) or absence (zit = 0) of the target species at the
i th site tth season with i isin 1 2 N j isin 1 2 J and t isin 1 2 T
Additionally assume that probabilities for occupancy at time t = 1 persistence
colonization and detection are all functions of covariates with corresponding parameter
vectors α (s) =δ(s)tminus1
Tt=2
B(c) =β(c)tminus1
Tt=2
and = λtTt=1 and covariate matrices
X(o) X = Xtminus1Tt=2 and Q(s) = QtTt=1 respectively Using the notation above our
proposed dynamic occupancy model is defined by the following hierarchyState model
zi1|α sim Bernoulli (ψi1) where ψi1 = F(xprime(o)iα
)zit |zi(tminus1) δ
(c)tminus1β
(s)tminus1 sim Bernoulli
(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)where θi(tminus1) = F
(δ(s)tminus1 + xprimei(tminus1)β
(c)tminus1
) and
γi(tminus1) = F(xprimei(tminus1)β
(c)tminus1
)(2ndash28)
Observed modelyijt |zit ηt sim Bernoulli (zitpijt)
where pijt = F (qTijtλt) (2ndash29)
In the hierarchical setup given by Equations 2ndash28 and 2ndash29 θi(tminus1) corresponds to
the probability of persistence from time t minus 1 to time t at site i and γi(tminus1) denotes the
colonization probability Note that θi(tminus1) minus γi(tminus1) yields the survival probability from t minus 1
to t The effect of survival is introduced by changing the intercept of the linear predictor
by a quantity δ(s)tminus1 Although in this version of the model this effect is accomplished by
just modifying the intercept it can be extended to have covariates determining δ(s)tminus1 as
well The graphical representation of the model for a single site is
38
α
zi1
yi1
λ1
zi2
yi2
λ1
δ(s)1
β(c)1
middot middot middot
zit
yit
λt
δ(s)tminus1
β(c)tminus1
middot middot middot
ziT
yiT
λT
δ(s)Tminus1
β(c)Tminus1
Figure 2-3 Graphical representation multiseason model for a single site
The joint posterior for the model defined by this hierarchical setting is
[ zηαβλ|y ] = Cy
Nprodi=1
ψi1 Jprodj=1
pyij1ij1 (1minus pij1)
(1minusyij1)
zi1(1minus ψi1)
Jprodj=1
Iyij1=0
1minuszi1 [η1][α]times
Tprodt=2
Nprodi=1
[(θziti(tminus1)(1minus θi(tminus1))
1minuszit)zi(tminus1)
+(γziti(tminus1)(1minus γi(tminus1))
1minuszit)1minuszi(tminus1)
] Jprod
j=1
pyijtijt (1minus pijt)
1minusyijt
zit
times
Jprodj=1
Iyijt=0
1minuszit [ηt ][βtminus1][λtminus1]
(2ndash30)
which as in the single season case is intractable Once again a Gibbs sampler cannot
be constructed directly to sample from this joint posterior The graphical representation
of the model for one site incorporating the latent variables is provided in Figure 2-4
α
ui1
zi1
yi1
wi1
λ1
zi2
yi2
wi2
λ1
vi1
δ(s)1
β(c)1
middot middot middot
middot middot middot
zit
vi tminus1
yit
wit
λt
δ(s)tminus1
β(c)tminus1
middot middot middot
middot middot middot
ziT
vi Tminus1
yiT
wiT
λT
δ(s)Tminus1
β(s)Tminus1
Figure 2-4 Graphical representation data-augmented multiseason model
Probit link normal-mixture DYMOSS model
39
We deal with the intractability of the joint posterior distribution as before that is
by introducing latent random variables Each of the latent variables incorporates the
relevant linear combinations of covariates for the probabilities considered in the model
This artifact enables us to sample from the joint posterior distributions of the model
parameters For the probit link the sets of latent random variables respectively for first
season occupancy persistence and colonization and detection are
bull ui sim N (bTi α 1)
bull vi(tminus1) sim zi(tminus1)N(δ(s)(tminus1) + xTi(tminus1)β
(c)(tminus1) 1
)+ (1minus zi(tminus1))N
(xTi(tminus1)β
(c)(tminus1) 1
) and
bull wijt sim N (qTijtηt 1)
Introducing these latent variables into the hierarchical formulation yieldsState model
ui1|α sim N(xprime(o)iα 1
)zi1|ui sim Bernoulli
(Iuigt0
)for t gt 1
vi(tminus1)|zi(tminus1)βtminus1 sim zi(tminus1)N(δ(s)(tminus1) + xprimei(tminus1)β
(c)(tminus1) 1
)+
(1minus zi(tminus1))N(xprimei(tminus1)β
(c)(tminus1) 1
)zit |vi(tminus1) sim Bernoulli
(Ivi(tminus1)gt0
)(2ndash31)
Observed modelwijt |ηt sim N
(qTijtλt 1
)yijt |zit wijt sim Bernoulli
(zitIrijtgt0
)(2ndash32)
Note that the result presented in Section 22 corresponds to the particular case for
T = 1 of the model specified by Equations 2ndash31 and 2ndash32
As mentioned previously model parameters are obtained using a Gibbs sampling
approach Let ϕ(x |microσ2) denote the pdf of a normally distributed random variable x
with mean micro and standard deviation σ Also let
1 Wt = (w1t w2t wNt) with wit = (wi1t wi2t wiJitt) (for i = 1 2 N andt = 1 2 T )
40
2 u = (u1 u2 uN)
3 V = (v1 vTminus1) with vt = (v1t v2t vNt)
For the probit link model the joint posterior distribution is
π(ZuV WtTt=1αB(c) δ(s)
)prop [α]
prodNi=1 ϕ
(ui∣∣ xprime(o)iα 1
)Izi1uigt0I
1minuszi1uile0
times
Tprodt=2
[β(c)tminus1 δ
(s)tminus1
] Nprodi=1
ϕ(vi(tminus1)
∣∣micro(v)i(tminus1) 1
)Izitvi(tminus1)gt0
I1minuszitvi(tminus1)le0
times
Tprodt=1
[λt ]
Nprodi=1
Jitprodj=1
ϕ(wijt
∣∣qprimeijtλt 1)(zitIwijtgt0)yij1(1minus zitIwijtgt0)
(1minusyijt)
where micro(v)i(tminus1) = zi(tminus1)δ
(s)tminus1 + xprimei(tminus1)β
(c)tminus1 (2ndash33)
Initialize the Gibbs sampler at α(0)B(0)(c) δ
(s)(0)2minus1 and (0) For m = 0 1 nsim
The sampler proceeds iteratively by block sampling sequentially for each primary
sampling period as follows first the presence process then the latent variables from
the data-augmentation step for the presence component followed by the parameters for
the presence process then the latent variables for the detection component and finally
the parameters for the detection component Letting [|] denote the full conditional
probability density function of the component conditional on all other unknown
parameters and the observed data for m = 1 nsim the sampling procedure can be
summarized as
[z(m)1 | middot
]rarr[u(m)| middot
]rarr[α(m)
∣∣∣ middot ]rarr [W
(m)1 | middot
]rarr[λ(m)1
∣∣∣ middot ]rarr[z(m)2 | middot
]rarr[V(m)2minus1| middot
]rarr[β(c)(m)2minus1 δ(s)(m)
2minus1
∣∣∣ middot ]rarr [W
(m)2 | middot
]rarr[λ(m)2
∣∣∣ middot ]rarr middot middot middot
middot middot middot rarr[z(m)T | middot
]rarr[V(m)Tminus1| middot
]rarr[β(c)(m)Tminus1 δ(s)(m)
Tminus1
∣∣∣ middot ]rarr [W
(m)T | middot
]rarr[λ(m)T
∣∣∣ middot ]The full conditional probability densities for this Gibbs sampling algorithm are
presented in detail within Appendix A
41
Logit link Polya-Gamma DYMOSS model
Using the same notation as before the logit link model resorts to the hierarchy given
byState model
ui1|α sim PG(xT(o)iα 1
)zi1|ui sim Bernoulli
(Iuigt0
)for t gt 1
vi(tminus1)| sim PG(1∣∣zi(tminus1)δ(s)(tminus1) + xprimei(tminus1)β
(c)(tminus1)
∣∣)sim Bernoulli
(Ivi(tminus1)gt0
)(2ndash34)
Observed modelwijt |λt sim PG
(qTijtλt 1
)yijt |zit wijt sim Bernoulli
(zitIwijtgt0
)(2ndash35)
The logit link version of the joint posterior is given by
π(ZuV WtTt=1αB(s)B(c)
)prop
Nprodi=1
(e
xprime(o)i
α)zi1
1 + exprime(o)i
αPG
(ui 1 |xprime(o)iα|
)[λ1][α]times
Ji1prodj=1
(zi1
eqprimeij1λ1
1 + eqprimeij1λ1
)yij1(1minus zi1
eqprimeij1λ1
1 + eqprimeij1λ1
)1minusyij1
PG(wij1 1 |zi1qprimeij1λ1|
)times
Tprodt=2
[δ(s)tminus1][β
(c)tminus1][λt ]
Nprodi=1
(exp
[micro(v)tminus1
])zit1 + exp
[micro(v)i(tminus1)
]PG (vit 1 ∣∣∣micro(v)i(tminus1)
∣∣∣)timesJitprodj=1
(zit
eqprimeijtλt
1 + eqprimeijtλt
)yijt(1minus zit
eqprimeijtλt
1 + eqlowastTij
λt
)1minusyijt
PG(wijt 1 |zitqprimeijtλt |
)
(2ndash36)
with micro(v)tminus1 = zi(tminus1)δ
(s)tminus1 + xprimei(tminus1)β
(c)tminus1
42
The sampling procedure is entirely analogous to that described for the probit
version The full conditional densities derived from expression 2ndash36 are described in
detail in Appendix A232 Incorporating Spatial Dependence
In this section we describe how the additional layer of complexity space can also
be accounted for by continuing to use the same data-augmentation framework The
method we employ to incorporate spatial dependence is a slightly modified version of
the traditional approach for spatial generalized linear mixed models (GLMMrsquos) and
extends the model proposed by Johnson et al (2013) for the single season closed
population occupancy model
The traditional approach consists of using spatial random effects to induce a
correlation structure among adjacent sites This formulation introduced by Besag et al
(1991) assumes that the spatial random effect corresponds to a Gaussian Markov
Random Field (GMRF) The model known as the Spatial GLMM (SGLMM) is used to
analyze areal data It has been applied extensively given the flexibility of its hierarchical
formulation and the availability of software for its implementation (Hughes amp Haran
2013)
Succinctly the spatial dependence is accounted for in the model by adding a
random vector η assumed to have a conditionally-autoregressive (CAR) prior (also
known as the Gaussian Markov random field prior) To define the prior let the pair
G = (V E) represent the undirected graph for the entire spatial region studied where
V = (1 2 N) denotes the vertices of the graph (sites) and E the set of edges
between sites E is constituted by elements of the form (i j) indicating that sites i
and j are spatially adjacent for some i j isin V The prior for the spatial effects is then
characterized by
[η|τ ] prop τ rank()2exp[minusτ2ηprimeη
] (2ndash37)
43
where = (diag(A1)minus A) is the precision matrix with A denoting the adjacency matrix
The entries of the adjacency matrix A are such that diag(A) = 0 and Aij = I(i j)isinE
The matrix is singular Hence the probability density defined in equation 2ndash37
is improper ie it doesnrsquot integrate to 1 Regardless of the impropriety of the prior this
model can be fitted using a Bayesian approach since even if the prior is improper the
posterior for the model parameters is proper If a constraint such assum
k ηk = 0 is
imposed or if the precision matrix is replaced by a positive definite matrix the model
can also be fitted using a maximum likelihood approach
Assuming that all but the detection process are subject to spatial correlations and
using the notation we have developed up to this point the spatially explicit version of the
DYMOSS model is characterized by the hierarchy represented by equations 2ndash38 and
2ndash39
Hence adding spatial structure into the DYMOSS framework described in the
previous section only involves adding the steps to sample η(o) and ηtT
t=2 conditional
on all other parameters Furthermore the corresponding parameters and spatial
random effects of a given component (ie occupancy survival and colonization)
can be effortlessly pooled together into a single parameter vector to perform block
sampling For each of the latent variables the only modification required is to sum the
corresponding spatial effect to the linear predictor so that these retain their conditional
independence given the linear combination of fixed effects and the spatial effects
State modelzi1|α sim Bernoulli (ψi1) where ψi1 = F
(xT(o)iα+ η
(o)i
)[η(o)|τ
]prop τ rank()2exp
[minusτ2η(o)primeη(o)
]zit |zi(tminus1)αβtminus1λtminus1 sim Bernoulli
(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)where θi(tminus1) = F
(δ(s)(tminus1) + xTi(tminus1)β
(c)tminus1 + ηit
) and
γi(tminus1) = F(xTi(tminus1)β
(c)tminus1 + ηit
)[ηt |τ ] prop τ rank()2exp
[minusτ2ηprimetηt
](2ndash38)
44
Observed modelyijt |zit ηt sim Bernoulli (zitpijt)
where pijt = F (qTijtλt) (2ndash39)
In spite of the popularity of this approach to incorporating spatial dependence three
shortcomings have been reported in the literature (Hughes amp Haran 2013 Reich et al
2006) (1) model parameters have no clear interpretation due to spatial confounding
of the predictors with the spatial effect (2) there is variance inflation due to spatial
confounding and (3) the high dimensionality of the latent spatial variables leads to
high computational costs To avoid such difficulties we follow the approach used by
Hughes amp Haran (2013) which builds upon the earlier work by Reich et al (2006) This
methodology is summarized in what follows
Let a vector of spatial effects η have the CAR model given by 2ndash37 above Now
consider a random vector ζ sim MVN(0 τKprimeK
) with defined as above and where
τKprimeK corresponds to the precision of the distribution and not the covariance matrix
with matrix K satisfying KprimeK = I
This last condition implies that the linear predictor Xβ + η = Xβ + Kζ With
respect to how the matrix K is chosen Hughes amp Haran (2013) recommend basing its
construction on the spectral decomposition of operator matrices based on Moranrsquos I
The Moran operator matrix is defined as PperpAPperp with Pperp = IminusX (XprimeX )minus1X
prime and where A
is the adjacency matrix previously described The choice of the Moran operator is based
on the fact that it accounts for the underlying graph while incorporating the spatial
structure residual to the design matrix X These elements are incorporated into its
spectral decomposition of the Moran operator That is its eigenvalues correspond to the
values of Moranrsquos I statistic (a measure of spatial autocorrelation) for a spatial process
orthogonal to X while its eigenvectors provide the patterns of spatial dependence
residual to X Thus the matrix K is chosen to be the matrix whose columns are the
eigenvectors of the Moran operator for a particular adjacency matrix
45
Using this strategy the new hierarchical formulation of our model is simply modified
by letting η(o) = K(o)ζ(o) and ηt = Ktζt with
1 ζ(o) sim MVN(0 τ (o)K(o)primeK(o)
) where K(o) is the eigenvector matrix for
P(o)perpAP(o)perp and
2 ζt sim MVN(0 τtK
primetKt
) where Kt is the Pperp
t APperpt for t = 2 3 T
The algorithms for the probit and logit link from section 231 can be readily
adapted to incorporate the spatial structure simply by obtaining the joint posteriors
for (α ζ(o)) and (β(c)tminus1 δ
(s)tminus1 ζt) making the obvious modification of the corresponding
linear predictors to incorporate the spatial components24 Summary
With a few exceptions (Dorazio amp Taylor-Rodrıguez 2012 Johnson et al 2013
Royle amp Kery 2007) recent Bayesian approaches to site-occupancy modeling with
covariates have relied on model configurations (eg as multivariate normal priors of
parameters in logit scale) that lead to unfamiliar conditional posterior distributions thus
precluding the use of a direct sampling approach Therefore the sampling strategies
available are based on algorithms (eg Metropolis Hastings) that require tuning and the
knowledge to do so correctly
In Dorazio amp Taylor-Rodrıguez (2012) we proposed a Bayesian specification for
which a Gibbs sampler of the basic occupancy model is available and allowed detection
and occupancy probabilities to depend on linear combinations of predictors This
method described in section 221 is based on the data augmentation algorithm of
Albert amp Chib (1993) There the full conditional posteriors of the parameters of the probit
regression model are cast as latent mixtures of normal random variables The probit and
the logit link yield similar results with large sample sizes however their results may be
different when small to moderate sample sizes are considered because the logit link
function places more mass in the tails of the distribution than the probit link does In
46
section 222 we adapt the method for the single season model to work with the logit link
function
The basic occupancy framework is useful but it assumes a single closed population
with fixed probabilities through time Hence its assumptions may not be appropriate to
address problems where the interest lies in the temporal dynamics of the population
Hence we developed a dynamic model that incorporates the notion that occupancy
at a site previously occupied takes place through persistence which depends both on
survival and habitat suitability By this we mean that a site occupied at time t may again
be occupied at time t + 1 if (1) the current settlers survive (2) the existing settlers
perish but new settlers simultaneously colonize or (3) current settlers survive and new
ones colonize during the same season In our current formulation of the DYMOSS both
colonization and persistence depend on habitat suitability characterized by xprimei(tminus1)β(c)tminus1
They only differ in that persistence is also influenced by whether the site being occupied
during season t minus 1 enhances the suitability of the site or harms it through density
dependence
Additionally the study of the dynamics that govern distribution and abundance of
biological populations requires an understanding of the physical and biotic processes
that act upon them and these vary in time and space Consequently as a final step in
this Chapter we described a straightforward strategy to add spatial dependence among
neighboring sites in the dynamic metapopulation model This extension is based on the
popular Bayesian spatial modeling technique of Besag et al (1991) updated using the
methods described in (Hughes amp Haran 2013)
Future steps along these lines are (1) develop the software necessary to
implement the tools described throughout the Chapter and (2) build a suite of additional
extensions using this framework for occupancy models will be explored The first of
them will be used to incorporate information from different sources such as tracks
scats surveys and direct observations into a single model This can be accomplished
47
by adding a layer to the hierarchy where the source and spatial scale of the data is
accounted for The second extension is a single season spatially explicit multiple
species co-occupancy model This model will allow studying complex interactions
and testing hypotheses about species interactions at a given point in time Lastly this
co-occupancy model will be adapted to incorporate temporal dynamics in the spirit of
the DYMOSS model
48
CHAPTER 3INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
Eliminate all other factors and the one which remains must be the truthndashSherlock Holmes
The Sign of Four
31 Introduction
Occupancy models are often used to understand the mechanisms that dictate
the distribution of a species Therefore variable selection plays a fundamental role in
achieving this goal To the best of our knowledge ldquoobjectiverdquo Bayesian alternatives for
variable selection have not been put forth for this problem and with a few exceptions
(Hooten amp Hobbs 2014 Link amp Barker 2009) AIC is the method used to choose from
competing site-occupancy models In addition the procedures currently implemented
and accessible to ecologists require enumerating and estimating all the candidate
models (Fiske amp Chandler 2011 Mazerolle amp Mazerolle 2013) In practice this
can be achieved if the model space considered is small enough which is possible
if the choice of the model space is guided by substantial prior knowledge about the
underlying ecological processes Nevertheless many site-occupancy surveys collect
large amounts of covariate information about the sampled sites Given that the total
number of candidate models grows exponentially fast with the number of predictors
considered choosing a reduced set of models guided by ecological intuition becomes
increasingly difficult This is even more so the case in the occupancy model context
where the model space is the cartesian product of models for presence and models for
detection Given the issues mentioned above we propose the first objective Bayesian
variable selection method for the single-season occupancy model framework This
approach explores in a principled manner the entire model space It is completely
49
automatic precluding the need for both tuning parameters in the sampling algorithm and
subjective elicitation of parameter prior distributions
As mentioned above in ecological modeling if model selection or less frequently
model averaging is considered the Akaike Information Criterion (AIC) (Akaike 1983)
or a version of it is the measure of choice for comparing candidate models (Fiske amp
Chandler 2011 Mazerolle amp Mazerolle 2013) The AIC is designed to find the model
that has on average the density closest in Kullback-Leibler distance to the density
of the true data generating mechanism The model with the smallest AIC is selected
However if nested models are considered one of them being the true one generally the
AIC will not select it (Wasserman 2000) Commonly the model selected by AIC will be
more complex than the true one The reason for this is that the AIC has a weak signal to
noise ratio and as such it tends to overfit (Rao amp Wu 2001) Other versions of the AIC
provide a bias correction that enhances the signal to noise ratio leading to a stronger
penalization for model complexity Some examples are the AICc (Hurvich amp Tsai 1989)
and AICu (McQuarrie et al 1997) however these are also not consistent for selection
albeit asymptotically efficient (Rao amp Wu 2001)
If we are interested in prediction as opposed to testing the AIC is certainly
appropriate However when conducting inference the use of Bayesian model averaging
and selection methods is more fitting If the true data generating mechanism is among
those considered asymptotically Bayesian methods choose the true model with
probability one Conversely if the true model is not among the alternatives and a
suitable parameter prior is used the posterior probability of the most parsimonious
model closest to the true one tends asymptotically to one
In spite of this in general for Bayesian testing direct elicitation of prior probabilistic
statements is often impeded because the problems studied may not be sufficiently
well understood to make an informed decision about the priors Conversely there may
be a prohibitively large number of parameters making specifying priors for each of
50
these parameters an arduous task In addition to this seemingly innocuous subjective
choices for the priors on the parameter space may drastically affect test outcomes
This has been a recurring argument in favor of objective Bayesian procedures
which appeal to the use of formal rules to build parameter priors that incorporate the
structural information inside the likelihood while utilizing some objective criterion (Kass amp
Wasserman 1996)
One popular choice of ldquoobjectiverdquo prior is the reference prior (Berger amp Bernardo
1992) which is the prior that maximizes the amount of signal extracted from the
data These priors have proven to be effective as they are fully automatic and can
be frequentist matching in the sense that the posterior credible interval agrees with the
frequentist confidence interval from repeated sampling with equal coverage-probability
(Kass amp Wasserman 1996) Reference priors however are improper and while
they yield reasonable posterior parameter probabilities the derived model posterior
probabilities may be ill defined To avoid this shortcoming Berger amp Pericchi (1996)
introduced the intrinsic Bayes factor (IBF) for model comparison Moreno et al (1998)
building on the IBF of Berger amp Pericchi (1996) developed a limiting procedure to
generate a system of priors that yield well-defined posteriors even though these
priors may sometimes be improper The IBF is built using a data-dependent prior to
automatically generate Bayes factors however the extension introduced by Moreno
et al (1998) generates the intrinsic prior by taking a theoretical average over the space
of training samples freeing the prior from data dependence
In our view in the face of a large number of predictors the best alternative is to run
a stochastic search algorithm using good ldquoobjectiverdquo testing parameter priors and to
incorporate suitable model priors This being said the discussion about model priors is
deferred until Chapter 4 this Chapter focuses on the priors on the parameter space
The Chapter is structured as follows First issues surrounding multimodel inference
are described and insight about objective Bayesian inferential procedures is provided
51
Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate priors
on the parameter space the intrinsic priors for the parameters of the occupancy model
are derived These are used in the construction of an algorithm for ldquoobjectiverdquo model
selection tailored to the occupancy model framework To assess the performance of our
methods we provide results from a simulation study in which distinct scenarios both
favorable and unfavorable are used to determine the robustness of these tools and
analyze the Blue Hawker data set which has been examined previously in the ecological
literature (Dorazio amp Taylor-Rodrıguez 2012 Kery et al 2010)32 Objective Bayesian Inference
As mentioned before in practice noninformative priors arising from structural
rules are an alternative to subjective elicitation of priors Some of the rules used in
defining noninformative priors include the principle of insufficient reason parametrization
invariance maximum entropy geometric arguments coverage matching and decision
theoretic approaches (see Kass amp Wasserman (1996) for a discussion)
These rules reflect one of two attitudes (1) noninformative priors either aim to
convey unique representations of ignorance or (2) they attempt to produce probability
statements that may be accepted by convention This latter attitude is in the same
spirit as how weights and distances are defined (Kass amp Wasserman 1996) and
characterizes the way in which Bayesian reference methods are interpreted today ie
noninformative priors are seen to be chosen by convention according to the situation
A word of caution must be given when using noninformative priors Difficulties arise
in their implementation that should not be taken lightly In particular these difficulties
may occur because noninformative priors are generally improper (meaning that they do
not integrate or sum to a finite number) and as such are said to depend on arbitrary
constants
Bayes factors strongly depend upon the prior distributions for the parameters
included in each of the models being compared This can be an important limitation
52
considering that when using noninformative priors their introduction will result in the
Bayes factors being a function of the ratio of arbitrary constants given that these priors
are typically improper (see Jeffreys 1961 Pericchi 2005 and references therein)
Many different approaches have been developed to deal with the arbitrary constants
when using improper priors since then These include the use of partial Bayes factors
(Berger amp Pericchi 1996 Good 1950 Lempers 1971) setting the ratio of arbitrary
constants to a predefined value (Spiegelhalter amp Smith 1982) and approximating to the
Bayes factor (see Haughton 1988 as cited in Berger amp Pericchi 1996 Kass amp Raftery
1995 Tierney amp Kadane 1986)321 The Intrinsic Methodology
Berger amp Pericchi (1996) cleverly dealt with the arbitrary constants that arise when
using improper priors by introducing the intrinsic Bayes factor (IBF) procedure This
solution based on partial Bayes factors provides the means to replace the improper
priors by proper ldquoposteriorrdquo priors The IBF is obtained from combining the model
structure with information contained in the observed data Furthermore they showed
that as the sample size tends to infinity the Intrinsic Bayes factor corresponds to the
proper Bayes factor arising from the intrinsic priors
Intrinsic priors however are not unique The asymptotic correspondence between
the IBF and the Bayes factor arising from the intrinsic prior yields two functional
equations that are solved by a whole class of intrinsic priors Because all the priors
in the class produce Bayes factors that are asymptotically equivalent to the IBF for
finite sample sizes the resulting Bayes factor is not unique To address this issue
Moreno et al (1998) formalized the methodology through the ldquolimiting procedurerdquo
This procedure allows one to obtain a unique Bayes factor consolidating the method
as a valid objective Bayesian model selection procedure which we will refer to as the
Bayes factor for intrinsic priors (BFIP) This result is particularly valid for nested models
although the methodology may be extended with some caution to nonnested models
53
As mentioned before the Bayesian hypothesis testing procedure is highly sensitive
to parameter-prior specification and not all priors that are useful for estimation are
recommended for hypothesis testing or model selection Evidence of this is provided
by the Jeffreys-Lindley paradox which states that a point null hypothesis will always
be accepted when the variance of a conjugate prior goes to infinity (Robert 1993)
Additionally when comparing nested models the null model should correspond to
a substantial reduction in complexity from that of larger alternative models Hence
priors for the larger alternative models that place probability mass away from the null
model are wasteful If the true model is ldquofarrdquo from the null it will be easily detected by
any statistical procedure Therefore the prior on the alternative models should ldquowork
harderrdquo at selecting competitive models that are ldquocloserdquo to the null This principle known
as the Savage continuity condition (Gunel amp Dickey 1974) is widely recognized by
statisticians
Interestingly the intrinsic prior in correspondence with the BFIP automatically
satisfies the Savage continuity condition That is when comparing nested models the
intrinsic prior for the more complex model is centered around the null model and in spite
of being a limiting procedure it is not subject to the Jeffreys-Lindley paradox
Moreover beyond the usual pairwise consistency of the Bayes factor for nested
models Casella et al (2009) show that the corresponding Bayesian procedure with
intrinsic priors for variable selection in normal regression is consistent in the entire
class of normal linear models adding an important feature to the list of virtues of the
procedure Consistency of the BFIP for the case where the dimension of the alternative
model grows with the sample size is discussed in Moreno et al (2010)322 Mixtures of g-Priors
As previously mentioned in the Bayesian paradigm a model M in M is defined
by a sampling density and a prior distribution The sampling density associated with
model M is denoted by f (y|βM σ2M M) where (βM σ
2M) is a vector of model-specific
54
unknown parameters The prior for model M and its corresponding set of parameters is
π(βM σ2M M|M) = π(βM σ
2M |MM) middot π(M|M)
Objective local priors for the model parameters (βM σ2M) are achieved through
modifications and extensions of Zellnerrsquos g-prior (Liang et al 2008 Womack et al
2014) In particular below we focus on the intrinsic prior and provide some details for
other scaled mixtures of g-priors We defer the discussion on priors over the model
space until Chapter 5 where we describe them in detail and develop a few alternatives
of our own3221 Intrinsic priors
An automatic choice of an objective prior is the intrinsic prior (Berger amp Pericchi
1996 Moreno et al 1998) Because MB sube M for all M isin M the intrinsic prior for
(βM σ2M) is defined as an expected posterior prior
πI (βM σ2M |M) =
intpR(βM σ
2M |~yM)mR(~y|MB)d~y (3ndash1)
where ~y is a minimal training sample for model M I denotes the intrinsic distributions
and R denotes distributions derived from the reference prior πR(βM σ2M |M) = cM
dβMdσ2M
σ2M
In (3ndash1) mR(~y|M) =intint
f (~y|βM σ2M M)πR(βM σ
2M |M)dβMdσ2M is the reference marginal
of ~y under model M and pR(βM σ2M |~yM) =
f (~y|βM σ2MM)πR(βM σ2
M|M)
mR(~y|M)is the reference
posterior density
In the regression framework the reference marginal mR is improper and produces
improper intrinsic priors However the intrinsic Bayes factor of model M to the base
model MB is well-defined and given by
BF IMMB
(y) = (1minus R2M)
minus nminus|MB |2 times
int 1
0
n + sin2(π2θ) middot (|M|+ 1)
n +sin2(π
2θ)middot(|M|+1)1minusR2
M
nminus|M|
2sin2(π
2θ) middot (|M|+ 1)
n +sin2(π
2θ)middot(|M|+1)1minusR2
M
|M|minus|MB |
2
dθ (3ndash2)
55
where R2M is the coefficient of determination of model M versus model MB The Bayes
factor between two models M and M prime is defined as BF IMMprime(y) = BF I
MMB(y)BF I
MprimeMB(y)
The ldquogoodnessrdquo of the model M based on the intrinsic priors is given by its posterior
probability
pI (M|yM) =BF I
MMB(y)π(M|M)sum
MprimeisinM BF IMprimeMB
(y)π(M prime|M) (3ndash3)
It has been shown that the system of intrinsic priors produces consistent model selection
(Casella et al 2009 Giron et al 2010) In the context of well-formulated models the
true model MT is the smallest well-formulated model M isin M such that α isin M if βα = 0
If MT is the true model then the posterior probability of model MT based on equation
(3ndash3) converges to 13222 Other mixtures of g-priors
Scaled mixtures of g-priors place a reference prior on (βMBσ2) and a multivariate
normal distribution on β in M MB that is normal with mean 0 and precision matrix
qMw
nσ2ZprimeM (IminusH0)ZM
where H0 is the hat matrix associated with ZMB The prior is completed by a prior on w
and choice of scaling qM that is set at |M| + 1 to account for the minimal sample size of
M Under these assumptions the Bayesrsquo factor for M to MB is given by
BFMMB(y) =
(1minus R2
M
) nminus|MB |2
int n + w(|M|+ 1)
n + w(|M|+1)1minusR2
M
nminus|M|
2w(|M|+ 1)
n + w(|M|+1)1minusR2
M
|M|minus|MB |
2
π(w)dw
We consider the following priors on w The intrinsic prior is π(w) = Beta(w 05 05)
which is only defined for w isin (0 1) A version of the Zellner-Siow prior is given by
w sim Gamma(05 05) which produces a multivariate Cauchy distribution on β A family
of hyper-g priors are defined by π(w) prop wminus12(β + w)(α+1)2 which have Cauchy-like
tails but produce more shrinkage than the Cauchy prior
56
33 Objective Bayes Occupancy Model Selection
As mentioned before Bayesian inferential approaches used for ecological models
are lacking In particular there exists a need for suitable objective and automatic
Bayesian testing procedures and software implementations that explore thoroughly the
model space considered With this goal in mind in this section we develop an objective
intrinsic and fully automatic Bayesian model selection methodology for single season
site-occupancy models We refer to this method as automatic and objective given that
in its implementation no hyperparameter tuning is required and that it is built using
noninformative priors with good testing properties (eg intrinsic priors)
An inferential method for the occupancy problem is possible using the intrinsic
approach given that we are able to link intrinsic-Bayesian tools for the normal linear
model through our probit formulation of the occupancy model In other words because
we can represent the single season probit occupancy model through the hierarchy
yij |zi wij sim Bernoulli(ziIwijgt0
)wij |λ sim N
(qprimeijλ 1
)zi |vi sim Bernoulli
(Ivigt0
)vi |α sim N (x primeiα 1)
it is possible to solve the selection problem on the latent scale variables wij and vi and
to use those results at the level of the occupancy and detection processes
In what follows first we provide some necessary notation Then a derivation of
the intrinsic priors for the parameters of the detection and occupancy components
is outlined Using these priors we obtain the general form of the model posterior
probabilities Finally the results are incorporated in a model selection algorithm for
site-occupancy data Although the priors on the model space are not discussed in this
Chapter the software and methods developed have different choices of model priors
built in
57
331 Preliminaries
The notation used in Chapter 2 will be considered in this section as well Namely
presence will be denoted by z detection by y their corresponding latent processes are
v and w and the model parameters are denoted by α and λ However some additional
notation is also necessary Let M0 =M0y M0z
denote the ldquobaserdquo model defined by
the smallest models considered for the detection and presence processes The base
models M0y and M0z include predictors that must be contained in every model that
belongs to the model space Some examples of base models are the intercept only
model a model with covariates related to the sampling design and a model including
some predictors important to the researcher that should be included in every model
Furthermore let the sets [Kz ] = 1 2 Kz and [Ky ] = 1 2 Ky index
the covariates considered for the variable selection procedure for the presence and
detection processes respectively That is these sets denote the covariates that can
be added from the base models in M0 or removed from the largest possible models
considered MF z and MF y which we will refer to as the ldquofullrdquo models The model space
can then be represented by the Cartesian product of subsets such that Ay sube [Ky ]
and Az sube [Kz ] The entire model space is populated by models of the form MA =MAy
MAz
isin M = My timesMz with MAy
isin My and MAzisin Mz
For the presence process z the design matrix for model MAzis given by the block
matrix XAz= (X0|Xr A) X0 corresponds to the design matrix of the base model ndash which
is such that M0z sube MAzisin Mz for all Az isin [Kz ] ndash and Xr A corresponds to the submatrix
that contains the covariates indexed by Az Analogously for the detection process y the
design matrix is given by QAy= (Q0|Qr A) Similarly the coefficients for models MAz
and
MAyare given by αA = (αprime
0αprimer A)
prime and λA = (λprime0λ
primer A)
prime
With these elements in place the model selection problem consists of finding
subsets of covariates indexed by A = Az Ay that have a high posterior probability
given the detection and occupancy processes This is equivalent to finding models with
58
high posterior odds when compared to a suitable base model These posterior odds are
given by
p(MA|y z)p(M0|y z)
=m(y z|MA)π(MA)
m(y z|M0)π(M0)= BFMAM0
(y z)π(MA)
π(M0)
Since we are able to represent the occupancy model as a truncation of latent
normal variables it is possible to work through the occupancy model selection problem
in the latent normal scale used for the presence and detection processes We formulate
two solutions to this problem one that depends on the observed and latent components
and another that solely depends on the latent level variables used to data-augment the
problem We will however focus on the latter approach as this yields a straightforward
MCMC sampling scheme For completeness the other alternative is described in
Section 34
At the root of our objective inferential procedure for occupancy models lies the
conditional argument introduced by Womack et al (work in progress) for the simple
probit regression In the occupancy setting the argument is
p(MA|y zw v) =m(y z vw|MA)π(MA)
m(y zw v)
=fyz(y z|w v)
(intfvw(vw|αλMA)παλ(αλ|MA)d(αλ)
)π(MA)
fyz(y z|w v)sum
MlowastisinM(int
fvw(vw|αλMlowast)παλ(αλ|Mlowast)d(αλ))π(Mlowast)
=m(v|MAz
)m(w|MAy)π(MA)
m(v)m(w)
prop m(v|MAz)m(w|MAy
)π(MA) (3ndash4)
where
1 fyz(y z|w v) =prodN
i=1 Izivigt0I
(1minuszi )vile0
prodJ
j=1(ziIwijgt0)yij (1minus ziIwijgt0)
1minusyij
2 fvw(vw|αλMA) =
(Nprodi=1
ϕ(vi xprimeiαMAz
1)
)︸ ︷︷ ︸
f (v|αr Aα0MAz )
(Nprodi=1
Jiprodj=1
ϕ(wij qprimeijλMAy
1)
)︸ ︷︷ ︸
f (w|λr Aλ0MAy )
and
59
3 παλ(αλ|MA) = πα(α|MAz)πλ(λ|MAy
)
This result implies that once the occupancy and detection indicators are
conditioned on the latent processes v and w respectively the model posterior
probabilities only depend on the latent variables Hence in this case the model
selection problem is driven by the posterior odds
p(MA|y zw v)p(M0|y zw v)
=m(w v|MA)
m(w v|M0)
π(MA)
π(M0) (3ndash5)
where m(w v|MA) = m(w|MAy) middotm(v|MAz
) with
m(v|MAz) =
int intf (v|αr Aα0MAz
)π(αr A|α0MAz)π(α0)dαr Adα0
(3ndash6)
m(w|MAy) =
int intf (w|λr Aλ0MAy
)π(λr A|λ0MAy)π(λ0)dλ0dλr A
(3ndash7)
332 Intrinsic Priors for the Occupancy Problem
In general the intrinsic priors as defined by Moreno et al (1998) use the functional
form of the response to inform their construction assuming some preliminary prior
distribution proper or improper on the model parameters For our purposes we assume
noninformative improper priors for the parameters denoted by πN(middot|middot) Specifically the
intrinsic priors πIP(θMlowast|Mlowast) for a vector of parameters θMlowast corresponding to model
Mlowast isin M0M sub M for a response vector s with probability density (or mass) function
f (s|θMlowast) are defined by
πIP(θM0|M0) = πN(θM0
|M0)
πIP(θM |M) = πN(θM |M)
intm(~s|M)
m(~s|M0)f (~s|θM M)d~s
where ~s is a theoretical training sample
In what follows whenever it is clear from the context in an attempt to simplify the
notation MA will be used to refer to MAzor MAy
and A will denote Az or Ay To derive
60
the parameter priors involved in equations 3ndash6 and 3ndash7 using the objective intrinsic prior
strategy we start by assuming flat priors πN(αA|MA) prop cA and πN(λA|MA) prop dA where
cA and dA are unknown constants
The intrinsic prior for the parameters associated with the occupancy process αA
conditional on model MA is
πIP(αA|MA) = πN(αA|MA)
intm(~v|MA)
m(~v|M0)f (~v|αAMA)d~v
where the marginals m(~v|Mj) with j isin A 0 are obtained by solving the analogous
equation 3ndash6 for the (theoretical) training sample ~v These marginals are given by
m(~v|Mj) = cj (2π)pjminusp0
2 |~X primej~Xj |
12 eminus
12~vprime(Iminus~Hj )~v
The training sample ~v has dimension pAz=∣∣MAz
∣∣ that is the total number of
parameters in model MAz Note that without ambiguity we use
∣∣ middot ∣∣ to denote both
the cardinality of a set and also the determinant of a matrix The design matrix ~XA
corresponds to the training sample ~v and is chosen such that ~X primeA~XA =
pAzNX primeAXA
(Leon-Novelo et al 2012) and ~Hj is the corresponding hat matrix
Replacing m(~v|MA) and m(~v|M0) in πIP(αA|MA) and solving the integral with
respect to the theoretical training sample ~v we have
πIP(αA|MA) = cA
int ((2π)minus
pAzminusp0z2
(c0
cA
)eminus
12~vprime((Iminus~HA)minus(Iminus~H0))~v |~X
primeA~XA|12
|~X prime0~X0|12
)times(
(2π)minuspAz2 eminus
12(~vminus~XAαA)
prime(~vminus~XAαA))d~v
= c0(2π)minus
pAzminusp0z2 |~X prime
Ar~XAr |
12 2minus
pAzminusp0z2 exp
[minus1
2αprimer A
(1
2~X primer A
~Xr A
)αr A
]= πN(α0)timesN
(αr A
∣∣ 0 2 middot ( ~X primer A
~Xr A)minus1)
(3ndash8)
61
Analogously the intrinsic prior for the parameters associated to the detection
process is
πIP(λA|MA) = d0(2π)minus
pAyminusp0y2 | ~Q prime
Ar~QAr |
12 2minus
pAyminusp0y2 exp
[minus1
2λprimer A
(1
2~Q primer A
~Qr A
)λr A
]= πN(λ0)timesN
(λr A
∣∣ 0 2 middot ( ~Q primeA~QA)
minus1)
(3ndash9)
In short the intrinsic priors for αA = (αprime0α
primer A)
prime and λprimeA = (λprime
0λprimer A)
prime are the product
of a reference prior on the parameters of the base model and a normal density on the
parameters indexed by Az and Ay respectively333 Model Posterior Probabilities
We now derive the expressions involved in the calculations of the model posterior
probabilities First recall that p(MA|y zw v) prop m(w v|MA)π(MA) Hence determining
this posterior probability only requires calculating m(w v|MA)
Note that since w and v are independent obtaining the model posteriors from
expression 3ndash4 reduces to finding closed form expressions for the marginals m(v |MAz)
and m(w |MAy) respectively from equations 3ndash6 and 3ndash7 Therefore
m(w v|MA) =
int intf (vw|αλMA)π
IP (α|MAz)πIP
(λ|MAy
)dαdλ
(3ndash10)
For the latent variable associated with the occupancy process plugging the
parameter intrinsic prior given by 3ndash8 into equation 3ndash6 (recalling that ~X primeA~XA =
pAzNX primeAXA)
and integrating out αA yields
m(v|MA) =
int intc0N (v|X0α0 + Xr Aαr A I)N
(αr A|0 2( ~X prime
r A~Xr A)
minus1)dαr Adα0
= c0(2π)minusn2
int (pAz
2N + pAz
) (pAzminusp0z
)
2
times
exp[minus1
2(v minus X0α0)
prime(I minus
(2N
2N + pAz
)Hr Az
)(v minus X0α0)
]dα0
62
= c0 (2π)minus(nminusp0z )2
(pAz
2N + pAz
) (pAzminusp0z
)
2
|X prime0X0|minus
12 times
exp[minus1
2vprime(I minus H0z minus
(2N
2N + pAz
)Hr Az
)v
] (3ndash11)
with Hr Az= HAz
minus H0z where HAzis the hat matrix for the entire model MAz
and H0z is
the hat matrix for the base model
Similarly the marginal distribution for w is
m(w|MA) = d0 (2π)minus(Jminusp0y )2
(pAy
2J + pAy
) (pAyminusp0y
)
2
|Q prime0Q0|minus
12 times
exp[minus1
2wprime(I minus H0y minus
(2J
2J + pAy
)Hr Ay
)w
] (3ndash12)
where J =sumN
i=1 Ji or in other words J denotes the total number of surveys conducted
Now the posteriors for the base model M0 =M0y M0z
are
m(v|M0) =
intc0N (v|X0α0 I) dα0
= c0(2π)minus(nminusp0z )2 |X prime
0X0|minus12 exp
[minus1
2(v (I minus H0z ) v)
](3ndash13)
and
m(w|M0) = d0(2π)minus(Jminusp0y )2 |Q prime
0Q0|minus12 exp
[minus1
2
(w(I minus H0y
)w)]
(3ndash14)
334 Model Selection Algorithm
Having the parameter intrinsic priors in place and knowing the form of the model
posterior probabilities it is finally possible to develop a strategy to conduct model
selection for the occupancy framework
For each of the two components of the model ndashoccupancy and detectionndash the
algorithm first draws the set of active predictors (ie Az and Ay ) together with their
corresponding parameters This is a reversible jump step which uses a Metropolis
63
Hastings correction with proposal distributions given by
q(Alowastz |zo z(t)u v(t)MAz
) =1
2
(p(MAlowast
z|zo z(t)u v(t)Mz MAlowast
zisin L(MAz
)) +1
|L(MAz)|
)q(Alowast
y |y zo z(t)u w(t)MAy) =
1
2
(p(MAlowast
w|y zo z(t)u w(t)My MAlowast
yisin L(MAy
)) +1
|L(MAy)|
)(3ndash15)
where L(MAz) and L(MAy
) denote the sets of models obtained from adding or removing
one predictor at a time from MAzand MAy
respectively
To promote mixing this step is followed by an additional draw from the full
conditionals of α and λ The densities p(α0|) p(αr A|) p(λ0|) and p(λr A|) can
be sampled from directly with Gibbs steps Using the notation a|middot to denote the random
variable a conditioned on all other parameters and on the data these densities are given
by
bull α0|middot sim N((X
prime0X0)
minus1Xprime0v (X
prime0X0)
minus1)bull αr A|middot sim N
(microαr A
αr A
) where the mean vector and the covariance matrix are
given by αr A= 2N
2N+pAz(X
prime
r AXr A)minus1 and microαr A
=(αr A
Xprime
r Av)
bull λ0|middot sim N((Q
prime0Q0)
minus1Qprime0w (Q
prime0Q0)
minus1) and
bull λr A|middot sim N(microλr A
λr A
) analogously with mean and covariance matrix given by
λr A= 2J
2J+pAy(Q
prime
r AQr A)minus1 and microλr A
=(λr A
Qprime
r Aw)
Finally Gibbs sampling steps are also available for the unobserved occupancy
indicators zu and for the corresponding latent variables v and w The full conditional
posterior densities for z(t+1)u v(t+1) and w(t+1) are those introduced in Chapter 2 for the
single season probit model
The following steps summarize the stochastic search algorithm
1 Initialize A(0)y A
(0)z z
(0)u v(0)w(0)α(0)
0 λ(0)0
2 Sample the model indices and corresponding parameters
(a) Draw simultaneously
64
bull Alowastz sim q(Az |zo z(t)u v(t)MAz
)
bull αlowast0 sim p(α0|MAlowast
z zo z
(t)u v(t)) and
bull αlowastr Alowast sim p(αr A|MAlowast
z zo z
(t)u v(t))
(b) Accept (M(t+1)Az
α(t+1)10 α(t+1)1
r A ) = (MAlowastzαlowast
0αlowastr Alowast) with probability
δz = min
(1
p(MAlowastz|zo z(t)u v(t))
p(MA(t)z|zo z(t)u v(t))
q(A(t)z |zo z(t)u v(t)MAlowast
z)
q(Alowastz |zo z
(t)u v(t)MAz
)
)
otherwise let (M(t+1)Az
α(t+1)10 α(t+1)1
r A ) = (A(t)z α(t)2
0 α(t)2r A )
(c) Sample simultaneously
bull Alowasty sim q(Ay |y zo z(t)u w(t)MAy
)
bull λlowast0 sim p(λ0|MAlowast
y y zo z
(t)u w(t)) and
bull λlowastr Alowast sim p(λr A|MAlowast
y y zo z
(t)u w(t))
(d) Accept (M(t+1)Ay
λ(t+1)10 λ(t+1)1
r A ) = (MAlowastyλlowast
0λlowastr Alowast) with probability
δy = min
(1
p(MAlowastz|y zo z(t)u w(t))
p(MA(t)z|y zo z(t)u w(t))
q(A(t)z |y zo z(t)u w(t)MAlowast
y)
q(Alowastz |y zo z
(t)u w(t)MAy
)
)
otherwise let (M(t+1)Ay
λ(t+1)10 λ(t+1)1
r A ) = (A(t)y λ(t)2
0 λ(t)2r A )
3 Sample base model parameters
(a) Draw α(t+1)20 sim p(α0|MA
(t+1)z
zo z(t)u v(t))
(b) Draw λ(t+1)20 sim p(λ0|MA(t+1)y
y zo z(t)u v(t))
4 To improve mixing resample model coefficients not present the base model butare in MA
(a) Draw α(t+1)2r A sim p(αr A|MA
(t+1)z
zo z(t)u v(t))
(b) Draw λ(t+1)2r A sim p(λr A|MA
(t+1)y
yzo z(t)u v(t))
5 Sample latent and missing (unobserved) variables
(a) Sample z(t+1)u sim p(zu|MA(t+1)z
yα(t+1)2r A α(t+1)2
0 λ(t+1)2r A λ(t+1)2
0 )
(b) Sample v(t+1) sim p(v|MA(t+1)z
zo z(t+1)u α(t+1)2
r A α(t+1)20 )
65
(c) Sample w(t+1) sim p(w|MA(t+1)y
zo z(t+1)u λ(t+1)2
r A λ(t+1)20 )
34 Alternative Formulation
Because the occupancy process is partially observed it is reasonable to consider
the posterior odds in terms of the observed responses that is the detections y and
the presences at sites where at least one detection takes place Partitioning the vector
of presences into observed and unobserved z = (zprimeo zprimeu)
prime and integrating out the
unobserved component the model posterior for MA can be obtained as
p(MA|y zo) prop Ezu [m(y z|MA)] π(MA) (3ndash16)
Data-augmenting the model in terms of latent normal variables a la Albert and Chib
the marginals for any model My Mz = M isin M of z and y inside of the expectation in
equation 3ndash16 can be expressed in terms of the latent variables
m(y z|M) =
intT (z)
intT (yz)
m(w v|M)dwdv
=
(intT (z)
m(v| Mz)dv
)(intT (yz)
m(w|My)dw
) (3ndash17)
where T (z) and T (y z) denote the corresponding truncation regions for v and w which
depend on the values taken by z and y and
m(v|Mz) =
intf (v|αMz)π(α|Mz)dα (3ndash18)
m(w|My) =
intf (w|λMy)π(λ|My)dλ (3ndash19)
The last equality in equation 3ndash17 is a consequence of the independence of the
latent processes v and w Using expressions 3ndash18 and 3ndash19 allows one to embed this
model selection problem in the classical linear normal regression setting where many
ldquoobjectiverdquo Bayesian inferential tools are available In particular these expressions
facilitate deriving the parameter intrinsic priors (Berger amp Pericchi 1996 Moreno
et al 1998) for this problem This approach is an extension of the one implemented in
Leon-Novelo et al (2012) for the simple probit regression problem
66
Using this alternative approach all that is left is to integrate m(v|MA) and m(w|MA)
over their corresponding truncation regions T (z) and T (y z) which yields m(y z|MA)
and then to obtain the expectation with respect to the unobserved zrsquos Note however
two issues arise First such integrals are not available in closed form Second
calculating the expectation over the limit of integration further complicates things To
address these difficulties it is possible to express E [m(y z|MA)] as
Ezu [m(y z|MA)] = Ezu
[(intT (z)
m(v| MAz)dv
)(intT (yz)
m(w|MAy)dw
)](3ndash20)
= Ezu
[(intT (z)
intm(v| MAz
α0)πIP(α0|MAz
)dα0dv
)times(int
T (yz)
intm(w| MAy
λ0)πIP(λ0|MAy
)dλ0dw
)]
= Ezu
int (int
T (z)
m(v| MAzα0)dv
)︸ ︷︷ ︸
g1(T (z)|MAz α0)
πIP(α0|MAz)dα0 times
int (intT (yz)
m(w|MAyλ0)dw
)︸ ︷︷ ︸
g2(T (yz)|MAy λ0)
πIP(λ0|MAy)dλ0
= Ezu
[intg1(T (z)|MAz
α0)πIP(α0|MAz
)dα0 timesintg2(T (y z)|MAy
λ0)πIP(λ0|MAy
)dλ0
]= c0 d0
int intEzu
[g1(T (z)|MAz
α0)g2(T (y z)|MAyλ0)
]dα0 dλ0
where the last equality follows from Fubinirsquos theorem since m(v|MAzα0) and
m(w|MAyλ0) are proper densities From 3ndash21 the posterior odds are
p(MA|y zo)p(M0|y zo)
=
int intEzu
[g1(T (z)|MAz
α0)g2(T (y z)|MAyλ0)
]dα0 dλ0int int
Ezu
[g1(T (z)|M0z α0)g2(T (y z)|M0y λ0)
]dα0 dλ0
π(MA)
π(M0)
(3ndash21)
67
35 Simulation Experiments
The proposed methodology was tested under 36 different scenarios where we
evaluate the behavior of the algorithm by varying the number of sites the number of
surveys the amount of signal in the predictors for the presence component and finally
the amount of signal in the predictors for the detection component
For each model component the base model is taken to be the intercept only model
and the full models considered for the presence and the detection have respectively 30
and 20 predictors Therefore the model space contains 230times220 asymp 112times1015 candidate
models
To control the amount of signal in the presence and detection components values
for the model parameter were purposefully chosen so that quantiles 10 50 and 90 of the
occupancy and detection probabilities match some pre-specified probabilities Because
presence and detection are binary variables the amount of signal in each model
component associates to the spread and center of the distribution for the occupancy and
detection probabilities respectively Low signal levels relate to occupancy or detection
probabilities close to 50 High signal levels associate with probabilities close to 0 or 1
Large spreads of the distributions for the occupancy and detection probabilities reflect
greater heterogeneity among the observations collected improving the discrimination
capability of the model and viceversa
Therefore for the presence component the parameter values of the true model
were chosen to set the median for the occupancy probabilities equal 05 The chosen
parameter values also fix quantiles 10 and 90 symmetrically about 05 at small (Qz10 =
03Qz90 = 07) intermediate (Qz
10 = 02Qz90 = 08) and large (Qz
10 = 01Qz90 = 09)
distances For the detection component the model parameters are obtained to reflect
detection probabilities concentrated about low values (Qy50 = 02) intermediate values
(Qy50 = 05) and high values (Qy
50 = 08) while keeping quantiles 10 and 90 fixed at 01
and 09 respectively
68
Table 3-1 Simulation control parameters occupancy model selectorParameter Values considered
N 50 100
J 3 5
(Qz10Q
z50Q
z90)
(03 05 07) (02 05 08) (01 05 09)
(Qy
10Qy50Q
y90)
(01 02 09) (01 05 09) (01 08 09)
There are in total 36 scenarios these result from crossing all the levels of the
simulation control parameters (Table 3-1) Under each of these scenarios 20 data sets
were generated at random True presence and detection indicators were generated
with the probit model formulation from Chapter 2 This with the assumed true models
MTz = 1 x2 x15 x16 x22 x28 for the presence and MTy = 1 q7 q10 q12 q17 for
the detection with the predictors included in the randomly generated datasets In this
context 1 represents the intercept term Throughout the Section we refer to predictors
included in the true models as true predictors and to those absent as false predictors
The selection procedure was conducted using each one of these data sets with
two different priors on the model space the uniform or equal probability prior and a
multiplicity correcting prior
The results are summarized through the marginal posterior inclusion probabilities
(MPIPs) for each predictor and also the five highest posterior probability models (HPM)
The MPIP for a given predictor under a specific scenario and for a particular data set is
defined as
p(predictor is included|y zw v) =sumMisinM
I(predictorisinM)p(M|y zw vM) (3ndash22)
In addition we compare the MPIP odds between predictors present in the true model
and predictors absent from it Specifically we consider the minimum odds of marginal
posterior inclusion probabilities for the predictors Let ~ξ and ξ denote respectively a
69
predictor in the true model MT and a predictor absent from MT We define the minimum
MPIP odds between the probabilities of true and false predictor as
minOddsMPIP =min~ξisinMT
p(I~ξ = 1|~ξ isin MT )
maxξ isinMTp(Iξ = 1|ξ isin MT )
(3ndash23)
If the variable selection procedure adequately discriminates true and false predictors
minOddsMPIP will take values larger than one The ability of the method to discriminate
between the least probable true predictor and the most probable false predictor worsens
as the indicator approaches 0351 Marginal Posterior Inclusion Probabilities for Model Predictors
For clarity in Figures 3-1 through 3-5 only predictors in the true models are labeled
and are emphasized with a dotted line passing through them The left hand side plots
in these figures contain the results for the presence component and the ones on the
right correspond to predictors in the detection component The results obtained with
the uniform model priors correspond to the black lines and those for the multiplicity
correcting prior are in red In these Figures the MPIPrsquos have been averaged over all
datasets corresponding scenarios matching the condition observed
In Figure 3-1 we contrast the mean MPIPrsquos of the predictors over all datasets from
scenarios with 50 sites to the mean MPIPrsquos obtained for the scenarios with 100 sites
Similarly Figure 3-2 compares the mean MPIPrsquos of scenarios where 3 surveys are
performed to those of scenarios having 5 surveys per site Figures 3-4 and 3-5 show the
effect of the different levels of signal considered in the occupancy probabilities and in the
detection probabilities
From these figures mainly three results can be drawn (1) the effect of the model
prior is substantial (2) the proposed methods yield MPIPrsquos that clearly separate
true predictors from false predictors and (3) the separation between MPIPrsquos of true
predictors and false predictors is noticeably larger in the detection component
70
Regardless of the simulation scenario and model component observed under the
uniform prior false predictors obtain a relatively high MPIP Conversely the multiplicity
correction prior strongly shrinks towards 0 the MPIP for false predictors In the presence
component the MPIP for the true predictors is shrunk substantially under the multiplicity
prior however there remains a clear separation between true and false predictors In
contrast in the detection component the MPIP for true predictors remains relatively high
(Figures 3-1 through 3-5)
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif N=50MC N=50
Unif N=100MC N=100
Figure 3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors
71
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif J=3MC J=3
Unif J=5MC J=5
Figure 3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif N=50 J=3Unif N=50 J=5
Unif N=100 J=3Unif N=100 J=5
MC N=50 J=3MC N=50 J=5
MC N=100 J=3MC N=100 J=5
Figure 3-3 Predictor MPIP averaged over scenarios with the interaction between thenumber of sites and the surveys per site using uniform (U) and multiplicitycorrection (MC) priors
72
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
U(03 05 07)MC(03 05 07)
U(02 05 08)MC(02 05 08)
U(01 05 09)MC(01 05 09)
Figure 3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
U(01 02 09)MC(01 02 09)
U(01 05 09)MC(01 05 09)
U(01 08 09)MC(01 08 09)
Figure 3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors
73
In scenarios where more sites were surveyed the separation between the MPIP of
true and false predictors grew in both model components (Figure 3-1) Increasing the
number of sites has an effect over both components given that every time a new site is
included covariate information is added to the design matrix of both the presence and
the detection components
On the hand increasing the number of surveys affects the MPIP of predictors in the
detection component (Figures 3-2 and 3-3) but has only a marginal effect on predictors
of the presence component This may appear to be counterintuitive however increasing
the number of surveys only increases the number of observation in the design matrix
for the detection while leaving unaltered the design matrix for the presence The small
changes observed in the MPIP for the presence predictors J increases are exclusively
a result of having additional detection indicators equal to 1 in sites where with less
surveys would only have 0 valued detections
From Figure 3-3 it is clear that for the presence component the effect of the number
of sites dominates the behavior of the MPIP especially when using the multiplicity
correction priors In the detection component the MPIP is influenced by both the number
of sites and number of surveys The influence of increasing the number of surveys is
larger when considering a smaller number of sites and viceversa
Regarding the effect of the distribution for the occupancy probabilities we observe
that mostly the detection component is affected There is stronger discrimination
between true and false predictors as the distribution has a higher variability (Figure
3-4) This is consistent with intuition since having the presence probabilities more
concentrated about 05 implies that the predictors do not vary much from one site to
the next whereas having the occupancy probabilities more spread out would have the
opposite effect
Finally concentrating the detection probabilities about high or low values For
predictors in the detection component the separation between MPIP of true and false
74
predictors is larger both in scenarios where the distribution of the detection probability
is centered about 02 or 08 when compared to those scenarios where this distribution
is centered about 05 (where the signal of the predictors is weakest) For predictors in
the presence component having the detection probabilities centered at higher values
slightly increases the inclusion probabilities of the true predictors (Figure 3-5) and
reduces that of false predictors
Table 3-2 Comparison of average minOddsMPIP under scenarios having differentnumber of sites (N=50 N=100) and under scenarios having different numberof surveys per site (J=3 J=5) for the presence and detection componentsusing uniform and multiplicity correction priors
Sites SurveysComp π(M) N=50 N=100 J=3 J=5
Presence Unif 112 131 119 124MC 320 846 420 674
Detection Unif 203 264 211 257MC 2115 3246 2139 3252
Table 3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors
(Qz10Q
z50Q
z90) (Qy
10Qy50Q
y90)
Comp π(M) (030507) (020508) (010509) (010209) (010509) (010809)
Presence Unif 105 120 134 110 123 124MC 202 455 805 238 619 640
Detection Unif 234 234 230 257 200 238MC 2537 2077 2528 2933 1852 2849
The separation between the MPIP of true and false predictors is even more
notorious in Tables 3-2 and 3-3 where the minimum MPIP odds between true and
false predictors are shown Under every scenario the value for the minOddsMPIP (as
defined in 3ndash23) was greater than 1 implying that on average even the lowest MPIP
for a true predictor is higher than the maximum MPIP for a false predictor In both
components of the model the minOddsMPIP are markedly larger under the multiplicity
correction prior and increase with the number of sites and with the number of surveys
75
For the presence component increasing the signal in the occupancy probabilities
or having the detection probabilities concentrate about higher values has a positive and
considerable effect on the magnitude of the odds For the detection component these
odds are particularly high specially under the multiplicity correction prior Also having
the distribution for the detection probabilities center about low or high values increases
the minOddsMPIP 352 Summary Statistics for the Highest Posterior Probability Model
Tables 3-4 through 3-7 show the number of true predictors that are included in
the HPM (True +) and the number of false predictors excluded from it (True minus)
The mean percentages observed in these Tables provide one clear message The
highest probability models chosen with either model prior commonly differ from the
corresponding true models The multiplicity correction priorrsquos strong shrinkage only
allows a few true predictors to be selected but at the same time it prevents from
including in the HPM any false predictors On the other hand the uniform prior includes
in the HPM a larger proportion of true predictors but at the expense of also introducing
a large number of false predictors This situation is exacerbated in the presence
component but also occurs to a minor extent in the detection component
Table 3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highestprobability models for the presence and the detection components usinguniform and multiplicity correcting priors on the model space
True + True minusComp π(M) N=50 N=100 N=50 N=100
Presence Unif 057 063 051 055MC 006 013 100 100
Detection Unif 077 085 087 093MC 049 070 100 100
Having more sites or surveys improves the inclusion of true predictors and exclusion
of false ones in the HPM for both the presence and detection components (Tables 3-4
and 3-5) On the other hand if the distribution for the occupancy probabilities is more
76
Table 3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space
True + True minusComp π(M) J=3 J=5 J=3 J=5
Presence Unif 059 061 052 054MC 008 010 100 100
Detection Unif 078 085 087 092MC 050 068 100 100
spread out the HPM includes more true predictors and less false ones in the presence
component In contrast the effect of the spread of the occupancy probabilities in the
detection HPM is negligible (Table 3-6) Finally there is a positive relationship between
the location of the median for the detection probabilities and the number of correctly
classified true and false predictors for the presence The HPM in the detection part of
the model responds positively to low and high values of the median detection probability
(increased signal levels) in terms of correctly classified true and false predictors (Table
3-7)
Table 3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space
True + True minusComp π(M) (030507) (020508) (010509) (030507) (020508) (010509)
Presence Unif 055 061 064 050 054 055MC 002 008 018 100 100 100
Detection Unif 081 082 081 090 089 089MC 057 061 059 100 100 100
36 Case Study Blue Hawker Data Analysis
During 1999 and 2000 an intensive volunteer surveying effort coordinated by the
Centre Suisse de Cartographie de la Faune (CSCF) was conducted in order to analyze
the distribution of the blue hawker Ashna cyanea (Odonata Aeshnidae) a common
dragonfly in Switzerland Given that Switzerland is a small and mountainous country
77
Table 3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space
True + True minusComp π(M) (010209) (010509) (010809) (010209) (010509) (010809)
Presence Unif 059 059 062 051 054 054MC 006 010 011 100 100 100
Detection Unif 089 077 078 091 087 091MC 070 048 059 100 100 100
there is large variation in its topography and physio-geography as such elevation is a
good candidate covariate to predict species occurrence at a large spatial scale It can
be used as a proxy for habitat type intensity of land use temperature as well as some
biotic factors (Kery et al 2010)
Repeated visits to 1-ha pixels took place to obtain the corresponding detection
history In addition to the survey outcome the x and y-coordinates thermal-level the
date of the survey and the elevation were recorded Surveys were restricted to the
known flight period of the blue hawker which takes place between May 1 and October
10 In total 2572 sites were surveyed at least once during the surveying period The
number of surveys per site ranges from 1 to 22 times within each survey year
Kery et al (2010) summarize the results of this effort using AIC-based model
comparisons first by following a backwards elimination approach for the detection
process while keeping the occupancy component fixed at the most complex model and
then for the presence component choosing among a group of three models while using
the detection model chosen In our analysis of this dataset for the detection and the
presence we consider as the full models those used in Kery et al (2010) namely
minus1(ψ) = α0 + α1year+ α2elev+ α3elev2 + α4elev
3
minus1(p) = λ0 + λ1year+ λ2elev+ λ3elev2 + λ4elev
3 + λ5date+ λ6date2
78
where year = Iyear=2000
The model spaces for this data contain 26 = 64 and 24 = 16 models respectively
for the detection and occupancy components That is in total the model space contains
24+6 = 1 024 models Although this model space can be enumerated entirely for
illustration we implemented the algorithm from section 334 generating 10000 draws
from the Gibbs sampler Each one of the models sampled were chosen from the set of
models that could be reached by changing the state of a single term in the current model
(to inclusion or exclusion accordingly) This allows a more thorough exploration of the
model space because for each of the 10000 models drawn the posterior probabilities
for many more models can be observed Below the labels for the predictors are followed
by either ldquozrdquo or ldquoyrdquo accordingly to represent the component they pertain to Finally
using the results from the model selection procedure we conducted a validation step to
determine the predictive accuracy of the HPMrsquos and of the median probability models
(MPMrsquos) The performance of these models is then contrasted with that of the model
ultimately selected by Kery et al (2010)361 Results Variable Selection Procedure
The model finally chosen for the presence component in Kery et al (2010) was not
found among the highest five probability models under either model prior 3-8 Moreover
the year indicator was never chosen under the multiplicity correcting prior hinting that
this term might correspond to a falsely identified predictor under the uniform prior
Results in Table 3-10 support this claim the marginal inclusion posterior probability for
the year predictor is 7 under the multiplicity correction prior The multiplicity correction
prior concentrates more densely the model posterior probability mass in the highest
ranked models (90 of the mass is in the top five models) than the uniform prior (which
account for 40 of the mass)
For the detection component the HPM under both priors is the intercept only model
which we represent in Table 3-9 with a blank label In both cases this model obtains very
79
Table 3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data
Uniform model priorRank Mz selected p(Mz |y)
1 yrz+elevz 0102 yrz+elevz+elevz3 0083 elevz2+elevz3 0084 yrz+elevz2 0075 yrz+elevz3 007
Multiplicity correcting model priorRank Mz selected p(Mz |y)
1 elevz+elevz3 0532 0153 elevz+elevz2 0094 elevz2 0065 elevz+elevz2+elevz3 005
high posterior probabilities The terms contained in cubic polynomial for the elevation
appear to contain some relevant information however this conflicts with the MPIPs
observed in Table 3-11 which under both model priors are relatively low (lt 20 with the
uniform and le 4 with the multiplicity correcting prior)
Table 3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data
Uniform model priorRank Mz selected p(Mz |y)
1 0452 elevy3 0063 elevy2 0054 elevy 0055 yry 004
Multiplicity correcting model priorRank Mz selected p(Mz |y)
1 0862 elevy3 0023 datey2 0024 elevy2 0025 yry 002
Finally it is possible to use the MPIPs to obtain the median probability model which
contains the terms that have a MPIP higher than 50 For the occupancy process
(Table 3-10) under the uniform prior the model with the year the elevation and the
elevation cubed are included The MPM with multiplicity correction prior coincides with
the HPM from this prior The MPM chosen for the detection component (Table 3-11)
under both priors is the intercept only model coinciding again with the HPM
Given the outcomes of the simulation studies from Section 35 especially those
pertaining to the detection component the results in Table 3-11 appear to indicate that
none of the predictors considered belong to the true model especially when considering
80
Table 3-10 MPIP presence component
Predictor p(predictor isin MTz |y z w v)
Unif MultCorryrz 053 007elevz 051 073elevz2 045 023elevz3 050 067
Table 3-11 MPIP detection component
Predictor p(predictor isin MTy |y z w v)
Unif MultCorryry 019 003elevy 018 003elevy2 018 003elevy 3 019 004datey 016 003datey2 015 004
those derived with the multiplicity correction prior On the other hand for the presence
component (Table 3-10) there is an indication that terms related to the cubic polynomial
in elevz can explain the occupancy patterns362 Validation for the Selection Procedure
Approximately half of the sites were selected at random for training (ie for model
selection and parameter estimation) and the remaining half were used as test data In
the previous section we observed that using the marginal posterior inclusion probability
of the predictors the our method effectively separates predictors in the true model from
those that are not in it However in Tables 3-10 and 3-11 this separation is only clear for
the presence component using the multiplicity correction prior
Therefore in the validation procedure we observe the misclassification rates for the
detections using the following models (1) the model ultimately recommended in Kery
et al (2010) (yrz+elevz+elevz2+elevz3 + elevy+ elevy2+ datey+ datey2) (2) the
highest probability model (HPM) with a uniform prior (yrz+elevz) (3) the HPM with a
multiplicity correcting prior (elevz + elevz3 ) (4) the median probability model (MPM)
ndashthe model including only predictors with a MPIP larger than 50ndash with the uniform
prior (yrz+elevz+elevz3) and finally (5) the MPM with a multiplicity correction prior
(elevz+elevz3 same as the HPM with multiplicity correction)
We must emphasize that the models resulting from the implement ion of our model
selection procedure used exclusively the training dataset On the other hand the model
in Kery et al (2010) was chosen to minimize the prediction error of the complete data
81
Because this model was obtained from the full dataset results derived from it can only
be considered as a lower bound for the prediction errors
The benchmark misclassification error rate for true 1rsquos is high (close to 70)
However the misclassification rate for true 0rsquos which accounts for most of the
responses is less pronounced (15) Overall the performance of the selected models
is comparable They yield considerably worse results than the benchmark for the true
1rsquos but achieve rates close to the benchmark for the true zeros Pooling together
the results for true ones and true zeros the selected models with either prior have
misclassification rates close to 30 The benchmark model performs comparably with a
joint misclassification error of 23 (Table 3-12)
Table 3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform andmultiplicity correction model priors
Model True 1 True 0 Jointbenchmark (Kery et al 2010) yrz+elevz+elevz2+elevz3 + 066 015 023
elevy+ elevy2+ datey+ datey2
HPM Unif yrz+elevz 083 017 028HPMHPM MC elevz + elevz3 082 018 028MPM Unif yrz+elevz+elevz3 082 018 029
37 Discussion
In this Chapter we proposed an objective and fully automatic Bayes methodology for
the single season site-occupancy model The methodology is said to be fully automatic
because no hyper-parameter specification is necessary in defining the parameter priors
and objective because it relies on the intrinsic priors derived from noninformative priors
The intrinsic priors have been shown to have desirable properties as testing priors We
also propose a fast stochastic search algorithm to explore large model spaces using our
model selection procedure
Our simulation experiments demonstrated the ability of the method to single out the
predictors present in the true model when considering the marginal posterior inclusion
probabilities for the predictors For predictors in the true model these probabilities
were comparatively larger than those for predictors absent from it Also the simulations
82
indicated that the method has a greater discrimination capability for predictors in the
detection component of the model especially when using multiplicity correction priors
Multiplicity correction priors were not described in this Chapter however their
influence on the selection outcome is significant This behavior was observed in the
simulation experiment and in the analysis of the Blue Hawker data Model priors play an
essential role As the number of predictors grows these are instrumental in controlling
for selection of false positive predictors Additionally model priors can be used to
account for predictor structure in the selection process which helps both to reduce the
size of the model space and to make the selection more robust These issues are the
topic of the next Chapter
Accounting for the polynomial hierarchy in the predictors within the occupancy
context is a straightforward extension of the procedures we describe in Chapter 4
Hence our next step is to develop efficient software for it An additional direction we
plan to pursue is developing methods for occupancy variable selection in a multivariate
setting This can be used to conduct hypothesis testing in scenarios with varying
conditions through time or in the case where multiple species are co-observed A
final variation we will investigate for this problem is that of occupancy model selection
incorporating random effects
83
CHAPTER 4PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
It has long been an axiom of mine that the little things are infinitely themost important
ndashSherlock HolmesA Case of Identity
41 Introduction
In regression problems if a large number of potential predictors is available the
complete model space is too large to enumerate and automatic selection algorithms are
necessary to find informative parsimonious models This multiple testing problem
is difficult and even more so when interactions or powers of the predictors are
considered In the ecological literature models with interactions andor higher order
polynomial terms are ubiquitous (Johnson et al 2013 Kery et al 2010 Zeller et al
2011) given the complexity and non-linearities found in ecological processes Several
model selection procedures even in the classical normal linear setting fail to address
two fundamental issues (1) the model selection outcome is not invariant to affine
transformations when interactions or polynomial structures are found among the
predictors and (2) additional penalization is required to control for false positives as the
model space grows (ie as more covariates are considered)
These two issues motivate the developments developed throughout this Chapter
Building on the results of Chipman (1996) we propose investigate and provide
recommendations for three different prior distributions on the model space These
priors help control for test multiplicity while accounting for polynomial structure in the
predictors They improve upon those proposed by Chipman first by avoiding the need
for specific values for the prior inclusion probabilities of the predictors and second
by formulating principled alternatives to introduce additional structure in the model
84
priors Finally we design a stochastic search algorithm that allows fast and thorough
exploration of model spaces with polynomial structure
Having structure in the predictors can determine the selection outcome As an
illustration consider the model E [y ] = β00 + β01x2 + β20x21 where the order one
term x1 is not present (this choice of subscripts for the coefficients is defined in the
following section) Transforming x1 7rarr xlowast1 = x1 + c for some c = 0 the model
becomes E [y ] = β00 + β01x2 + βlowast20x
lowast21 Note that in terms of the original predictors
xlowast21 = x21 + 2c middot x1 + c2 implying that this seemingly innocuous transformation of x1
modifies the column space of the design matrix by including x1 which was not in the
original model That is when lower order terms in the hierarchy are omitted from the
model the column space of the design matrix is not invariant to afine transformations
As the hat matrix depends on the column space the modelrsquos predictive capability is also
affected by how the covariates in the model are coded an undesirable feature for any
model selection procedure To make model selection invariant to afine transformations
the selection must be constrained to the subset of models that respect the hierarchy
(Griepentrog et al 1982 Khuri 2002 McCullagh amp Nelder 1989 Nelder 2000
Peixoto 1987 1990) These models are known as well-formulated models (WFMs)
Succinctly a model is well-formulated if for any predictor in the model every lower order
predictor associated with it is also in the model The model above is not well-formulated
as it contains x21 but not x1
WFMs exhibit strong heredity in that all lower order terms dividing higher order
terms in the model must also be included An alternative is to only require weak heredity
(Chipman 1996) which only forces some of the lower terms in the corresponding
polynomial hierarchy to be in the model However Nelder (1998) demonstrated that the
conditions under which weak heredity allows the design matrix to be invariant to afine
transformations of the predictors are too restrictive to be useful in practice
85
Although this topic appeared in the literature more than three decades ago (Nelder
1977) only recently have modern variable selection techniques been adapted to
account for the constraints imposed by heredity As described in Bien et al (2013)
the current literature on variable selection for polynomial response surface models
can be classified into three broad groups mult-istep procedures (Brusco et al 2009
Peixoto 1987) regularized regression methods (Bien et al 2013 Yuan et al 2009)
and Bayesian approaches (Chipman 1996) The methods introduced in this Chapter
take a Bayesian approach towards variable selection for well-formulated models with
particular emphasis on model priors
As mentioned in previous chapters the Bayesian variable selection problem
consists of finding models with high posterior probabilities within a pre-specified model
space M The model posterior probability for M isin M is given by
p(M|yM) prop m(y|M)π(M|M) (4ndash1)
Model posterior probabilities depend on the prior distribution on the model space
as well as on the prior distributions for the model specific parameters implicitly through
the marginals m(y|M) Priors on the model specific parameters have been extensively
discussed in the literature (Berger amp Pericchi 1996 Berger et al 2001 George 2000
Jeffreys 1961 Kass amp Wasserman 1996 Liang et al 2008 Zellner amp Siow 1980) In
contrast the effect of the prior on the model space has until recently been neglected
A few authors (eg Casella et al (2014) Scott amp Berger (2010) Wilson et al (2010))
have highlighted the relevance of the priors on the model space in the context of multiple
testing Adequately formulating priors on the model space can both account for structure
in the predictors and provide additional control on the detection of false positive terms
In addition using the popular uniform prior over the model space may lead to the
undesirable and ldquoinformativerdquo implication of favoring models of size p2 (where p is the
86
total number of covariates) since this is the most abundant model size contained in the
model space
Variable selection within the model space of well-formulated polynomial models
poses two challenges for automatic objective model selection procedures First the
notion of model complexity takes on a new dimension Complexity is not exclusively
a function of the number of predictors but also depends upon the depth and
connectedness of the associations defined by the polynomial hierarchy Second
because the model space is shaped by such relationships stochastic search algorithms
used to explore the models must also conform to these restrictions
Models without polynomial hierarchy constitute a special case of WFMs where
all predictors are of order one Hence all the methods developed throughout this
Chapter also apply to models with no predictor structure Additionally although our
proposed methods are presented for the normal linear case to simplify the exposition
these methods are general enough to be embedded in many Bayesian selection
and averaging procedures including of course the occupancy framework previously
discussed
In this Chapter first we provide the necessary definitions to characterize the
well-formulated model selection problem Then we proceed to introduce three new prior
structures on the well-formulated model space and characterize their behavior with
simple examples and simulations With the model priors in place we build a stochastic
search algorithm to explore spaces of well-formulated models that relies on intrinsic
priors for the model specific parameters mdash though this assumption can be relaxed
to use other mixtures of g-priors Finally we implement our procedures using both
simulated and real data
87
42 Setup for Well-Formulated Models
Suppose that the observations yi are modeled using the polynomial regression of
the covariates xi 1 xi p given by
yi =sum
β(α1αp)
pprodj=1
xαji j + ϵi (4ndash2)
where α = (α1 αp) belongs to Np0 the p-dimensional space of natural numbers
including 0 with ϵiiidsim N(0σ2) and only finitely many βα are allowed to be non-zero
As an illustration consider a model space that includes polynomial terms incorporating
covariates xi 1 and xi 2 only The terms x2i 2 and x2i 1xi 2 can be represented by α = (0 2)
and α = (2 1) respectively
The notation y = Z(X)β + ϵ is used to denote that observed response y =
(y1 yn) is modeled via a polynomial function Z of the original covariates contained
in X = (x1 xp) (where xj = (x1j xnj)prime) and the coefficients of the polynomial
terms are given by β A specific polynomial model M is defined by the set of coefficients
βα that are allowed to be non-zero This definition is equivalent to characterizing M
through a collection of multi-indices α isin Np0 In particular model M is specified by
M = αM1 αM|M| for αMk isin Np0 where βα = 0 for α isin M
Any particular model M uses a subset XM of the original covariates X to form the
polynomial terms in the design matrix ZM(X) Without ambiguity a polynomial model
ZM(X) on X can be identified with a polynomial model ZM(XM) on the covariates XM
The number of terms used by M to model the response y denoted by |M| corresponds
to the number of columns of ZM(XM) The coefficient vector and error variance of
the model M are denoted by βM and σ2M respectively Thus M models the data as
y = ZM(XM)βM + ϵM where ϵM sim N(0 Iσ2M
) Model M is said to be nested in model M prime
if M sub M prime M models the response of the covariates in two distinct ways choosing the
set of meaningful covariates XM as well as choosing the polynomial structure of these
covariates ZM(XM)
88
The set Np0 constitutes a partially ordered set or more succinctly a poset A poset
is a set partially ordered through a binary relation ldquo≼rdquo In this context the binary relation
on the poset Np0 is defined between pairs (ααprime) by αprime ≼ α whenever αj ge αprime
j for all
j = 1 prime with αprime ≺ α if additionally αj gt αprimej for some j The order of a term α isin Np
0
is given by the sum of its elements order(α) =sumαj When order(α) = order(αprime) + 1
and αprime ≺ α then αprime is said to immediately precede α which is denoted by αprime rarr α
The parent set of α is defined by P(α) = αprime isin Np0 αprime rarr α and is given by the
set of nodes that immediately precede the given node A polynomial model M is said to
be well-formulated if α isin M implies that P(α) sub M For example any well-formulated
model using x2i 1xi 2 to model yi must also include the parent terms xi 1xi 2 and x2i 1 their
corresponding parent terms xi 1 and xi 2 and the intercept term 1
The poset Np0 can be represented by a Directed Acyclic Graph (DAG) denoted
by (Np0) Without ambiguity we can identify nodes in the graph α isin Np
0 with terms in
the set of covariates The graph has directed edges to a node from its parents Any
well-formulated model M is represented by a subgraph (M) of (Np0) with the property
that if node α isin (M) then the nodes corresponding to P(α) are also in (M) Figure
4-1 shows examples of well-formulated polynomial models where α isin Np0 is identified
withprodp
j=1 xαjj
The motivation for considering only well-formulated polynomial models is
compelling Let ZM be the design matrix associated with a polynomial model The
subspace of y modeled by ZM given by the hat matrix HM = ZM(ZprimeMZM)
minus1ZprimeM is
invariant to affine transformations of the matrix XM if and only if M corresponds to a
well-formulated polynomial model (Peixoto 1990)
89
A B
Figure 4-1 Graphs of well-formulated polynomial models for p = 2
For example if p = 2 and yi = β(00) + β(10)xi 1 + β(01)xi 2 + β(11)xi 1xi 2 + ϵi then
the hat matrix is invariant to any covariate transformation of the form A(xi 1xi 2
)+ b for any
real-valued positive definite 2 times 2 matrix A and any real-valued vector of dimension two
b In contrast if yi = β(00) + β(20)x2i 1 + ϵi then the hat matrix formed after applying the
transformation xi 1 7rarr xi 1 + c for real c = 0 is not the same as the hat matrix formed by
the original xi 1421 Well-Formulated Model Spaces
The spaces of WFMs M considered in this paper can be characterized in terms
of two WFMs MB the base model and MF the full model The base model contains at
least the intercept term and is nested in the full model The model space M is populated
by all well formulated models M that nest MB and are nested in MF
M = M MB sube M sube MF and M is well-formulated
For M to be well-formulated the entire ancestry of each node in M must also be
included in M Because of this M isin M can be uniquely identified by two different sets
of nodes in MF the set of extreme nodes and the set of children nodes For M isin M
90
the sets of extreme and children nodes respectively denoted by E(M) and C(M) are
defined by
E(M) = α isin M MB α isin P(αprime) forall αprime isin M
C(M) = α isin MF M α cupM is well-formulated
The extreme nodes are those nodes that when removed from M give rise to a WFM in
M The children nodes are those nodes that when added to M give rise to a WFM in
M Because MB sube M for all M isin M the set of nodes E(M)cupMB determine M by
beginning with this set and iteratively adding parent nodes Similarly the nodes in C(M)
determine the set αprime isin P(α) α isin C(M)cupαprime isin E(MF ) α ≼ αprime for all α isin C(M)
which contains E(M)cupMB and thus uniquely identifies M
1
x1
x2
x21
x1x2
x22
A Extreme node set
1
x1
x2
x21
x1x2
x22
B Children node set
Figure 4-2
In Figure 4-2 the extreme and children sets for model M = 1 x1 x21 are shown for
the model space characterized by MF = 1 x1 x2 x21 x1x2 x22 In Figure 4-2A the solid
nodes represent nodes α isin M E(M) the dashed node corresponds to α isin E(M) and
the dotted nodes are not in M Solid nodes in Figure 4-2B correspond to those in M
The dashed node is the single node in C(M) and the dotted nodes are not in M cup C(M)43 Priors on the Model Space
As discussed in Scott amp Berger (2010) the Ockhamrsquos-razor effect found
automatically in Bayesian variable selection through the Bayes factor does not correct
91
for multiple testing This penalization acts against more complex models but does not
account for the collection of models in the model space which describes the multiplicity
of the testing problem This is where the role of the prior on the model space becomes
important As Scott amp Berger explain the multiplicity penalty is ldquohidden awayrdquo in the
model prior probabilities π(M|M)
In what follows we propose three different prior structures on the model space
for WFMs discuss their advantages and disadvantages and describe reasonable
choices for their hyper-parameters In addition we investigate how the choice of
prior structure and hyper-parameter combinations affect the posterior probabilities for
predictor inclusion providing some recommendations for different situations431 Model Prior Definition
The graphical structure for the model spaces suggests a method for prior
construction on M guided by the notion of inheritance A node α is said to inherit from
a node αprime if there is a directed path from αprime to α in the graph (MF ) The inheritance
is said to be immediate if order(α) = order(αprime) + 1 (equivalently if αprime isin P(α) or if αprime
immediately precedes α)
For convenience define (M) = M MB to be the set of nodes in M that are not
in the base model MB For α isin (MF ) let γα(M) be the indicator function describing
whether α is included in M ie γα(M) = I(αisinM) Denote by γν(M) the set of indicators
of inclusion in M for all order ν nodes in (MF ) Finally let γltν(M) =cupνminus1
j=0 γ j(M)
the set of indicators for inclusion in M for all nodes in (MF ) of order less than ν With
these definitions the prior probability of any model M isin M can be factored as
π(M|M) =
JmaxMprod
j=JminM
π(γ j(M)|γltj(M)M) (4ndash3)
where JminM and Jmax
M are respectively the minimum and maximum order of nodes in
(MF ) and π(γJminM (M)|γltJmin
M (M)M) = π(γJminM (M)|M)
92
Prior distributions on M can be simplified by making two assumptions First if
order(α) = order(αprime) = j then γα and γαprime are assumed to be conditionally independent
when conditioned on γltj denoted by γα perpperp γαprime|γltj Second immediate inheritance is
invoked and it is assumed that if order(α) = j then γα(M)|γltj(M) = γα(M)|γP(α)(M)
where γP(α)(M) is the inclusion indicator for the set of parent nodes of α This indicator
is one if the complete parent set of α is contained in M and zero otherwise
In Figure 4-3 these two assumptions are depicted with MF being an order two
surface in two main effects The conditional independence assumption (Figure 4-3A)
implies that the inclusion indicators for x21 x22 and x1x2 is independent when conditioned
on all the lower order terms In this same space immediate inheritance implies that
the inclusion of x21 conditioned on the inclusion of all lower order nodes is equivalent to
conditioning it on its parent set (x1 in this case)
x21 perpperp x1x2 perpperp x22
∣∣∣∣∣
1
x1
x2
A Conditional independence
x21∣∣∣∣∣
1
x1
x2
=
x21
∣∣∣∣∣ x1
B Immediate inheritance
Figure 4-3
Denote the conditional inclusion probability of node α in model M by πα =
π(γα(M) = 1|γP(α)(M)M) Under the assumptions of conditional independence
93
and immediate inheritance the prior probability of M is
π(M|πMM) =prod
αisin(MF )
πγα(M)α (1minus πα)
1minusγα(M) (4ndash4)
with πM = πα α isin (MF ) Because M must be well-formulated πα = γα =
0 if γP(α)(M) = 0 Thus the product in 4ndash4 can be restricted to the set of nodes
α isin (M)cup
C(M) Additional structure can be built into the prior on M by making
assumptions about the inclusion probabilities πα such as equality assumptions or
assumptions of a hyper-prior for these parameters Three such prior classes are
developed next first by assigning hyperpriors on πM assuming some structure among
its elements and then marginalizing out the πM
Hierarchical Uniform Prior (HUP) The HUP assumes that the non-zero πα
are all equal Specifically for a model M isin M it is assumed that πα = π for all
α isin (M)cupC(M) A complete Bayesian specification of the HUP is completed by
assuming a prior distribution for π The choice of π sim Beta(a b) produces
πHUP(M|M a b) =B(|(M)|+ a |C(M)|+ b)
B(a b) (4ndash5)
where B is the beta function Setting a = b = 1 gives the particular value of
πHUP(M|M a = 1 b = 1) =1
|(M)|+ |C(M)|+ 1
(|(M)|+ |C(M)|
|(M)|
)minus1
(4ndash6)
The HUP assigns equal probabilities to all models for which the sets of nodes (M)
and C(M) have the same cardinality This prior provides a combinatorial penalization
but essentially fails to account for the hierarchical structure of the model space An
additional penalization for model complexity can be incorporated into the HUP by
changing the values of a and b Because πα = π for all α this penalization can only
depend on some aspect of the entire graph of MF such as the total number of nodes
not in the null model |(MF )|
94
Hierarchical Independence Prior (HIP) The HIP assumes that there are no
equality constraints among the non-zero πα Each non-zero πα is given its own prior
which is assumed to be a Beta distribution with parameters aα and bα Thus the prior
probability of M under the HIP is
πHIP(M|M ab) =
prodαisin(M)
aα
aα + bα
prodαisinC(M)
bα
aα + bα
(4ndash7)
where the product over empty is taken to be 1 Because the πα are totally independent any
choice of aα and bα is equivalent to choosing a probability of success πα for a given α
Setting aα = bα = 1 for all α isin (M)cup
C(M) gives the particular value of
πHIP(M|M a = 1b = 1) =
(1
2
)|(M)|+|C(M)|
(4ndash8)
Although the prior with this choice of hyper-parameters accounts for the hierarchical
structure of the model space it essentially provides no penalization for combinatorial
complexity at different levels of the hierarchy This can be observed by considering a
model space with main effects only the exponent in 4ndash8 is the same for every model in
the space because each node is either in the model or in the children set
Additional penalizations for model complexity can be incorporated into the HIP
Because each γ j is conditioned on γltj in the prior construction the aα and bα for α of
order j can be conditioned on γltj One such additional penalization utilizes the number
of nodes of order j that could be added to produce a WFM conditioned on the inclusion
vector γltj which is denoted as chj(γltj) Choosing aα = 1 and bα(M) = chj(γ
ltj) is
equivalent to choosing a probability of success πα = 1chj(γltj) This penalization can
drive down the false positive rate when chj(γltj) is large but may produce more false
negatives
Hierarchical Order Prior (HOP) A compromise between complete equality and
complete independence of the πα is to assume equality between the πα of a given
order and independence across the different orders Define j(M) = α isin (M)
95
order(α) = j and Cj(M) = α isin C(M) order(α) = j The HOP assumes that πα = πj
for all α isin j(M)cupCj(M) Assuming that πj sim Beta(aj bj) provides a prior probability of
πHOP(M|M ab) =
JmaxMprod
j=JminM
B(|j(M)|+ aj |Cj(M)|+ bj)
B(aj bj)(4ndash9)
The specific choice of aj = bj = 1 for all j gives a value of
πHOP(M|M a = 1b = 1) =prodj
[1
|j(M)|+ |Cj(M)|+ 1
(|j(M)|+ |Cj(M)|
|j(M)|
)minus1]
(4ndash10)
and produces a hierarchical version of the Scott and Berger multiplicity correction
The HOP arises from a conditional exchangeability assumption on the indicator
variables Conditioned on γltj(M) the indicators γα α isin j(M)cup
Cj(M) are
assumed to be exchangeable Bernoulli random variables By de Finettirsquos theorem these
arise from independent Bernoulli random variables with common probability of success
πj with a prior distribution Our construction of the HOP assumes that this prior is a
beta distribution Additional complexity penalizations can be incorporated into the HOP
in a similar fashion to the HIP The number of possible nodes that could be added of
order j while maintaining a WFM is given by chj(M) = chj(γltj(M)) = |j(M)
cupCj(M)|
Using aj = 1 and bj(M) = chj(M) produces a prior with two desirable properties
First if M prime sub M then π(M) le π(M prime) Second for each order j the conditional
probability of including k nodes is greater than or equal to that of including k + 1 nodes
for k = 0 1 chj(M)minus 1432 Choice of Prior Structure and Hyper-Parameters
Each of the priors introduced in Section 31 defines a whole family of model priors
characterized by the probability distribution assumed for the inclusion probabilities πM
For the sake of simplicity this paper focuses on those arising from Beta distributions
and concentrates on particular choices of hyper-parameters which can be specified
automatically First we describe some general features about how each of the three
prior structures (HUP HIP HOP) allocates mass to the models in the model space
96
Second as there is an infinite number of ways in which the hyper-parameters can be
specified focused is placed on the default choice a = b = 1 as well as the complexity
penalizations described in Section 31 The second alternative is referred to as a =
1b = ch where b = ch has a slightly different interpretation depending on the prior
structure Accordingly b = ch is given by bj(M) = bα(M) = chj(M) = |j(M)cup
Cj(M)|
for the HOP and HIP where j = order(α) while b = ch denotes that b = |(MF )| for
the HUP The prior behavior for two model spaces In both cases the base model MB is
taken to be the intercept only model and MF is the DAG shown (Figures 4-4 and 4-5)
The priors considered treat model complexity differently and some general properties
can be seen in these examples
ModelHIP HOP HUP
(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)
1 1 14 49 13 12 13 572 1 x1 18 19 112 112 112 5563 1 x2 18 19 112 112 112 5564 1 x1 x
21 18 19 112 112 112 5168
5 1 x2 x22 18 19 112 112 112 5168
6 1 x1 x2 132 364 112 112 160 1727 1 x1 x2 x
21 132 164 136 160 160 1168
8 1 x1 x2 x1x2 132 164 136 160 160 11689 1 x1 x2 x
22 132 164 136 160 160 1168
10 1 x1 x2 x21 x1x2 132 1192 136 1120 130 1252
11 1 x1 x2 x21 x
22 132 1192 136 1120 130 1252
12 1 x1 x2 x1x2 x22 132 1192 136 1120 130 1252
13 1 x1 x2 x21 x1x2 x
22 132 1576 112 1120 16 1252
Figure 4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch)
First contrast the choice of HIP HUP and HOP for the choice of (ab) = (1 1) The
HIP induces a complexity penalization that only accounts for the order of the terms in
the model This is best exhibited by the model space in Figure 4-4 Models including x1
and x2 models 6 through 13 are given the same prior probability and no penalization is
incurred for the inclusion of any or all of the quadratic terms In contrast to the HIP the
97
ModelHIP HOP HUP
(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)
1 1 18 2764 14 12 14 472 1 x1 18 964 112 110 112 2213 1 x2 18 964 112 110 112 2214 1 x3 18 964 112 110 112 2215 1 x1 x3 18 364 112 120 112 41056 1 x2 x3 18 364 112 120 112 41057 1 x1 x2 116 3128 124 140 130 1428 1 x1 x2 x1x2 116 3128 124 140 120 1709 1 x1 x2 x3 116 1128 18 140 120 17010 1 x1 x2 x3 x1x2 116 1128 18 140 15 170
Figure 4-5 Prior probabilities for the space of well-formulated models associated tothree main effects and one interaction term where MB is taken to be theintercept only model and (ab) isin (1 1) (1 ch)
HUP induces a penalization for model complexity but it does not adequately penalize
models for including additional terms Using the HIP models including all of the terms
are given at least as much probability as any model containing any non-empty set of
terms (Figures 4-4 and 4-5) This lack of penalization of the full model is originates from
its combinatorial simplicity (ie this is the only model that contains every term) and
as an unfortunate consequence this model space distribution favors the base and full
models Similar behavior is observed with the HOP with (ab) = (1 1) As models
become more complex they are appropriately penalized for their size However after a
sufficient number of nodes are added the number of possible models of that particular
size is considerably reduced Thus combinatorial complexity is negligible on the largest
models This is best exhibited in Figure 4-5 where the HOP places more mass on
the full model than on any model containing a single order one node highlighting an
undesirable behavior of the priors with this choice of hyper-parameters
In contrast if (ab) = (1 ch) all three priors produce strong penalization as
models become more complex both in terms of the number and order of the nodes
contained in the model For all of the priors adding a node α to a model M to form M prime
produces p(M) ge p(M prime) However differences between the priors are apparent The
98
HIP penalizes the full model the most with the HOP penalizing it the least and the HUP
lying between them At face value the HOP creates the most compelling penalization
of model complexity In Figure 4-5 the penalization of the HOP is the least dramatic
producing prior odds of 20 for MB versus MF as opposed to the HUP and HIP which
produce prior odds of 40 and 54 respectively Similarly the prior odds in Figure 4-4 are
60 180 and 256 for the HOP HUP and HIP respectively433 Posterior Sensitivity to the Choice of Prior
To determine how the proposed priors are adjusting the posterior probabilities to
account for multiplicity a simple simulation was performed The goal of this exercise
was to understand how the priors respond to increasing complexity First the priors are
compared as the number of main effects p grows Second they are compared as the
depth of the hierarchy increases or in other words as the orderJMmax increases
The quality of a node is characterized by its marginal posterior inclusion
probabilities defined as pα =sum
MisinM I(αisinM)p(M|yM) for α isin MF These posteriors
were obtained for the proposed priors as well as the Equal Probability Prior (EPP)
on M For all prior structures both the default hyper-parameters a = b = 1 and
the penalizing choice of a = 1 and b = ch are considered The results for the
different combinations of MF and MT incorporated in the analysis were obtained
from 100 random replications (ie generating at random 100 matrices of main effects
and responses) The simulation proceeds as follows
1 Randomly generate main effects matrices X = (x1 x18) for xiiidsim Nn(0 In) and
error vectors ϵ sim Nn(0 In) for n = 60
2 Setting all coefficient values equal to one calculate y = ZMTβ + ϵ for the true
models given byMT 1 = x1 x2 x3 x
21 x1x2 x
22 x2x3 with |MT 1| = 7
MT 2 = x1 x2 x16 with |MT 2| = 16MT 3 = x1 x2 x3 x4 with |MT 3| = 4MT 4 = x1 x2 x8 x
21 x3x4 with |MT 4| = 10
MT 5 = x1 x2 x3 x4 x21 x3x4 with |MT 5| = 6
99
Table 4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulationsgrowing p fixed JM
max fixed p growing JMmax
MF
∣∣MF
∣∣ ∣∣M∣∣ MT used MF
∣∣MF
∣∣ ∣∣M∣∣ MT used(x1 + x2 + x3)
2 9 95 MT 1 (x1 + x2 + x3)2 9 95 MT 1
(x1 + + x4)2 14 1337 MT 1 (x1 + x2 + x3)
3 19 2497 MT 1
(x1 + + x5)2 20 38619 MT 1 (x1 + x2 + x3)
4 34 161421 MT 1
Other model spacesMF
∣∣MF
∣∣ ∣∣M∣∣ MT usedx1 + x2 + middot middot middot+ x18 18 262144 MT 2MT 3
(x1 + x2 + x4)2 + x5+ 20 85568 MT 4MT 5x6 + + x10
3 In all simulations the base model MB is the intercept only model The notation(x1 + + xp)
d is used to represent the full order-d polynomial response surface inp main effects The model spaces characterized by their corresponding full modelMF are presented in Table 4-1 as well as the true models used in each case
4 Enumerate the model spaces and calculate p(M|yM) for all M isin Musing the EPP HUP HIP and HOP the latter two each with the two sets ofhyper-parameters
5 Count the number of true positives and false positives in each M for the differentpriors
The true positives (TP) are defined as those nodes α isin MT such that pα gt 05
With the false positives (FP) three different cutoffs are considered for pα elucidating
the adjustment for multiplicity induced by the model priors These cutoffs are
010 020 and 050 for α isin MT The results from this exercise provide insight
about the influence of the prior on the marginal posterior inclusion probabilities In Table
4-1 the model spaces considered are described in terms of the number of models they
contain and in terms of the number of nodes of MF the full model that defines the DAG
for M
Growing number of main effects fixed polynomial degree This simulation
investigates the posterior behavior as the number of covariates grows for a polynomial
100
surface of degree two The true model is assumed to be MT 1 and has 7 polynomial
terms The false positive and true positive rates are displayed in Table 4-2
First focus on the posterior when (ab) = (1 1) As p increases and the cutoff
is low the number of false positives increases for the EPP as well as the hierarchical
priors although less dramatically for the latter All of the priors identify all of the true
positives The false positive rate for the 50 cutoff is less than one for all four prior
structures with the HIP exhibiting the smallest false positive rate
With the second choice of hyper-parameters (1 ch) the improvement of the
hierarchical priors over the EPP is dramatic and the difference in performance is more
pronounced as p increases These also considerably outperform the priors using the
default hyper-parameters a = b = 1 in terms of the false positives Regarding the
number of true positives all priors discovered the 7 true predictors in MT 1 for most of
the 100 random samples of data with only minor differences observed between any of
the priors considered That being said the means for the priors with a = 1b = ch are
slightly lower for the true positives With a 50 cutoff the hierarchical priors keep a tight
control on the number of false positives but in doing so discard true positives with slightly
higher frequency
Growing polynomial degree fixed main effects For these examples the true
model is once again MT 1 When the complexity is increased by making the order of MF
larger (Table 4-3) the inability of the EPP to adjust the inclusion posteriors for multiplicity
becomes more pronounced the EPP becomes less and less efficient at removing false
positives when the FP cutoff is low Among the priors with a = b = 1 as the order
increases the HIP is the best at filtering out the false positives Using the 05 false
positive cutoff some false positives are included both for the EEP and for all the priors
with a = b = 1 indicating that the default hyper-parameters might not be the best option
to control FP The 7 covariates in the true model all obtain a high inclusion posterior
probability both with the EEP and the a = b = 1 priors
101
Table 4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
7 (x1 + x2 + x3)2
178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3 + x4)2
362 194 233 245 010 063 107FP(gt020) 160 047 217 215 001 017 024FP(gt050) 025 006 035 036 000 002 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3 + x4 + x5)2
600 216 260 255 012 043 115FP(gt020) 291 055 213 218 002 019 027FP(gt050) 066 011 025 037 000 003 001TP(gt050) (MT 1) 700 700 700 700 697 699 699
In contrast any of the a = 1 and b = ch priors dramatically improve upon their
a = b = 1 counterparts consistently assigning low inclusion probabilities for the majority
of the false positive terms even for low cutoffs As the order of the polynomial surface
increases the difference in performance between these priors and either the EEP or
their default versions becomes even more clear At the 50 cutoff the hierarchical priors
with complexity penalization exhibit very low false positive rates The true positive rate
decreases slightly for the priors but not to an alarming degree
Other model spaces This part of the analysis considers model spaces that do not
correspond to full polynomial degree response surfaces (Table 4-4) The first example
is a model space with main effects only The second example includes a full quadratic
surface of order 2 but in addition includes six terms for which only main effects are to be
modeled Two true models are used in combination with each model space to observe
how the posterior probabilities vary under the influence of the different priors for ldquolargerdquo
and ldquosmallrdquo true models
102
Table 4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
7 (x1 + x2 + x3)2
178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3)3
737 521 606 291 055 105 139FP(gt020) 291 155 361 208 017 034 031FP(gt050) 040 021 050 026 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 698 700FP(gt010)
7 (x1 + x2 + x3)4
822 400 469 261 052 055 132FP(gt020) 421 113 176 203 012 015 031FP(gt050) 056 017 022 027 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 697 699
By construction in model spaces with main effects only HIP(11) and EPP are
equivalent as are HOP(ab) and HUP(ab) This accounts for the similarities observed
among the results for the first two cases presented in Table 4-4 where the model space
corresponds to a full model with 18 main effects and the true models are a model with
16 and 4 main effects respectively When the number of true coefficients is large the
HUP(11) and HOP(11) do poorly at controlling false positives even at the 50 cutoff
In contrast the HIP (and thus the EPP) with the 50 cutoff identifies the true positives
and no false positives This result however does not imply that the EPP controls false
positives well The true model contains 16 out of the 18 nodes in MF so there is little
potential for false positives The a = 1 and b = ch priors show dramatically different
behavior The HIP controls false positive well but fails to identify the true coefficients at
the 50 cutoff In contrast the HOP identifies all of the true positives and has a small
false positive rate for the 50 cutoff
103
If the number of true positives is small most terms in the full model are truly zero
The EPP includes at least one false positive in approximately 50 of the randomly
sampled datasets On the other hand the HUP(11) provides some control for
multiplicity obtaining on average a lower number of false positives than the EPP
Furthermore the proposed hierarchical priors with a = 1b = ch are substantially better
than the EPP (and the choice of a = b = 1) at controlling false positives and capturing
all true positives using the marginal posterior inclusion probabilities The two examples
suggest that the HOP(1 ch) is the best default choice for model selection when the
number of terms available at a given degree is large
The third and fourth examples in Table 4-4 consider the same irregular model
space with data generated from MT 4 with ten terms and MT 5 with six terms HIP(11)
and EPP again behave quite similarly incorporating a large number of false positives
for the 01 cutoff At the 05 cutoff some false positives are still included The HUP(11)
and HOP(11) behave similarly with a slightly higher false positive rate at the 50 cutoff
In terms of the true positives the EPP and a = b = 1 priors always include all of the
predictors in MT 4 and MT 5 On the other hand the ability of the a = 1b = ch priors
to control for false positives is markedly better than that of the EPP and the hierarchical
priors with choice of a = 1 = b = 1 At the 50 cutoff these priors identify all of the true
positives and true negatives Once again these examples point to the hierarchical priors
with additional penalization for complexity as being good default priors on the model
space44 Random Walks on the Model Space
When the model space M is too large to enumerate a stochastic procedure can
be used to find models with high posterior probability In particular an MCMC algorithm
can be utilized to generate a dependent sample of models from the model posterior The
structure of the model space M both presents difficulties and provides clues on how to
build algorithms to explore it Different MCMC strategies can be adopted two of which
104
Table 4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
16 x1 + x2 + + x18
193 193 200 200 003 180 180FP(gt020) 052 052 200 200 001 046 046FP(gt050) 007 007 200 200 001 004 004TP(gt050) (MT 2) 1599 1599 1600 1600 699 1599 1599FP(gt010)
4 x1 + x2 + + x18
1395 1395 915 915 026 131 131FP(gt020) 545 545 303 303 005 045 045FP(gt050) 084 084 045 045 002 006 006TP(gt050) (MT 3) 400 400 400 400 400 400 400FP(gt010)
10
973 971 1000 560 034 233 220FP(gt020) (x1 + + x4)
2+ 265 265 873 305 012 074 069FP(gt050) +x5 + + x10 035 035 136 168 002 011 012TP(gt050) (MT 4) 1000 1000 1000 999 994 998 999FP(gt010)
6
1352 1352 1106 994 044 163 196FP(gt020) (x1 + + x4)
2+ 422 421 360 501 015 048 068FP(gt050) +x5 + + x10 053 053 057 075 001 008 011TP(gt050) (MT 5) 600 600 600 600 599 599 599
are outlined in this section Combining the different strategies allows the model selection
algorithm to explore the model space thoroughly and relatively fast441 Simple Pruning and Growing
This first strategy relies on small localized jumps around the model space turning
on or off a single node at each step The idea behind this algorithm is to grow the model
by activating one node in the children set or to prune the model by removing one node
in the extreme set At a given step in the algorithm assume that the current state of the
chain is model M Let pG be the probability that algorithm chooses the growth step The
proposed model M prime can either be M+ = M cup α for some α isin C(M) or Mminus = M α
or some α isin E(M)
An example transition kernel is defined by the mixture
g(M prime|M) = pG middot qGrow(M prime|M) + (1minus pG) middot qPrune(M prime|M)
105
=IM =MF
1 + IM =MBmiddotIαisinC(M)
|C(M)|+
IM =MB
1 + IM =MF middotIαisinE(M)
|E(M)|(4ndash11)
where pG has explicitly been defined as 05 when both C(M) and E(M) are non-empty
and as 0 (or 1) when C(M) = empty (or E(M) = empty) After choosing pruning or growing a
single node is proposed for addition to or deletion from M uniformly at random
For this simple algorithm pruning has the reverse kernel of growing and vice-versa
From this construction more elaborate algorithms can be specified First instead of
choosing the node uniformly at random from the corresponding set nodes can be
selected using the relative posterior probability of adding or removing the node Second
more than one node can be selected at any step for instance by also sampling at
random the number of nodes to add or remove given the size of the set Third the
strategy could combine pruning and growing in a single step by sampling one node
α isin C(M) cup E(M) and adding or removing it accordingly Fourth the sets of nodes from
C(M) cup E(M) that yield well-formulated models can be added or removed This simple
algorithm produces small moves around the model space by focusing node addition or
removal only on the set C(M) cup E(M)442 Degree Based Pruning and Growing
In exploring the model space it is possible to take advantage of the hierarchical
structure defined between nodes of different order One can update the vector of
inclusion indicators by blocks denoted j(M) Two flavors of this algorithm are
proposed one that separates the pruning and growing steps and one where both
are done simultaneously
Assume that at a given step say t the algorithm is at M If growing the strategy
proceeds successively by order class going from j = Jmin up to j = Jmax with Jmin
and Jmax being the lowest and highest orders of nodes in MF MB respectively Define
Mt(Jminminus1) = M and set j = Jmin The growth kernel comprises the following steps
proceeding from j = Jmin to j = Jmax
106
1) Propose a model M prime by selecting a set of nodes from Cj(Mt(jminus1)) through thekernel qGrow j(middot|Mt(jminus1))
2) Compute the Metropolis-Hastings correction for M prime versus Mt(jminus1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(jminus1)
3) If j lt Jmax then set j = j + 1 and return to 1) otherwise proceed to 4)
4) Set Mt = Mt(Jmax )
The pruning step is defined In a similar fashion however it starts at order j = Jmax
and proceeds down to j = Jmin Let Ej(M prime) = E(M prime) cap j(MF ) be the set of nodes of
order j that can removed from the model M to produce a WFM Define Mt(Jmax+1) = M
and set j = Jmax The pruning kernel comprises the following steps
1) Propose a model M prime by selecting a set of nodes from Ej(Mt(j+1)) through thekernel qPrunej(middot|Mt(j+1))
2) Compute the Metropolis-Hastings correction for M prime versus Mt(j+1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(j+1)
3) If j gt Jmin then set j = j minus 1 and return to Step 1) otherwise proceed to Step 4)
4) Set Mt = Mt(Jmin )
It is clear that the growing and pruning steps are reverse kernels of each other
Pruning and growing can be combined for each j The forward kernel proceeds from
j = Jmin to j = Jmax and proposes adding sets of nodes from Cj(M) cup Ej(M) The reverse
kernel simply reverses the direction of j proceeding from j = Jmax to j = Jmin 45 Simulation Study
To study the operating characteristics of the proposed priors a simulation
experiment was designed with three goals First the priors are characterized by how
the posterior distributions are affected by the sample size and the signal-to-noise ratio
(SNR) Second given the SNR level the influence of the allocation of the signal across
the terms in the model is investigated Third performance is assessed when the true
model has special points in the scale (McCullagh amp Nelder 1989) ie when the true
107
model has coefficients equal to zero for some lower-order terms in the polynomial
hierarchy
With these goals in mind sets of predictors and responses are generated under
various experimental conditions The model space is defined with MB being the
intercept-only model and MF being the complete order-four polynomial surface in five
main effects that has 126 nodes The entries of the matrix of main effects are generated
as independent standard normal The response vectors are drawn from the n-variate
normal distribution as y sim Nn
(ZMT
(X)βγ In) where MT is the true model and In is the
n times n identity matrix
The sample sizes considered are n isin 130 260 1040 which ensures that
ZMF(X) is of full rank The cardinality of this model space is |M| gt 12 times 1022 which
makes enumeration of all models unfeasible Because the value of the 2k-th moment
of the standard normal distribution increases with k = 1 2 higher-order terms by
construction have a larger variance than their ancestors As such assuming equal
values for all coefficients higher-order terms necessarily contain more ldquosignalrdquo than
the lower order terms from which they inherit (eg x21 has more signal than x1) Once a
higher-order term is selected its entire ancestry is also included Therefore to prevent
the simulation results from being overly optimistic (because of the larger signals from the
higher-order terms) sphering is used to calculate meaningful values of the coefficients
ensuring that the signal is of the magnitude intended in any given direction Given
the results of the simulations from Section 433 only the HOP with a = 1b = ch is
considered with the EPP included for comparison
The total number of combinations of SNR sample size regression coefficient
values and nodes in MT amounts to 108 different scenarios Each scenario was run
with 100 independently generated datasets and the mean behavior of the samples was
observed The results presented in this section correspond to the median probability
model (MPM) from each of the 108 simulation scenarios considered Figure 4-7 shows
108
the comparison between the two priors for the mean number of true positive (TP) and
false positive (FP) terms Although some of the scenarios consider true models that are
not well-formulated the smallest well-formulated model that stems from MT is always
the one shown in Figure 4-6
Figure 4-6 MT DAG of the largest true model used in simulations
The results are summarized in Figure 4-7 Each point on the horizontal axis
corresponds to the average for a given set of simulation conditions Only labels for the
SNR and sample size are included for clarity but the results are also shown for the
different values of the regression coefficients and the different true models considered
Additional details about the procedure and other results are included in the appendices451 SNR and Sample Size Effect
As expected small sample sizes conditioned upon a small SNR impair the ability
of the algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this
effect being greater when using the latter prior However considering the mean number
of TPs jointly with the number of FPs it is clear that although the number of TPs is
specially low with HOP(1 ch) most of the few predictors that are discovered in fact
belong to the true model In comparison to the results with EPP in terms of FPs the
HOP(1 ch) does better and even more so when both the sample size and the SNR are
109
Figure 4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch)
smallest Finally when either the SNR or the sample size is large the performance in
terms of TPs is similar between both priors but the number of FPs are somewhat lower
with the HOP452 Coefficient Magnitude
Three ways to allocate the amount of signal across predictors are considered For
the first choice all coefficients contain the same amount of signal regardless of their
order In the second each order-one coefficient contains twice as much signal as any
order-two coefficient and four times as much as any order-three coefficient Finally
each order-one coefficient contains a half as much signal as any order-two coefficient
and a quarter of what any order-three coefficient has These choices are denoted by
β(1) = c(1o1 1o2 1o3) β(2) = c(1o1 05o2 025o3) and β(3) = c(025o1 05o2 1o3)
respectively In Figure 4-7 the first 4 scenarios correspond to simulations with β(1) the
next four use β(2) the next four correspond to β(3) and then the values are cycled in
110
the same way The results show that scenarios using either β(1) or β(3) behave similarly
contrasting with the negative impact of having the highest signal in the order-one terms
through β(2) In Figure 4-7 the effect of using β(2) is evident as it corresponds to the
lowest values for the TPs regardless of the sample size the SNR or the prior used This
is an intuitive result since giving more signal to higher-order terms makes it easier to
detect higher-order terms and consequently by strong heredity the algorithm will also
select the corresponding lower-order terms included in the true model453 Special Points on the Scale
Four true models were considered (1) the model from Figure 4-6 (MT 1) (2)
the model without the order-one terms (MT 2) (3) the model without order-two terms
(MT 3) and (4) the model without x21 and x2x5 (MT 4) The last three are clearly not
well-formulated In Figure 4-7 the leftmost point on the horizontal axis corresponds to
scenarios with MT 1 the next point is for scenarios with MT 2 followed by those with MT 3
then with MT 4 then MT 1 etc In comparison to the EPP the HOP(1 ch) tightly controls
the inclusion of FPs by choosing smaller models at the expense of also reducing the TP
count especially when there is more uncertainty about the true model (ie SNR=025)
For both prior structures the results indicate that at low SNR levels the presence of
special points has no apparent impact as the selection behavior is similar between the
four models in terms of both the TP and FP An interesting observation is that the effect
of having special points on the scale is vastly magnified whenever the coefficients that
assign more weight to order-one terms (β(2)) are used46 Case Study Ozone Data Analysis
This section uses the ozone data from Breiman amp Friedman (1985) and followsthe analysis performed by Liang et al (2008) who investigated hyper g-priors Afterremoving observations with missing values 330 observations remain includingdaily measurements of maximum ozone concentration near Los Angeles and eightmeteorological variables Table D From the 330 observations 165 were sampled atrandom without replacement and used to run the variable selection procedure theremaining 165 were used for validation The eight meteorological variables interactionsand their squared terms are used as predictors resulting in a full model with 44predictors The model space assumes that the base model MB is the intercept onlymodel and that MF is the quadratic surface in the eight meteorological variables The
111
model space contains approximately 71 billion models and computation of all modelposterior probabilities is not feasible
Table 4-5 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA
vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX
The HOP HUP and HIP with a = 1 and b = ch as well as the EPP are considered forcomparison purposes To obtain the Bayes factors in equation 3ndash3 four different mixtures ofg-priors are utilized intrinsic priors (IP) (which yields the expression in equation 3ndash2) hyper-g(HG) priors (Liang et al 2008) with hyper-parameters α = 2β = 1 and α = β = 1 and Zellner-Siow (ZS) priors (Zellner amp Siow 1980) The results were extracted for the median posteriorprobability (MPM) models Additionally the model is estimated using the R package hierNet(Bien et al 2013) to compare model selection results to those obtained using the hierarchicallasso (Bien et al 2013) restricted to well formulated models by imposing the strong heredityconstraint The procedures were assessed on the basis of their predictive accuracy on thevalidation dataset
Among all models the one that yields the smallest RMSE is the median probability modelobtained using the HOP and EPP with the ZS prior and also using the HOP with both HGpriors (Table 4-6) The HOP model with the intrinsic prior has all the terms contained in thelowest RMSE model with the exception of dpg2 which has a relatively high marginal inclusionprobability of 46 This disparity between the IP and other mixtures of g-priors is explainedby the fact that the IP induces less posterior shrinkage than the ZS and HG priors The MPMobtained through the HUP and HIP are nested in the best model suggesting that these modelspace priors penalize complexity too much and result in false negatives Consideration ofthese MPMs suggest that the HOP is best at producing true positives while controlling for falsepositives
Finally the model obtained from the hierarchical lasso (HierNet) is the largest model andproduces the second to largest RMSE All of the terms contained in any of the other modelsexcept for vh are nested within the hierarchical lasso model and most of the terms that areexclusive to this model receive extremely low marginal inclusion probabilities under any of themodel priors and parameter priors considered under Bayesian model selection
112
Table 4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso
BF Prior Model R2 RMSEIP EPP hum dpg ibt hum2 hum lowast dpg 08054 42739
hum lowast ibt dpg2 ibt2IP HIP hum ibt hum2 hum lowast ibt ibt2 07740 43396IP HOP hum dpg ibt hum2 hum lowast ibt ibt2 07848 43175IP HUP hum dpg ibt hum lowast ibt ibt2 07767 43508ZS EPP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HIP hum ibt hum lowast ibt ibt2 07525 43505ZS HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG11 EPP vh hum dpg ibt hum2 hum lowast ibt dpg2 07701 43049HG11 HIP hum ibt hum lowast ibt ibt2 07525 43505HG11 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG11 HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 EPP hum dpg ibt hum2 hum lowast ibt dpg2 07701 43037HG21 HIP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG21 HUP hum dpg ibt hum lowast ibt 07526 44036
HierNet hum temp ibh dpg ibt vis hum2 hum lowast ibt 07651 43680temp2 temp lowast ibt dpg2
47 DiscussionScott amp Berger (2010) noted that Ockhamrsquos-razor effect found automatically in Bayesian
variable selection through the Bayes factor does not correct for multiple testing The Bayesfactor penalizes complexity of the alternative model according to the number of parametersin excess of those of the null model Therefore the Bayes factor only controls complexity in apairwise fashion If the model selection procedure uses equal prior probabilities for all M isin Mthen these comparisons ignore the effect of the multiplicity of the testing problem This is wherethe role of the prior on the model space becomes important The multiplicity penalty is ldquohiddenawayrdquo in the model prior probabilities π(M|M)
In addition to the multiplicity of the testing problem disregarding the hierarchical polynomialstructure in the predictors in model selection procedures has the potential to lead to differentresults according to how the predictors are setup (eg in what units these predictors areexpressed)
In this Chapter we investigated a solution to these two issues We define prior structuresfor well-formulated models and develop random walk algorithms to traverse this type of modelspace The key to understanding prior distributions on the space of WFMs is the hierarchicalnature of the model space itself The prior distributions described take advantage of thathierarchy in two ways First conditional independence and immediate inheritance are used todevelop the HOP HIP and HUP structures discussed in Section 43 Second the conditionalnature of the priors allows for the direct incorporation of complexity penalizations Of the priorsproposed the HOP using the hyperparameter choice (1 ch) provides the best control of falsepositives while maintaining a reasonable true positive rate Thus this prior is recommended asthe default prior on the space of WFMs
113
In the near future the software developed to carry out a Metropolis-Hastings random walkon the space of WFMs will be integrated to the R package varSelectIP These new functionsimplement various local priors for the regression coefficients including the intrinsic prior Zellner-Siow prior and hyper g-priors In addition the software supports the computation of crediblesets for each regression coefficient conditioned on the selected model as well as under modelaveraging
114
CHAPTER 5CONCLUSIONS
Ecologists are now embracing the use of Bayesian methods to investigate the
interactions that dictate the distribution and abundance of organisms These tools are
both powerful and flexible They allow integrating under a single methodology empirical
observations and theoretical process models and can seamlessly account for several
sources of uncertainty and dependence The estimation and testing methods proposed
throughout the document will contribute to the understanding of Bayesian methods used
in ecology and hopefully these will shed light about the differences between estimation
and testing Bayesian tools
All of our contributions exploit the potential of the latent variable formulation This
approach greatly simplifies the analysis of complex models it redirects the bulk of
the inferential burden away from the original response variables and places it on the
easy-to-work-with latent scale for which several time-tested approaches are available
Our methods are distinctly classified into estimation and testing tools
For estimation we proposed a Bayesian specification of the single-season
occupancy model for which a Gibbs sampler is available using both logit and probit
link functions This setup allows detection and occupancy probabilities to depend
on linear combinations of predictors Then we developed a dynamic version of this
approach incorporating the notion that occupancy at a previously occupied site depends
both on survival of current settlers and habitat suitability Additionally because these
dynamics also vary in space we suggest a strategy to add spatial dependence among
neighboring sites
Ecological inquiry usually requires of competing explanations and uncertainty
surrounds the decision of choosing any one of them Hence a model or a set of
probable models should be selected from all the viable alternatives To address this
testing problem we proposed an objective and fully automatic Bayesian methodology
115
for the single season site-occupancy model Our approach relies on the intrinsic prior
which prevents from introducing (commonly unavailable) subjectively information
into the model In simulation experiments we observed that the methods single out
accurately the predictors present in the true model using the marginal posterior inclusion
probabilities of the predictors For predictors in the true model these probabilities were
comparatively larger than those for predictors not present in the true model Also the
simulations indicated that the method provides better discrimination for predictors in the
detection component of the model
In our simulations and in the analysis of the Blue Hawker data we observed that the
effect from using the multiplicity correction prior was substantial This occurs because
the Bayes factor only penalizes complexity of the alternative model according to its
number of parameters in excess to those of the null model As the number of predictors
grows the number of models in the models space also grows increasing the chances
of making false positive decisions on the inclusion of predictors This is where the role
of the prior on the model space becomes important The multiplicity penalty is ldquohidden
awayrdquo in the model prior probabilities π(M|M) In addition to the multiplicity of the
testing problem disregarding the hierarchical polynomial structure in the predictors in
model selection procedures has the potential to lead to different results according to
how the predictors are coded (eg in what units these predictors are expressed)
To confront this situation we propose three prior structures for well-formulated
models take advantage of the hierarchical structure of the predictors Of the priors
proposed we recommend the HOP using the hyperparameter choice (1 ch) which
provides the best control of false positives while maintaining a reasonable true positive
rate
Overall considering the flexibility of the latent approach several other extensions of
these methods follow Currently we envision three future developments (1) occupancy
models incorporate various sources of information (2) multi-species models that make
116
use of spatial and interspecific dependence and (3) investigate methods to conduct
model selection for the dynamic and spatially explicit version of the model
117
APPENDIX AFULL CONDITIONAL DENSITIES DYMOSS
In this section we introduce the full conditional probability density functions for all
the parameters involved in the DYMOSS model using probit as well as logic links
Sampler Z
The full conditionals corresponding to the presence indicators have the same form
regardless of the link used These are derived separately for the cases t = 1 1 lt t lt T
and t = T since their corresponding probabilities take on slightly different forms
Let ϕ(ν|microσ2) represent the density for a normal random variable ν with mean micro and
variance σ2 and recall that ψi1 = F (xprime(o)iα) and pijt = F (qprimeijtλt) where F () is the
inverse link function The full conditional for zit is given by
1 For t = 1
π(zi1|vi1αλ1βc1 δ
s1) = ψlowast
i1zi1 (1minus ψlowast
i1)1minuszi1
= Bernoulli(ψlowasti1) (Andash1)
where
ψlowasti1 =
ψi1ϕ(vi1|xprimei1βc1 + δs1 1)
prodJi1j=1(1minus pij1)
ψi1ϕ(vi1|xprimei1βc1 + δs1 1)
prodJi1j=1(1minus pij1) + (1minus ψi1)ϕ(vi1|xprimei1β
c1 1)
prodJj=1 Iyij1=0
2 For 1 lt t lt T
π(zit |zi(tminus1) zi(t+1)λt βctminus1 δ
stminus1) = ψlowast
itzit (1minus ψlowast
it)1minuszit
= Bernoulli(ψlowastit) (Andash2)
where
ψlowastit =
κitprodJit
j=1(1minus pijt)
κitprodJit
j=1(1minus pijt) +nablait
prodJj=1 Iyijt=0
with
(a) κit = F (xprimei(tminus1)β
ctminus1 + zi(tminus1)δ
stminus1)ϕ(vit |xprimeitβ
ct + δst 1) and
(b) nablait =(1minus F (xprime
i(tminus1)βctminus1 + zi(tminus1)δ
stminus1)
)ϕ(vit |xprimeitβ
ct 1)
3 For t = T
π(ziT |zi(Tminus1)λT βcTminus1 δ
sTminus1) = ψ⋆iT
ziT (1minus ψ⋆iT )1minusziT
118
=
Nprodi=1
Bernoulli(ψ⋆iT ) (Andash3)
where
ψ⋆iT =κ⋆iT
prodJiTj=1(1minus pijT )
κ⋆iTprodJiT
j=1(1minus pijT ) +nabla⋆iT
prodJj=1 IyijT=0
with
(a) κ⋆iT = F (xprimei(Tminus1)β
cTminus1 + zi(Tminus1)δ
sTminus1) and
(b) nabla⋆iT =
(1minus F (xprime
i(Tminus1)βcTminus1 + zi(Tminus1)δ
sTminus1)
)Sampler ui
1
π(ui |zi1α) = tr N(xprime(o)iα 1 trunc(zi1))
where trunc(zi1) =
(minusinfin 0] zi1 = 0
(0infin) zi1 = 1(Andash4)
and tr N(microσ2A) denotes the pdf of a truncated normal random variable with mean microvariance σ2 and truncation region A
Sampler α
1
π(α|u) prop [α]
Nprodi=1
ϕ(ui xprime(o)iα 1) (Andash5)
If [α] prop 1 then
α|u sim N(m(α)α)
with m(α) = αXprime(o)u and α = (X prime
(o)X(o))minus1
Sampler vit
1 (For t gt 1)
π(vi (tminus1)|zi (tminus1) zit βctminus1 δ
stminus1) = tr N
(micro(v)i(tminus1) 1 trunc(zit)
)(Andash6)
where micro(v)i(tminus1) = xprime
i(tminus1)βctminus1 + zi(tminus1)δ
ci(tminus1) and trunc(zit) defines the corresponding
truncation region given by zit
119
Sampler(β(c)tminus1 δ
(c)tminus1
)
1 (For t gt 1)
π(β(s)tminus1 δ
(c)tminus1|vtminus1 ztminus1) prop [β
(s)tminus1 δ
(c)tminus1]
Nprodi=1
ϕ(vit xprimei(tminus1)β
(c)tminus1 + zi(tminus1)δ
(s)tminus1 1) (Andash7)
If[β(c)tminus1 δ
(s)tminus1
]prop 1 then
β(c)tminus1 δ
(s)tminus1|vtminus1 ztminus1 sim N(m(β
(c)tminus1 δ
(s)tminus1)tminus1)
with m(β(c)tminus1 δ
(s)tminus1) = tminus1 ~X
primetminus1vtminus1 and tminus1 = (~X prime
tminus1 ~Xtminus1)minus1 where ~Xtminus1 =(
Xtminus1 ztminus1)
Sampler wijt
1 (For t gt 1 and zit = 1)
π(wijt | i zit = 1 yijt λ) = tr N(qprimeijtλt 1 tr(yijt)
)(Andash8)
Sampler λt
1 (For t = 1 2 T )
π(λt |zt wt) prop [λt ]prod
i zit=1
Jitprodj=1
ϕ(wijt qprimeijtλt 1) (Andash9)
If [λt ] prop 1 then
λt |wt zt sim N(m(λt)λt)
with m(λt) = λtQ primetwt and λt
= (Q primetQt)
minus1 where Qt and wt respectively are the designmatrix and the vector of latent variables for surveys of sites such that zit = 1
120
APPENDIX BRANDOM WALK ALGORITHMS
Global Jump From the current state M the global jump is performed by drawing
a model M prime at random from the model space This is achieved by beginning at the base
model and increasing the order from JminM to the Jmax
M the minimum and maximum orders
of nodes in (MF ) = MF MB at each order a set of nodes is selected at random from
the prior conditioned on the nodes already in the model The MH correction is
α =
1m(y|M primeM)
m(y|MM)
Local Jump From the current state M the local jump is performed by drawing a
model from the set of models L(M) = Mα α isin E(M) cup C(M) where Mα is M α
for α isin E(M) and M cup α for α isin C(M) The proposal probabilities for the model are
computed as a mixture of p(M prime|yMM prime isin L(M)) and the discrete uniform distribution
The proposal kernel is
q(M prime|yMM prime isin L(M)) =1
2
(p(M prime|yMM prime isin L(M)) +
1
|L(M)|
)This choice promotes moving to better models while maintaining a non-negligible
probability of moving to any of the possible models The MH correction is
α =
1m(y|M primeM)
m(y|MM)
q(M|yMM isin L(M prime))
q(M prime|yMM prime isin L(M))
Intermediate Jump The intermediate jump is performed by increasing or
decreasing the order of the nodes under consideration performing local proposals based
on order For a model M prime define Lj(Mprime) = M prime cup M prime
α α isin (E(M prime) cup C(M prime)) capj(MF )
From a state M the kernel chooses at random whether to increase or decrease the
order If M = MF then decreasing the order is chosen with probability 1 and if M = MB
then increasing the order is chosen with probability 1 in all other cases the probability of
increasing and decreasing order is 12 The proposal kernels are given by
121
Increasing order proposal kernel
1 Set j = JminM minus 1 and M prime
j = M
2 Draw M primej+1 from qincj+1(M
prime|yMM prime isin Lj+1(Mprimej )) where
qincj+1(Mprime|yMM prime isin Lj+1(M
primej )) =
12
(p(M prime|yMM prime isin Lj+1(M
primej )) +
1|Lj+1(M
primej)|
)
3 Set j = j + 1
4 If j lt JmaxM then return to 2 O therwise proceed to 5
5 Set M prime = M primeJmaxM
and compute the proposal probability
qinc(Mprime|yMM) =
JmaxM minus1prod
j=JminM minus1
qincj+1(Mprimej |yMM prime isin Lj+1(M
primej )) (Bndash1)
Decreasing order proposal kernel
1 Set j = JmaxM + 1 and M prime
j = M
2 Draw M primejminus1 from qdecjminus1(M
prime|yMM prime isin Ljminus1(Mprimej )) where
qdecjminus1(Mprime|yMM prime isin Ljminus1(M
primej )) =
12
(p(M prime|yMM prime isin Ljminus1(M
primej )) +
1|Ljminus1(M
primej)|
)
3 Set j = j minus 1
4 If j gt JminM then return to 2 Otherwise proceed to 5
5 Set M prime = M primeJminM
and compute the proposal probability
qdec(Mprime|yMM) =
JminM +1prod
j=JmaxM +1
qdecjminus1(Mprimej |yMM prime isin Ljminus1(M
primej )) (Bndash2)
If increasing order is chosen then the MH correction is given by
α = min
1
(1 + I (M prime = MF )
1 + I (M = MB)
)qdec(M|yMM prime)
qinc(M prime|yMM)
p(M prime|yM)
p(M|yM)
(Bndash3)
and similarly if decreasing order is chosen
Other Local and Intermediate Kernels The local and intermediate kernels
described here perform a kind of stochastic forwards-backwards selection Each kernel
122
q can be relaxed to allow more than one node to be turned on or off at each step which
could provide larger jumps for each of these kernels The tradeoff is that number of
proposed models for such jumps could be very large precluding the use of posterior
information in the construction of the proposal kernel
123
APPENDIX CWFM SIMULATION DETAILS
Briefly the idea is to let ZMT(X )βMT
= (QR)βMT= QηMT
(ie βMT= Rminus1ηMT
)
using the QR decomposition As such setting all values in ηMTproportional to one
corresponds to distributing the signal in the model uniformly across all predictors
regardless of their order
The (unconditional) variance of a single observation yi is var(yi) = var (E [yi |zi ]) +
E [var(yi |zi)] where zi is the i -th row of the design matrix ZMT Hence we take the
signal to noise ratio for each observation to be
SNR(η) = ηTMT
RminusTzRminus1ηMT
σ2
where z = var(zi) We determine how the signal is distributed across predictors up to a
proportionality constant to be able to control simultaneously the signal to noise ratio
Additionally to investigate the ability of the model to capture correctly the
hierarchical structure we specify four different 0-1 vectors that determine the predictors
in MT which generates the data in the different scenarios
Table C-1 Experimental conditions WFM simulationsParameter Values considered
SNR(ηMT) = k 025 1 4
ηMTprop (1 13 14 12) (1 13 1214
1412) (1 1413
1214 12)
γMT(1 13 14 12) (1 13 14 02) (1 13 04 12) (1 03 0 1 1 0 12)
n 130 260 1040
The results presented below are somewhat different from those found in the main
body of the article in Section 5 These are extracted averaging the number of FPrsquos
TPrsquos and model sizes respectively over the 100 independent runs and across the
corresponding scenarios for the 20 highest probability models
124
SNR and Sample Size Effect
In terms of the SNR and the sample size (Figure C-1) we observe that as
expected small sample sizes conditioned upon a small SNR impair the ability of the
algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this effect
more notorious when using the latter prior However considering the mean number
of true positives (TP) jointly with the mean model size it is clear that although the
sensitivity is low most of the few predictors that are discovered belong to the true
model The results observed with SNR of 025 and a relatively small sample size are
far from being impressive however real problems where the SNR is as low as 025
will yield many spurious associations under the EPP The fact that the HOP(1 ch) has
a strong protection against false positive is commendable in itself A SNR of 1 also
represents a feeble relationship between the predictors and the response nonetheless
the method captures approximately half of the true coefficients while including very few
false positives Following intuition as either the sample size or the SNR increase the
algorithms performance is greatly enhanced Either having a large sample size or a
large SNR yields models that contain mostly true predictors Additionally HOP(1 ch)
provides a strong control over the number of false positives therefore for high SNR
or larger sample sizes the number of predictors in the top 20 models is close to the
size of the true model In general the EPP allows the detection of more TPrsquos while
the HOP(1 ch) provides a stronger control on the amount of FPrsquos included when
considering small sample sizes combined with small SNRs As either sample size or
SNR grows the differences between the two priors become indistinct
125
Figure C-1 SNR vs n Average model size average true positives and average false
positives for all simulated scenarios by model ranking according to model
posterior probabilities
Coefficient Magnitude
This part of the experiment explores the effect of how the signal is distributed across
predictors As mentioned before sphering is used to assign the coefficients values
in a manner that controls the amount of signal that goes into each coefficient Three
possible ways to allocate the signal are considered First each order-one coefficient
contains twice as much signal as any order-two coefficient and four times as much
any as order-three coefficient second all coefficients contain the same amount of
signal regardless of their order and third each order-one coefficient contains a half
as much signal as any order-two coefficient and a quarter of what any order-three
126
coefficient has In Figure C-2 these values are denoted by β = c(1o1 05o2 025o3)
β = c(1o1 1o2 1o3) and β = c(025o1 05o2 1o3) respectively
Observe that the number of FPrsquos is invulnerable to how the SNR is distributed
across predictors using the HOP(1 ch) conversely when using the EPP the number
of FPrsquos decreases as the SNR grows always being slightly higher than those obtained
with the HOP With either prior structure the algorithm performs better whenever all
coefficients are equally weighted or when those for the order-three terms have higher
weights In these two cases (ie with β = c(1o1 05o2 025o3) or β = c(1o1 1o2 1o3))
the effect of the SNR appears to be similar In contrast when more weight is given to
order one terms the algorithm yields slightly worse models at any SNR level This is an
intuitive result since giving more signal to higher order terms makes it easier to detect
higher order terms and consequently by strong heredity the algorithm will also select
the corresponding lower order terms included in the true model
Special Points on the Scale
In Nelder (1998) the author argues that the conditions under which the
weak-heredity principle can be used for model selection are so restrictive that the
principle is commonly not valid in practice in this context In addition the author states
that considering well-formulated models only does not take into account the possible
presence of special points on the scales of the predictors that is situations where
omitting lower order terms is justified due to the nature of the data However it is our
contention that every model has an underlying well-formulated structure whether or not
some predictor has special points on its scale will be determined through the estimation
of the coefficients once a valid well-formulated structure has been chosen
To understand how the algorithm behaves whenever the true data generating
mechanism has zero-valued coefficients for some lower order terms in the hierarchy
four different true models are considered Three of them are not well-formulated while
the remaining one is the WFM shown in Figure 4-6 The three models that have special
127
Figure C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model rankingaccording to model posterior probabilities
points correspond to the same model MT from Figure 4-6 but have respectively
zero-valued coefficients for all the order-one terms all the order-two terms and for x21
and x2x5
As seen before in comparison to the EPP the HOP(1 ch) tightly controls the
inclusion FPs by choosing smaller models at the expense of also reducing the TP
count especially when there is more uncertainty about the true model (ie SNR=025)
For both prior structures the results in Figure C-3 indicate that at low SNR levels the
presence of special points has no apparent impact as the selection behavior is similar
between the four models in terms of both the TP and FP As the SNR increases the
TPs and the model size are affected for true models with zero-valued lower order
128
Figure C-3 SNR vs different true models MT Average model size average truepositives and average false positives for all simulated scenarios by modelranking according to model posterior probabilities
terms These differences however are not very large Relatively smaller models are
selected whenever some terms in the hierarchy are missing but with high SNR which
is where the differences are most pronounced the predictors included are mostly true
coefficients The impact is almost imperceptible for the true model that lacks order one
terms and the model with zero coefficients for x21 and x2x5 and is more visible for models
without order two terms This last result is expected due to strong-heredity whenever
the order-one coefficients are missing the inclusion of order-two and order-three
terms will force their selection which is also the case when only a few order two terms
have zero-valued coefficients Conversely when all order two predictors are removed
129
some order three predictors are not selected as their signal is attributed the order two
predictors missing from the true model This is especially the case for the order three
interaction term x1x2x5 which depends on the inclusion of three order two terms terms
(x1x2 x1x5 x2x5) in order for it to be included as well This makes the inclusion of this
term somewhat more challenging the three order two interactions capture most of
the variation of the polynomial terms that is present when the order three term is also
included However special points on the scale commonly occur on a single or at most
on a few covariates A true data generating mechanism that removes all terms of a given
order in the context of polynomial models is clearly not justified here this was only done
for comparison purposes
130
APPENDIX DSUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
The covariates considered for the ozone data analysis match those used in Liang
et al (2008) these are displayed in Table D below
Table D-1 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA
vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX
The marginal posterior inclusion probability corresponds to the probability of including a
given term in the full model MF after summing over all models in the model space For each
node α isin MF this probability is given by pα =sum
MisinM I(αisinM)p(M|yM) Given that in problems
with a large model space such as the one considered for the ozone concentration problem
enumeration of the entire space is not feasible Thus these probabilities are estimated summing
over every model drawn by the random walk over the model space M
Given that there are in total 44 potential predictors for convenience in Tables D-2 to D-5
below we only display the marginal posterior probabilities for the terms included under at least
one of the model priors considered (EPP HIP HUP and HOP) for each of the parameter priors
utilized (intrinsic priors Zellner-Siow priors Hyper-g(11) and Hyper-g(21))
131
Table D-2 Marginal inclusion probabilities
intrinsic prior
EPP HIP HUP HOP
hum 099 069 085 076
dpg 085 048 052 053
ibt 099 100 100 100
hum2 076 051 043 062
humdpg 055 002 003 017
humibt 098 069 084 075
dpg2 072 036 025 046
ibt2 059 078 057 081
Table D-3 Marginal inclusion probabilities
Zellner-Siow prior
EPP HIP HUP HOP
hum 076 067 080 069
dpg 089 050 055 058
ibt 099 100 100 100
hum2 057 049 040 057
humibt 072 066 078 068
dpg2 081 038 031 051
ibt2 054 076 055 077
Table D-4 Marginal inclusion probabilities
Hyper-g11
EPP HIP HUP HOP
vh 054 005 010 011
hum 081 067 080 069
dpg 090 050 055 058
ibt 099 100 099 099
hum2 061 049 040 057
humibt 078 066 078 068
dpg2 083 038 030 051
ibt2 049 076 054 077
Table D-5 Marginal inclusion probabilities
Hyper-g21
EPP HIP HUP HOP
hum 079 064 073 067
dpg 090 052 060 059
ibt 099 100 099 100
hum2 060 047 037 055
humibt 076 064 071 067
dpg2 082 041 036 052
ibt2 047 073 049 075
132
REFERENCES
Akaike H (1983) Information measures and model selection Bull Int Statist Inst 50277ndash290
Albert J H amp Chib S (1993) Bayesian-analysis of binary and polychotomousresponse data Journal of the American Statistical Association 88(422) 669ndash679
Berger J amp Bernardo J (1992) On the development of reference priors BayesianStatistics 4 (pp 35ndash60)
URL httpisbastatdukeedueventsvalencia1992Valencia4Refpdf
Berger J amp Pericchi L (1996) The intrinsic Bayes factor for model selection andprediction Journal of the American Statistical Association 91(433) 109ndash122
URL httpamstattandfonlinecomdoiabs10108001621459199610476668
Berger J Pericchi L amp Ghosh J (2001) Objective Bayesian methods for modelselection introduction and comparison In Model selection vol 38 of IMS LectureNotes Monogr Ser (pp 135ndash207) Inst Math Statist
URL httpwwwjstororgstable1023074356165
Besag J York J amp Mollie A (1991) Bayesian Image-Restoration with 2 Applicationsin Spatial Statistics Annals of the Institute of Statistical Mathematics 43 1ndash20
Bien J Taylor J amp Tibshirani R (2013) A lasso for hierarchical interactions TheAnnals of Statistics 41(3) 1111ndash1141
URL httpprojecteuclidorgeuclidaos1371150895
Breiman L amp Friedman J (1985) Estimating optimal transformations for multipleregression and correlation Journal of the American Statistical Association 80580ndash598
Brusco M J Steinley D amp Cradit J D (2009) An exact algorithm for hierarchicallywell-formulated subsets in second-order polynomial regression Technometrics 51(3)306ndash315
Casella G Giron F J Martınez M L amp Moreno E (2009) Consistency of Bayesianprocedures for variable selection The Annals of Statistics 37 (3) 1207ndash1228
URL httpprojecteuclidorgeuclidaos1239369020
Casella G Moreno E amp Giron F (2014) Cluster Analysis Model Selection and PriorDistributions on Models Bayesian Analysis TBA(TBA) 1ndash46
URL httpwwwstatufledu~casellaPapersClusterModel-July11-Apdf
133
Chipman H (1996) Bayesian variable selection with related predictors CanadianJournal of Statistics 24(1) 17ndash36
URL httponlinelibrarywileycomdoi1023073315687abstract
Clyde M amp George E I (2004) Model Uncertainty Statistical Science 19(1) 81ndash94
URL httpprojecteuclidorgDienstgetRecordid=euclidss1089808274
Dewey J (1958) Experience and nature New York Dover Publications
Dorazio R M amp Taylor-Rodrıguez D (2012) A Gibbs sampler for Bayesian analysis ofsite-occupancy data Methods in Ecology and Evolution 3 1093ndash1098
Ellison A M (2004) Bayesian inference in ecology Ecology Letters 7 509ndash520
Fiske I amp Chandler R (2011) unmarked An R package for fitting hierarchical modelsof wildlife occurrence and abundance Journal of Statistical Software 43(10)
URL httpcorekmiopenacukdownloadpdf5701760pdf
George E (2000) The variable selection problem Journal of the American StatisticalAssociation 95(452) 1304ndash1308
URL httpwwwtandfonlinecomdoiabs10108001621459200010474336
Giron F J Moreno E Casella G amp Martınez M L (2010) Consistency of objectiveBayes factors for nonnested linear models and increasing model dimension Revistade la Real Academia de Ciencias Exactas Fisicas y Naturales Serie A Matematicas104(1) 57ndash67
URL httpwwwspringerlinkcomindex105052RACSAM201006
Good I J (1950) Probability and the Weighing of Evidence New York Haffner
Griepentrog G L Ryan J M amp Smith L D (1982) Linear transformations ofpolynomial regression-models American Statistician 36(3) 171ndash174
Gunel E amp Dickey J (1974) Bayes factors for independence in contingency tablesBiometrika 61 545ndash557
Hanski I (1994) A Practical Model of Metapopulation Dynamics Journal of AnimalEcology 63 151ndash162
Hooten M (2006) Hierarchical spatio-temporal models for ecological processesDoctoral dissertation University of Missouri-Columbia
URL httpsmospacelibraryumsystemeduxmluihandle103554500
Hooten M B amp Hobbs N T (2014) A Guide to Bayesian Model Selection forEcologists Ecological Monographs (In Press)
134
Hughes J amp Haran M (2013) Dimension reduction and alleviation of confoundingfor spatial generalized linear mixed models Journal of the Royal Statistical SocietySeries B Statistical Methodology 75 139ndash159
Hurvich C M amp Tsai C-L (1989) Regression and time series model selection insmall samples Biometrika 76 297ndash307
URL httpbiometoxfordjournalsorgcontent762297abstract
Jeffreys H (1935) Some tests of significance treated by the theory of probabilityProcedings of the Cambridge Philosophy Society 31 203ndash222
Jeffreys H (1961) Theory of Probability London Oxford University Press 3rd ed
Johnson D Conn P Hooten M Ray J amp Pond B (2013) Spatial occupancymodels for large data sets Ecology 94(4) 801ndash808
URL httpwwwesajournalsorgdoiabs10189012-05641mi=3eywlhampaf=R
ampsearchText=human+population
Kass R amp Wasserman L (1995) A reference Bayesian test for nested hypothesesand its relationship to the Schwarz criterion Journal of the American StatisticalAssociation 90(431)
URL httpamstattandfonlinecomdoiabs10108001621459199510476592
Kass R E amp Raftery A E (1995) Bayes Factors Journal of the American StatisticalAssociation 90 773ndash795
URL httpwwwtandfonlinecomdoiabs10108001621459199510476572$
delimiter026E30F$nhttpwwwtandfonlinecomdoiabs10108001621459
199510476572UvBybrTIgcs
Kass R E amp Wasserman L (1996) The Selection of Prior Distributions by FormalRules Journal of the American Statistical Association 91(435) 1343
URL httpwwwjstororgstable2291752origin=crossref
Kery M (2010) Introduction to WinBUGS for Ecologists Bayesian Approach toRegression ANOVA Mixed Models and Related Analyses Academic Press 1st ed
Kery M Gardner B amp Monnerat C (2010) Predicting species distributions fromchecklist data using site-occupancy models Journal of Biogeography 37 (10)1851ndash1862 Kery Marc Gardner Beth Monnerat Christian
Khuri A (2002) Nonsingular linear transformations of the control variables in responsesurface models Technical Report
Krebs C J (1972) Ecology the experimental analysis of distribution and abundance
135
Lempers F B (1971) Posterior probabilities of alternative linear models University ofRotterdam Press Rotterdam
Leon-Novelo L Moreno E amp Casella G (2012) Objective Bayes model selection inprobit models Statistics in medicine 31(4) 353ndash65
URL httpwwwncbinlmnihgovpubmed22162041
Liang F Paulo R Molina G Clyde M a amp Berger J O (2008) Mixtures of g Priorsfor Bayesian Variable Selection Journal of the American Statistical Association103(481) 410ndash423
URL httpwwwtandfonlinecomdoiabs101198016214507000001337
Link W amp Barker R (2009) Bayesian inference with ecological applications Elsevier
URL httpbooksgooglecombookshl=enamplr=ampid=hecon2l2QPcCampoi=fnd
amppg=PP2ampdq=Bayesian+Inference+with+ecological+applicationsampots=S82_
0pxrNmampsig=L3xbsSQcKD8FV6rxCMp2pmP2JKk
MacKenzie D amp Nichols J (2004) Occupancy as a surrogate for abundanceestimation Animal biodiversity and conservation 1 461ndash467
URL httpcrsitbacidmediajurnalrefslandscapemackenzie2004zhpdf
MacKenzie D Nichols J amp Hines J (2003) Estimating site occupancy colonizationand local extinction when a species is detected imperfectly Ecology 84(8)2200ndash2207
URL httpwwwesajournalsorgdoiabs10189002-3090
MacKenzie D I Bailey L L amp Nichols J D (2004) Investigating speciesco-occurrence patterns when species Journal of Animal Ecology 73 546ndash555
MacKenzie D I Nichols J D Lachman G B Droege S Royle J A amp LangtimmC A (2002) Estimating site occupancy rates when detection probabilities are lessthan one Ecology 83(8) 2248ndash2255
Mazerolle M amp Mazerolle M (2013) Package rsquoAICcmodavgrsquo (c)
URL ftpheanetarchivegnewsenseorgdisk1CRANwebpackages
AICcmodavgAICcmodavgpdf
McCullagh P amp Nelder J A (1989) Generalized linear models (2nd ed) LondonEngland Chapman amp Hall
McQuarrie A Shumway R amp Tsai C-L (1997) The model selection criterion AICu
136
Moreno E Bertolino F amp Racugno W (1998) An intrinsic limiting procedure for modelselection and hypotheses testing Journal of the American Statistical Association93(444) 1451ndash1460
Moreno E Giron F J amp Casella G (2010) Consistency of objective Bayes factors asthe model dimension grows The Annals of Statistics 38(4) 1937ndash1952
URL httpprojecteuclidorgeuclidaos1278861238
Nelder J A (1977) Reformulation of linear-models Journal of the Royal StatisticalSociety Series A - Statistics in Society 140 48ndash77
Nelder J A (1998) The selection of terms in response-surface models - how strong isthe weak-heredity principle American Statistician 52(4) 315ndash318
Nelder J A (2000) Functional marginality and response-surface fitting Journal ofApplied Statistics 27 (1) 109ndash112
Nichols J Hines J amp Mackenzie D (2007) Occupancy estimation and modeling withmultiple states and state uncertainty Ecology 88(6) 1395ndash1400
URL httpwwwesajournalsorgdoipdf10189006-1474
Ovaskainen O Hottola J amp Siitonen J (2010) Modeling species co-occurrenceby multivariate logistic regression generates new hypotheses on fungal interactionsEcology 91(9) 2514ndash21
URL httpwwwncbinlmnihgovpubmed20957941
Peixoto J L (1987) Hierarchical variable selection in polynomial regression-modelsAmerican Statistician 41(4) 311ndash313
Peixoto J L (1990) A property of well-formulated polynomial regression-modelsAmerican Statistician 44(1) 26ndash30
Pericchi L R (2005) Model selection and hypothesis testing based on objectiveprobabilities and bayes factors In Handbook of Statistics Elsevier
Polson N G Scott J G amp Windle J (2013) Bayesian Inference for Logistic ModelsUsing Polya-Gamma Latent Variables Journal of the American Statistical Association108 1339ndash1349
URL httpdxdoiorg101080016214592013829001
Rao C R amp Wu Y (2001) On model selection vol Volume 38 of Lecture NotesndashMonograph Series (pp 1ndash57) Beachwood OH Institute of Mathematical Statistics
URL httpdxdoiorg101214lnms1215540960
137
Reich B J Hodges J S amp Zadnik V (2006) Effects of residual smoothing on theposterior of the fixed effects in disease-mapping models Biometrics 62 1197ndash1206
Reiners W amp Lockwood J (2009) Philosophical Foundations for the Practices ofEcology Cambridge University Press
URL httpbooksgooglecombooksid=dr9cPgAACAAJ
Rigler F amp Peters R (1995) Excellence in Ecology Science and Limnology EcologyInstitute Germany
URL httportoncatieaccrcgi-binwxisexeIsisScript=CIENLxis
ampmethod=postampformato=2ampcantidad=1ampexpresion=mfn=008268
Robert C Chopin N amp Rousseau J (2009) Harold Jeffreysrsquo Theory of Probabilityrevisited Statistical Science Volume 24(2) 141ndash179
URL httpswwwnewtonacukpreprintsNI08021pdf
Robert C P (1993) A note on jeffreys-lindley paradox Statistica Sinica 3 601ndash608
Royle J A amp Kery M (2007) A Bayesian state-space formulation of dynamicoccupancy models Ecology 88(7) 1813ndash23
URL httpwwwncbinlmnihgovpubmed17645027
Scott J amp Berger J (2010) Bayes and Empirical-Bayes Multiplicity Adjustment in thevariable selection problem The Annals of Statistics
URL httpprojecteuclidorgeuclidaos1278861454
Spiegelhalter D J amp Smith A F M (1982) Bayes factor for linear and log-linearmodels with vague prior information J R Statist Soc B 44 377ndash387
Tierney L amp Kadane J B (1986) Accurate approximations for posterior moments andmarginal densities Journal of the American Statistical Association 81 82ndash86
Tyre A J Tenhumberg B Field S a Niejalke D Parris K amp Possingham H P(2003) Improving Precision and Reducing Bias in Biological Surveys EstimatingFalse-Negative Error Rates Ecological Applications 13(6) 1790ndash1801
URL httpwwwesajournalsorgdoiabs10189002-5078
Waddle J H Dorazio R M Walls S C Rice K G Beauchamp J Schuman M Jamp Mazzotti F J (2010) A new parameterization for estimating co-occurrence ofinteracting species Ecological applications a publication of the Ecological Society ofAmerica 20 1467ndash1475
Wasserman L (2000) Bayesian Model Selection and Model Averaging Journal ofmathematical psychology 44(1) 92ndash107
138
URL httpwwwncbinlmnihgovpubmed10733859
Wilson M Iversen E Clyde M A Schmidler S C amp Schildkraut J M (2010)Bayesian model search and multilevel inference for SNP association studies TheAnnals of Applied Statistics 4(3) 1342ndash1364
URL httpwwwncbinlmnihgovpmcarticlesPMC3004292
Womack A J Leon-Novelo L amp Casella G (2014) Inference from Intrinsic BayesProcedures Under Model Selection and Uncertainty Journal of the AmericanStatistical Association (June) 140114063448000
URL httpwwwtandfonlinecomdoiabs101080016214592014880348
Yuan M Joseph V R amp Zou H (2009) Structured variable selection and estimationThe Annals of Applied Statistics 3(4) 1738ndash1757
URL httpprojecteuclidorgeuclidaoas1267453962
Zeller K A Nijhawan S Salom-Perez R Potosme S H amp Hines J E (2011)Integrating occupancy modeling and interview data for corridor identification A casestudy for jaguars in nicaragua Biological Conservation 144(2) 892ndash901
Zellner A amp Siow A (1980) Posterior odds ratios for selected regression hypothesesIn Trabajos de estadıstica y de investigacion operativa (pp 585ndash603)
URL httpwwwspringerlinkcomindex5300770UP12246M9pdf
139
BIOGRAPHICAL SKETCH
Daniel Taylor-Rodrıguez was born in Bogota Colombia He earned a BS
degree in economics from the Universidad de Los Andes (2004) and a Specialist
degree in statistics from the Universidad Nacional de Colombia In 2009 he traveled
to Gainesville Florida to pursue a masterrsquos in statistics under the supervision of
George Casella Upon completion he started a PhD in interdisciplinary ecology with
concentration in statistics again under George Casellarsquos supervision After Georgersquos
passing Linda Young and Nikolay Bliznyuk continued to oversee Danielrsquos mentorship
He has currently accepted a joint postdoctoral fellowship at the Statistical and Applied
Mathematical Sciences Institute and the Department of Statistical Science at Duke
University
140
- ACKNOWLEDGMENTS
- TABLE OF CONTENTS
- LIST OF TABLES
- LIST OF FIGURES
- ABSTRACT
- 1 GENERAL INTRODUCTION
-
- 11 Occupancy Modeling
- 12 A Primer on Objective Bayesian Testing
- 13 Overview of the Chapters
-
- 2 MODEL ESTIMATION METHODS
-
- 21 Introduction
-
- 211 The Occupancy Model
- 212 Data Augmentation Algorithms for Binary Models
-
- 22 Single Season Occupancy
-
- 221 Probit Link Model
- 222 Logit Link Model
-
- 23 Temporal Dynamics and Spatial Structure
-
- 231 Dynamic Mixture Occupancy State-Space Model
- 232 Incorporating Spatial Dependence
-
- 24 Summary
-
- 3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
-
- 31 Introduction
- 32 Objective Bayesian Inference
-
- 321 The Intrinsic Methodology
- 322 Mixtures of g-Priors
-
- 3221 Intrinsic priors
- 3222 Other mixtures of g-priors
-
- 33 Objective Bayes Occupancy Model Selection
-
- 331 Preliminaries
- 332 Intrinsic Priors for the Occupancy Problem
- 333 Model Posterior Probabilities
- 334 Model Selection Algorithm
-
- 34 Alternative Formulation
- 35 Simulation Experiments
-
- 351 Marginal Posterior Inclusion Probabilities for Model Predictors
- 352 Summary Statistics for the Highest Posterior Probability Model
-
- 36 Case Study Blue Hawker Data Analysis
-
- 361 Results Variable Selection Procedure
- 362 Validation for the Selection Procedure
-
- 37 Discussion
-
- 4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
-
- 41 Introduction
- 42 Setup for Well-Formulated Models
-
- 421 Well-Formulated Model Spaces
-
- 43 Priors on the Model Space
-
- 431 Model Prior Definition
- 432 Choice of Prior Structure and Hyper-Parameters
- 433 Posterior Sensitivity to the Choice of Prior
-
- 44 Random Walks on the Model Space
-
- 441 Simple Pruning and Growing
- 442 Degree Based Pruning and Growing
-
- 45 Simulation Study
-
- 451 SNR and Sample Size Effect
- 452 Coefficient Magnitude
- 453 Special Points on the Scale
-
- 46 Case Study Ozone Data Analysis
- 47 Discussion
-
- 5 CONCLUSIONS
- A FULL CONDITIONAL DENSITIES DYMOSS
- B RANDOM WALK ALGORITHMS
- C WFM SIMULATION DETAILS
- D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
- REFERENCES
- BIOGRAPHICAL SKETCH
-
LIST OF TABLES
Table page
1-1 Interpretation of BFji when contrasting Mj and Mi 20
3-1 Simulation control parameters occupancy model selector 69
3-2 Comparison of average minOddsMPIP under scenarios having different numberof sites (N=50 N=100) and under scenarios having different number of surveysper site (J=3 J=5) for the presence and detection components using uniformand multiplicity correction priors 75
3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors 75
3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highest probabilitymodels for the presence and the detection components using uniform andmultiplicity correcting priors on the model space 76
3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space 77
3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 77
3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 78
3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data 80
3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data 80
3-10 MPIP presence component 81
3-11 MPIP detection component 81
3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform and multiplicitycorrection model priors 82
8
4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulations 100
4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchical independenceprior (HIP) the hierarchical order prior (HOP) and the hierarchical uniformprior (HUP) 102
4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and the hierarchicaluniform prior (HUP) 103
4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP) 105
4-5 Variables used in the analyses of the ozone contamination dataset 112
4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso 113
C-1 Experimental conditions WFM simulations 124
D-1 Variables used in the analyses of the ozone contamination dataset 131
D-2 Marginal inclusion probabilities intrinsic prior 132
D-3 Marginal inclusion probabilities Zellner-Siow prior 132
D-4 Marginal inclusion probabilities Hyper-g11 132
D-5 Marginal inclusion probabilities Hyper-g21 132
9
LIST OF FIGURES
Figure page
2-1 Graphical representation occupancy model 25
2-2 Graphical representation occupancy model after data-augmentation 31
2-3 Graphical representation multiseason model for a single site 39
2-4 Graphical representation data-augmented multiseason model 39
3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors 71
3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors 72
3-3 Predictor MPIP averaged over scenarios with the interaction between the numberof sites and the surveys per site using uniform (U) and multiplicity correction(MC) priors 72
3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors 73
3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors 73
4-1 Graphs of well-formulated polynomial models for p = 2 90
4-2 E(M) and C(M) in M defined by a quadratic surface in two main effects formodel M = 1 x1 x21 91
4-3 Graphical representation of assumptions on M defined by the quadratic surfacein two main effects 93
4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the intercept onlymodel and (ab) isin (1 1) (1 ch) 97
4-5 Prior probabilities for the space of well-formulated models associated to threemain effects and one interaction term where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch) 98
4-6 MT DAG of the largest true model used in simulations 109
4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch) 110
C-1 SNR vs n Average model size average true positives and average false positivesfor all simulated scenarios by model ranking according to model posterior probabilities126
10
C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 128
C-3 SNR vs different true models MT Average model size average true positivesand average false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 129
11
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
OBJECTIVE BAYESIAN METHODS FOR OCCUPANCY MODEL ESTIMATION ANDSELECTION
By
Daniel Taylor-Rodrıguez
August 2014
Chair Linda J YoungCochair Nikolay BliznyukMajor Interdisciplinary Ecology
The ecological literature contains numerous methods for conducting inference about
the dynamics that govern biological populations Among these methods occupancy
models have played a leading role during the past decade in the analysis of large
biological population surveys The flexibility of the occupancy framework has brought
about useful extensions for determining key population parameters which provide
insights about the distribution structure and dynamics of a population However the
methods used to fit the models and to conduct inference have gradually grown in
complexity leaving practitioners unable to fully understand their implicit assumptions
increasing the potential for misuse This motivated our first contribution We develop
a flexible and straightforward estimation method for occupancy models that provides
the means to directly incorporate temporal and spatial heterogeneity using covariate
information that characterizes habitat quality and the detectability of a species
Adding to the issue mentioned above studies of complex ecological systems now
collect large amounts of information To identify the drivers of these systems robust
techniques that account for test multiplicity and for the structure in the predictors are
necessary but unavailable for ecological models We develop tools to address this
methodological gap First working in an ldquoobjectiverdquo Bayesian framework we develop
the first fully automatic and objective method for occupancy model selection based
12
on intrinsic parameter priors Moreover for the general variable selection problem we
propose three sets of prior structures on the model space that correct for multiple testing
and a stochastic search algorithm that relies on the priors on the models space to
account for the polynomial structure in the predictors
13
CHAPTER 1GENERAL INTRODUCTION
As with any other branch of science ecology strives to grasp truths about the
world that surrounds us and in particular about nature The objective truth sought
by ecology may well be beyond our grasp however it is reasonable to think that at
least partially ldquoNature is capable of being understoodrdquo (Dewey 1958) We can observe
and interpret nature to formulate hypotheses which can then be tested against reality
Hypotheses that encounter no or little opposition when confronted with reality may
become contextual versions of the truth and may be generalized by scaling them
spatially andor temporally accordingly to delimit the bounds within which they are valid
To formulate hypotheses accurately and in a fashion amenable to scientific inquiry
not only the point of view and assumptions considered must be made explicit but
also the object of interest the properties worthy of consideration of that object and
the methods used in studying such properties (Reiners amp Lockwood 2009 Rigler amp
Peters 1995) Ecology as defined by Krebs (1972) is ldquothe study of interactions that
determine the distribution and abundance of organismsrdquo This characterizes organisms
and their interactions as the objects of interest to ecology and prescribes distribution
and abundance as a relevant property of these organisms
With regards to the methods used to acquire ecological scientific knowledge
traditionally theoretical mathematical models (such as deterministic PDEs) have been
used However naturally varying systems are imprecisely observed and as such are
subject to multiple sources of uncertainty that must be explicitly accounted for Because
of this the ecological scientific community is developing a growing interest in flexible
and powerful statistical methods and among these Bayesian hierarchical models
predominate These methods rely on empirical observations and can accommodate
fairly complex relationships between empirical observations and theoretical process
models while accounting for diverse sources of uncertainty (Hooten 2006)
14
Bayesian approaches are now used extensively in ecological modeling however
there are two issues of concern one from the standpoint of ecological practitioners
and another from the perspective of scientific ecological endeavors First Bayesian
modeling tools require a considerable understanding of probability and statistical theory
leading practitioners to view them as black box approaches (Kery 2010) Second
although Bayesian applications proliferate in the literature in general there is a lack of
awareness of the distinction between approaches specifically devised for testing and
those for estimation (Ellison 2004) Furthermore there is a dangerous unfamiliarity with
the proven risks of using tools designed for estimation in testing procedures (Berger amp
Pericchi 1996 Berger et al 2001 Kass amp Raftery 1995 Moreno et al 1998 Robert
et al 2009 Robert 1993) (eg use of flat priors in hypothesis testing)
Occupancy models have played a leading role during the past decade in large
biological population surveys The flexibility of the occupancy framework has allowed
the development of useful extensions to determine several key population parameters
which provide robust notions of the distribution structure and dynamics of a population
In order to address some of the concerns stated in previous paragraph we concentrate
in the occupancy framework to develop estimation and testing tools that will allow
ecologists first to gain insight about the estimation procedure and second to conduct
statistically sound model selection for site-occupancy data
11 Occupancy Modeling
Since MacKenzie et al (2002) and Tyre et al (2003) introduced the site-occupancy
framework countless applications and extensions of the method have been developed
in the ecological literature as evidenced by the 438000 hits on Google Scholar for
a search of rdquooccupancy modelrdquo This class of models acknowledges that techniques
used to conduct biological population surveys are prone to detection errors ndashif an
individual is detected it must be present while if it is not detected it might or might
not be Occupancy models improve upon traditional binary regression by accounting
15
for observed detection and partially observed presence as two separate but related
components In the site occupancy setting the chosen locations are surveyed
repeatedly in order to reduce the ambiguity caused by the observed zeros This
approach therefore allows probabilities of both presence (occurrence) and detection
to be estimated
The uses of site-occupancy models are many For example metapopulation
and island biogeography models are often parameterized in terms of site (or patch)
occupancy (Hansky 19921994 1997 as cited in MacKenzie et al (2003)) and
occupancy may be used as a surrogate for abundance to answer questions regarding
geographic distribution range size and metapopulation dynamics (MacKenzie et al
2004 Royle amp Kery 2007)
The basic occupancy framework which assumes a single closed population with
fixed probabilities through time has proven to be quite useful however it might be of
limited utility when addressing some problems In particular assumptions for the basic
model may become too restrictive or unrealistic whenever the study period extends
throughout multiple years or seasons especially given the increasingly changing
environmental conditions that most ecosystems are currently experiencing
Among the extensions found in the literature one that we consider particularly
relevant incorporates heterogenous occupancy probabilities through time Models
that incorporate temporally varying probabilities stem from important meta-population
notions provided by Hanski (1994) such as occupancy probabilities depending on local
colonization and local extinction processes In spite of the conceptual usefulness of
Hanskirsquos model several strong and untenable assumptions (eg all patches being
homogenous in quality) are required for it to provide practically meaningful results
A more viable alternative which builds on Hanski (1994) is an extension of
the single season occupancy model of MacKenzie et al (2003) In this model the
heterogeneity of occupancy probabilities across seasons arises from local colonization
16
and extinction processes This model is flexible enough to let detection occurrence
extinction and colonization probabilities to each depend upon its own set of covariates
Model parameters are obtained through likelihood-based estimation
Using a maximum likelihood approach presents two drawbacks First the
uncertainty assessment for maximum likelihood parameter estimates relies on
asymptotic results which are obtained from implementation of the delta method
making it sensitive to sample size Second to obtain parameter estimates the latent
process (occupancy) is marginalized out of the likelihood leading to the usual zero
inflated Bernoulli model Although this is a convenient strategy for solving the estimation
problem after integrating the latent state variables (occupancy indicators) they are
no longer available Therefore finite sample estimates cannot be calculated directly
Instead a supplementary parametric bootstrapping step is necessary Further
additional structure such as temporal or spatial variation cannot be introduced by
means of random effects (Royle amp Kery 2007)
12 A Primer on Objective Bayesian Testing
With the advent of high dimensional data such as that found in modern problems
in ecology genetics physics etc coupled with evolving computing capability objective
Bayesian inferential methods have gained increasing popularity This however is by no
means a new approach in the way Bayesian inference is conducted In fact starting with
Bayes and Laplace and continuing for almost 200 years Bayesian analysis was primarily
based on ldquononinformativerdquo priors (Berger amp Bernardo 1992)
Now subjective elicitation of prior probabilities in Bayesian analysis is widely
recognized as the ideal (Berger et al 2001) however it is often the case that the
available information is insufficient to specify appropriate prior probabilistic statements
Commonly as in model selection problems where large model spaces have to be
explored the number of model parameters is prohibitively large preventing one from
eliciting prior information for the entire parameter space As a consequence in practice
17
the determination of priors through the definition of structural rules has become the
alternative to subjective elicitation for a variety of problems in Bayesian testing Priors
arising from these rules are known in the literature as noninformative objective default
or reference Many of these connotations generate controversy and are accused
perhaps rightly of providing a false pretension of objectivity Nevertheless we will avoid
that discussion and refer to them herein exchangeably as noninformative or objective
priors to convey the sense that no attempt to introduce an informed opinion is made in
defining prior probabilities
A plethora of ldquononinformativerdquo methods has been developed in the past few
decades (see Berger amp Bernardo (1992) Berger amp Pericchi (1996) Berger et al (2001)
Clyde amp George (2004) Kass amp Wasserman (1995 1996) Liang et al (2008) Moreno
et al (1998) Spiegelhalter amp Smith (1982) Wasserman (2000) and the references
therein) We find particularly interesting those derived from the model structure in which
no tuning parameters are required especially since these can be regarded as automatic
methods Among them methods based on the Bayes factor for Intrinsic Priors have
proven their worth in a variety of inferential problems given their excellent performance
flexibility and ease of use This class of priors is discussed in detail in chapter 3 For
now some basic notation and notions of Bayesian inferential procedures are introduced
Hypothesis testing and the Bayes factor
Bayesian model selection techniques that aim to find the true model as opposed
to searching for the model that best predicts the data are fundamentally extensions to
Bayesian hypothesis testing strategies In general this Bayesian approach to hypothesis
testing and model selection relies on determining the amount of evidence found in favor
of one hypothesis (or model) over the other given an observed set of data Approached
from a Bayesian standpoint this type of problem can be formulated in great generality
using a natural well defined probabilistic framework that incorporates both model and
parameter uncertainty
18
Jeffreys (1935) first developed the Bayesian strategy to hypothesis testing and
consequently to the model selection problem Bayesian model selection within
a model space M = (M1M2 MJ) where each model is associated with a
parameter θj which may be a vector of parameters itself incorporates three types
of probability distributions (1) a prior probability distribution for each model π(Mj)
(2) a prior probability distribution for the parameters in each model π(θj |Mj) and (3)
the distribution of the data conditional on both the model and the modelrsquos parameters
f (x|θj Mj) These three probability densities induce the joint distribution p(x θj Mj) =
f (x|θj Mj) middot π(θj |Mj) middot π(Mj) which is instrumental in producing model posterior
probabilities The model posterior probability is the probability that a model is true given
the data It is obtained by marginalizing over the parameter space and using Bayes rule
p(Mj |x) =m(x|Mj)π(Mj)sumJ
i=1m(x|Mi)π(Mi) (1ndash1)
where m(x|Mj) =intf (x|θj Mj)π(θj |Mj)dθj is the marginal likelihood of Mj
Given that interest lies in comparing different models evidence in favor of one or
another model is assessed with pairwise comparisons using posterior odds
p(Mj |x)p(Mk |x)
=m(x|Mj)
m(x|Mk)middot π(Mj)
π(Mk) (1ndash2)
The first term on the right hand side of (1ndash2) m(x|Mj )
m(x|Mk) is known as the Bayes factor
comparing model Mj to model Mk and it is denoted by BFjk(x) The Bayes factor
provides a measure of the evidence in favor of either model given the data and updates
the model prior odds given by π(Mj )
π(Mk) to produce the posterior odds
Note that the model posterior probability in (1ndash1) can be expressed as a function of
Bayes factors To illustrate let model Mlowast isin M be a reference model All other models
compare in M are compared to the reference model Then dividing both the numerator
19
and denominator in (1ndash1) by m(x|Mlowast)π(Mlowast) yields
p(Mj |x) =BFjlowast(x)
π(Mj )
π(Mlowast)
1 +sum
MiisinMMi =Mlowast
BFilowast(x)π(Mi )π(Mlowast)
(1ndash3)
Therefore as the Bayes factor increases the posterior probability of model Mj given the
data increases If all models have equal prior probabilities a straightforward criterion
to select the best among all candidate models is to choose the model with the largest
Bayes factor As such the Bayes factor is not only useful for identifying models favored
by the data but it also provides a means to rank models in terms of their posterior
probabilities
Assuming equal model prior probabilities in (1ndash3) the prior odds are set equal to
one and the model posterior odds in (1ndash2) become p(Mj |x)p(Mk |x) = BFjk(x) Based
on the Bayes factors the evidence in favor of one or another model can be interpreted
using Table 1-1 adapted from Kass amp Raftery (1995)
Table 1-1 Interpretation of BFji when contrasting Mj and Mi
lnBFjk BFjk Evidence in favor of Mj P(Mj |x)0 to 2 1 to 3 Weak evidence 05-0752 to 6 3 to 20 Positive evidence 075-095
6 to 10 20 to 150 Strong evidence 095-099gt10 gt150 Very strong evidence gt 099
Bayesian hypothesis testing and model selection procedures through Bayes factors
and posterior probabilities have several desirable features First these methods have a
straight forward interpretation since the Bayes factor is an increasing function of model
(or hypothesis) posterior probabilities Second these methods can yield frequentist
matching confidence bounds when implemented with good testing priors (Kass amp
Wasserman 1996) such as the reference priors of Berger amp Bernardo (1992) Third
since the Bayes factor contains the ratio of marginal densities it automatically penalizes
complexity according to the number of parameters in each model this property is
known as Ockhamrsquos razor (Kass amp Raftery 1995) Four the use of Bayes factors does
20
not require having nested hypotheses (ie having the null hypothesis nested in the
alternative) standard distributions or regular asymptotics (eg convergence to normal
or chi squared distributions) (Berger et al 2001) In contrast this is not always the case
with frequentist and likelihood ratio tests which depend on known distributions (at least
asymptotically) for the test statistic to perform the test Finally Bayesian hypothesis
testing procedures using the Bayes factor can naturally incorporate model uncertainty by
using the Bayesian machinery for model averaged predictions and confidence bounds
(Kass amp Raftery 1995) It is not clear how to account for this uncertainty rigorously in a
fully frequentist approach
13 Overview of the Chapters
In the chapters that follow we develop a flexible and straightforward hierarchical
Bayesian framework for occupancy models allowing us to obtain estimates and conduct
robust testing from an ldquoobjectiverdquo Bayesian perspective Latent mixtures of random
variables supply a foundation for our methodology This approach provides a means to
directly incorporate spatial dependency and temporal heterogeneity through predictors
that characterize either habitat quality of a given site or detectability features of a
particular survey conducted in a specific site On the other hand the Bayesian testing
methods we propose are (1) a fully automatic and objective method for occupancy
model selection and (2) an objective Bayesian testing tool that accounts for multiple
testing and for polynomial hierarchical structure in the space of predictors
Chapter 2 introduces the methods proposed for estimation of occupancy model
parameters A simple estimation procedure for the single season occupancy model
with covariates is formulated using both probit and logit links Based on the simple
version an extension is provided to cope with metapopulation dynamics by introducing
persistence and colonization processes Finally given the fundamental role that spatial
dependence plays in defining temporal dynamics a strategy to seamlessly account for
this feature in our framework is introduced
21
Chapter 3 develops a new fully automatic and objective method for occupancy
model selection that is asymptotically consistent for variable selection and averts the
use of tuning parameters In this Chapter first some issues surrounding multimodel
inference are described and insight about objective Bayesian inferential procedures is
provided Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate
priors on the parameter space the intrinsic priors for the parameters of the occupancy
model are obtained These are used in the construction of a variable selection algorithm
for ldquoobjectiverdquo variable selection tailored to the occupancy model framework
Chapter 4 touches on two important and interconnected issues when conducting
model testing that have yet to receive the attention they deserve (1) controlling for false
discovery in hypothesis testing given the size of the model space ie given the number
of tests performed and (2) non-invariance to location transformations of the variable
selection procedures in the face of polynomial predictor structure These elements both
depend on the definition of prior probabilities on the model space In this chapter a set
of priors on the model space and a stochastic search algorithm are proposed Together
these control for model multiplicity and account for the polynomial structure among the
predictors
22
CHAPTER 2MODEL ESTIMATION METHODS
ldquoData Data Datardquo he cried impatiently ldquoI canrsquot make bricks without clayrdquo
ndashSherlock HolmesThe Adventure of the Copper Beeches
21 Introduction
Prior to the introduction of site-occupancy models (MacKenzie et al 2002 Tyre
et al 2003) presence-absence data from ecological monitoring programs were used
without any adjustment to assess the impact of management actions to observe trends
in species distribution through space and time or to model the habitat of a species (Tyre
et al 2003) These efforts however were suspect due to false-negative errors not
being accounted for False-negative errors occur whenever a species is present at a site
but goes undetected during the survey
Site-occupancy models developed independently by MacKenzie et al (2002)
and Tyre et al (2003) extend simple binary-regression models to account for the
aforementioned errors in detection of individuals common in surveys of animal or plant
populations Since their introduction the site-occupancy framework has been used in
countless applications and numerous extensions for it have been proposed Occupancy
models improve upon traditional binary regression by analyzing observed detection
and partially observed presence as two separate but related components In the site
occupancy setting the chosen locations are surveyed repeatedly in order to reduce the
ambiguity caused by the observed zeros This approach therefore allows simultaneous
estimation of the probabilities of presence (occurrence) and detection
Several extensions to the basic single-season closed population model are
now available The occupancy approach has been used to determine species range
dynamics (MacKenzie et al 2003 Royle amp Kery 2007) and to understand agestage
23
structure within populations (Nichols et al 2007) model species co-occurrence
(MacKenzie et al 2004 Ovaskainen et al 2010 Waddle et al 2010) It has even been
suggested as a surrogate for abundance (MacKenzie amp Nichols 2004) MacKenzie et al
suggested using occupancy models to conduct large-scale monitoring programs since
this approach avoids the high costs associated with surveys designed for abundance
estimation Also to investigate metapopulation dynamics occupancy models improve
upon incidence function models (Hanski 1994) which are often parameterized in terms
of site (or patch) occupancy and assume homogenous patches and a metapopulation
that is at a colonization-extinction equilibrium
Nevertheless the implementation of Bayesian occupancy models commonly resorts
to sampling strategies dependent on hyper-parameters subjective prior elicitation
and relatively elaborate algorithms From the standpoint of practitioners these are
often treated as black-box methods (Kery 2010) As such the potential of using the
methodology incorrectly is high Commonly these procedures are fitted with packages
such as BUGS or JAGS Although the packagersquos ease of use has led to a wide-spread
adoption of the methods the user may be oblivious as to the assumptions underpinning
the analysis
We believe providing straightforward and robust alternatives to implement these
methods will help practitioners gain insight about how occupancy modeling and more
generally Bayesian modeling is performed In this Chapter using a simple Gibbs
sampling approach first we develop a versatile method to estimate the single season
closed population site-occupancy model then extend it to analyze metapopulation
dynamics through time and finally provide a further adaptation to incorporate spatial
dependence among neighboring sites211 The Occupancy Model
In this section of the document we first introduce our results published in Dorazio
amp Taylor-Rodrıguez (2012) and build upon them to propose relevant extensions For
24
the standard sampling protocol for collecting site-occupancy data J gt 1 independent
surveys are conducted at each of N representative sample locations (sites) noting
whether a species is detected or not detected during each survey Let yij denote a binary
random variable that indicates detection (y = 1) or non-detection (y = 0) during the
j th survey of site i Without loss of generality J may be assumed constant among all N
sites to simplify description of the model In practice however site-specific variation in
J poses no real difficulties and is easily implemented This sampling protocol therefore
yields a N times J matrix Y of detectionnon-detection data
Note that the observed process yij is an imperfect representation of the underlying
occupancy or presence process Hence letting zi denote the presence indicator at site i
this model specification can therefore be represented through the hierarchy
yij |zi λ sim Bernoulli (zipij)
zi |α sim Bernoulli (ψi) (2ndash1)
where pij is the probability of correctly classifying as occupied the i th site during the j th
survey ψi is the presence probability at the i th site The graphical representation of this
process is
ψi
zi
yi
pi
Figure 2-1 Graphical representation occupancy model
Probabilities of detection and occupancy can both be made functions of covariates
and their corresponding parameter estimates can be obtained using either a maximum
25
likelihood or a Bayesian approach Existing methodologies from the likelihood
perspective marginalize over the latent occupancy process (zi ) making the estimation
procedure depend only on the detections Most Bayesian strategies rely on MCMC
algorithms that require parameter prior specification and tuning However Albert amp Chib
(1993) proposed a longstanding strategy in the Bayesian statistical literature that models
binary outcomes using a simple Gibbs sampler This procedure which is described in
the following section can be extrapolated to the occupancy setting eliminating the need
for tuning parameters and subjective prior elicitation212 Data Augmentation Algorithms for Binary Models
Probit model Data-augmentation with latent normal variables
At the root of Albert amp Chibrsquos algorithm lies the idea that if the observed outcome is
0 the latent variable can be simulated from a truncated normal distribution with support
(minusinfin 0] And if the outcome is 1 the latent variable can be simulated from a truncated
normal distribution in (0infin) To understand the reasoning behind this strategy let
Y sim Bern((xTβ)
) and V = xTβ + ε with ε sim N (0 1) In such a case note that
Pr(y = 1 | xTβ) = (xTβ) = Pr(ε lt xTβ)
= Pr(ε gt minusxTβ)
= Pr(v gt 0 | xTβ)
Thus whenever y = 1 then v gt 0 and v le 0 otherwise In other words we
may think of y as a truncated version of v Thus we can sample iteratively alternating
between the latent variables conditioned on model parameters and vice versa to draw
from the desired posterior densities By augmenting the data with the latent variables
we are able to obtain full conditional posterior distributions for model parameters that are
easy to draw from (equation 2ndash3 below) Further we may sample the latent variables
we may also sample the parameters
Given some initial values for all model parameters values for the latent variables
can be simulated By conditioning on the latter it is then possible to draw samples
26
from the parameterrsquos posterior distributions These samples can be used to generate
new values for the latent variables etc The process is iterated using a Gibbs sampling
approach Generally after a large number iterations it yields draws from the joint
posterior distribution of the latent variables and the model parameters conditional on the
observed outcome values We formalize the procedure below
Assume that each outcome Y1Y2 Yn is such that Yi |xi β sim Bernoulli(qi)
where qi = (xTi β) is the standard normal CDF evaluated at xTi β where xi and β
are the p-dimensional vectors of observed covariates for the i -th observation and their
corresponding parameters respectively
Now let y = y1 y2 yn be the vector of observed outcomes and [ β ] represents
the prior distribution of the model parameters Therefore the posterior distribution of β is
given by
[ β|y ] prop [ β ]nprodi=1
(xTi β)yi(1minus(xTi β)
)1minusyi (2ndash2)
which is intractable Nevertheless introducing latent random variables V = (V1 Vn)
such that Vi sim N (xTi β 1) resolves this difficulty by specifying that whenever Yi = 1
then Vi gt 0 and if Yi = 0 then Vi le 0 This yields
[ β v|y ] prop [ β ]
nprodi=1
ϕ(vi | xTi β 1)Ivile0Iyi=0 + Ivigt0Iyi=1
(2ndash3)
where ϕ(x |micro τ 2) is the probability density function of normal random variable x
with mean micro and variance τ2 The data augmentation artifact works since [ β|y ] =int[ β v|y ]dv hence if we sample from joint posterior 2ndash3 and extract only the sampled
values for β they will correspond to samples from [ β|y ]
From the expression above it is possible to obtain the full conditional distributions
for V and β Thus a Gibbs sampler can be proposed For example if we use a flat prior
27
for β (ie [ β ] prop 1) the full conditionals are given by
β|V y sim MVNk
((XTX )minus1(XTV ) (XTX )minus1
)(2ndash4)
V|β y simnprodi=1
tr N (xTi β 1Qi) (2ndash5)
where MVNq(micro ) represents a multivariate normal distribution with mean vector micro
and variance-covariance matrix and tr N (ξσ2Q) stands for the truncated normal
distribution with mean ξ variance σ2 and truncation region Q For each i = 1 2 n
the support of the truncated variables is given by Q = (minusinfin 0 ] if yi = 0 and Q = (0infin)
otherwise Note that conjugate normal priors could be used alternatively
At iteration m + 1 the Gibbs sampler draws V(m+1) conditional on β(m) from (2ndash5)
and then samples β(m+1) conditional on V(m+1) from (2ndash4) This process is repeated for
s = 0 1 nsim where nsim is the number of iterations in the Gibbs sampler
Logit model Data-augmentation with latent Polya-gamma variables
Recently Polson et al (2013) developed a novel and efficient approach for Bayesian
inference for logistic models using Polya-gamma latent variables which is analogous
to the Albert amp Chib algorithm The result arises from what the authors refer to as the
Polya-gamma distribution To construct a random variable from this family consider the
infinite mixture of the iid sequence of Exp(1) random variables Ekinfink=1 given by
ω =2
π2
infinsumk=1
Ek
(2k minus 1)2
with probability density function
g(ω) =infinsumk=1
(minus1)k 2k + 1radic2πω3
eminus(2k+1)2
8ω Iωisin(0infin) (2ndash6)
and Laplace density transform E[eminustω] = coshminus1(radic
t2)
28
The Polya-gamma family of densities is obtained through an exponential tilting of
the density g from 2ndash6 These densities indexed by c ge 0 are characterized by
f (ω|c) = cosh c2 eminusc2ω2 g(ω)
The likelihood for the binomial logistic model can be expressed in terms of latent
Polya-gamma variables as follows Assume yi sim Bernoulli(δi) with predictors xprimei =
(xi1 xip) and success probability δi = exprimeiβ(1 + ex
primeiβ) Hence the posterior for the
model parameters can be represented as
[β|y] =[β]prodn
i δyii (1minus δi)
1minusyi
c(y)
where c(y) is the normalizing constant
To facilitate the sampling procedure a data augmentation step can be performed
by introducing a Polya-gamma random variable ω sim PG(xprimeβ 1) This yields the
data-augmented posterior
[βω|y] =
(prodn
i=1 Pr(yi = 1|β))f (ω|xprime
β) [β] dω
c(y) (2ndash7)
such that [β|y] =int
R+[βω|y] dω
Thus from the augmented model the full conditional density for β is given by
[β|ω y] prop
(nprodi=1
Pr(yi = 1|β)
)f (ω|xprime
β) [β] dω
=
nprodi=1
(exprimeiβ)yi
1 + exprimeiβ
nprodi=1
cosh
(∣∣xprime
iβ∣∣
2
)exp
[minus(x
prime
iβ)2ωi
2
]g(ωi)
(2ndash8)
This expression yields a normal posterior distribution if β is assigned flat or normal
priors Hence a two-step sampling strategy analogous to that of Albert amp Chib (1993)
can be used to estimate β in the occupancy framework22 Single Season Occupancy
Let pij = F (qTij λ) be the probability of correctly classifying as occupied the i th
site during the j th survey conditional on the site being occupied and let ψi = F (xTi α)
29
correspond to the presence probability at the i th site Further let Fminus1(middot) denote a link
function (ie probit or logit) connecting the response to the predictors and denote by λ
and α respectively the r -variate and p-variate coefficient vectors for the detection and
for the presence probabilities Then the following is the joint posterior probability for the
presence indicators and the model parameters
πlowast(z vαwλ) prop πα(α)πλ(λ)Nprodi=1
F (xprimeiα)zi (1minus F (xprimeiα))
(1minuszi ) times
Jprodj=1
(ziF (qprimeijλ))
yij (1minus ziF (qprimeijλ))
1minusyij (2ndash9)
As in the simple probit regression problem this posterior is intractable Consequently
sampling from it directly is not possible But the procedures of Albert amp Chib for the
probit model and of Polson et al for the logit model can be extended to generate an
MCMC sampling strategy for the occupancy problem In what follows we make use of
this framework to develop samplers with which occupancy parameter estimates can be
obtained for both probit and logit link functions These algorithms have the added benefit
that they do not require tuning parameters nor eliciting parameter priors subjectively221 Probit Link Model
To extend Albert amp Chibrsquos algorithm to the occupancy framework with a probit link
first we introduce two sets of latent variables denoted by wij and vi corresponding to
the normal latent variables used to augment the data The corresponding hierarchy is
yij |zi sij sim Bernoulli(ziIwijgt0
)wij |λ sim N
(qprimeijλ 1
)λ sim [λ]
zi |vi sim Ivigt0
vi |α sim N (xprimeiα 1)
α sim [α] (2ndash10)
30
represented by the directed graph found in Figure 2-2
α
vi
zi
yi
wi
λ
Figure 2-2 Graphical representation occupancy model after data-augmentation
Under this hierarchical model the joint density is given by
πlowast(z vαwλ) prop Cyπα(α)πλ(λ)Nprodi=1
ϕ(vi xprimeiα 1)I
zivigt0I
(1minuszi )vile0 times
Jprodj=1
(ziIwijgt0)yij (1minus ziIwijgt0)
1minusyijϕ(wij qprimeijλ 1) (2ndash11)
The full conditional densities derived from the posterior in equation 2ndash11 are
detailed below
1 These are obtained from the full conditional of z after integrating out v and w
f (z|αλ) =
Nprodi=1
f (zi |αλ) =Nprodi=1
ψlowastizi (1minus ψlowast
i )1minuszi
where ψlowasti =
ψiprodJ
j=1 pyijij (1minus pij)
1minusyij
ψiprodJ
j=1 pyijij (1minus pij)1minusyij + (1minus ψi)
prodJ
j=1 Iyij=0(2ndash12)
2
f (v|zα) =
Nprodi=1
f (vi |zi α) =Nprodi=1
tr N (x primeiα 1Ai)
where Ai =
(minusinfin 0] zi = 0(0infin) zi = 1
(2ndash13)
31
and tr N(microσ2A) denotes the pdf of a truncated normal random variable withmean micro variance σ2 and truncation region A
3
f (α|v) = ϕp (α αXprimev α) (2ndash14)
where α = (X primeX )minus1and ϕk(x micro ) represents the k-variate normal density withmean vector micro and variance matrix
4
f (w|y zλ) =
Nprodi=1
Jprodj=1
f (wij |yij zi λ) =Nprodi=1
Jprodj=1
tr N (qprimeijλ 1Bij)
where Bij =
(minusinfininfin) zi = 0(minusinfin 0] zi = 1 and yij = 0(0infin) zi = 1 and yij = 1
(2ndash15)
5
f (λ|w) = ϕr (λ λQprimew λ) (2ndash16)
where λ = (Q primeQ)minus1
The Gibbs sampling algorithm for the model can then be summarized as
1 Initialize z α v λ and w
2 Sample zi sim Bern(ψilowast)
3 Sample vi from a truncated normal with micro = x primeiα σ = 1 and truncation regiondepending on zi
4 Sample α sim N (αXprimev α) with α = (X primeX )minus1
5 Sample wij from a truncated normal with micro = qprimeijλ σ = 1 and truncation region
depending on yij and zi
6 Sample λ sim N (λQprimew λ) with λ = (Q primeQ)minus1
222 Logit Link Model
Now turning to the logit link version of the occupancy model again let yij be the
indicator variable used to mark detection of the target species on the j th survey at the
i th site and let zi be the indicator variable that denotes presence (zi = 1) or absence
32
(zi = 0) of the target species at the i th site The model is now defined by
yij |zi λ sim Bernoulli (zipij) where pij =eq
primeijλ
1 + eqprimeijλ
λ sim [λ]
zi |α sim Bernoulli (ψi) where ψi =ex
primeiα
1 + exprimeiα
α sim [α]
In this hierarchy the contribution of a single site to the likelihood is
Li(αλ) =(ex
primeiα)zi
1 + exprimeiα
Jprodj=1
(zi
eqprimeijλ
1 + eqprimeijλ
)yij(1minus zi
eqprimeijλ
1 + eqprimeijλ
)1minusyij
(2ndash17)
As in the probit case we data-augment the likelihood with two separate sets
of covariates however in this case each of them having Polya-gamma distribution
Augmenting the model and using the posterior in (2ndash7) the joint is
[ zαλ|y ] prop [α] [λ]
Nprodi=1
(ex
primeiα)zi
1 + exprimeiαcosh
(∣∣xprime
iα∣∣
2
)exp
[minus(x
prime
iα)2vi
2
]g(vi)times
Jprodj=1
(zi
eqprimeijλ
1 + eqprimeijλ
)yij(1minus zi
eqprimeijλ
1 + eqprimeijλ
)1minusyij
times
cosh
(∣∣ziqprimeijλ∣∣2
)exp
[minus(ziq
primeijλ)2wij
2
]g(wij)
(2ndash18)
The full conditionals for z α v λ and w obtained from (2ndash18) are provided below
1 The full conditional for z is obtained after marginalizing the latent variables andyields
f (z|αλ) =
Nprodi=1
f (zi |αλ) =Nprodi=1
ψlowastizi (1minus ψlowast
i )1minuszi
where ψlowasti =
ψiprodJ
j=1 pyijij (1minus pij)
1minusyij
ψiprodJ
j=1 pyijij (1minus pij)1minusyij + (1minus ψi)
prodJ
j=1 Iyij=0(2ndash19)
33
2 Using the result derived in Polson et al (2013) we have that
f (v|zα) =
Nprodi=1
f (vi |zi α) =Nprodi=1
PG(1 xprimeiα) (2ndash20)
3
f (α|v) prop [α ]
Nprodi=1
exp[zix
prime
iαminus xprime
iα
2minus (x
prime
iα)2vi
2
] (2ndash21)
4 By the same result as that used for v the full conditional for w is
f (w|y zλ) =
Nprodi=1
Jprodj=1
f (wij |yij zi λ)
=
(prodiisinS1
Jprodj=1
PG(1 |qprimeijλ| )
)(prodi isinS1
Jprodj=1
PG(1 0)
) (2ndash22)
with S1 = i isin 1 2 N zi = 1
5
f (λ|z yw) prop [λ ]prodiisinS1
exp
[yijq
prime
ijλminusq
prime
ijλ
2minus
(qprime
ijλ)2wij
2
] (2ndash23)
with S1 as defined above
The Gibbs sampling algorithm is analogous to the one with a probit link but with the
obvious modifications to incorporate Polya-gamma instead of normal latent variables23 Temporal Dynamics and Spatial Structure
The uses of the single-season model are limited to very specific problems In
particular assumptions for the basic model may become too restrictive or unrealistic
whenever the study period extends throughout multiple years or seasons especially
given the increasingly changing environmental conditions that most ecosystems are
currently experiencing
Among the many extensions found in the literature one that we consider particularly
relevant incorporates heterogenous occupancy probabilities through time Extensions of
34
site-occupancy models that incorporate temporally varying probabilities can be traced
back to Hanski (1994) The heterogeneity of occupancy probabilities through time arises
from local colonization and extinction processes MacKenzie et al (2003) proposed an
alternative to Hanskirsquos approach in order to incorporate imperfect detection The method
is flexible enough to let detection occurrence survival and colonization probabilities
each depend upon its own set of covariates using likelihood-based estimation for the
model parameters
However the approach of MacKenzie et al presents two drawbacks First
the uncertainty assessment for maximum likelihood parameter estimates relies on
asymptotic results (obtained from implementation of the delta method) making it
sensitive to sample size And second to obtain parameter estimates the latent process
(occupancy) is marginalized out of the likelihood leading to the usual zero-inflated
Bernoulli model Although this is a convenient strategy to solve the estimation problem
the latent state variables (occupancy indicators) are no longer available and as such
finite sample estimates cannot be calculated unless an additional (and computationally
expensive) parametric bootstrap step is performed (Royle amp Kery 2007) Additionally as
the occupancy process is integrated out the likelihood approach precludes incorporation
of additional structural dependence using random effects Thus the model cannot
account for spatial dependence which plays a fundamental role in this setting
To work around some of the shortcomings encountered when fitting dynamic
occupancy models via likelihood based methods Royle amp Kery developed what they
refer to as a dynamic occupancy state space model (DOSS) alluding to the conceptual
similarity found between this model and the class of state space models found in the
time series literature In particular this model allows one to retain the latent process
(occupancy indicators) in order to obtain small sample estimates and to eventually
generate extensions that incorporate structure in time andor space through random
effects
35
The data used in the DOSS model comes from standard repeated presenceabsence
surveys with N sampling locations (patches or sites) indexed by i = 1 2 N Within
a given season (eg year month week depending on the biology of the species) each
sampling location is visited (surveyed) j = 1 2 J times This process is repeated for
t = 1 2 T seasons Here an important assumption is that the site occupancy status
is closed within but not across seasons
As is usual in the occupancy modeling framework two different processes are
considered The first one is the detection process per site-visit-season combination
denoted by yijt The yijt are indicator functions that take the value 1 if the species is
present at site i survey j and season t and 0 otherwise These detection indicators
are assumed to be independent within each site and season The second response
considered is the partially observed presence (occupancy) indicators zit These are
indicator variables which are equal to 1 whenever yijt = 1 for one or more of the visits
made to site i during season t otherwise the values of the zit rsquos are unknown Royle amp
Kery refer to these two processes as the observation (yijt) and the state (zit) models
In this setting the parameters of greatest interest are the occurrence or site
occupancy probabilities denoted by ψit as well as those representing the population
dynamics which are accounted for by introducing changes in occupancy status over
time through local colonization and survival That is if a site was not occupied at season
t minus 1 at season t it can either be colonized or remain unoccupied On the other hand
if the site was in fact occupied at season t minus 1 it can remain that way (survival) or
become abandoned (local extinction) at season t The probabilities of survival and
colonization from season t minus 1 to season t at the i th site are denoted by θi(tminus1) and
γi(tminus1) respectivelyDuring the initial period (or season) the model for the state process is expressed in
terms of the occupancy probability (equation 2ndash24) For subsequent periods the stateprocess is specified in terms of survival and colonization probabilities (equation 2ndash25) inparticular
zi1 sim Bernoulli (ψi1) (2ndash24)
36
zit |zi(tminus1) sim Bernoulli(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)(2ndash25)
The observation model conditional on the latent process zit is defined by
yijt |zit sim Bernoulli(zitpijt
)(2ndash26)
Royle amp Kery induce the heterogeneity by site site-season and site-survey-seasonrespectively in the occupancy survival and colonization and in the detection probabilitiesthroughthe following specification
logit(ψi1) = x1 + ri ri sim N(0σ2ψ) logitminus1(x1) sim Unif(0 1)
logit(θit) = at + ui ui sim N(0σ2θ ) logitminus1(ai) sim Unif(0 1)logit(γit) = bt + vi vi sim N(0σ2γ) logitminus1(bi) sim Unif(0 1)
logit(pijt) = ct + wij wi sim N(0σ2p) logitminus1(ci) sim Unif(0 1) (2ndash27)
where x1 at bt ct are the season fixed effects for the corresponding probabilities
and where (ri ui vi) and wij are the site and site-survey random effects respectively
Additionally all variance components assume the usual inverse gamma priors
As the authors state this formulation can be regarded as ldquobeing suitably vaguerdquo
however it is also restrictive in the sense that it is not clear what strategy to follow to
incorporate additional covariates while preserving the straightforward sampling strategy231 Dynamic Mixture Occupancy State-Space Model
We assume that the probabilities for occupancy survival colonization and detection
are all functions of linear combinations of covariates However our setup varies
slightly from that considered by Royle amp Kery (2007) In essence we modify the way in
which the estimates for survival and colonization probabilities are attained Our model
incorporates the notion that occupancy at a site occupied during the previous season
takes place through persistence where we define persistence as a function of both
survival and colonization That is a site occupied at time t may again be occupied
at time t + 1 if the current settlers survive if they perish and new settlers colonize
simultaneously or if both current settlers survive and new ones colonize
Our functional forms of choice are again the probit and logit link functions This
means that each probability of interest which we will refer to for illustration as δ is
37
linked to a linear combination of covariates xprime ξ through the relationship defined by
δ = F (xT ξ) where F (middot) represents the inverse link function This particular assumption
facilitates relating the data augmentation algorithms of Albert amp Chib and Polson et al to
Royle amp Keryrsquos DOSS model We refer to this extension of Royle amp Keryrsquos model as the
Dynamic Mixture Occupancy State Space model (DYMOSS)
As before let yijt be the indicator variable used to mark detection of the target
species on the j th survey at the i th site during the tth season and let zit be the indicator
variable that denotes presence (zit = 1) or absence (zit = 0) of the target species at the
i th site tth season with i isin 1 2 N j isin 1 2 J and t isin 1 2 T
Additionally assume that probabilities for occupancy at time t = 1 persistence
colonization and detection are all functions of covariates with corresponding parameter
vectors α (s) =δ(s)tminus1
Tt=2
B(c) =β(c)tminus1
Tt=2
and = λtTt=1 and covariate matrices
X(o) X = Xtminus1Tt=2 and Q(s) = QtTt=1 respectively Using the notation above our
proposed dynamic occupancy model is defined by the following hierarchyState model
zi1|α sim Bernoulli (ψi1) where ψi1 = F(xprime(o)iα
)zit |zi(tminus1) δ
(c)tminus1β
(s)tminus1 sim Bernoulli
(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)where θi(tminus1) = F
(δ(s)tminus1 + xprimei(tminus1)β
(c)tminus1
) and
γi(tminus1) = F(xprimei(tminus1)β
(c)tminus1
)(2ndash28)
Observed modelyijt |zit ηt sim Bernoulli (zitpijt)
where pijt = F (qTijtλt) (2ndash29)
In the hierarchical setup given by Equations 2ndash28 and 2ndash29 θi(tminus1) corresponds to
the probability of persistence from time t minus 1 to time t at site i and γi(tminus1) denotes the
colonization probability Note that θi(tminus1) minus γi(tminus1) yields the survival probability from t minus 1
to t The effect of survival is introduced by changing the intercept of the linear predictor
by a quantity δ(s)tminus1 Although in this version of the model this effect is accomplished by
just modifying the intercept it can be extended to have covariates determining δ(s)tminus1 as
well The graphical representation of the model for a single site is
38
α
zi1
yi1
λ1
zi2
yi2
λ1
δ(s)1
β(c)1
middot middot middot
zit
yit
λt
δ(s)tminus1
β(c)tminus1
middot middot middot
ziT
yiT
λT
δ(s)Tminus1
β(c)Tminus1
Figure 2-3 Graphical representation multiseason model for a single site
The joint posterior for the model defined by this hierarchical setting is
[ zηαβλ|y ] = Cy
Nprodi=1
ψi1 Jprodj=1
pyij1ij1 (1minus pij1)
(1minusyij1)
zi1(1minus ψi1)
Jprodj=1
Iyij1=0
1minuszi1 [η1][α]times
Tprodt=2
Nprodi=1
[(θziti(tminus1)(1minus θi(tminus1))
1minuszit)zi(tminus1)
+(γziti(tminus1)(1minus γi(tminus1))
1minuszit)1minuszi(tminus1)
] Jprod
j=1
pyijtijt (1minus pijt)
1minusyijt
zit
times
Jprodj=1
Iyijt=0
1minuszit [ηt ][βtminus1][λtminus1]
(2ndash30)
which as in the single season case is intractable Once again a Gibbs sampler cannot
be constructed directly to sample from this joint posterior The graphical representation
of the model for one site incorporating the latent variables is provided in Figure 2-4
α
ui1
zi1
yi1
wi1
λ1
zi2
yi2
wi2
λ1
vi1
δ(s)1
β(c)1
middot middot middot
middot middot middot
zit
vi tminus1
yit
wit
λt
δ(s)tminus1
β(c)tminus1
middot middot middot
middot middot middot
ziT
vi Tminus1
yiT
wiT
λT
δ(s)Tminus1
β(s)Tminus1
Figure 2-4 Graphical representation data-augmented multiseason model
Probit link normal-mixture DYMOSS model
39
We deal with the intractability of the joint posterior distribution as before that is
by introducing latent random variables Each of the latent variables incorporates the
relevant linear combinations of covariates for the probabilities considered in the model
This artifact enables us to sample from the joint posterior distributions of the model
parameters For the probit link the sets of latent random variables respectively for first
season occupancy persistence and colonization and detection are
bull ui sim N (bTi α 1)
bull vi(tminus1) sim zi(tminus1)N(δ(s)(tminus1) + xTi(tminus1)β
(c)(tminus1) 1
)+ (1minus zi(tminus1))N
(xTi(tminus1)β
(c)(tminus1) 1
) and
bull wijt sim N (qTijtηt 1)
Introducing these latent variables into the hierarchical formulation yieldsState model
ui1|α sim N(xprime(o)iα 1
)zi1|ui sim Bernoulli
(Iuigt0
)for t gt 1
vi(tminus1)|zi(tminus1)βtminus1 sim zi(tminus1)N(δ(s)(tminus1) + xprimei(tminus1)β
(c)(tminus1) 1
)+
(1minus zi(tminus1))N(xprimei(tminus1)β
(c)(tminus1) 1
)zit |vi(tminus1) sim Bernoulli
(Ivi(tminus1)gt0
)(2ndash31)
Observed modelwijt |ηt sim N
(qTijtλt 1
)yijt |zit wijt sim Bernoulli
(zitIrijtgt0
)(2ndash32)
Note that the result presented in Section 22 corresponds to the particular case for
T = 1 of the model specified by Equations 2ndash31 and 2ndash32
As mentioned previously model parameters are obtained using a Gibbs sampling
approach Let ϕ(x |microσ2) denote the pdf of a normally distributed random variable x
with mean micro and standard deviation σ Also let
1 Wt = (w1t w2t wNt) with wit = (wi1t wi2t wiJitt) (for i = 1 2 N andt = 1 2 T )
40
2 u = (u1 u2 uN)
3 V = (v1 vTminus1) with vt = (v1t v2t vNt)
For the probit link model the joint posterior distribution is
π(ZuV WtTt=1αB(c) δ(s)
)prop [α]
prodNi=1 ϕ
(ui∣∣ xprime(o)iα 1
)Izi1uigt0I
1minuszi1uile0
times
Tprodt=2
[β(c)tminus1 δ
(s)tminus1
] Nprodi=1
ϕ(vi(tminus1)
∣∣micro(v)i(tminus1) 1
)Izitvi(tminus1)gt0
I1minuszitvi(tminus1)le0
times
Tprodt=1
[λt ]
Nprodi=1
Jitprodj=1
ϕ(wijt
∣∣qprimeijtλt 1)(zitIwijtgt0)yij1(1minus zitIwijtgt0)
(1minusyijt)
where micro(v)i(tminus1) = zi(tminus1)δ
(s)tminus1 + xprimei(tminus1)β
(c)tminus1 (2ndash33)
Initialize the Gibbs sampler at α(0)B(0)(c) δ
(s)(0)2minus1 and (0) For m = 0 1 nsim
The sampler proceeds iteratively by block sampling sequentially for each primary
sampling period as follows first the presence process then the latent variables from
the data-augmentation step for the presence component followed by the parameters for
the presence process then the latent variables for the detection component and finally
the parameters for the detection component Letting [|] denote the full conditional
probability density function of the component conditional on all other unknown
parameters and the observed data for m = 1 nsim the sampling procedure can be
summarized as
[z(m)1 | middot
]rarr[u(m)| middot
]rarr[α(m)
∣∣∣ middot ]rarr [W
(m)1 | middot
]rarr[λ(m)1
∣∣∣ middot ]rarr[z(m)2 | middot
]rarr[V(m)2minus1| middot
]rarr[β(c)(m)2minus1 δ(s)(m)
2minus1
∣∣∣ middot ]rarr [W
(m)2 | middot
]rarr[λ(m)2
∣∣∣ middot ]rarr middot middot middot
middot middot middot rarr[z(m)T | middot
]rarr[V(m)Tminus1| middot
]rarr[β(c)(m)Tminus1 δ(s)(m)
Tminus1
∣∣∣ middot ]rarr [W
(m)T | middot
]rarr[λ(m)T
∣∣∣ middot ]The full conditional probability densities for this Gibbs sampling algorithm are
presented in detail within Appendix A
41
Logit link Polya-Gamma DYMOSS model
Using the same notation as before the logit link model resorts to the hierarchy given
byState model
ui1|α sim PG(xT(o)iα 1
)zi1|ui sim Bernoulli
(Iuigt0
)for t gt 1
vi(tminus1)| sim PG(1∣∣zi(tminus1)δ(s)(tminus1) + xprimei(tminus1)β
(c)(tminus1)
∣∣)sim Bernoulli
(Ivi(tminus1)gt0
)(2ndash34)
Observed modelwijt |λt sim PG
(qTijtλt 1
)yijt |zit wijt sim Bernoulli
(zitIwijtgt0
)(2ndash35)
The logit link version of the joint posterior is given by
π(ZuV WtTt=1αB(s)B(c)
)prop
Nprodi=1
(e
xprime(o)i
α)zi1
1 + exprime(o)i
αPG
(ui 1 |xprime(o)iα|
)[λ1][α]times
Ji1prodj=1
(zi1
eqprimeij1λ1
1 + eqprimeij1λ1
)yij1(1minus zi1
eqprimeij1λ1
1 + eqprimeij1λ1
)1minusyij1
PG(wij1 1 |zi1qprimeij1λ1|
)times
Tprodt=2
[δ(s)tminus1][β
(c)tminus1][λt ]
Nprodi=1
(exp
[micro(v)tminus1
])zit1 + exp
[micro(v)i(tminus1)
]PG (vit 1 ∣∣∣micro(v)i(tminus1)
∣∣∣)timesJitprodj=1
(zit
eqprimeijtλt
1 + eqprimeijtλt
)yijt(1minus zit
eqprimeijtλt
1 + eqlowastTij
λt
)1minusyijt
PG(wijt 1 |zitqprimeijtλt |
)
(2ndash36)
with micro(v)tminus1 = zi(tminus1)δ
(s)tminus1 + xprimei(tminus1)β
(c)tminus1
42
The sampling procedure is entirely analogous to that described for the probit
version The full conditional densities derived from expression 2ndash36 are described in
detail in Appendix A232 Incorporating Spatial Dependence
In this section we describe how the additional layer of complexity space can also
be accounted for by continuing to use the same data-augmentation framework The
method we employ to incorporate spatial dependence is a slightly modified version of
the traditional approach for spatial generalized linear mixed models (GLMMrsquos) and
extends the model proposed by Johnson et al (2013) for the single season closed
population occupancy model
The traditional approach consists of using spatial random effects to induce a
correlation structure among adjacent sites This formulation introduced by Besag et al
(1991) assumes that the spatial random effect corresponds to a Gaussian Markov
Random Field (GMRF) The model known as the Spatial GLMM (SGLMM) is used to
analyze areal data It has been applied extensively given the flexibility of its hierarchical
formulation and the availability of software for its implementation (Hughes amp Haran
2013)
Succinctly the spatial dependence is accounted for in the model by adding a
random vector η assumed to have a conditionally-autoregressive (CAR) prior (also
known as the Gaussian Markov random field prior) To define the prior let the pair
G = (V E) represent the undirected graph for the entire spatial region studied where
V = (1 2 N) denotes the vertices of the graph (sites) and E the set of edges
between sites E is constituted by elements of the form (i j) indicating that sites i
and j are spatially adjacent for some i j isin V The prior for the spatial effects is then
characterized by
[η|τ ] prop τ rank()2exp[minusτ2ηprimeη
] (2ndash37)
43
where = (diag(A1)minus A) is the precision matrix with A denoting the adjacency matrix
The entries of the adjacency matrix A are such that diag(A) = 0 and Aij = I(i j)isinE
The matrix is singular Hence the probability density defined in equation 2ndash37
is improper ie it doesnrsquot integrate to 1 Regardless of the impropriety of the prior this
model can be fitted using a Bayesian approach since even if the prior is improper the
posterior for the model parameters is proper If a constraint such assum
k ηk = 0 is
imposed or if the precision matrix is replaced by a positive definite matrix the model
can also be fitted using a maximum likelihood approach
Assuming that all but the detection process are subject to spatial correlations and
using the notation we have developed up to this point the spatially explicit version of the
DYMOSS model is characterized by the hierarchy represented by equations 2ndash38 and
2ndash39
Hence adding spatial structure into the DYMOSS framework described in the
previous section only involves adding the steps to sample η(o) and ηtT
t=2 conditional
on all other parameters Furthermore the corresponding parameters and spatial
random effects of a given component (ie occupancy survival and colonization)
can be effortlessly pooled together into a single parameter vector to perform block
sampling For each of the latent variables the only modification required is to sum the
corresponding spatial effect to the linear predictor so that these retain their conditional
independence given the linear combination of fixed effects and the spatial effects
State modelzi1|α sim Bernoulli (ψi1) where ψi1 = F
(xT(o)iα+ η
(o)i
)[η(o)|τ
]prop τ rank()2exp
[minusτ2η(o)primeη(o)
]zit |zi(tminus1)αβtminus1λtminus1 sim Bernoulli
(zi(tminus1)θi(tminus1) +
(1minus zi(tminus1)
)γi(tminus1)
)where θi(tminus1) = F
(δ(s)(tminus1) + xTi(tminus1)β
(c)tminus1 + ηit
) and
γi(tminus1) = F(xTi(tminus1)β
(c)tminus1 + ηit
)[ηt |τ ] prop τ rank()2exp
[minusτ2ηprimetηt
](2ndash38)
44
Observed modelyijt |zit ηt sim Bernoulli (zitpijt)
where pijt = F (qTijtλt) (2ndash39)
In spite of the popularity of this approach to incorporating spatial dependence three
shortcomings have been reported in the literature (Hughes amp Haran 2013 Reich et al
2006) (1) model parameters have no clear interpretation due to spatial confounding
of the predictors with the spatial effect (2) there is variance inflation due to spatial
confounding and (3) the high dimensionality of the latent spatial variables leads to
high computational costs To avoid such difficulties we follow the approach used by
Hughes amp Haran (2013) which builds upon the earlier work by Reich et al (2006) This
methodology is summarized in what follows
Let a vector of spatial effects η have the CAR model given by 2ndash37 above Now
consider a random vector ζ sim MVN(0 τKprimeK
) with defined as above and where
τKprimeK corresponds to the precision of the distribution and not the covariance matrix
with matrix K satisfying KprimeK = I
This last condition implies that the linear predictor Xβ + η = Xβ + Kζ With
respect to how the matrix K is chosen Hughes amp Haran (2013) recommend basing its
construction on the spectral decomposition of operator matrices based on Moranrsquos I
The Moran operator matrix is defined as PperpAPperp with Pperp = IminusX (XprimeX )minus1X
prime and where A
is the adjacency matrix previously described The choice of the Moran operator is based
on the fact that it accounts for the underlying graph while incorporating the spatial
structure residual to the design matrix X These elements are incorporated into its
spectral decomposition of the Moran operator That is its eigenvalues correspond to the
values of Moranrsquos I statistic (a measure of spatial autocorrelation) for a spatial process
orthogonal to X while its eigenvectors provide the patterns of spatial dependence
residual to X Thus the matrix K is chosen to be the matrix whose columns are the
eigenvectors of the Moran operator for a particular adjacency matrix
45
Using this strategy the new hierarchical formulation of our model is simply modified
by letting η(o) = K(o)ζ(o) and ηt = Ktζt with
1 ζ(o) sim MVN(0 τ (o)K(o)primeK(o)
) where K(o) is the eigenvector matrix for
P(o)perpAP(o)perp and
2 ζt sim MVN(0 τtK
primetKt
) where Kt is the Pperp
t APperpt for t = 2 3 T
The algorithms for the probit and logit link from section 231 can be readily
adapted to incorporate the spatial structure simply by obtaining the joint posteriors
for (α ζ(o)) and (β(c)tminus1 δ
(s)tminus1 ζt) making the obvious modification of the corresponding
linear predictors to incorporate the spatial components24 Summary
With a few exceptions (Dorazio amp Taylor-Rodrıguez 2012 Johnson et al 2013
Royle amp Kery 2007) recent Bayesian approaches to site-occupancy modeling with
covariates have relied on model configurations (eg as multivariate normal priors of
parameters in logit scale) that lead to unfamiliar conditional posterior distributions thus
precluding the use of a direct sampling approach Therefore the sampling strategies
available are based on algorithms (eg Metropolis Hastings) that require tuning and the
knowledge to do so correctly
In Dorazio amp Taylor-Rodrıguez (2012) we proposed a Bayesian specification for
which a Gibbs sampler of the basic occupancy model is available and allowed detection
and occupancy probabilities to depend on linear combinations of predictors This
method described in section 221 is based on the data augmentation algorithm of
Albert amp Chib (1993) There the full conditional posteriors of the parameters of the probit
regression model are cast as latent mixtures of normal random variables The probit and
the logit link yield similar results with large sample sizes however their results may be
different when small to moderate sample sizes are considered because the logit link
function places more mass in the tails of the distribution than the probit link does In
46
section 222 we adapt the method for the single season model to work with the logit link
function
The basic occupancy framework is useful but it assumes a single closed population
with fixed probabilities through time Hence its assumptions may not be appropriate to
address problems where the interest lies in the temporal dynamics of the population
Hence we developed a dynamic model that incorporates the notion that occupancy
at a site previously occupied takes place through persistence which depends both on
survival and habitat suitability By this we mean that a site occupied at time t may again
be occupied at time t + 1 if (1) the current settlers survive (2) the existing settlers
perish but new settlers simultaneously colonize or (3) current settlers survive and new
ones colonize during the same season In our current formulation of the DYMOSS both
colonization and persistence depend on habitat suitability characterized by xprimei(tminus1)β(c)tminus1
They only differ in that persistence is also influenced by whether the site being occupied
during season t minus 1 enhances the suitability of the site or harms it through density
dependence
Additionally the study of the dynamics that govern distribution and abundance of
biological populations requires an understanding of the physical and biotic processes
that act upon them and these vary in time and space Consequently as a final step in
this Chapter we described a straightforward strategy to add spatial dependence among
neighboring sites in the dynamic metapopulation model This extension is based on the
popular Bayesian spatial modeling technique of Besag et al (1991) updated using the
methods described in (Hughes amp Haran 2013)
Future steps along these lines are (1) develop the software necessary to
implement the tools described throughout the Chapter and (2) build a suite of additional
extensions using this framework for occupancy models will be explored The first of
them will be used to incorporate information from different sources such as tracks
scats surveys and direct observations into a single model This can be accomplished
47
by adding a layer to the hierarchy where the source and spatial scale of the data is
accounted for The second extension is a single season spatially explicit multiple
species co-occupancy model This model will allow studying complex interactions
and testing hypotheses about species interactions at a given point in time Lastly this
co-occupancy model will be adapted to incorporate temporal dynamics in the spirit of
the DYMOSS model
48
CHAPTER 3INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
Eliminate all other factors and the one which remains must be the truthndashSherlock Holmes
The Sign of Four
31 Introduction
Occupancy models are often used to understand the mechanisms that dictate
the distribution of a species Therefore variable selection plays a fundamental role in
achieving this goal To the best of our knowledge ldquoobjectiverdquo Bayesian alternatives for
variable selection have not been put forth for this problem and with a few exceptions
(Hooten amp Hobbs 2014 Link amp Barker 2009) AIC is the method used to choose from
competing site-occupancy models In addition the procedures currently implemented
and accessible to ecologists require enumerating and estimating all the candidate
models (Fiske amp Chandler 2011 Mazerolle amp Mazerolle 2013) In practice this
can be achieved if the model space considered is small enough which is possible
if the choice of the model space is guided by substantial prior knowledge about the
underlying ecological processes Nevertheless many site-occupancy surveys collect
large amounts of covariate information about the sampled sites Given that the total
number of candidate models grows exponentially fast with the number of predictors
considered choosing a reduced set of models guided by ecological intuition becomes
increasingly difficult This is even more so the case in the occupancy model context
where the model space is the cartesian product of models for presence and models for
detection Given the issues mentioned above we propose the first objective Bayesian
variable selection method for the single-season occupancy model framework This
approach explores in a principled manner the entire model space It is completely
49
automatic precluding the need for both tuning parameters in the sampling algorithm and
subjective elicitation of parameter prior distributions
As mentioned above in ecological modeling if model selection or less frequently
model averaging is considered the Akaike Information Criterion (AIC) (Akaike 1983)
or a version of it is the measure of choice for comparing candidate models (Fiske amp
Chandler 2011 Mazerolle amp Mazerolle 2013) The AIC is designed to find the model
that has on average the density closest in Kullback-Leibler distance to the density
of the true data generating mechanism The model with the smallest AIC is selected
However if nested models are considered one of them being the true one generally the
AIC will not select it (Wasserman 2000) Commonly the model selected by AIC will be
more complex than the true one The reason for this is that the AIC has a weak signal to
noise ratio and as such it tends to overfit (Rao amp Wu 2001) Other versions of the AIC
provide a bias correction that enhances the signal to noise ratio leading to a stronger
penalization for model complexity Some examples are the AICc (Hurvich amp Tsai 1989)
and AICu (McQuarrie et al 1997) however these are also not consistent for selection
albeit asymptotically efficient (Rao amp Wu 2001)
If we are interested in prediction as opposed to testing the AIC is certainly
appropriate However when conducting inference the use of Bayesian model averaging
and selection methods is more fitting If the true data generating mechanism is among
those considered asymptotically Bayesian methods choose the true model with
probability one Conversely if the true model is not among the alternatives and a
suitable parameter prior is used the posterior probability of the most parsimonious
model closest to the true one tends asymptotically to one
In spite of this in general for Bayesian testing direct elicitation of prior probabilistic
statements is often impeded because the problems studied may not be sufficiently
well understood to make an informed decision about the priors Conversely there may
be a prohibitively large number of parameters making specifying priors for each of
50
these parameters an arduous task In addition to this seemingly innocuous subjective
choices for the priors on the parameter space may drastically affect test outcomes
This has been a recurring argument in favor of objective Bayesian procedures
which appeal to the use of formal rules to build parameter priors that incorporate the
structural information inside the likelihood while utilizing some objective criterion (Kass amp
Wasserman 1996)
One popular choice of ldquoobjectiverdquo prior is the reference prior (Berger amp Bernardo
1992) which is the prior that maximizes the amount of signal extracted from the
data These priors have proven to be effective as they are fully automatic and can
be frequentist matching in the sense that the posterior credible interval agrees with the
frequentist confidence interval from repeated sampling with equal coverage-probability
(Kass amp Wasserman 1996) Reference priors however are improper and while
they yield reasonable posterior parameter probabilities the derived model posterior
probabilities may be ill defined To avoid this shortcoming Berger amp Pericchi (1996)
introduced the intrinsic Bayes factor (IBF) for model comparison Moreno et al (1998)
building on the IBF of Berger amp Pericchi (1996) developed a limiting procedure to
generate a system of priors that yield well-defined posteriors even though these
priors may sometimes be improper The IBF is built using a data-dependent prior to
automatically generate Bayes factors however the extension introduced by Moreno
et al (1998) generates the intrinsic prior by taking a theoretical average over the space
of training samples freeing the prior from data dependence
In our view in the face of a large number of predictors the best alternative is to run
a stochastic search algorithm using good ldquoobjectiverdquo testing parameter priors and to
incorporate suitable model priors This being said the discussion about model priors is
deferred until Chapter 4 this Chapter focuses on the priors on the parameter space
The Chapter is structured as follows First issues surrounding multimodel inference
are described and insight about objective Bayesian inferential procedures is provided
51
Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate priors
on the parameter space the intrinsic priors for the parameters of the occupancy model
are derived These are used in the construction of an algorithm for ldquoobjectiverdquo model
selection tailored to the occupancy model framework To assess the performance of our
methods we provide results from a simulation study in which distinct scenarios both
favorable and unfavorable are used to determine the robustness of these tools and
analyze the Blue Hawker data set which has been examined previously in the ecological
literature (Dorazio amp Taylor-Rodrıguez 2012 Kery et al 2010)32 Objective Bayesian Inference
As mentioned before in practice noninformative priors arising from structural
rules are an alternative to subjective elicitation of priors Some of the rules used in
defining noninformative priors include the principle of insufficient reason parametrization
invariance maximum entropy geometric arguments coverage matching and decision
theoretic approaches (see Kass amp Wasserman (1996) for a discussion)
These rules reflect one of two attitudes (1) noninformative priors either aim to
convey unique representations of ignorance or (2) they attempt to produce probability
statements that may be accepted by convention This latter attitude is in the same
spirit as how weights and distances are defined (Kass amp Wasserman 1996) and
characterizes the way in which Bayesian reference methods are interpreted today ie
noninformative priors are seen to be chosen by convention according to the situation
A word of caution must be given when using noninformative priors Difficulties arise
in their implementation that should not be taken lightly In particular these difficulties
may occur because noninformative priors are generally improper (meaning that they do
not integrate or sum to a finite number) and as such are said to depend on arbitrary
constants
Bayes factors strongly depend upon the prior distributions for the parameters
included in each of the models being compared This can be an important limitation
52
considering that when using noninformative priors their introduction will result in the
Bayes factors being a function of the ratio of arbitrary constants given that these priors
are typically improper (see Jeffreys 1961 Pericchi 2005 and references therein)
Many different approaches have been developed to deal with the arbitrary constants
when using improper priors since then These include the use of partial Bayes factors
(Berger amp Pericchi 1996 Good 1950 Lempers 1971) setting the ratio of arbitrary
constants to a predefined value (Spiegelhalter amp Smith 1982) and approximating to the
Bayes factor (see Haughton 1988 as cited in Berger amp Pericchi 1996 Kass amp Raftery
1995 Tierney amp Kadane 1986)321 The Intrinsic Methodology
Berger amp Pericchi (1996) cleverly dealt with the arbitrary constants that arise when
using improper priors by introducing the intrinsic Bayes factor (IBF) procedure This
solution based on partial Bayes factors provides the means to replace the improper
priors by proper ldquoposteriorrdquo priors The IBF is obtained from combining the model
structure with information contained in the observed data Furthermore they showed
that as the sample size tends to infinity the Intrinsic Bayes factor corresponds to the
proper Bayes factor arising from the intrinsic priors
Intrinsic priors however are not unique The asymptotic correspondence between
the IBF and the Bayes factor arising from the intrinsic prior yields two functional
equations that are solved by a whole class of intrinsic priors Because all the priors
in the class produce Bayes factors that are asymptotically equivalent to the IBF for
finite sample sizes the resulting Bayes factor is not unique To address this issue
Moreno et al (1998) formalized the methodology through the ldquolimiting procedurerdquo
This procedure allows one to obtain a unique Bayes factor consolidating the method
as a valid objective Bayesian model selection procedure which we will refer to as the
Bayes factor for intrinsic priors (BFIP) This result is particularly valid for nested models
although the methodology may be extended with some caution to nonnested models
53
As mentioned before the Bayesian hypothesis testing procedure is highly sensitive
to parameter-prior specification and not all priors that are useful for estimation are
recommended for hypothesis testing or model selection Evidence of this is provided
by the Jeffreys-Lindley paradox which states that a point null hypothesis will always
be accepted when the variance of a conjugate prior goes to infinity (Robert 1993)
Additionally when comparing nested models the null model should correspond to
a substantial reduction in complexity from that of larger alternative models Hence
priors for the larger alternative models that place probability mass away from the null
model are wasteful If the true model is ldquofarrdquo from the null it will be easily detected by
any statistical procedure Therefore the prior on the alternative models should ldquowork
harderrdquo at selecting competitive models that are ldquocloserdquo to the null This principle known
as the Savage continuity condition (Gunel amp Dickey 1974) is widely recognized by
statisticians
Interestingly the intrinsic prior in correspondence with the BFIP automatically
satisfies the Savage continuity condition That is when comparing nested models the
intrinsic prior for the more complex model is centered around the null model and in spite
of being a limiting procedure it is not subject to the Jeffreys-Lindley paradox
Moreover beyond the usual pairwise consistency of the Bayes factor for nested
models Casella et al (2009) show that the corresponding Bayesian procedure with
intrinsic priors for variable selection in normal regression is consistent in the entire
class of normal linear models adding an important feature to the list of virtues of the
procedure Consistency of the BFIP for the case where the dimension of the alternative
model grows with the sample size is discussed in Moreno et al (2010)322 Mixtures of g-Priors
As previously mentioned in the Bayesian paradigm a model M in M is defined
by a sampling density and a prior distribution The sampling density associated with
model M is denoted by f (y|βM σ2M M) where (βM σ
2M) is a vector of model-specific
54
unknown parameters The prior for model M and its corresponding set of parameters is
π(βM σ2M M|M) = π(βM σ
2M |MM) middot π(M|M)
Objective local priors for the model parameters (βM σ2M) are achieved through
modifications and extensions of Zellnerrsquos g-prior (Liang et al 2008 Womack et al
2014) In particular below we focus on the intrinsic prior and provide some details for
other scaled mixtures of g-priors We defer the discussion on priors over the model
space until Chapter 5 where we describe them in detail and develop a few alternatives
of our own3221 Intrinsic priors
An automatic choice of an objective prior is the intrinsic prior (Berger amp Pericchi
1996 Moreno et al 1998) Because MB sube M for all M isin M the intrinsic prior for
(βM σ2M) is defined as an expected posterior prior
πI (βM σ2M |M) =
intpR(βM σ
2M |~yM)mR(~y|MB)d~y (3ndash1)
where ~y is a minimal training sample for model M I denotes the intrinsic distributions
and R denotes distributions derived from the reference prior πR(βM σ2M |M) = cM
dβMdσ2M
σ2M
In (3ndash1) mR(~y|M) =intint
f (~y|βM σ2M M)πR(βM σ
2M |M)dβMdσ2M is the reference marginal
of ~y under model M and pR(βM σ2M |~yM) =
f (~y|βM σ2MM)πR(βM σ2
M|M)
mR(~y|M)is the reference
posterior density
In the regression framework the reference marginal mR is improper and produces
improper intrinsic priors However the intrinsic Bayes factor of model M to the base
model MB is well-defined and given by
BF IMMB
(y) = (1minus R2M)
minus nminus|MB |2 times
int 1
0
n + sin2(π2θ) middot (|M|+ 1)
n +sin2(π
2θ)middot(|M|+1)1minusR2
M
nminus|M|
2sin2(π
2θ) middot (|M|+ 1)
n +sin2(π
2θ)middot(|M|+1)1minusR2
M
|M|minus|MB |
2
dθ (3ndash2)
55
where R2M is the coefficient of determination of model M versus model MB The Bayes
factor between two models M and M prime is defined as BF IMMprime(y) = BF I
MMB(y)BF I
MprimeMB(y)
The ldquogoodnessrdquo of the model M based on the intrinsic priors is given by its posterior
probability
pI (M|yM) =BF I
MMB(y)π(M|M)sum
MprimeisinM BF IMprimeMB
(y)π(M prime|M) (3ndash3)
It has been shown that the system of intrinsic priors produces consistent model selection
(Casella et al 2009 Giron et al 2010) In the context of well-formulated models the
true model MT is the smallest well-formulated model M isin M such that α isin M if βα = 0
If MT is the true model then the posterior probability of model MT based on equation
(3ndash3) converges to 13222 Other mixtures of g-priors
Scaled mixtures of g-priors place a reference prior on (βMBσ2) and a multivariate
normal distribution on β in M MB that is normal with mean 0 and precision matrix
qMw
nσ2ZprimeM (IminusH0)ZM
where H0 is the hat matrix associated with ZMB The prior is completed by a prior on w
and choice of scaling qM that is set at |M| + 1 to account for the minimal sample size of
M Under these assumptions the Bayesrsquo factor for M to MB is given by
BFMMB(y) =
(1minus R2
M
) nminus|MB |2
int n + w(|M|+ 1)
n + w(|M|+1)1minusR2
M
nminus|M|
2w(|M|+ 1)
n + w(|M|+1)1minusR2
M
|M|minus|MB |
2
π(w)dw
We consider the following priors on w The intrinsic prior is π(w) = Beta(w 05 05)
which is only defined for w isin (0 1) A version of the Zellner-Siow prior is given by
w sim Gamma(05 05) which produces a multivariate Cauchy distribution on β A family
of hyper-g priors are defined by π(w) prop wminus12(β + w)(α+1)2 which have Cauchy-like
tails but produce more shrinkage than the Cauchy prior
56
33 Objective Bayes Occupancy Model Selection
As mentioned before Bayesian inferential approaches used for ecological models
are lacking In particular there exists a need for suitable objective and automatic
Bayesian testing procedures and software implementations that explore thoroughly the
model space considered With this goal in mind in this section we develop an objective
intrinsic and fully automatic Bayesian model selection methodology for single season
site-occupancy models We refer to this method as automatic and objective given that
in its implementation no hyperparameter tuning is required and that it is built using
noninformative priors with good testing properties (eg intrinsic priors)
An inferential method for the occupancy problem is possible using the intrinsic
approach given that we are able to link intrinsic-Bayesian tools for the normal linear
model through our probit formulation of the occupancy model In other words because
we can represent the single season probit occupancy model through the hierarchy
yij |zi wij sim Bernoulli(ziIwijgt0
)wij |λ sim N
(qprimeijλ 1
)zi |vi sim Bernoulli
(Ivigt0
)vi |α sim N (x primeiα 1)
it is possible to solve the selection problem on the latent scale variables wij and vi and
to use those results at the level of the occupancy and detection processes
In what follows first we provide some necessary notation Then a derivation of
the intrinsic priors for the parameters of the detection and occupancy components
is outlined Using these priors we obtain the general form of the model posterior
probabilities Finally the results are incorporated in a model selection algorithm for
site-occupancy data Although the priors on the model space are not discussed in this
Chapter the software and methods developed have different choices of model priors
built in
57
331 Preliminaries
The notation used in Chapter 2 will be considered in this section as well Namely
presence will be denoted by z detection by y their corresponding latent processes are
v and w and the model parameters are denoted by α and λ However some additional
notation is also necessary Let M0 =M0y M0z
denote the ldquobaserdquo model defined by
the smallest models considered for the detection and presence processes The base
models M0y and M0z include predictors that must be contained in every model that
belongs to the model space Some examples of base models are the intercept only
model a model with covariates related to the sampling design and a model including
some predictors important to the researcher that should be included in every model
Furthermore let the sets [Kz ] = 1 2 Kz and [Ky ] = 1 2 Ky index
the covariates considered for the variable selection procedure for the presence and
detection processes respectively That is these sets denote the covariates that can
be added from the base models in M0 or removed from the largest possible models
considered MF z and MF y which we will refer to as the ldquofullrdquo models The model space
can then be represented by the Cartesian product of subsets such that Ay sube [Ky ]
and Az sube [Kz ] The entire model space is populated by models of the form MA =MAy
MAz
isin M = My timesMz with MAy
isin My and MAzisin Mz
For the presence process z the design matrix for model MAzis given by the block
matrix XAz= (X0|Xr A) X0 corresponds to the design matrix of the base model ndash which
is such that M0z sube MAzisin Mz for all Az isin [Kz ] ndash and Xr A corresponds to the submatrix
that contains the covariates indexed by Az Analogously for the detection process y the
design matrix is given by QAy= (Q0|Qr A) Similarly the coefficients for models MAz
and
MAyare given by αA = (αprime
0αprimer A)
prime and λA = (λprime0λ
primer A)
prime
With these elements in place the model selection problem consists of finding
subsets of covariates indexed by A = Az Ay that have a high posterior probability
given the detection and occupancy processes This is equivalent to finding models with
58
high posterior odds when compared to a suitable base model These posterior odds are
given by
p(MA|y z)p(M0|y z)
=m(y z|MA)π(MA)
m(y z|M0)π(M0)= BFMAM0
(y z)π(MA)
π(M0)
Since we are able to represent the occupancy model as a truncation of latent
normal variables it is possible to work through the occupancy model selection problem
in the latent normal scale used for the presence and detection processes We formulate
two solutions to this problem one that depends on the observed and latent components
and another that solely depends on the latent level variables used to data-augment the
problem We will however focus on the latter approach as this yields a straightforward
MCMC sampling scheme For completeness the other alternative is described in
Section 34
At the root of our objective inferential procedure for occupancy models lies the
conditional argument introduced by Womack et al (work in progress) for the simple
probit regression In the occupancy setting the argument is
p(MA|y zw v) =m(y z vw|MA)π(MA)
m(y zw v)
=fyz(y z|w v)
(intfvw(vw|αλMA)παλ(αλ|MA)d(αλ)
)π(MA)
fyz(y z|w v)sum
MlowastisinM(int
fvw(vw|αλMlowast)παλ(αλ|Mlowast)d(αλ))π(Mlowast)
=m(v|MAz
)m(w|MAy)π(MA)
m(v)m(w)
prop m(v|MAz)m(w|MAy
)π(MA) (3ndash4)
where
1 fyz(y z|w v) =prodN
i=1 Izivigt0I
(1minuszi )vile0
prodJ
j=1(ziIwijgt0)yij (1minus ziIwijgt0)
1minusyij
2 fvw(vw|αλMA) =
(Nprodi=1
ϕ(vi xprimeiαMAz
1)
)︸ ︷︷ ︸
f (v|αr Aα0MAz )
(Nprodi=1
Jiprodj=1
ϕ(wij qprimeijλMAy
1)
)︸ ︷︷ ︸
f (w|λr Aλ0MAy )
and
59
3 παλ(αλ|MA) = πα(α|MAz)πλ(λ|MAy
)
This result implies that once the occupancy and detection indicators are
conditioned on the latent processes v and w respectively the model posterior
probabilities only depend on the latent variables Hence in this case the model
selection problem is driven by the posterior odds
p(MA|y zw v)p(M0|y zw v)
=m(w v|MA)
m(w v|M0)
π(MA)
π(M0) (3ndash5)
where m(w v|MA) = m(w|MAy) middotm(v|MAz
) with
m(v|MAz) =
int intf (v|αr Aα0MAz
)π(αr A|α0MAz)π(α0)dαr Adα0
(3ndash6)
m(w|MAy) =
int intf (w|λr Aλ0MAy
)π(λr A|λ0MAy)π(λ0)dλ0dλr A
(3ndash7)
332 Intrinsic Priors for the Occupancy Problem
In general the intrinsic priors as defined by Moreno et al (1998) use the functional
form of the response to inform their construction assuming some preliminary prior
distribution proper or improper on the model parameters For our purposes we assume
noninformative improper priors for the parameters denoted by πN(middot|middot) Specifically the
intrinsic priors πIP(θMlowast|Mlowast) for a vector of parameters θMlowast corresponding to model
Mlowast isin M0M sub M for a response vector s with probability density (or mass) function
f (s|θMlowast) are defined by
πIP(θM0|M0) = πN(θM0
|M0)
πIP(θM |M) = πN(θM |M)
intm(~s|M)
m(~s|M0)f (~s|θM M)d~s
where ~s is a theoretical training sample
In what follows whenever it is clear from the context in an attempt to simplify the
notation MA will be used to refer to MAzor MAy
and A will denote Az or Ay To derive
60
the parameter priors involved in equations 3ndash6 and 3ndash7 using the objective intrinsic prior
strategy we start by assuming flat priors πN(αA|MA) prop cA and πN(λA|MA) prop dA where
cA and dA are unknown constants
The intrinsic prior for the parameters associated with the occupancy process αA
conditional on model MA is
πIP(αA|MA) = πN(αA|MA)
intm(~v|MA)
m(~v|M0)f (~v|αAMA)d~v
where the marginals m(~v|Mj) with j isin A 0 are obtained by solving the analogous
equation 3ndash6 for the (theoretical) training sample ~v These marginals are given by
m(~v|Mj) = cj (2π)pjminusp0
2 |~X primej~Xj |
12 eminus
12~vprime(Iminus~Hj )~v
The training sample ~v has dimension pAz=∣∣MAz
∣∣ that is the total number of
parameters in model MAz Note that without ambiguity we use
∣∣ middot ∣∣ to denote both
the cardinality of a set and also the determinant of a matrix The design matrix ~XA
corresponds to the training sample ~v and is chosen such that ~X primeA~XA =
pAzNX primeAXA
(Leon-Novelo et al 2012) and ~Hj is the corresponding hat matrix
Replacing m(~v|MA) and m(~v|M0) in πIP(αA|MA) and solving the integral with
respect to the theoretical training sample ~v we have
πIP(αA|MA) = cA
int ((2π)minus
pAzminusp0z2
(c0
cA
)eminus
12~vprime((Iminus~HA)minus(Iminus~H0))~v |~X
primeA~XA|12
|~X prime0~X0|12
)times(
(2π)minuspAz2 eminus
12(~vminus~XAαA)
prime(~vminus~XAαA))d~v
= c0(2π)minus
pAzminusp0z2 |~X prime
Ar~XAr |
12 2minus
pAzminusp0z2 exp
[minus1
2αprimer A
(1
2~X primer A
~Xr A
)αr A
]= πN(α0)timesN
(αr A
∣∣ 0 2 middot ( ~X primer A
~Xr A)minus1)
(3ndash8)
61
Analogously the intrinsic prior for the parameters associated to the detection
process is
πIP(λA|MA) = d0(2π)minus
pAyminusp0y2 | ~Q prime
Ar~QAr |
12 2minus
pAyminusp0y2 exp
[minus1
2λprimer A
(1
2~Q primer A
~Qr A
)λr A
]= πN(λ0)timesN
(λr A
∣∣ 0 2 middot ( ~Q primeA~QA)
minus1)
(3ndash9)
In short the intrinsic priors for αA = (αprime0α
primer A)
prime and λprimeA = (λprime
0λprimer A)
prime are the product
of a reference prior on the parameters of the base model and a normal density on the
parameters indexed by Az and Ay respectively333 Model Posterior Probabilities
We now derive the expressions involved in the calculations of the model posterior
probabilities First recall that p(MA|y zw v) prop m(w v|MA)π(MA) Hence determining
this posterior probability only requires calculating m(w v|MA)
Note that since w and v are independent obtaining the model posteriors from
expression 3ndash4 reduces to finding closed form expressions for the marginals m(v |MAz)
and m(w |MAy) respectively from equations 3ndash6 and 3ndash7 Therefore
m(w v|MA) =
int intf (vw|αλMA)π
IP (α|MAz)πIP
(λ|MAy
)dαdλ
(3ndash10)
For the latent variable associated with the occupancy process plugging the
parameter intrinsic prior given by 3ndash8 into equation 3ndash6 (recalling that ~X primeA~XA =
pAzNX primeAXA)
and integrating out αA yields
m(v|MA) =
int intc0N (v|X0α0 + Xr Aαr A I)N
(αr A|0 2( ~X prime
r A~Xr A)
minus1)dαr Adα0
= c0(2π)minusn2
int (pAz
2N + pAz
) (pAzminusp0z
)
2
times
exp[minus1
2(v minus X0α0)
prime(I minus
(2N
2N + pAz
)Hr Az
)(v minus X0α0)
]dα0
62
= c0 (2π)minus(nminusp0z )2
(pAz
2N + pAz
) (pAzminusp0z
)
2
|X prime0X0|minus
12 times
exp[minus1
2vprime(I minus H0z minus
(2N
2N + pAz
)Hr Az
)v
] (3ndash11)
with Hr Az= HAz
minus H0z where HAzis the hat matrix for the entire model MAz
and H0z is
the hat matrix for the base model
Similarly the marginal distribution for w is
m(w|MA) = d0 (2π)minus(Jminusp0y )2
(pAy
2J + pAy
) (pAyminusp0y
)
2
|Q prime0Q0|minus
12 times
exp[minus1
2wprime(I minus H0y minus
(2J
2J + pAy
)Hr Ay
)w
] (3ndash12)
where J =sumN
i=1 Ji or in other words J denotes the total number of surveys conducted
Now the posteriors for the base model M0 =M0y M0z
are
m(v|M0) =
intc0N (v|X0α0 I) dα0
= c0(2π)minus(nminusp0z )2 |X prime
0X0|minus12 exp
[minus1
2(v (I minus H0z ) v)
](3ndash13)
and
m(w|M0) = d0(2π)minus(Jminusp0y )2 |Q prime
0Q0|minus12 exp
[minus1
2
(w(I minus H0y
)w)]
(3ndash14)
334 Model Selection Algorithm
Having the parameter intrinsic priors in place and knowing the form of the model
posterior probabilities it is finally possible to develop a strategy to conduct model
selection for the occupancy framework
For each of the two components of the model ndashoccupancy and detectionndash the
algorithm first draws the set of active predictors (ie Az and Ay ) together with their
corresponding parameters This is a reversible jump step which uses a Metropolis
63
Hastings correction with proposal distributions given by
q(Alowastz |zo z(t)u v(t)MAz
) =1
2
(p(MAlowast
z|zo z(t)u v(t)Mz MAlowast
zisin L(MAz
)) +1
|L(MAz)|
)q(Alowast
y |y zo z(t)u w(t)MAy) =
1
2
(p(MAlowast
w|y zo z(t)u w(t)My MAlowast
yisin L(MAy
)) +1
|L(MAy)|
)(3ndash15)
where L(MAz) and L(MAy
) denote the sets of models obtained from adding or removing
one predictor at a time from MAzand MAy
respectively
To promote mixing this step is followed by an additional draw from the full
conditionals of α and λ The densities p(α0|) p(αr A|) p(λ0|) and p(λr A|) can
be sampled from directly with Gibbs steps Using the notation a|middot to denote the random
variable a conditioned on all other parameters and on the data these densities are given
by
bull α0|middot sim N((X
prime0X0)
minus1Xprime0v (X
prime0X0)
minus1)bull αr A|middot sim N
(microαr A
αr A
) where the mean vector and the covariance matrix are
given by αr A= 2N
2N+pAz(X
prime
r AXr A)minus1 and microαr A
=(αr A
Xprime
r Av)
bull λ0|middot sim N((Q
prime0Q0)
minus1Qprime0w (Q
prime0Q0)
minus1) and
bull λr A|middot sim N(microλr A
λr A
) analogously with mean and covariance matrix given by
λr A= 2J
2J+pAy(Q
prime
r AQr A)minus1 and microλr A
=(λr A
Qprime
r Aw)
Finally Gibbs sampling steps are also available for the unobserved occupancy
indicators zu and for the corresponding latent variables v and w The full conditional
posterior densities for z(t+1)u v(t+1) and w(t+1) are those introduced in Chapter 2 for the
single season probit model
The following steps summarize the stochastic search algorithm
1 Initialize A(0)y A
(0)z z
(0)u v(0)w(0)α(0)
0 λ(0)0
2 Sample the model indices and corresponding parameters
(a) Draw simultaneously
64
bull Alowastz sim q(Az |zo z(t)u v(t)MAz
)
bull αlowast0 sim p(α0|MAlowast
z zo z
(t)u v(t)) and
bull αlowastr Alowast sim p(αr A|MAlowast
z zo z
(t)u v(t))
(b) Accept (M(t+1)Az
α(t+1)10 α(t+1)1
r A ) = (MAlowastzαlowast
0αlowastr Alowast) with probability
δz = min
(1
p(MAlowastz|zo z(t)u v(t))
p(MA(t)z|zo z(t)u v(t))
q(A(t)z |zo z(t)u v(t)MAlowast
z)
q(Alowastz |zo z
(t)u v(t)MAz
)
)
otherwise let (M(t+1)Az
α(t+1)10 α(t+1)1
r A ) = (A(t)z α(t)2
0 α(t)2r A )
(c) Sample simultaneously
bull Alowasty sim q(Ay |y zo z(t)u w(t)MAy
)
bull λlowast0 sim p(λ0|MAlowast
y y zo z
(t)u w(t)) and
bull λlowastr Alowast sim p(λr A|MAlowast
y y zo z
(t)u w(t))
(d) Accept (M(t+1)Ay
λ(t+1)10 λ(t+1)1
r A ) = (MAlowastyλlowast
0λlowastr Alowast) with probability
δy = min
(1
p(MAlowastz|y zo z(t)u w(t))
p(MA(t)z|y zo z(t)u w(t))
q(A(t)z |y zo z(t)u w(t)MAlowast
y)
q(Alowastz |y zo z
(t)u w(t)MAy
)
)
otherwise let (M(t+1)Ay
λ(t+1)10 λ(t+1)1
r A ) = (A(t)y λ(t)2
0 λ(t)2r A )
3 Sample base model parameters
(a) Draw α(t+1)20 sim p(α0|MA
(t+1)z
zo z(t)u v(t))
(b) Draw λ(t+1)20 sim p(λ0|MA(t+1)y
y zo z(t)u v(t))
4 To improve mixing resample model coefficients not present the base model butare in MA
(a) Draw α(t+1)2r A sim p(αr A|MA
(t+1)z
zo z(t)u v(t))
(b) Draw λ(t+1)2r A sim p(λr A|MA
(t+1)y
yzo z(t)u v(t))
5 Sample latent and missing (unobserved) variables
(a) Sample z(t+1)u sim p(zu|MA(t+1)z
yα(t+1)2r A α(t+1)2
0 λ(t+1)2r A λ(t+1)2
0 )
(b) Sample v(t+1) sim p(v|MA(t+1)z
zo z(t+1)u α(t+1)2
r A α(t+1)20 )
65
(c) Sample w(t+1) sim p(w|MA(t+1)y
zo z(t+1)u λ(t+1)2
r A λ(t+1)20 )
34 Alternative Formulation
Because the occupancy process is partially observed it is reasonable to consider
the posterior odds in terms of the observed responses that is the detections y and
the presences at sites where at least one detection takes place Partitioning the vector
of presences into observed and unobserved z = (zprimeo zprimeu)
prime and integrating out the
unobserved component the model posterior for MA can be obtained as
p(MA|y zo) prop Ezu [m(y z|MA)] π(MA) (3ndash16)
Data-augmenting the model in terms of latent normal variables a la Albert and Chib
the marginals for any model My Mz = M isin M of z and y inside of the expectation in
equation 3ndash16 can be expressed in terms of the latent variables
m(y z|M) =
intT (z)
intT (yz)
m(w v|M)dwdv
=
(intT (z)
m(v| Mz)dv
)(intT (yz)
m(w|My)dw
) (3ndash17)
where T (z) and T (y z) denote the corresponding truncation regions for v and w which
depend on the values taken by z and y and
m(v|Mz) =
intf (v|αMz)π(α|Mz)dα (3ndash18)
m(w|My) =
intf (w|λMy)π(λ|My)dλ (3ndash19)
The last equality in equation 3ndash17 is a consequence of the independence of the
latent processes v and w Using expressions 3ndash18 and 3ndash19 allows one to embed this
model selection problem in the classical linear normal regression setting where many
ldquoobjectiverdquo Bayesian inferential tools are available In particular these expressions
facilitate deriving the parameter intrinsic priors (Berger amp Pericchi 1996 Moreno
et al 1998) for this problem This approach is an extension of the one implemented in
Leon-Novelo et al (2012) for the simple probit regression problem
66
Using this alternative approach all that is left is to integrate m(v|MA) and m(w|MA)
over their corresponding truncation regions T (z) and T (y z) which yields m(y z|MA)
and then to obtain the expectation with respect to the unobserved zrsquos Note however
two issues arise First such integrals are not available in closed form Second
calculating the expectation over the limit of integration further complicates things To
address these difficulties it is possible to express E [m(y z|MA)] as
Ezu [m(y z|MA)] = Ezu
[(intT (z)
m(v| MAz)dv
)(intT (yz)
m(w|MAy)dw
)](3ndash20)
= Ezu
[(intT (z)
intm(v| MAz
α0)πIP(α0|MAz
)dα0dv
)times(int
T (yz)
intm(w| MAy
λ0)πIP(λ0|MAy
)dλ0dw
)]
= Ezu
int (int
T (z)
m(v| MAzα0)dv
)︸ ︷︷ ︸
g1(T (z)|MAz α0)
πIP(α0|MAz)dα0 times
int (intT (yz)
m(w|MAyλ0)dw
)︸ ︷︷ ︸
g2(T (yz)|MAy λ0)
πIP(λ0|MAy)dλ0
= Ezu
[intg1(T (z)|MAz
α0)πIP(α0|MAz
)dα0 timesintg2(T (y z)|MAy
λ0)πIP(λ0|MAy
)dλ0
]= c0 d0
int intEzu
[g1(T (z)|MAz
α0)g2(T (y z)|MAyλ0)
]dα0 dλ0
where the last equality follows from Fubinirsquos theorem since m(v|MAzα0) and
m(w|MAyλ0) are proper densities From 3ndash21 the posterior odds are
p(MA|y zo)p(M0|y zo)
=
int intEzu
[g1(T (z)|MAz
α0)g2(T (y z)|MAyλ0)
]dα0 dλ0int int
Ezu
[g1(T (z)|M0z α0)g2(T (y z)|M0y λ0)
]dα0 dλ0
π(MA)
π(M0)
(3ndash21)
67
35 Simulation Experiments
The proposed methodology was tested under 36 different scenarios where we
evaluate the behavior of the algorithm by varying the number of sites the number of
surveys the amount of signal in the predictors for the presence component and finally
the amount of signal in the predictors for the detection component
For each model component the base model is taken to be the intercept only model
and the full models considered for the presence and the detection have respectively 30
and 20 predictors Therefore the model space contains 230times220 asymp 112times1015 candidate
models
To control the amount of signal in the presence and detection components values
for the model parameter were purposefully chosen so that quantiles 10 50 and 90 of the
occupancy and detection probabilities match some pre-specified probabilities Because
presence and detection are binary variables the amount of signal in each model
component associates to the spread and center of the distribution for the occupancy and
detection probabilities respectively Low signal levels relate to occupancy or detection
probabilities close to 50 High signal levels associate with probabilities close to 0 or 1
Large spreads of the distributions for the occupancy and detection probabilities reflect
greater heterogeneity among the observations collected improving the discrimination
capability of the model and viceversa
Therefore for the presence component the parameter values of the true model
were chosen to set the median for the occupancy probabilities equal 05 The chosen
parameter values also fix quantiles 10 and 90 symmetrically about 05 at small (Qz10 =
03Qz90 = 07) intermediate (Qz
10 = 02Qz90 = 08) and large (Qz
10 = 01Qz90 = 09)
distances For the detection component the model parameters are obtained to reflect
detection probabilities concentrated about low values (Qy50 = 02) intermediate values
(Qy50 = 05) and high values (Qy
50 = 08) while keeping quantiles 10 and 90 fixed at 01
and 09 respectively
68
Table 3-1 Simulation control parameters occupancy model selectorParameter Values considered
N 50 100
J 3 5
(Qz10Q
z50Q
z90)
(03 05 07) (02 05 08) (01 05 09)
(Qy
10Qy50Q
y90)
(01 02 09) (01 05 09) (01 08 09)
There are in total 36 scenarios these result from crossing all the levels of the
simulation control parameters (Table 3-1) Under each of these scenarios 20 data sets
were generated at random True presence and detection indicators were generated
with the probit model formulation from Chapter 2 This with the assumed true models
MTz = 1 x2 x15 x16 x22 x28 for the presence and MTy = 1 q7 q10 q12 q17 for
the detection with the predictors included in the randomly generated datasets In this
context 1 represents the intercept term Throughout the Section we refer to predictors
included in the true models as true predictors and to those absent as false predictors
The selection procedure was conducted using each one of these data sets with
two different priors on the model space the uniform or equal probability prior and a
multiplicity correcting prior
The results are summarized through the marginal posterior inclusion probabilities
(MPIPs) for each predictor and also the five highest posterior probability models (HPM)
The MPIP for a given predictor under a specific scenario and for a particular data set is
defined as
p(predictor is included|y zw v) =sumMisinM
I(predictorisinM)p(M|y zw vM) (3ndash22)
In addition we compare the MPIP odds between predictors present in the true model
and predictors absent from it Specifically we consider the minimum odds of marginal
posterior inclusion probabilities for the predictors Let ~ξ and ξ denote respectively a
69
predictor in the true model MT and a predictor absent from MT We define the minimum
MPIP odds between the probabilities of true and false predictor as
minOddsMPIP =min~ξisinMT
p(I~ξ = 1|~ξ isin MT )
maxξ isinMTp(Iξ = 1|ξ isin MT )
(3ndash23)
If the variable selection procedure adequately discriminates true and false predictors
minOddsMPIP will take values larger than one The ability of the method to discriminate
between the least probable true predictor and the most probable false predictor worsens
as the indicator approaches 0351 Marginal Posterior Inclusion Probabilities for Model Predictors
For clarity in Figures 3-1 through 3-5 only predictors in the true models are labeled
and are emphasized with a dotted line passing through them The left hand side plots
in these figures contain the results for the presence component and the ones on the
right correspond to predictors in the detection component The results obtained with
the uniform model priors correspond to the black lines and those for the multiplicity
correcting prior are in red In these Figures the MPIPrsquos have been averaged over all
datasets corresponding scenarios matching the condition observed
In Figure 3-1 we contrast the mean MPIPrsquos of the predictors over all datasets from
scenarios with 50 sites to the mean MPIPrsquos obtained for the scenarios with 100 sites
Similarly Figure 3-2 compares the mean MPIPrsquos of scenarios where 3 surveys are
performed to those of scenarios having 5 surveys per site Figures 3-4 and 3-5 show the
effect of the different levels of signal considered in the occupancy probabilities and in the
detection probabilities
From these figures mainly three results can be drawn (1) the effect of the model
prior is substantial (2) the proposed methods yield MPIPrsquos that clearly separate
true predictors from false predictors and (3) the separation between MPIPrsquos of true
predictors and false predictors is noticeably larger in the detection component
70
Regardless of the simulation scenario and model component observed under the
uniform prior false predictors obtain a relatively high MPIP Conversely the multiplicity
correction prior strongly shrinks towards 0 the MPIP for false predictors In the presence
component the MPIP for the true predictors is shrunk substantially under the multiplicity
prior however there remains a clear separation between true and false predictors In
contrast in the detection component the MPIP for true predictors remains relatively high
(Figures 3-1 through 3-5)
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif N=50MC N=50
Unif N=100MC N=100
Figure 3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors
71
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif J=3MC J=3
Unif J=5MC J=5
Figure 3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
Unif N=50 J=3Unif N=50 J=5
Unif N=100 J=3Unif N=100 J=5
MC N=50 J=3MC N=50 J=5
MC N=100 J=3MC N=100 J=5
Figure 3-3 Predictor MPIP averaged over scenarios with the interaction between thenumber of sites and the surveys per site using uniform (U) and multiplicitycorrection (MC) priors
72
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
U(03 05 07)MC(03 05 07)
U(02 05 08)MC(02 05 08)
U(01 05 09)MC(01 05 09)
Figure 3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors
00
02
04
06
08
10
x2 x15 x22 x28 q7 q10 q17
Presence component Detection component
Mar
gina
l inc
lusi
on p
roba
bilit
y
U(01 02 09)MC(01 02 09)
U(01 05 09)MC(01 05 09)
U(01 08 09)MC(01 08 09)
Figure 3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors
73
In scenarios where more sites were surveyed the separation between the MPIP of
true and false predictors grew in both model components (Figure 3-1) Increasing the
number of sites has an effect over both components given that every time a new site is
included covariate information is added to the design matrix of both the presence and
the detection components
On the hand increasing the number of surveys affects the MPIP of predictors in the
detection component (Figures 3-2 and 3-3) but has only a marginal effect on predictors
of the presence component This may appear to be counterintuitive however increasing
the number of surveys only increases the number of observation in the design matrix
for the detection while leaving unaltered the design matrix for the presence The small
changes observed in the MPIP for the presence predictors J increases are exclusively
a result of having additional detection indicators equal to 1 in sites where with less
surveys would only have 0 valued detections
From Figure 3-3 it is clear that for the presence component the effect of the number
of sites dominates the behavior of the MPIP especially when using the multiplicity
correction priors In the detection component the MPIP is influenced by both the number
of sites and number of surveys The influence of increasing the number of surveys is
larger when considering a smaller number of sites and viceversa
Regarding the effect of the distribution for the occupancy probabilities we observe
that mostly the detection component is affected There is stronger discrimination
between true and false predictors as the distribution has a higher variability (Figure
3-4) This is consistent with intuition since having the presence probabilities more
concentrated about 05 implies that the predictors do not vary much from one site to
the next whereas having the occupancy probabilities more spread out would have the
opposite effect
Finally concentrating the detection probabilities about high or low values For
predictors in the detection component the separation between MPIP of true and false
74
predictors is larger both in scenarios where the distribution of the detection probability
is centered about 02 or 08 when compared to those scenarios where this distribution
is centered about 05 (where the signal of the predictors is weakest) For predictors in
the presence component having the detection probabilities centered at higher values
slightly increases the inclusion probabilities of the true predictors (Figure 3-5) and
reduces that of false predictors
Table 3-2 Comparison of average minOddsMPIP under scenarios having differentnumber of sites (N=50 N=100) and under scenarios having different numberof surveys per site (J=3 J=5) for the presence and detection componentsusing uniform and multiplicity correction priors
Sites SurveysComp π(M) N=50 N=100 J=3 J=5
Presence Unif 112 131 119 124MC 320 846 420 674
Detection Unif 203 264 211 257MC 2115 3246 2139 3252
Table 3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors
(Qz10Q
z50Q
z90) (Qy
10Qy50Q
y90)
Comp π(M) (030507) (020508) (010509) (010209) (010509) (010809)
Presence Unif 105 120 134 110 123 124MC 202 455 805 238 619 640
Detection Unif 234 234 230 257 200 238MC 2537 2077 2528 2933 1852 2849
The separation between the MPIP of true and false predictors is even more
notorious in Tables 3-2 and 3-3 where the minimum MPIP odds between true and
false predictors are shown Under every scenario the value for the minOddsMPIP (as
defined in 3ndash23) was greater than 1 implying that on average even the lowest MPIP
for a true predictor is higher than the maximum MPIP for a false predictor In both
components of the model the minOddsMPIP are markedly larger under the multiplicity
correction prior and increase with the number of sites and with the number of surveys
75
For the presence component increasing the signal in the occupancy probabilities
or having the detection probabilities concentrate about higher values has a positive and
considerable effect on the magnitude of the odds For the detection component these
odds are particularly high specially under the multiplicity correction prior Also having
the distribution for the detection probabilities center about low or high values increases
the minOddsMPIP 352 Summary Statistics for the Highest Posterior Probability Model
Tables 3-4 through 3-7 show the number of true predictors that are included in
the HPM (True +) and the number of false predictors excluded from it (True minus)
The mean percentages observed in these Tables provide one clear message The
highest probability models chosen with either model prior commonly differ from the
corresponding true models The multiplicity correction priorrsquos strong shrinkage only
allows a few true predictors to be selected but at the same time it prevents from
including in the HPM any false predictors On the other hand the uniform prior includes
in the HPM a larger proportion of true predictors but at the expense of also introducing
a large number of false predictors This situation is exacerbated in the presence
component but also occurs to a minor extent in the detection component
Table 3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highestprobability models for the presence and the detection components usinguniform and multiplicity correcting priors on the model space
True + True minusComp π(M) N=50 N=100 N=50 N=100
Presence Unif 057 063 051 055MC 006 013 100 100
Detection Unif 077 085 087 093MC 049 070 100 100
Having more sites or surveys improves the inclusion of true predictors and exclusion
of false ones in the HPM for both the presence and detection components (Tables 3-4
and 3-5) On the other hand if the distribution for the occupancy probabilities is more
76
Table 3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space
True + True minusComp π(M) J=3 J=5 J=3 J=5
Presence Unif 059 061 052 054MC 008 010 100 100
Detection Unif 078 085 087 092MC 050 068 100 100
spread out the HPM includes more true predictors and less false ones in the presence
component In contrast the effect of the spread of the occupancy probabilities in the
detection HPM is negligible (Table 3-6) Finally there is a positive relationship between
the location of the median for the detection probabilities and the number of correctly
classified true and false predictors for the presence The HPM in the detection part of
the model responds positively to low and high values of the median detection probability
(increased signal levels) in terms of correctly classified true and false predictors (Table
3-7)
Table 3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space
True + True minusComp π(M) (030507) (020508) (010509) (030507) (020508) (010509)
Presence Unif 055 061 064 050 054 055MC 002 008 018 100 100 100
Detection Unif 081 082 081 090 089 089MC 057 061 059 100 100 100
36 Case Study Blue Hawker Data Analysis
During 1999 and 2000 an intensive volunteer surveying effort coordinated by the
Centre Suisse de Cartographie de la Faune (CSCF) was conducted in order to analyze
the distribution of the blue hawker Ashna cyanea (Odonata Aeshnidae) a common
dragonfly in Switzerland Given that Switzerland is a small and mountainous country
77
Table 3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space
True + True minusComp π(M) (010209) (010509) (010809) (010209) (010509) (010809)
Presence Unif 059 059 062 051 054 054MC 006 010 011 100 100 100
Detection Unif 089 077 078 091 087 091MC 070 048 059 100 100 100
there is large variation in its topography and physio-geography as such elevation is a
good candidate covariate to predict species occurrence at a large spatial scale It can
be used as a proxy for habitat type intensity of land use temperature as well as some
biotic factors (Kery et al 2010)
Repeated visits to 1-ha pixels took place to obtain the corresponding detection
history In addition to the survey outcome the x and y-coordinates thermal-level the
date of the survey and the elevation were recorded Surveys were restricted to the
known flight period of the blue hawker which takes place between May 1 and October
10 In total 2572 sites were surveyed at least once during the surveying period The
number of surveys per site ranges from 1 to 22 times within each survey year
Kery et al (2010) summarize the results of this effort using AIC-based model
comparisons first by following a backwards elimination approach for the detection
process while keeping the occupancy component fixed at the most complex model and
then for the presence component choosing among a group of three models while using
the detection model chosen In our analysis of this dataset for the detection and the
presence we consider as the full models those used in Kery et al (2010) namely
minus1(ψ) = α0 + α1year+ α2elev+ α3elev2 + α4elev
3
minus1(p) = λ0 + λ1year+ λ2elev+ λ3elev2 + λ4elev
3 + λ5date+ λ6date2
78
where year = Iyear=2000
The model spaces for this data contain 26 = 64 and 24 = 16 models respectively
for the detection and occupancy components That is in total the model space contains
24+6 = 1 024 models Although this model space can be enumerated entirely for
illustration we implemented the algorithm from section 334 generating 10000 draws
from the Gibbs sampler Each one of the models sampled were chosen from the set of
models that could be reached by changing the state of a single term in the current model
(to inclusion or exclusion accordingly) This allows a more thorough exploration of the
model space because for each of the 10000 models drawn the posterior probabilities
for many more models can be observed Below the labels for the predictors are followed
by either ldquozrdquo or ldquoyrdquo accordingly to represent the component they pertain to Finally
using the results from the model selection procedure we conducted a validation step to
determine the predictive accuracy of the HPMrsquos and of the median probability models
(MPMrsquos) The performance of these models is then contrasted with that of the model
ultimately selected by Kery et al (2010)361 Results Variable Selection Procedure
The model finally chosen for the presence component in Kery et al (2010) was not
found among the highest five probability models under either model prior 3-8 Moreover
the year indicator was never chosen under the multiplicity correcting prior hinting that
this term might correspond to a falsely identified predictor under the uniform prior
Results in Table 3-10 support this claim the marginal inclusion posterior probability for
the year predictor is 7 under the multiplicity correction prior The multiplicity correction
prior concentrates more densely the model posterior probability mass in the highest
ranked models (90 of the mass is in the top five models) than the uniform prior (which
account for 40 of the mass)
For the detection component the HPM under both priors is the intercept only model
which we represent in Table 3-9 with a blank label In both cases this model obtains very
79
Table 3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data
Uniform model priorRank Mz selected p(Mz |y)
1 yrz+elevz 0102 yrz+elevz+elevz3 0083 elevz2+elevz3 0084 yrz+elevz2 0075 yrz+elevz3 007
Multiplicity correcting model priorRank Mz selected p(Mz |y)
1 elevz+elevz3 0532 0153 elevz+elevz2 0094 elevz2 0065 elevz+elevz2+elevz3 005
high posterior probabilities The terms contained in cubic polynomial for the elevation
appear to contain some relevant information however this conflicts with the MPIPs
observed in Table 3-11 which under both model priors are relatively low (lt 20 with the
uniform and le 4 with the multiplicity correcting prior)
Table 3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data
Uniform model priorRank Mz selected p(Mz |y)
1 0452 elevy3 0063 elevy2 0054 elevy 0055 yry 004
Multiplicity correcting model priorRank Mz selected p(Mz |y)
1 0862 elevy3 0023 datey2 0024 elevy2 0025 yry 002
Finally it is possible to use the MPIPs to obtain the median probability model which
contains the terms that have a MPIP higher than 50 For the occupancy process
(Table 3-10) under the uniform prior the model with the year the elevation and the
elevation cubed are included The MPM with multiplicity correction prior coincides with
the HPM from this prior The MPM chosen for the detection component (Table 3-11)
under both priors is the intercept only model coinciding again with the HPM
Given the outcomes of the simulation studies from Section 35 especially those
pertaining to the detection component the results in Table 3-11 appear to indicate that
none of the predictors considered belong to the true model especially when considering
80
Table 3-10 MPIP presence component
Predictor p(predictor isin MTz |y z w v)
Unif MultCorryrz 053 007elevz 051 073elevz2 045 023elevz3 050 067
Table 3-11 MPIP detection component
Predictor p(predictor isin MTy |y z w v)
Unif MultCorryry 019 003elevy 018 003elevy2 018 003elevy 3 019 004datey 016 003datey2 015 004
those derived with the multiplicity correction prior On the other hand for the presence
component (Table 3-10) there is an indication that terms related to the cubic polynomial
in elevz can explain the occupancy patterns362 Validation for the Selection Procedure
Approximately half of the sites were selected at random for training (ie for model
selection and parameter estimation) and the remaining half were used as test data In
the previous section we observed that using the marginal posterior inclusion probability
of the predictors the our method effectively separates predictors in the true model from
those that are not in it However in Tables 3-10 and 3-11 this separation is only clear for
the presence component using the multiplicity correction prior
Therefore in the validation procedure we observe the misclassification rates for the
detections using the following models (1) the model ultimately recommended in Kery
et al (2010) (yrz+elevz+elevz2+elevz3 + elevy+ elevy2+ datey+ datey2) (2) the
highest probability model (HPM) with a uniform prior (yrz+elevz) (3) the HPM with a
multiplicity correcting prior (elevz + elevz3 ) (4) the median probability model (MPM)
ndashthe model including only predictors with a MPIP larger than 50ndash with the uniform
prior (yrz+elevz+elevz3) and finally (5) the MPM with a multiplicity correction prior
(elevz+elevz3 same as the HPM with multiplicity correction)
We must emphasize that the models resulting from the implement ion of our model
selection procedure used exclusively the training dataset On the other hand the model
in Kery et al (2010) was chosen to minimize the prediction error of the complete data
81
Because this model was obtained from the full dataset results derived from it can only
be considered as a lower bound for the prediction errors
The benchmark misclassification error rate for true 1rsquos is high (close to 70)
However the misclassification rate for true 0rsquos which accounts for most of the
responses is less pronounced (15) Overall the performance of the selected models
is comparable They yield considerably worse results than the benchmark for the true
1rsquos but achieve rates close to the benchmark for the true zeros Pooling together
the results for true ones and true zeros the selected models with either prior have
misclassification rates close to 30 The benchmark model performs comparably with a
joint misclassification error of 23 (Table 3-12)
Table 3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform andmultiplicity correction model priors
Model True 1 True 0 Jointbenchmark (Kery et al 2010) yrz+elevz+elevz2+elevz3 + 066 015 023
elevy+ elevy2+ datey+ datey2
HPM Unif yrz+elevz 083 017 028HPMHPM MC elevz + elevz3 082 018 028MPM Unif yrz+elevz+elevz3 082 018 029
37 Discussion
In this Chapter we proposed an objective and fully automatic Bayes methodology for
the single season site-occupancy model The methodology is said to be fully automatic
because no hyper-parameter specification is necessary in defining the parameter priors
and objective because it relies on the intrinsic priors derived from noninformative priors
The intrinsic priors have been shown to have desirable properties as testing priors We
also propose a fast stochastic search algorithm to explore large model spaces using our
model selection procedure
Our simulation experiments demonstrated the ability of the method to single out the
predictors present in the true model when considering the marginal posterior inclusion
probabilities for the predictors For predictors in the true model these probabilities
were comparatively larger than those for predictors absent from it Also the simulations
82
indicated that the method has a greater discrimination capability for predictors in the
detection component of the model especially when using multiplicity correction priors
Multiplicity correction priors were not described in this Chapter however their
influence on the selection outcome is significant This behavior was observed in the
simulation experiment and in the analysis of the Blue Hawker data Model priors play an
essential role As the number of predictors grows these are instrumental in controlling
for selection of false positive predictors Additionally model priors can be used to
account for predictor structure in the selection process which helps both to reduce the
size of the model space and to make the selection more robust These issues are the
topic of the next Chapter
Accounting for the polynomial hierarchy in the predictors within the occupancy
context is a straightforward extension of the procedures we describe in Chapter 4
Hence our next step is to develop efficient software for it An additional direction we
plan to pursue is developing methods for occupancy variable selection in a multivariate
setting This can be used to conduct hypothesis testing in scenarios with varying
conditions through time or in the case where multiple species are co-observed A
final variation we will investigate for this problem is that of occupancy model selection
incorporating random effects
83
CHAPTER 4PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
It has long been an axiom of mine that the little things are infinitely themost important
ndashSherlock HolmesA Case of Identity
41 Introduction
In regression problems if a large number of potential predictors is available the
complete model space is too large to enumerate and automatic selection algorithms are
necessary to find informative parsimonious models This multiple testing problem
is difficult and even more so when interactions or powers of the predictors are
considered In the ecological literature models with interactions andor higher order
polynomial terms are ubiquitous (Johnson et al 2013 Kery et al 2010 Zeller et al
2011) given the complexity and non-linearities found in ecological processes Several
model selection procedures even in the classical normal linear setting fail to address
two fundamental issues (1) the model selection outcome is not invariant to affine
transformations when interactions or polynomial structures are found among the
predictors and (2) additional penalization is required to control for false positives as the
model space grows (ie as more covariates are considered)
These two issues motivate the developments developed throughout this Chapter
Building on the results of Chipman (1996) we propose investigate and provide
recommendations for three different prior distributions on the model space These
priors help control for test multiplicity while accounting for polynomial structure in the
predictors They improve upon those proposed by Chipman first by avoiding the need
for specific values for the prior inclusion probabilities of the predictors and second
by formulating principled alternatives to introduce additional structure in the model
84
priors Finally we design a stochastic search algorithm that allows fast and thorough
exploration of model spaces with polynomial structure
Having structure in the predictors can determine the selection outcome As an
illustration consider the model E [y ] = β00 + β01x2 + β20x21 where the order one
term x1 is not present (this choice of subscripts for the coefficients is defined in the
following section) Transforming x1 7rarr xlowast1 = x1 + c for some c = 0 the model
becomes E [y ] = β00 + β01x2 + βlowast20x
lowast21 Note that in terms of the original predictors
xlowast21 = x21 + 2c middot x1 + c2 implying that this seemingly innocuous transformation of x1
modifies the column space of the design matrix by including x1 which was not in the
original model That is when lower order terms in the hierarchy are omitted from the
model the column space of the design matrix is not invariant to afine transformations
As the hat matrix depends on the column space the modelrsquos predictive capability is also
affected by how the covariates in the model are coded an undesirable feature for any
model selection procedure To make model selection invariant to afine transformations
the selection must be constrained to the subset of models that respect the hierarchy
(Griepentrog et al 1982 Khuri 2002 McCullagh amp Nelder 1989 Nelder 2000
Peixoto 1987 1990) These models are known as well-formulated models (WFMs)
Succinctly a model is well-formulated if for any predictor in the model every lower order
predictor associated with it is also in the model The model above is not well-formulated
as it contains x21 but not x1
WFMs exhibit strong heredity in that all lower order terms dividing higher order
terms in the model must also be included An alternative is to only require weak heredity
(Chipman 1996) which only forces some of the lower terms in the corresponding
polynomial hierarchy to be in the model However Nelder (1998) demonstrated that the
conditions under which weak heredity allows the design matrix to be invariant to afine
transformations of the predictors are too restrictive to be useful in practice
85
Although this topic appeared in the literature more than three decades ago (Nelder
1977) only recently have modern variable selection techniques been adapted to
account for the constraints imposed by heredity As described in Bien et al (2013)
the current literature on variable selection for polynomial response surface models
can be classified into three broad groups mult-istep procedures (Brusco et al 2009
Peixoto 1987) regularized regression methods (Bien et al 2013 Yuan et al 2009)
and Bayesian approaches (Chipman 1996) The methods introduced in this Chapter
take a Bayesian approach towards variable selection for well-formulated models with
particular emphasis on model priors
As mentioned in previous chapters the Bayesian variable selection problem
consists of finding models with high posterior probabilities within a pre-specified model
space M The model posterior probability for M isin M is given by
p(M|yM) prop m(y|M)π(M|M) (4ndash1)
Model posterior probabilities depend on the prior distribution on the model space
as well as on the prior distributions for the model specific parameters implicitly through
the marginals m(y|M) Priors on the model specific parameters have been extensively
discussed in the literature (Berger amp Pericchi 1996 Berger et al 2001 George 2000
Jeffreys 1961 Kass amp Wasserman 1996 Liang et al 2008 Zellner amp Siow 1980) In
contrast the effect of the prior on the model space has until recently been neglected
A few authors (eg Casella et al (2014) Scott amp Berger (2010) Wilson et al (2010))
have highlighted the relevance of the priors on the model space in the context of multiple
testing Adequately formulating priors on the model space can both account for structure
in the predictors and provide additional control on the detection of false positive terms
In addition using the popular uniform prior over the model space may lead to the
undesirable and ldquoinformativerdquo implication of favoring models of size p2 (where p is the
86
total number of covariates) since this is the most abundant model size contained in the
model space
Variable selection within the model space of well-formulated polynomial models
poses two challenges for automatic objective model selection procedures First the
notion of model complexity takes on a new dimension Complexity is not exclusively
a function of the number of predictors but also depends upon the depth and
connectedness of the associations defined by the polynomial hierarchy Second
because the model space is shaped by such relationships stochastic search algorithms
used to explore the models must also conform to these restrictions
Models without polynomial hierarchy constitute a special case of WFMs where
all predictors are of order one Hence all the methods developed throughout this
Chapter also apply to models with no predictor structure Additionally although our
proposed methods are presented for the normal linear case to simplify the exposition
these methods are general enough to be embedded in many Bayesian selection
and averaging procedures including of course the occupancy framework previously
discussed
In this Chapter first we provide the necessary definitions to characterize the
well-formulated model selection problem Then we proceed to introduce three new prior
structures on the well-formulated model space and characterize their behavior with
simple examples and simulations With the model priors in place we build a stochastic
search algorithm to explore spaces of well-formulated models that relies on intrinsic
priors for the model specific parameters mdash though this assumption can be relaxed
to use other mixtures of g-priors Finally we implement our procedures using both
simulated and real data
87
42 Setup for Well-Formulated Models
Suppose that the observations yi are modeled using the polynomial regression of
the covariates xi 1 xi p given by
yi =sum
β(α1αp)
pprodj=1
xαji j + ϵi (4ndash2)
where α = (α1 αp) belongs to Np0 the p-dimensional space of natural numbers
including 0 with ϵiiidsim N(0σ2) and only finitely many βα are allowed to be non-zero
As an illustration consider a model space that includes polynomial terms incorporating
covariates xi 1 and xi 2 only The terms x2i 2 and x2i 1xi 2 can be represented by α = (0 2)
and α = (2 1) respectively
The notation y = Z(X)β + ϵ is used to denote that observed response y =
(y1 yn) is modeled via a polynomial function Z of the original covariates contained
in X = (x1 xp) (where xj = (x1j xnj)prime) and the coefficients of the polynomial
terms are given by β A specific polynomial model M is defined by the set of coefficients
βα that are allowed to be non-zero This definition is equivalent to characterizing M
through a collection of multi-indices α isin Np0 In particular model M is specified by
M = αM1 αM|M| for αMk isin Np0 where βα = 0 for α isin M
Any particular model M uses a subset XM of the original covariates X to form the
polynomial terms in the design matrix ZM(X) Without ambiguity a polynomial model
ZM(X) on X can be identified with a polynomial model ZM(XM) on the covariates XM
The number of terms used by M to model the response y denoted by |M| corresponds
to the number of columns of ZM(XM) The coefficient vector and error variance of
the model M are denoted by βM and σ2M respectively Thus M models the data as
y = ZM(XM)βM + ϵM where ϵM sim N(0 Iσ2M
) Model M is said to be nested in model M prime
if M sub M prime M models the response of the covariates in two distinct ways choosing the
set of meaningful covariates XM as well as choosing the polynomial structure of these
covariates ZM(XM)
88
The set Np0 constitutes a partially ordered set or more succinctly a poset A poset
is a set partially ordered through a binary relation ldquo≼rdquo In this context the binary relation
on the poset Np0 is defined between pairs (ααprime) by αprime ≼ α whenever αj ge αprime
j for all
j = 1 prime with αprime ≺ α if additionally αj gt αprimej for some j The order of a term α isin Np
0
is given by the sum of its elements order(α) =sumαj When order(α) = order(αprime) + 1
and αprime ≺ α then αprime is said to immediately precede α which is denoted by αprime rarr α
The parent set of α is defined by P(α) = αprime isin Np0 αprime rarr α and is given by the
set of nodes that immediately precede the given node A polynomial model M is said to
be well-formulated if α isin M implies that P(α) sub M For example any well-formulated
model using x2i 1xi 2 to model yi must also include the parent terms xi 1xi 2 and x2i 1 their
corresponding parent terms xi 1 and xi 2 and the intercept term 1
The poset Np0 can be represented by a Directed Acyclic Graph (DAG) denoted
by (Np0) Without ambiguity we can identify nodes in the graph α isin Np
0 with terms in
the set of covariates The graph has directed edges to a node from its parents Any
well-formulated model M is represented by a subgraph (M) of (Np0) with the property
that if node α isin (M) then the nodes corresponding to P(α) are also in (M) Figure
4-1 shows examples of well-formulated polynomial models where α isin Np0 is identified
withprodp
j=1 xαjj
The motivation for considering only well-formulated polynomial models is
compelling Let ZM be the design matrix associated with a polynomial model The
subspace of y modeled by ZM given by the hat matrix HM = ZM(ZprimeMZM)
minus1ZprimeM is
invariant to affine transformations of the matrix XM if and only if M corresponds to a
well-formulated polynomial model (Peixoto 1990)
89
A B
Figure 4-1 Graphs of well-formulated polynomial models for p = 2
For example if p = 2 and yi = β(00) + β(10)xi 1 + β(01)xi 2 + β(11)xi 1xi 2 + ϵi then
the hat matrix is invariant to any covariate transformation of the form A(xi 1xi 2
)+ b for any
real-valued positive definite 2 times 2 matrix A and any real-valued vector of dimension two
b In contrast if yi = β(00) + β(20)x2i 1 + ϵi then the hat matrix formed after applying the
transformation xi 1 7rarr xi 1 + c for real c = 0 is not the same as the hat matrix formed by
the original xi 1421 Well-Formulated Model Spaces
The spaces of WFMs M considered in this paper can be characterized in terms
of two WFMs MB the base model and MF the full model The base model contains at
least the intercept term and is nested in the full model The model space M is populated
by all well formulated models M that nest MB and are nested in MF
M = M MB sube M sube MF and M is well-formulated
For M to be well-formulated the entire ancestry of each node in M must also be
included in M Because of this M isin M can be uniquely identified by two different sets
of nodes in MF the set of extreme nodes and the set of children nodes For M isin M
90
the sets of extreme and children nodes respectively denoted by E(M) and C(M) are
defined by
E(M) = α isin M MB α isin P(αprime) forall αprime isin M
C(M) = α isin MF M α cupM is well-formulated
The extreme nodes are those nodes that when removed from M give rise to a WFM in
M The children nodes are those nodes that when added to M give rise to a WFM in
M Because MB sube M for all M isin M the set of nodes E(M)cupMB determine M by
beginning with this set and iteratively adding parent nodes Similarly the nodes in C(M)
determine the set αprime isin P(α) α isin C(M)cupαprime isin E(MF ) α ≼ αprime for all α isin C(M)
which contains E(M)cupMB and thus uniquely identifies M
1
x1
x2
x21
x1x2
x22
A Extreme node set
1
x1
x2
x21
x1x2
x22
B Children node set
Figure 4-2
In Figure 4-2 the extreme and children sets for model M = 1 x1 x21 are shown for
the model space characterized by MF = 1 x1 x2 x21 x1x2 x22 In Figure 4-2A the solid
nodes represent nodes α isin M E(M) the dashed node corresponds to α isin E(M) and
the dotted nodes are not in M Solid nodes in Figure 4-2B correspond to those in M
The dashed node is the single node in C(M) and the dotted nodes are not in M cup C(M)43 Priors on the Model Space
As discussed in Scott amp Berger (2010) the Ockhamrsquos-razor effect found
automatically in Bayesian variable selection through the Bayes factor does not correct
91
for multiple testing This penalization acts against more complex models but does not
account for the collection of models in the model space which describes the multiplicity
of the testing problem This is where the role of the prior on the model space becomes
important As Scott amp Berger explain the multiplicity penalty is ldquohidden awayrdquo in the
model prior probabilities π(M|M)
In what follows we propose three different prior structures on the model space
for WFMs discuss their advantages and disadvantages and describe reasonable
choices for their hyper-parameters In addition we investigate how the choice of
prior structure and hyper-parameter combinations affect the posterior probabilities for
predictor inclusion providing some recommendations for different situations431 Model Prior Definition
The graphical structure for the model spaces suggests a method for prior
construction on M guided by the notion of inheritance A node α is said to inherit from
a node αprime if there is a directed path from αprime to α in the graph (MF ) The inheritance
is said to be immediate if order(α) = order(αprime) + 1 (equivalently if αprime isin P(α) or if αprime
immediately precedes α)
For convenience define (M) = M MB to be the set of nodes in M that are not
in the base model MB For α isin (MF ) let γα(M) be the indicator function describing
whether α is included in M ie γα(M) = I(αisinM) Denote by γν(M) the set of indicators
of inclusion in M for all order ν nodes in (MF ) Finally let γltν(M) =cupνminus1
j=0 γ j(M)
the set of indicators for inclusion in M for all nodes in (MF ) of order less than ν With
these definitions the prior probability of any model M isin M can be factored as
π(M|M) =
JmaxMprod
j=JminM
π(γ j(M)|γltj(M)M) (4ndash3)
where JminM and Jmax
M are respectively the minimum and maximum order of nodes in
(MF ) and π(γJminM (M)|γltJmin
M (M)M) = π(γJminM (M)|M)
92
Prior distributions on M can be simplified by making two assumptions First if
order(α) = order(αprime) = j then γα and γαprime are assumed to be conditionally independent
when conditioned on γltj denoted by γα perpperp γαprime|γltj Second immediate inheritance is
invoked and it is assumed that if order(α) = j then γα(M)|γltj(M) = γα(M)|γP(α)(M)
where γP(α)(M) is the inclusion indicator for the set of parent nodes of α This indicator
is one if the complete parent set of α is contained in M and zero otherwise
In Figure 4-3 these two assumptions are depicted with MF being an order two
surface in two main effects The conditional independence assumption (Figure 4-3A)
implies that the inclusion indicators for x21 x22 and x1x2 is independent when conditioned
on all the lower order terms In this same space immediate inheritance implies that
the inclusion of x21 conditioned on the inclusion of all lower order nodes is equivalent to
conditioning it on its parent set (x1 in this case)
x21 perpperp x1x2 perpperp x22
∣∣∣∣∣
1
x1
x2
A Conditional independence
x21∣∣∣∣∣
1
x1
x2
=
x21
∣∣∣∣∣ x1
B Immediate inheritance
Figure 4-3
Denote the conditional inclusion probability of node α in model M by πα =
π(γα(M) = 1|γP(α)(M)M) Under the assumptions of conditional independence
93
and immediate inheritance the prior probability of M is
π(M|πMM) =prod
αisin(MF )
πγα(M)α (1minus πα)
1minusγα(M) (4ndash4)
with πM = πα α isin (MF ) Because M must be well-formulated πα = γα =
0 if γP(α)(M) = 0 Thus the product in 4ndash4 can be restricted to the set of nodes
α isin (M)cup
C(M) Additional structure can be built into the prior on M by making
assumptions about the inclusion probabilities πα such as equality assumptions or
assumptions of a hyper-prior for these parameters Three such prior classes are
developed next first by assigning hyperpriors on πM assuming some structure among
its elements and then marginalizing out the πM
Hierarchical Uniform Prior (HUP) The HUP assumes that the non-zero πα
are all equal Specifically for a model M isin M it is assumed that πα = π for all
α isin (M)cupC(M) A complete Bayesian specification of the HUP is completed by
assuming a prior distribution for π The choice of π sim Beta(a b) produces
πHUP(M|M a b) =B(|(M)|+ a |C(M)|+ b)
B(a b) (4ndash5)
where B is the beta function Setting a = b = 1 gives the particular value of
πHUP(M|M a = 1 b = 1) =1
|(M)|+ |C(M)|+ 1
(|(M)|+ |C(M)|
|(M)|
)minus1
(4ndash6)
The HUP assigns equal probabilities to all models for which the sets of nodes (M)
and C(M) have the same cardinality This prior provides a combinatorial penalization
but essentially fails to account for the hierarchical structure of the model space An
additional penalization for model complexity can be incorporated into the HUP by
changing the values of a and b Because πα = π for all α this penalization can only
depend on some aspect of the entire graph of MF such as the total number of nodes
not in the null model |(MF )|
94
Hierarchical Independence Prior (HIP) The HIP assumes that there are no
equality constraints among the non-zero πα Each non-zero πα is given its own prior
which is assumed to be a Beta distribution with parameters aα and bα Thus the prior
probability of M under the HIP is
πHIP(M|M ab) =
prodαisin(M)
aα
aα + bα
prodαisinC(M)
bα
aα + bα
(4ndash7)
where the product over empty is taken to be 1 Because the πα are totally independent any
choice of aα and bα is equivalent to choosing a probability of success πα for a given α
Setting aα = bα = 1 for all α isin (M)cup
C(M) gives the particular value of
πHIP(M|M a = 1b = 1) =
(1
2
)|(M)|+|C(M)|
(4ndash8)
Although the prior with this choice of hyper-parameters accounts for the hierarchical
structure of the model space it essentially provides no penalization for combinatorial
complexity at different levels of the hierarchy This can be observed by considering a
model space with main effects only the exponent in 4ndash8 is the same for every model in
the space because each node is either in the model or in the children set
Additional penalizations for model complexity can be incorporated into the HIP
Because each γ j is conditioned on γltj in the prior construction the aα and bα for α of
order j can be conditioned on γltj One such additional penalization utilizes the number
of nodes of order j that could be added to produce a WFM conditioned on the inclusion
vector γltj which is denoted as chj(γltj) Choosing aα = 1 and bα(M) = chj(γ
ltj) is
equivalent to choosing a probability of success πα = 1chj(γltj) This penalization can
drive down the false positive rate when chj(γltj) is large but may produce more false
negatives
Hierarchical Order Prior (HOP) A compromise between complete equality and
complete independence of the πα is to assume equality between the πα of a given
order and independence across the different orders Define j(M) = α isin (M)
95
order(α) = j and Cj(M) = α isin C(M) order(α) = j The HOP assumes that πα = πj
for all α isin j(M)cupCj(M) Assuming that πj sim Beta(aj bj) provides a prior probability of
πHOP(M|M ab) =
JmaxMprod
j=JminM
B(|j(M)|+ aj |Cj(M)|+ bj)
B(aj bj)(4ndash9)
The specific choice of aj = bj = 1 for all j gives a value of
πHOP(M|M a = 1b = 1) =prodj
[1
|j(M)|+ |Cj(M)|+ 1
(|j(M)|+ |Cj(M)|
|j(M)|
)minus1]
(4ndash10)
and produces a hierarchical version of the Scott and Berger multiplicity correction
The HOP arises from a conditional exchangeability assumption on the indicator
variables Conditioned on γltj(M) the indicators γα α isin j(M)cup
Cj(M) are
assumed to be exchangeable Bernoulli random variables By de Finettirsquos theorem these
arise from independent Bernoulli random variables with common probability of success
πj with a prior distribution Our construction of the HOP assumes that this prior is a
beta distribution Additional complexity penalizations can be incorporated into the HOP
in a similar fashion to the HIP The number of possible nodes that could be added of
order j while maintaining a WFM is given by chj(M) = chj(γltj(M)) = |j(M)
cupCj(M)|
Using aj = 1 and bj(M) = chj(M) produces a prior with two desirable properties
First if M prime sub M then π(M) le π(M prime) Second for each order j the conditional
probability of including k nodes is greater than or equal to that of including k + 1 nodes
for k = 0 1 chj(M)minus 1432 Choice of Prior Structure and Hyper-Parameters
Each of the priors introduced in Section 31 defines a whole family of model priors
characterized by the probability distribution assumed for the inclusion probabilities πM
For the sake of simplicity this paper focuses on those arising from Beta distributions
and concentrates on particular choices of hyper-parameters which can be specified
automatically First we describe some general features about how each of the three
prior structures (HUP HIP HOP) allocates mass to the models in the model space
96
Second as there is an infinite number of ways in which the hyper-parameters can be
specified focused is placed on the default choice a = b = 1 as well as the complexity
penalizations described in Section 31 The second alternative is referred to as a =
1b = ch where b = ch has a slightly different interpretation depending on the prior
structure Accordingly b = ch is given by bj(M) = bα(M) = chj(M) = |j(M)cup
Cj(M)|
for the HOP and HIP where j = order(α) while b = ch denotes that b = |(MF )| for
the HUP The prior behavior for two model spaces In both cases the base model MB is
taken to be the intercept only model and MF is the DAG shown (Figures 4-4 and 4-5)
The priors considered treat model complexity differently and some general properties
can be seen in these examples
ModelHIP HOP HUP
(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)
1 1 14 49 13 12 13 572 1 x1 18 19 112 112 112 5563 1 x2 18 19 112 112 112 5564 1 x1 x
21 18 19 112 112 112 5168
5 1 x2 x22 18 19 112 112 112 5168
6 1 x1 x2 132 364 112 112 160 1727 1 x1 x2 x
21 132 164 136 160 160 1168
8 1 x1 x2 x1x2 132 164 136 160 160 11689 1 x1 x2 x
22 132 164 136 160 160 1168
10 1 x1 x2 x21 x1x2 132 1192 136 1120 130 1252
11 1 x1 x2 x21 x
22 132 1192 136 1120 130 1252
12 1 x1 x2 x1x2 x22 132 1192 136 1120 130 1252
13 1 x1 x2 x21 x1x2 x
22 132 1576 112 1120 16 1252
Figure 4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch)
First contrast the choice of HIP HUP and HOP for the choice of (ab) = (1 1) The
HIP induces a complexity penalization that only accounts for the order of the terms in
the model This is best exhibited by the model space in Figure 4-4 Models including x1
and x2 models 6 through 13 are given the same prior probability and no penalization is
incurred for the inclusion of any or all of the quadratic terms In contrast to the HIP the
97
ModelHIP HOP HUP
(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)
1 1 18 2764 14 12 14 472 1 x1 18 964 112 110 112 2213 1 x2 18 964 112 110 112 2214 1 x3 18 964 112 110 112 2215 1 x1 x3 18 364 112 120 112 41056 1 x2 x3 18 364 112 120 112 41057 1 x1 x2 116 3128 124 140 130 1428 1 x1 x2 x1x2 116 3128 124 140 120 1709 1 x1 x2 x3 116 1128 18 140 120 17010 1 x1 x2 x3 x1x2 116 1128 18 140 15 170
Figure 4-5 Prior probabilities for the space of well-formulated models associated tothree main effects and one interaction term where MB is taken to be theintercept only model and (ab) isin (1 1) (1 ch)
HUP induces a penalization for model complexity but it does not adequately penalize
models for including additional terms Using the HIP models including all of the terms
are given at least as much probability as any model containing any non-empty set of
terms (Figures 4-4 and 4-5) This lack of penalization of the full model is originates from
its combinatorial simplicity (ie this is the only model that contains every term) and
as an unfortunate consequence this model space distribution favors the base and full
models Similar behavior is observed with the HOP with (ab) = (1 1) As models
become more complex they are appropriately penalized for their size However after a
sufficient number of nodes are added the number of possible models of that particular
size is considerably reduced Thus combinatorial complexity is negligible on the largest
models This is best exhibited in Figure 4-5 where the HOP places more mass on
the full model than on any model containing a single order one node highlighting an
undesirable behavior of the priors with this choice of hyper-parameters
In contrast if (ab) = (1 ch) all three priors produce strong penalization as
models become more complex both in terms of the number and order of the nodes
contained in the model For all of the priors adding a node α to a model M to form M prime
produces p(M) ge p(M prime) However differences between the priors are apparent The
98
HIP penalizes the full model the most with the HOP penalizing it the least and the HUP
lying between them At face value the HOP creates the most compelling penalization
of model complexity In Figure 4-5 the penalization of the HOP is the least dramatic
producing prior odds of 20 for MB versus MF as opposed to the HUP and HIP which
produce prior odds of 40 and 54 respectively Similarly the prior odds in Figure 4-4 are
60 180 and 256 for the HOP HUP and HIP respectively433 Posterior Sensitivity to the Choice of Prior
To determine how the proposed priors are adjusting the posterior probabilities to
account for multiplicity a simple simulation was performed The goal of this exercise
was to understand how the priors respond to increasing complexity First the priors are
compared as the number of main effects p grows Second they are compared as the
depth of the hierarchy increases or in other words as the orderJMmax increases
The quality of a node is characterized by its marginal posterior inclusion
probabilities defined as pα =sum
MisinM I(αisinM)p(M|yM) for α isin MF These posteriors
were obtained for the proposed priors as well as the Equal Probability Prior (EPP)
on M For all prior structures both the default hyper-parameters a = b = 1 and
the penalizing choice of a = 1 and b = ch are considered The results for the
different combinations of MF and MT incorporated in the analysis were obtained
from 100 random replications (ie generating at random 100 matrices of main effects
and responses) The simulation proceeds as follows
1 Randomly generate main effects matrices X = (x1 x18) for xiiidsim Nn(0 In) and
error vectors ϵ sim Nn(0 In) for n = 60
2 Setting all coefficient values equal to one calculate y = ZMTβ + ϵ for the true
models given byMT 1 = x1 x2 x3 x
21 x1x2 x
22 x2x3 with |MT 1| = 7
MT 2 = x1 x2 x16 with |MT 2| = 16MT 3 = x1 x2 x3 x4 with |MT 3| = 4MT 4 = x1 x2 x8 x
21 x3x4 with |MT 4| = 10
MT 5 = x1 x2 x3 x4 x21 x3x4 with |MT 5| = 6
99
Table 4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulationsgrowing p fixed JM
max fixed p growing JMmax
MF
∣∣MF
∣∣ ∣∣M∣∣ MT used MF
∣∣MF
∣∣ ∣∣M∣∣ MT used(x1 + x2 + x3)
2 9 95 MT 1 (x1 + x2 + x3)2 9 95 MT 1
(x1 + + x4)2 14 1337 MT 1 (x1 + x2 + x3)
3 19 2497 MT 1
(x1 + + x5)2 20 38619 MT 1 (x1 + x2 + x3)
4 34 161421 MT 1
Other model spacesMF
∣∣MF
∣∣ ∣∣M∣∣ MT usedx1 + x2 + middot middot middot+ x18 18 262144 MT 2MT 3
(x1 + x2 + x4)2 + x5+ 20 85568 MT 4MT 5x6 + + x10
3 In all simulations the base model MB is the intercept only model The notation(x1 + + xp)
d is used to represent the full order-d polynomial response surface inp main effects The model spaces characterized by their corresponding full modelMF are presented in Table 4-1 as well as the true models used in each case
4 Enumerate the model spaces and calculate p(M|yM) for all M isin Musing the EPP HUP HIP and HOP the latter two each with the two sets ofhyper-parameters
5 Count the number of true positives and false positives in each M for the differentpriors
The true positives (TP) are defined as those nodes α isin MT such that pα gt 05
With the false positives (FP) three different cutoffs are considered for pα elucidating
the adjustment for multiplicity induced by the model priors These cutoffs are
010 020 and 050 for α isin MT The results from this exercise provide insight
about the influence of the prior on the marginal posterior inclusion probabilities In Table
4-1 the model spaces considered are described in terms of the number of models they
contain and in terms of the number of nodes of MF the full model that defines the DAG
for M
Growing number of main effects fixed polynomial degree This simulation
investigates the posterior behavior as the number of covariates grows for a polynomial
100
surface of degree two The true model is assumed to be MT 1 and has 7 polynomial
terms The false positive and true positive rates are displayed in Table 4-2
First focus on the posterior when (ab) = (1 1) As p increases and the cutoff
is low the number of false positives increases for the EPP as well as the hierarchical
priors although less dramatically for the latter All of the priors identify all of the true
positives The false positive rate for the 50 cutoff is less than one for all four prior
structures with the HIP exhibiting the smallest false positive rate
With the second choice of hyper-parameters (1 ch) the improvement of the
hierarchical priors over the EPP is dramatic and the difference in performance is more
pronounced as p increases These also considerably outperform the priors using the
default hyper-parameters a = b = 1 in terms of the false positives Regarding the
number of true positives all priors discovered the 7 true predictors in MT 1 for most of
the 100 random samples of data with only minor differences observed between any of
the priors considered That being said the means for the priors with a = 1b = ch are
slightly lower for the true positives With a 50 cutoff the hierarchical priors keep a tight
control on the number of false positives but in doing so discard true positives with slightly
higher frequency
Growing polynomial degree fixed main effects For these examples the true
model is once again MT 1 When the complexity is increased by making the order of MF
larger (Table 4-3) the inability of the EPP to adjust the inclusion posteriors for multiplicity
becomes more pronounced the EPP becomes less and less efficient at removing false
positives when the FP cutoff is low Among the priors with a = b = 1 as the order
increases the HIP is the best at filtering out the false positives Using the 05 false
positive cutoff some false positives are included both for the EEP and for all the priors
with a = b = 1 indicating that the default hyper-parameters might not be the best option
to control FP The 7 covariates in the true model all obtain a high inclusion posterior
probability both with the EEP and the a = b = 1 priors
101
Table 4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
7 (x1 + x2 + x3)2
178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3 + x4)2
362 194 233 245 010 063 107FP(gt020) 160 047 217 215 001 017 024FP(gt050) 025 006 035 036 000 002 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3 + x4 + x5)2
600 216 260 255 012 043 115FP(gt020) 291 055 213 218 002 019 027FP(gt050) 066 011 025 037 000 003 001TP(gt050) (MT 1) 700 700 700 700 697 699 699
In contrast any of the a = 1 and b = ch priors dramatically improve upon their
a = b = 1 counterparts consistently assigning low inclusion probabilities for the majority
of the false positive terms even for low cutoffs As the order of the polynomial surface
increases the difference in performance between these priors and either the EEP or
their default versions becomes even more clear At the 50 cutoff the hierarchical priors
with complexity penalization exhibit very low false positive rates The true positive rate
decreases slightly for the priors but not to an alarming degree
Other model spaces This part of the analysis considers model spaces that do not
correspond to full polynomial degree response surfaces (Table 4-4) The first example
is a model space with main effects only The second example includes a full quadratic
surface of order 2 but in addition includes six terms for which only main effects are to be
modeled Two true models are used in combination with each model space to observe
how the posterior probabilities vary under the influence of the different priors for ldquolargerdquo
and ldquosmallrdquo true models
102
Table 4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
7 (x1 + x2 + x3)2
178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)
7 (x1 + x2 + x3)3
737 521 606 291 055 105 139FP(gt020) 291 155 361 208 017 034 031FP(gt050) 040 021 050 026 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 698 700FP(gt010)
7 (x1 + x2 + x3)4
822 400 469 261 052 055 132FP(gt020) 421 113 176 203 012 015 031FP(gt050) 056 017 022 027 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 697 699
By construction in model spaces with main effects only HIP(11) and EPP are
equivalent as are HOP(ab) and HUP(ab) This accounts for the similarities observed
among the results for the first two cases presented in Table 4-4 where the model space
corresponds to a full model with 18 main effects and the true models are a model with
16 and 4 main effects respectively When the number of true coefficients is large the
HUP(11) and HOP(11) do poorly at controlling false positives even at the 50 cutoff
In contrast the HIP (and thus the EPP) with the 50 cutoff identifies the true positives
and no false positives This result however does not imply that the EPP controls false
positives well The true model contains 16 out of the 18 nodes in MF so there is little
potential for false positives The a = 1 and b = ch priors show dramatically different
behavior The HIP controls false positive well but fails to identify the true coefficients at
the 50 cutoff In contrast the HOP identifies all of the true positives and has a small
false positive rate for the 50 cutoff
103
If the number of true positives is small most terms in the full model are truly zero
The EPP includes at least one false positive in approximately 50 of the randomly
sampled datasets On the other hand the HUP(11) provides some control for
multiplicity obtaining on average a lower number of false positives than the EPP
Furthermore the proposed hierarchical priors with a = 1b = ch are substantially better
than the EPP (and the choice of a = b = 1) at controlling false positives and capturing
all true positives using the marginal posterior inclusion probabilities The two examples
suggest that the HOP(1 ch) is the best default choice for model selection when the
number of terms available at a given degree is large
The third and fourth examples in Table 4-4 consider the same irregular model
space with data generated from MT 4 with ten terms and MT 5 with six terms HIP(11)
and EPP again behave quite similarly incorporating a large number of false positives
for the 01 cutoff At the 05 cutoff some false positives are still included The HUP(11)
and HOP(11) behave similarly with a slightly higher false positive rate at the 50 cutoff
In terms of the true positives the EPP and a = b = 1 priors always include all of the
predictors in MT 4 and MT 5 On the other hand the ability of the a = 1b = ch priors
to control for false positives is markedly better than that of the EPP and the hierarchical
priors with choice of a = 1 = b = 1 At the 50 cutoff these priors identify all of the true
positives and true negatives Once again these examples point to the hierarchical priors
with additional penalization for complexity as being good default priors on the model
space44 Random Walks on the Model Space
When the model space M is too large to enumerate a stochastic procedure can
be used to find models with high posterior probability In particular an MCMC algorithm
can be utilized to generate a dependent sample of models from the model posterior The
structure of the model space M both presents difficulties and provides clues on how to
build algorithms to explore it Different MCMC strategies can be adopted two of which
104
Table 4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP)
Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch
HIP HUP HOP HIP HUP HOPFP(gt010)
16 x1 + x2 + + x18
193 193 200 200 003 180 180FP(gt020) 052 052 200 200 001 046 046FP(gt050) 007 007 200 200 001 004 004TP(gt050) (MT 2) 1599 1599 1600 1600 699 1599 1599FP(gt010)
4 x1 + x2 + + x18
1395 1395 915 915 026 131 131FP(gt020) 545 545 303 303 005 045 045FP(gt050) 084 084 045 045 002 006 006TP(gt050) (MT 3) 400 400 400 400 400 400 400FP(gt010)
10
973 971 1000 560 034 233 220FP(gt020) (x1 + + x4)
2+ 265 265 873 305 012 074 069FP(gt050) +x5 + + x10 035 035 136 168 002 011 012TP(gt050) (MT 4) 1000 1000 1000 999 994 998 999FP(gt010)
6
1352 1352 1106 994 044 163 196FP(gt020) (x1 + + x4)
2+ 422 421 360 501 015 048 068FP(gt050) +x5 + + x10 053 053 057 075 001 008 011TP(gt050) (MT 5) 600 600 600 600 599 599 599
are outlined in this section Combining the different strategies allows the model selection
algorithm to explore the model space thoroughly and relatively fast441 Simple Pruning and Growing
This first strategy relies on small localized jumps around the model space turning
on or off a single node at each step The idea behind this algorithm is to grow the model
by activating one node in the children set or to prune the model by removing one node
in the extreme set At a given step in the algorithm assume that the current state of the
chain is model M Let pG be the probability that algorithm chooses the growth step The
proposed model M prime can either be M+ = M cup α for some α isin C(M) or Mminus = M α
or some α isin E(M)
An example transition kernel is defined by the mixture
g(M prime|M) = pG middot qGrow(M prime|M) + (1minus pG) middot qPrune(M prime|M)
105
=IM =MF
1 + IM =MBmiddotIαisinC(M)
|C(M)|+
IM =MB
1 + IM =MF middotIαisinE(M)
|E(M)|(4ndash11)
where pG has explicitly been defined as 05 when both C(M) and E(M) are non-empty
and as 0 (or 1) when C(M) = empty (or E(M) = empty) After choosing pruning or growing a
single node is proposed for addition to or deletion from M uniformly at random
For this simple algorithm pruning has the reverse kernel of growing and vice-versa
From this construction more elaborate algorithms can be specified First instead of
choosing the node uniformly at random from the corresponding set nodes can be
selected using the relative posterior probability of adding or removing the node Second
more than one node can be selected at any step for instance by also sampling at
random the number of nodes to add or remove given the size of the set Third the
strategy could combine pruning and growing in a single step by sampling one node
α isin C(M) cup E(M) and adding or removing it accordingly Fourth the sets of nodes from
C(M) cup E(M) that yield well-formulated models can be added or removed This simple
algorithm produces small moves around the model space by focusing node addition or
removal only on the set C(M) cup E(M)442 Degree Based Pruning and Growing
In exploring the model space it is possible to take advantage of the hierarchical
structure defined between nodes of different order One can update the vector of
inclusion indicators by blocks denoted j(M) Two flavors of this algorithm are
proposed one that separates the pruning and growing steps and one where both
are done simultaneously
Assume that at a given step say t the algorithm is at M If growing the strategy
proceeds successively by order class going from j = Jmin up to j = Jmax with Jmin
and Jmax being the lowest and highest orders of nodes in MF MB respectively Define
Mt(Jminminus1) = M and set j = Jmin The growth kernel comprises the following steps
proceeding from j = Jmin to j = Jmax
106
1) Propose a model M prime by selecting a set of nodes from Cj(Mt(jminus1)) through thekernel qGrow j(middot|Mt(jminus1))
2) Compute the Metropolis-Hastings correction for M prime versus Mt(jminus1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(jminus1)
3) If j lt Jmax then set j = j + 1 and return to 1) otherwise proceed to 4)
4) Set Mt = Mt(Jmax )
The pruning step is defined In a similar fashion however it starts at order j = Jmax
and proceeds down to j = Jmin Let Ej(M prime) = E(M prime) cap j(MF ) be the set of nodes of
order j that can removed from the model M to produce a WFM Define Mt(Jmax+1) = M
and set j = Jmax The pruning kernel comprises the following steps
1) Propose a model M prime by selecting a set of nodes from Ej(Mt(j+1)) through thekernel qPrunej(middot|Mt(j+1))
2) Compute the Metropolis-Hastings correction for M prime versus Mt(j+1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(j+1)
3) If j gt Jmin then set j = j minus 1 and return to Step 1) otherwise proceed to Step 4)
4) Set Mt = Mt(Jmin )
It is clear that the growing and pruning steps are reverse kernels of each other
Pruning and growing can be combined for each j The forward kernel proceeds from
j = Jmin to j = Jmax and proposes adding sets of nodes from Cj(M) cup Ej(M) The reverse
kernel simply reverses the direction of j proceeding from j = Jmax to j = Jmin 45 Simulation Study
To study the operating characteristics of the proposed priors a simulation
experiment was designed with three goals First the priors are characterized by how
the posterior distributions are affected by the sample size and the signal-to-noise ratio
(SNR) Second given the SNR level the influence of the allocation of the signal across
the terms in the model is investigated Third performance is assessed when the true
model has special points in the scale (McCullagh amp Nelder 1989) ie when the true
107
model has coefficients equal to zero for some lower-order terms in the polynomial
hierarchy
With these goals in mind sets of predictors and responses are generated under
various experimental conditions The model space is defined with MB being the
intercept-only model and MF being the complete order-four polynomial surface in five
main effects that has 126 nodes The entries of the matrix of main effects are generated
as independent standard normal The response vectors are drawn from the n-variate
normal distribution as y sim Nn
(ZMT
(X)βγ In) where MT is the true model and In is the
n times n identity matrix
The sample sizes considered are n isin 130 260 1040 which ensures that
ZMF(X) is of full rank The cardinality of this model space is |M| gt 12 times 1022 which
makes enumeration of all models unfeasible Because the value of the 2k-th moment
of the standard normal distribution increases with k = 1 2 higher-order terms by
construction have a larger variance than their ancestors As such assuming equal
values for all coefficients higher-order terms necessarily contain more ldquosignalrdquo than
the lower order terms from which they inherit (eg x21 has more signal than x1) Once a
higher-order term is selected its entire ancestry is also included Therefore to prevent
the simulation results from being overly optimistic (because of the larger signals from the
higher-order terms) sphering is used to calculate meaningful values of the coefficients
ensuring that the signal is of the magnitude intended in any given direction Given
the results of the simulations from Section 433 only the HOP with a = 1b = ch is
considered with the EPP included for comparison
The total number of combinations of SNR sample size regression coefficient
values and nodes in MT amounts to 108 different scenarios Each scenario was run
with 100 independently generated datasets and the mean behavior of the samples was
observed The results presented in this section correspond to the median probability
model (MPM) from each of the 108 simulation scenarios considered Figure 4-7 shows
108
the comparison between the two priors for the mean number of true positive (TP) and
false positive (FP) terms Although some of the scenarios consider true models that are
not well-formulated the smallest well-formulated model that stems from MT is always
the one shown in Figure 4-6
Figure 4-6 MT DAG of the largest true model used in simulations
The results are summarized in Figure 4-7 Each point on the horizontal axis
corresponds to the average for a given set of simulation conditions Only labels for the
SNR and sample size are included for clarity but the results are also shown for the
different values of the regression coefficients and the different true models considered
Additional details about the procedure and other results are included in the appendices451 SNR and Sample Size Effect
As expected small sample sizes conditioned upon a small SNR impair the ability
of the algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this
effect being greater when using the latter prior However considering the mean number
of TPs jointly with the number of FPs it is clear that although the number of TPs is
specially low with HOP(1 ch) most of the few predictors that are discovered in fact
belong to the true model In comparison to the results with EPP in terms of FPs the
HOP(1 ch) does better and even more so when both the sample size and the SNR are
109
Figure 4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch)
smallest Finally when either the SNR or the sample size is large the performance in
terms of TPs is similar between both priors but the number of FPs are somewhat lower
with the HOP452 Coefficient Magnitude
Three ways to allocate the amount of signal across predictors are considered For
the first choice all coefficients contain the same amount of signal regardless of their
order In the second each order-one coefficient contains twice as much signal as any
order-two coefficient and four times as much as any order-three coefficient Finally
each order-one coefficient contains a half as much signal as any order-two coefficient
and a quarter of what any order-three coefficient has These choices are denoted by
β(1) = c(1o1 1o2 1o3) β(2) = c(1o1 05o2 025o3) and β(3) = c(025o1 05o2 1o3)
respectively In Figure 4-7 the first 4 scenarios correspond to simulations with β(1) the
next four use β(2) the next four correspond to β(3) and then the values are cycled in
110
the same way The results show that scenarios using either β(1) or β(3) behave similarly
contrasting with the negative impact of having the highest signal in the order-one terms
through β(2) In Figure 4-7 the effect of using β(2) is evident as it corresponds to the
lowest values for the TPs regardless of the sample size the SNR or the prior used This
is an intuitive result since giving more signal to higher-order terms makes it easier to
detect higher-order terms and consequently by strong heredity the algorithm will also
select the corresponding lower-order terms included in the true model453 Special Points on the Scale
Four true models were considered (1) the model from Figure 4-6 (MT 1) (2)
the model without the order-one terms (MT 2) (3) the model without order-two terms
(MT 3) and (4) the model without x21 and x2x5 (MT 4) The last three are clearly not
well-formulated In Figure 4-7 the leftmost point on the horizontal axis corresponds to
scenarios with MT 1 the next point is for scenarios with MT 2 followed by those with MT 3
then with MT 4 then MT 1 etc In comparison to the EPP the HOP(1 ch) tightly controls
the inclusion of FPs by choosing smaller models at the expense of also reducing the TP
count especially when there is more uncertainty about the true model (ie SNR=025)
For both prior structures the results indicate that at low SNR levels the presence of
special points has no apparent impact as the selection behavior is similar between the
four models in terms of both the TP and FP An interesting observation is that the effect
of having special points on the scale is vastly magnified whenever the coefficients that
assign more weight to order-one terms (β(2)) are used46 Case Study Ozone Data Analysis
This section uses the ozone data from Breiman amp Friedman (1985) and followsthe analysis performed by Liang et al (2008) who investigated hyper g-priors Afterremoving observations with missing values 330 observations remain includingdaily measurements of maximum ozone concentration near Los Angeles and eightmeteorological variables Table D From the 330 observations 165 were sampled atrandom without replacement and used to run the variable selection procedure theremaining 165 were used for validation The eight meteorological variables interactionsand their squared terms are used as predictors resulting in a full model with 44predictors The model space assumes that the base model MB is the intercept onlymodel and that MF is the quadratic surface in the eight meteorological variables The
111
model space contains approximately 71 billion models and computation of all modelposterior probabilities is not feasible
Table 4-5 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA
vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX
The HOP HUP and HIP with a = 1 and b = ch as well as the EPP are considered forcomparison purposes To obtain the Bayes factors in equation 3ndash3 four different mixtures ofg-priors are utilized intrinsic priors (IP) (which yields the expression in equation 3ndash2) hyper-g(HG) priors (Liang et al 2008) with hyper-parameters α = 2β = 1 and α = β = 1 and Zellner-Siow (ZS) priors (Zellner amp Siow 1980) The results were extracted for the median posteriorprobability (MPM) models Additionally the model is estimated using the R package hierNet(Bien et al 2013) to compare model selection results to those obtained using the hierarchicallasso (Bien et al 2013) restricted to well formulated models by imposing the strong heredityconstraint The procedures were assessed on the basis of their predictive accuracy on thevalidation dataset
Among all models the one that yields the smallest RMSE is the median probability modelobtained using the HOP and EPP with the ZS prior and also using the HOP with both HGpriors (Table 4-6) The HOP model with the intrinsic prior has all the terms contained in thelowest RMSE model with the exception of dpg2 which has a relatively high marginal inclusionprobability of 46 This disparity between the IP and other mixtures of g-priors is explainedby the fact that the IP induces less posterior shrinkage than the ZS and HG priors The MPMobtained through the HUP and HIP are nested in the best model suggesting that these modelspace priors penalize complexity too much and result in false negatives Consideration ofthese MPMs suggest that the HOP is best at producing true positives while controlling for falsepositives
Finally the model obtained from the hierarchical lasso (HierNet) is the largest model andproduces the second to largest RMSE All of the terms contained in any of the other modelsexcept for vh are nested within the hierarchical lasso model and most of the terms that areexclusive to this model receive extremely low marginal inclusion probabilities under any of themodel priors and parameter priors considered under Bayesian model selection
112
Table 4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso
BF Prior Model R2 RMSEIP EPP hum dpg ibt hum2 hum lowast dpg 08054 42739
hum lowast ibt dpg2 ibt2IP HIP hum ibt hum2 hum lowast ibt ibt2 07740 43396IP HOP hum dpg ibt hum2 hum lowast ibt ibt2 07848 43175IP HUP hum dpg ibt hum lowast ibt ibt2 07767 43508ZS EPP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HIP hum ibt hum lowast ibt ibt2 07525 43505ZS HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG11 EPP vh hum dpg ibt hum2 hum lowast ibt dpg2 07701 43049HG11 HIP hum ibt hum lowast ibt ibt2 07525 43505HG11 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG11 HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 EPP hum dpg ibt hum2 hum lowast ibt dpg2 07701 43037HG21 HIP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG21 HUP hum dpg ibt hum lowast ibt 07526 44036
HierNet hum temp ibh dpg ibt vis hum2 hum lowast ibt 07651 43680temp2 temp lowast ibt dpg2
47 DiscussionScott amp Berger (2010) noted that Ockhamrsquos-razor effect found automatically in Bayesian
variable selection through the Bayes factor does not correct for multiple testing The Bayesfactor penalizes complexity of the alternative model according to the number of parametersin excess of those of the null model Therefore the Bayes factor only controls complexity in apairwise fashion If the model selection procedure uses equal prior probabilities for all M isin Mthen these comparisons ignore the effect of the multiplicity of the testing problem This is wherethe role of the prior on the model space becomes important The multiplicity penalty is ldquohiddenawayrdquo in the model prior probabilities π(M|M)
In addition to the multiplicity of the testing problem disregarding the hierarchical polynomialstructure in the predictors in model selection procedures has the potential to lead to differentresults according to how the predictors are setup (eg in what units these predictors areexpressed)
In this Chapter we investigated a solution to these two issues We define prior structuresfor well-formulated models and develop random walk algorithms to traverse this type of modelspace The key to understanding prior distributions on the space of WFMs is the hierarchicalnature of the model space itself The prior distributions described take advantage of thathierarchy in two ways First conditional independence and immediate inheritance are used todevelop the HOP HIP and HUP structures discussed in Section 43 Second the conditionalnature of the priors allows for the direct incorporation of complexity penalizations Of the priorsproposed the HOP using the hyperparameter choice (1 ch) provides the best control of falsepositives while maintaining a reasonable true positive rate Thus this prior is recommended asthe default prior on the space of WFMs
113
In the near future the software developed to carry out a Metropolis-Hastings random walkon the space of WFMs will be integrated to the R package varSelectIP These new functionsimplement various local priors for the regression coefficients including the intrinsic prior Zellner-Siow prior and hyper g-priors In addition the software supports the computation of crediblesets for each regression coefficient conditioned on the selected model as well as under modelaveraging
114
CHAPTER 5CONCLUSIONS
Ecologists are now embracing the use of Bayesian methods to investigate the
interactions that dictate the distribution and abundance of organisms These tools are
both powerful and flexible They allow integrating under a single methodology empirical
observations and theoretical process models and can seamlessly account for several
sources of uncertainty and dependence The estimation and testing methods proposed
throughout the document will contribute to the understanding of Bayesian methods used
in ecology and hopefully these will shed light about the differences between estimation
and testing Bayesian tools
All of our contributions exploit the potential of the latent variable formulation This
approach greatly simplifies the analysis of complex models it redirects the bulk of
the inferential burden away from the original response variables and places it on the
easy-to-work-with latent scale for which several time-tested approaches are available
Our methods are distinctly classified into estimation and testing tools
For estimation we proposed a Bayesian specification of the single-season
occupancy model for which a Gibbs sampler is available using both logit and probit
link functions This setup allows detection and occupancy probabilities to depend
on linear combinations of predictors Then we developed a dynamic version of this
approach incorporating the notion that occupancy at a previously occupied site depends
both on survival of current settlers and habitat suitability Additionally because these
dynamics also vary in space we suggest a strategy to add spatial dependence among
neighboring sites
Ecological inquiry usually requires of competing explanations and uncertainty
surrounds the decision of choosing any one of them Hence a model or a set of
probable models should be selected from all the viable alternatives To address this
testing problem we proposed an objective and fully automatic Bayesian methodology
115
for the single season site-occupancy model Our approach relies on the intrinsic prior
which prevents from introducing (commonly unavailable) subjectively information
into the model In simulation experiments we observed that the methods single out
accurately the predictors present in the true model using the marginal posterior inclusion
probabilities of the predictors For predictors in the true model these probabilities were
comparatively larger than those for predictors not present in the true model Also the
simulations indicated that the method provides better discrimination for predictors in the
detection component of the model
In our simulations and in the analysis of the Blue Hawker data we observed that the
effect from using the multiplicity correction prior was substantial This occurs because
the Bayes factor only penalizes complexity of the alternative model according to its
number of parameters in excess to those of the null model As the number of predictors
grows the number of models in the models space also grows increasing the chances
of making false positive decisions on the inclusion of predictors This is where the role
of the prior on the model space becomes important The multiplicity penalty is ldquohidden
awayrdquo in the model prior probabilities π(M|M) In addition to the multiplicity of the
testing problem disregarding the hierarchical polynomial structure in the predictors in
model selection procedures has the potential to lead to different results according to
how the predictors are coded (eg in what units these predictors are expressed)
To confront this situation we propose three prior structures for well-formulated
models take advantage of the hierarchical structure of the predictors Of the priors
proposed we recommend the HOP using the hyperparameter choice (1 ch) which
provides the best control of false positives while maintaining a reasonable true positive
rate
Overall considering the flexibility of the latent approach several other extensions of
these methods follow Currently we envision three future developments (1) occupancy
models incorporate various sources of information (2) multi-species models that make
116
use of spatial and interspecific dependence and (3) investigate methods to conduct
model selection for the dynamic and spatially explicit version of the model
117
APPENDIX AFULL CONDITIONAL DENSITIES DYMOSS
In this section we introduce the full conditional probability density functions for all
the parameters involved in the DYMOSS model using probit as well as logic links
Sampler Z
The full conditionals corresponding to the presence indicators have the same form
regardless of the link used These are derived separately for the cases t = 1 1 lt t lt T
and t = T since their corresponding probabilities take on slightly different forms
Let ϕ(ν|microσ2) represent the density for a normal random variable ν with mean micro and
variance σ2 and recall that ψi1 = F (xprime(o)iα) and pijt = F (qprimeijtλt) where F () is the
inverse link function The full conditional for zit is given by
1 For t = 1
π(zi1|vi1αλ1βc1 δ
s1) = ψlowast
i1zi1 (1minus ψlowast
i1)1minuszi1
= Bernoulli(ψlowasti1) (Andash1)
where
ψlowasti1 =
ψi1ϕ(vi1|xprimei1βc1 + δs1 1)
prodJi1j=1(1minus pij1)
ψi1ϕ(vi1|xprimei1βc1 + δs1 1)
prodJi1j=1(1minus pij1) + (1minus ψi1)ϕ(vi1|xprimei1β
c1 1)
prodJj=1 Iyij1=0
2 For 1 lt t lt T
π(zit |zi(tminus1) zi(t+1)λt βctminus1 δ
stminus1) = ψlowast
itzit (1minus ψlowast
it)1minuszit
= Bernoulli(ψlowastit) (Andash2)
where
ψlowastit =
κitprodJit
j=1(1minus pijt)
κitprodJit
j=1(1minus pijt) +nablait
prodJj=1 Iyijt=0
with
(a) κit = F (xprimei(tminus1)β
ctminus1 + zi(tminus1)δ
stminus1)ϕ(vit |xprimeitβ
ct + δst 1) and
(b) nablait =(1minus F (xprime
i(tminus1)βctminus1 + zi(tminus1)δ
stminus1)
)ϕ(vit |xprimeitβ
ct 1)
3 For t = T
π(ziT |zi(Tminus1)λT βcTminus1 δ
sTminus1) = ψ⋆iT
ziT (1minus ψ⋆iT )1minusziT
118
=
Nprodi=1
Bernoulli(ψ⋆iT ) (Andash3)
where
ψ⋆iT =κ⋆iT
prodJiTj=1(1minus pijT )
κ⋆iTprodJiT
j=1(1minus pijT ) +nabla⋆iT
prodJj=1 IyijT=0
with
(a) κ⋆iT = F (xprimei(Tminus1)β
cTminus1 + zi(Tminus1)δ
sTminus1) and
(b) nabla⋆iT =
(1minus F (xprime
i(Tminus1)βcTminus1 + zi(Tminus1)δ
sTminus1)
)Sampler ui
1
π(ui |zi1α) = tr N(xprime(o)iα 1 trunc(zi1))
where trunc(zi1) =
(minusinfin 0] zi1 = 0
(0infin) zi1 = 1(Andash4)
and tr N(microσ2A) denotes the pdf of a truncated normal random variable with mean microvariance σ2 and truncation region A
Sampler α
1
π(α|u) prop [α]
Nprodi=1
ϕ(ui xprime(o)iα 1) (Andash5)
If [α] prop 1 then
α|u sim N(m(α)α)
with m(α) = αXprime(o)u and α = (X prime
(o)X(o))minus1
Sampler vit
1 (For t gt 1)
π(vi (tminus1)|zi (tminus1) zit βctminus1 δ
stminus1) = tr N
(micro(v)i(tminus1) 1 trunc(zit)
)(Andash6)
where micro(v)i(tminus1) = xprime
i(tminus1)βctminus1 + zi(tminus1)δ
ci(tminus1) and trunc(zit) defines the corresponding
truncation region given by zit
119
Sampler(β(c)tminus1 δ
(c)tminus1
)
1 (For t gt 1)
π(β(s)tminus1 δ
(c)tminus1|vtminus1 ztminus1) prop [β
(s)tminus1 δ
(c)tminus1]
Nprodi=1
ϕ(vit xprimei(tminus1)β
(c)tminus1 + zi(tminus1)δ
(s)tminus1 1) (Andash7)
If[β(c)tminus1 δ
(s)tminus1
]prop 1 then
β(c)tminus1 δ
(s)tminus1|vtminus1 ztminus1 sim N(m(β
(c)tminus1 δ
(s)tminus1)tminus1)
with m(β(c)tminus1 δ
(s)tminus1) = tminus1 ~X
primetminus1vtminus1 and tminus1 = (~X prime
tminus1 ~Xtminus1)minus1 where ~Xtminus1 =(
Xtminus1 ztminus1)
Sampler wijt
1 (For t gt 1 and zit = 1)
π(wijt | i zit = 1 yijt λ) = tr N(qprimeijtλt 1 tr(yijt)
)(Andash8)
Sampler λt
1 (For t = 1 2 T )
π(λt |zt wt) prop [λt ]prod
i zit=1
Jitprodj=1
ϕ(wijt qprimeijtλt 1) (Andash9)
If [λt ] prop 1 then
λt |wt zt sim N(m(λt)λt)
with m(λt) = λtQ primetwt and λt
= (Q primetQt)
minus1 where Qt and wt respectively are the designmatrix and the vector of latent variables for surveys of sites such that zit = 1
120
APPENDIX BRANDOM WALK ALGORITHMS
Global Jump From the current state M the global jump is performed by drawing
a model M prime at random from the model space This is achieved by beginning at the base
model and increasing the order from JminM to the Jmax
M the minimum and maximum orders
of nodes in (MF ) = MF MB at each order a set of nodes is selected at random from
the prior conditioned on the nodes already in the model The MH correction is
α =
1m(y|M primeM)
m(y|MM)
Local Jump From the current state M the local jump is performed by drawing a
model from the set of models L(M) = Mα α isin E(M) cup C(M) where Mα is M α
for α isin E(M) and M cup α for α isin C(M) The proposal probabilities for the model are
computed as a mixture of p(M prime|yMM prime isin L(M)) and the discrete uniform distribution
The proposal kernel is
q(M prime|yMM prime isin L(M)) =1
2
(p(M prime|yMM prime isin L(M)) +
1
|L(M)|
)This choice promotes moving to better models while maintaining a non-negligible
probability of moving to any of the possible models The MH correction is
α =
1m(y|M primeM)
m(y|MM)
q(M|yMM isin L(M prime))
q(M prime|yMM prime isin L(M))
Intermediate Jump The intermediate jump is performed by increasing or
decreasing the order of the nodes under consideration performing local proposals based
on order For a model M prime define Lj(Mprime) = M prime cup M prime
α α isin (E(M prime) cup C(M prime)) capj(MF )
From a state M the kernel chooses at random whether to increase or decrease the
order If M = MF then decreasing the order is chosen with probability 1 and if M = MB
then increasing the order is chosen with probability 1 in all other cases the probability of
increasing and decreasing order is 12 The proposal kernels are given by
121
Increasing order proposal kernel
1 Set j = JminM minus 1 and M prime
j = M
2 Draw M primej+1 from qincj+1(M
prime|yMM prime isin Lj+1(Mprimej )) where
qincj+1(Mprime|yMM prime isin Lj+1(M
primej )) =
12
(p(M prime|yMM prime isin Lj+1(M
primej )) +
1|Lj+1(M
primej)|
)
3 Set j = j + 1
4 If j lt JmaxM then return to 2 O therwise proceed to 5
5 Set M prime = M primeJmaxM
and compute the proposal probability
qinc(Mprime|yMM) =
JmaxM minus1prod
j=JminM minus1
qincj+1(Mprimej |yMM prime isin Lj+1(M
primej )) (Bndash1)
Decreasing order proposal kernel
1 Set j = JmaxM + 1 and M prime
j = M
2 Draw M primejminus1 from qdecjminus1(M
prime|yMM prime isin Ljminus1(Mprimej )) where
qdecjminus1(Mprime|yMM prime isin Ljminus1(M
primej )) =
12
(p(M prime|yMM prime isin Ljminus1(M
primej )) +
1|Ljminus1(M
primej)|
)
3 Set j = j minus 1
4 If j gt JminM then return to 2 Otherwise proceed to 5
5 Set M prime = M primeJminM
and compute the proposal probability
qdec(Mprime|yMM) =
JminM +1prod
j=JmaxM +1
qdecjminus1(Mprimej |yMM prime isin Ljminus1(M
primej )) (Bndash2)
If increasing order is chosen then the MH correction is given by
α = min
1
(1 + I (M prime = MF )
1 + I (M = MB)
)qdec(M|yMM prime)
qinc(M prime|yMM)
p(M prime|yM)
p(M|yM)
(Bndash3)
and similarly if decreasing order is chosen
Other Local and Intermediate Kernels The local and intermediate kernels
described here perform a kind of stochastic forwards-backwards selection Each kernel
122
q can be relaxed to allow more than one node to be turned on or off at each step which
could provide larger jumps for each of these kernels The tradeoff is that number of
proposed models for such jumps could be very large precluding the use of posterior
information in the construction of the proposal kernel
123
APPENDIX CWFM SIMULATION DETAILS
Briefly the idea is to let ZMT(X )βMT
= (QR)βMT= QηMT
(ie βMT= Rminus1ηMT
)
using the QR decomposition As such setting all values in ηMTproportional to one
corresponds to distributing the signal in the model uniformly across all predictors
regardless of their order
The (unconditional) variance of a single observation yi is var(yi) = var (E [yi |zi ]) +
E [var(yi |zi)] where zi is the i -th row of the design matrix ZMT Hence we take the
signal to noise ratio for each observation to be
SNR(η) = ηTMT
RminusTzRminus1ηMT
σ2
where z = var(zi) We determine how the signal is distributed across predictors up to a
proportionality constant to be able to control simultaneously the signal to noise ratio
Additionally to investigate the ability of the model to capture correctly the
hierarchical structure we specify four different 0-1 vectors that determine the predictors
in MT which generates the data in the different scenarios
Table C-1 Experimental conditions WFM simulationsParameter Values considered
SNR(ηMT) = k 025 1 4
ηMTprop (1 13 14 12) (1 13 1214
1412) (1 1413
1214 12)
γMT(1 13 14 12) (1 13 14 02) (1 13 04 12) (1 03 0 1 1 0 12)
n 130 260 1040
The results presented below are somewhat different from those found in the main
body of the article in Section 5 These are extracted averaging the number of FPrsquos
TPrsquos and model sizes respectively over the 100 independent runs and across the
corresponding scenarios for the 20 highest probability models
124
SNR and Sample Size Effect
In terms of the SNR and the sample size (Figure C-1) we observe that as
expected small sample sizes conditioned upon a small SNR impair the ability of the
algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this effect
more notorious when using the latter prior However considering the mean number
of true positives (TP) jointly with the mean model size it is clear that although the
sensitivity is low most of the few predictors that are discovered belong to the true
model The results observed with SNR of 025 and a relatively small sample size are
far from being impressive however real problems where the SNR is as low as 025
will yield many spurious associations under the EPP The fact that the HOP(1 ch) has
a strong protection against false positive is commendable in itself A SNR of 1 also
represents a feeble relationship between the predictors and the response nonetheless
the method captures approximately half of the true coefficients while including very few
false positives Following intuition as either the sample size or the SNR increase the
algorithms performance is greatly enhanced Either having a large sample size or a
large SNR yields models that contain mostly true predictors Additionally HOP(1 ch)
provides a strong control over the number of false positives therefore for high SNR
or larger sample sizes the number of predictors in the top 20 models is close to the
size of the true model In general the EPP allows the detection of more TPrsquos while
the HOP(1 ch) provides a stronger control on the amount of FPrsquos included when
considering small sample sizes combined with small SNRs As either sample size or
SNR grows the differences between the two priors become indistinct
125
Figure C-1 SNR vs n Average model size average true positives and average false
positives for all simulated scenarios by model ranking according to model
posterior probabilities
Coefficient Magnitude
This part of the experiment explores the effect of how the signal is distributed across
predictors As mentioned before sphering is used to assign the coefficients values
in a manner that controls the amount of signal that goes into each coefficient Three
possible ways to allocate the signal are considered First each order-one coefficient
contains twice as much signal as any order-two coefficient and four times as much
any as order-three coefficient second all coefficients contain the same amount of
signal regardless of their order and third each order-one coefficient contains a half
as much signal as any order-two coefficient and a quarter of what any order-three
126
coefficient has In Figure C-2 these values are denoted by β = c(1o1 05o2 025o3)
β = c(1o1 1o2 1o3) and β = c(025o1 05o2 1o3) respectively
Observe that the number of FPrsquos is invulnerable to how the SNR is distributed
across predictors using the HOP(1 ch) conversely when using the EPP the number
of FPrsquos decreases as the SNR grows always being slightly higher than those obtained
with the HOP With either prior structure the algorithm performs better whenever all
coefficients are equally weighted or when those for the order-three terms have higher
weights In these two cases (ie with β = c(1o1 05o2 025o3) or β = c(1o1 1o2 1o3))
the effect of the SNR appears to be similar In contrast when more weight is given to
order one terms the algorithm yields slightly worse models at any SNR level This is an
intuitive result since giving more signal to higher order terms makes it easier to detect
higher order terms and consequently by strong heredity the algorithm will also select
the corresponding lower order terms included in the true model
Special Points on the Scale
In Nelder (1998) the author argues that the conditions under which the
weak-heredity principle can be used for model selection are so restrictive that the
principle is commonly not valid in practice in this context In addition the author states
that considering well-formulated models only does not take into account the possible
presence of special points on the scales of the predictors that is situations where
omitting lower order terms is justified due to the nature of the data However it is our
contention that every model has an underlying well-formulated structure whether or not
some predictor has special points on its scale will be determined through the estimation
of the coefficients once a valid well-formulated structure has been chosen
To understand how the algorithm behaves whenever the true data generating
mechanism has zero-valued coefficients for some lower order terms in the hierarchy
four different true models are considered Three of them are not well-formulated while
the remaining one is the WFM shown in Figure 4-6 The three models that have special
127
Figure C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model rankingaccording to model posterior probabilities
points correspond to the same model MT from Figure 4-6 but have respectively
zero-valued coefficients for all the order-one terms all the order-two terms and for x21
and x2x5
As seen before in comparison to the EPP the HOP(1 ch) tightly controls the
inclusion FPs by choosing smaller models at the expense of also reducing the TP
count especially when there is more uncertainty about the true model (ie SNR=025)
For both prior structures the results in Figure C-3 indicate that at low SNR levels the
presence of special points has no apparent impact as the selection behavior is similar
between the four models in terms of both the TP and FP As the SNR increases the
TPs and the model size are affected for true models with zero-valued lower order
128
Figure C-3 SNR vs different true models MT Average model size average truepositives and average false positives for all simulated scenarios by modelranking according to model posterior probabilities
terms These differences however are not very large Relatively smaller models are
selected whenever some terms in the hierarchy are missing but with high SNR which
is where the differences are most pronounced the predictors included are mostly true
coefficients The impact is almost imperceptible for the true model that lacks order one
terms and the model with zero coefficients for x21 and x2x5 and is more visible for models
without order two terms This last result is expected due to strong-heredity whenever
the order-one coefficients are missing the inclusion of order-two and order-three
terms will force their selection which is also the case when only a few order two terms
have zero-valued coefficients Conversely when all order two predictors are removed
129
some order three predictors are not selected as their signal is attributed the order two
predictors missing from the true model This is especially the case for the order three
interaction term x1x2x5 which depends on the inclusion of three order two terms terms
(x1x2 x1x5 x2x5) in order for it to be included as well This makes the inclusion of this
term somewhat more challenging the three order two interactions capture most of
the variation of the polynomial terms that is present when the order three term is also
included However special points on the scale commonly occur on a single or at most
on a few covariates A true data generating mechanism that removes all terms of a given
order in the context of polynomial models is clearly not justified here this was only done
for comparison purposes
130
APPENDIX DSUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
The covariates considered for the ozone data analysis match those used in Liang
et al (2008) these are displayed in Table D below
Table D-1 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA
vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX
The marginal posterior inclusion probability corresponds to the probability of including a
given term in the full model MF after summing over all models in the model space For each
node α isin MF this probability is given by pα =sum
MisinM I(αisinM)p(M|yM) Given that in problems
with a large model space such as the one considered for the ozone concentration problem
enumeration of the entire space is not feasible Thus these probabilities are estimated summing
over every model drawn by the random walk over the model space M
Given that there are in total 44 potential predictors for convenience in Tables D-2 to D-5
below we only display the marginal posterior probabilities for the terms included under at least
one of the model priors considered (EPP HIP HUP and HOP) for each of the parameter priors
utilized (intrinsic priors Zellner-Siow priors Hyper-g(11) and Hyper-g(21))
131
Table D-2 Marginal inclusion probabilities
intrinsic prior
EPP HIP HUP HOP
hum 099 069 085 076
dpg 085 048 052 053
ibt 099 100 100 100
hum2 076 051 043 062
humdpg 055 002 003 017
humibt 098 069 084 075
dpg2 072 036 025 046
ibt2 059 078 057 081
Table D-3 Marginal inclusion probabilities
Zellner-Siow prior
EPP HIP HUP HOP
hum 076 067 080 069
dpg 089 050 055 058
ibt 099 100 100 100
hum2 057 049 040 057
humibt 072 066 078 068
dpg2 081 038 031 051
ibt2 054 076 055 077
Table D-4 Marginal inclusion probabilities
Hyper-g11
EPP HIP HUP HOP
vh 054 005 010 011
hum 081 067 080 069
dpg 090 050 055 058
ibt 099 100 099 099
hum2 061 049 040 057
humibt 078 066 078 068
dpg2 083 038 030 051
ibt2 049 076 054 077
Table D-5 Marginal inclusion probabilities
Hyper-g21
EPP HIP HUP HOP
hum 079 064 073 067
dpg 090 052 060 059
ibt 099 100 099 100
hum2 060 047 037 055
humibt 076 064 071 067
dpg2 082 041 036 052
ibt2 047 073 049 075
132
REFERENCES
Akaike H (1983) Information measures and model selection Bull Int Statist Inst 50277ndash290
Albert J H amp Chib S (1993) Bayesian-analysis of binary and polychotomousresponse data Journal of the American Statistical Association 88(422) 669ndash679
Berger J amp Bernardo J (1992) On the development of reference priors BayesianStatistics 4 (pp 35ndash60)
URL httpisbastatdukeedueventsvalencia1992Valencia4Refpdf
Berger J amp Pericchi L (1996) The intrinsic Bayes factor for model selection andprediction Journal of the American Statistical Association 91(433) 109ndash122
URL httpamstattandfonlinecomdoiabs10108001621459199610476668
Berger J Pericchi L amp Ghosh J (2001) Objective Bayesian methods for modelselection introduction and comparison In Model selection vol 38 of IMS LectureNotes Monogr Ser (pp 135ndash207) Inst Math Statist
URL httpwwwjstororgstable1023074356165
Besag J York J amp Mollie A (1991) Bayesian Image-Restoration with 2 Applicationsin Spatial Statistics Annals of the Institute of Statistical Mathematics 43 1ndash20
Bien J Taylor J amp Tibshirani R (2013) A lasso for hierarchical interactions TheAnnals of Statistics 41(3) 1111ndash1141
URL httpprojecteuclidorgeuclidaos1371150895
Breiman L amp Friedman J (1985) Estimating optimal transformations for multipleregression and correlation Journal of the American Statistical Association 80580ndash598
Brusco M J Steinley D amp Cradit J D (2009) An exact algorithm for hierarchicallywell-formulated subsets in second-order polynomial regression Technometrics 51(3)306ndash315
Casella G Giron F J Martınez M L amp Moreno E (2009) Consistency of Bayesianprocedures for variable selection The Annals of Statistics 37 (3) 1207ndash1228
URL httpprojecteuclidorgeuclidaos1239369020
Casella G Moreno E amp Giron F (2014) Cluster Analysis Model Selection and PriorDistributions on Models Bayesian Analysis TBA(TBA) 1ndash46
URL httpwwwstatufledu~casellaPapersClusterModel-July11-Apdf
133
Chipman H (1996) Bayesian variable selection with related predictors CanadianJournal of Statistics 24(1) 17ndash36
URL httponlinelibrarywileycomdoi1023073315687abstract
Clyde M amp George E I (2004) Model Uncertainty Statistical Science 19(1) 81ndash94
URL httpprojecteuclidorgDienstgetRecordid=euclidss1089808274
Dewey J (1958) Experience and nature New York Dover Publications
Dorazio R M amp Taylor-Rodrıguez D (2012) A Gibbs sampler for Bayesian analysis ofsite-occupancy data Methods in Ecology and Evolution 3 1093ndash1098
Ellison A M (2004) Bayesian inference in ecology Ecology Letters 7 509ndash520
Fiske I amp Chandler R (2011) unmarked An R package for fitting hierarchical modelsof wildlife occurrence and abundance Journal of Statistical Software 43(10)
URL httpcorekmiopenacukdownloadpdf5701760pdf
George E (2000) The variable selection problem Journal of the American StatisticalAssociation 95(452) 1304ndash1308
URL httpwwwtandfonlinecomdoiabs10108001621459200010474336
Giron F J Moreno E Casella G amp Martınez M L (2010) Consistency of objectiveBayes factors for nonnested linear models and increasing model dimension Revistade la Real Academia de Ciencias Exactas Fisicas y Naturales Serie A Matematicas104(1) 57ndash67
URL httpwwwspringerlinkcomindex105052RACSAM201006
Good I J (1950) Probability and the Weighing of Evidence New York Haffner
Griepentrog G L Ryan J M amp Smith L D (1982) Linear transformations ofpolynomial regression-models American Statistician 36(3) 171ndash174
Gunel E amp Dickey J (1974) Bayes factors for independence in contingency tablesBiometrika 61 545ndash557
Hanski I (1994) A Practical Model of Metapopulation Dynamics Journal of AnimalEcology 63 151ndash162
Hooten M (2006) Hierarchical spatio-temporal models for ecological processesDoctoral dissertation University of Missouri-Columbia
URL httpsmospacelibraryumsystemeduxmluihandle103554500
Hooten M B amp Hobbs N T (2014) A Guide to Bayesian Model Selection forEcologists Ecological Monographs (In Press)
134
Hughes J amp Haran M (2013) Dimension reduction and alleviation of confoundingfor spatial generalized linear mixed models Journal of the Royal Statistical SocietySeries B Statistical Methodology 75 139ndash159
Hurvich C M amp Tsai C-L (1989) Regression and time series model selection insmall samples Biometrika 76 297ndash307
URL httpbiometoxfordjournalsorgcontent762297abstract
Jeffreys H (1935) Some tests of significance treated by the theory of probabilityProcedings of the Cambridge Philosophy Society 31 203ndash222
Jeffreys H (1961) Theory of Probability London Oxford University Press 3rd ed
Johnson D Conn P Hooten M Ray J amp Pond B (2013) Spatial occupancymodels for large data sets Ecology 94(4) 801ndash808
URL httpwwwesajournalsorgdoiabs10189012-05641mi=3eywlhampaf=R
ampsearchText=human+population
Kass R amp Wasserman L (1995) A reference Bayesian test for nested hypothesesand its relationship to the Schwarz criterion Journal of the American StatisticalAssociation 90(431)
URL httpamstattandfonlinecomdoiabs10108001621459199510476592
Kass R E amp Raftery A E (1995) Bayes Factors Journal of the American StatisticalAssociation 90 773ndash795
URL httpwwwtandfonlinecomdoiabs10108001621459199510476572$
delimiter026E30F$nhttpwwwtandfonlinecomdoiabs10108001621459
199510476572UvBybrTIgcs
Kass R E amp Wasserman L (1996) The Selection of Prior Distributions by FormalRules Journal of the American Statistical Association 91(435) 1343
URL httpwwwjstororgstable2291752origin=crossref
Kery M (2010) Introduction to WinBUGS for Ecologists Bayesian Approach toRegression ANOVA Mixed Models and Related Analyses Academic Press 1st ed
Kery M Gardner B amp Monnerat C (2010) Predicting species distributions fromchecklist data using site-occupancy models Journal of Biogeography 37 (10)1851ndash1862 Kery Marc Gardner Beth Monnerat Christian
Khuri A (2002) Nonsingular linear transformations of the control variables in responsesurface models Technical Report
Krebs C J (1972) Ecology the experimental analysis of distribution and abundance
135
Lempers F B (1971) Posterior probabilities of alternative linear models University ofRotterdam Press Rotterdam
Leon-Novelo L Moreno E amp Casella G (2012) Objective Bayes model selection inprobit models Statistics in medicine 31(4) 353ndash65
URL httpwwwncbinlmnihgovpubmed22162041
Liang F Paulo R Molina G Clyde M a amp Berger J O (2008) Mixtures of g Priorsfor Bayesian Variable Selection Journal of the American Statistical Association103(481) 410ndash423
URL httpwwwtandfonlinecomdoiabs101198016214507000001337
Link W amp Barker R (2009) Bayesian inference with ecological applications Elsevier
URL httpbooksgooglecombookshl=enamplr=ampid=hecon2l2QPcCampoi=fnd
amppg=PP2ampdq=Bayesian+Inference+with+ecological+applicationsampots=S82_
0pxrNmampsig=L3xbsSQcKD8FV6rxCMp2pmP2JKk
MacKenzie D amp Nichols J (2004) Occupancy as a surrogate for abundanceestimation Animal biodiversity and conservation 1 461ndash467
URL httpcrsitbacidmediajurnalrefslandscapemackenzie2004zhpdf
MacKenzie D Nichols J amp Hines J (2003) Estimating site occupancy colonizationand local extinction when a species is detected imperfectly Ecology 84(8)2200ndash2207
URL httpwwwesajournalsorgdoiabs10189002-3090
MacKenzie D I Bailey L L amp Nichols J D (2004) Investigating speciesco-occurrence patterns when species Journal of Animal Ecology 73 546ndash555
MacKenzie D I Nichols J D Lachman G B Droege S Royle J A amp LangtimmC A (2002) Estimating site occupancy rates when detection probabilities are lessthan one Ecology 83(8) 2248ndash2255
Mazerolle M amp Mazerolle M (2013) Package rsquoAICcmodavgrsquo (c)
URL ftpheanetarchivegnewsenseorgdisk1CRANwebpackages
AICcmodavgAICcmodavgpdf
McCullagh P amp Nelder J A (1989) Generalized linear models (2nd ed) LondonEngland Chapman amp Hall
McQuarrie A Shumway R amp Tsai C-L (1997) The model selection criterion AICu
136
Moreno E Bertolino F amp Racugno W (1998) An intrinsic limiting procedure for modelselection and hypotheses testing Journal of the American Statistical Association93(444) 1451ndash1460
Moreno E Giron F J amp Casella G (2010) Consistency of objective Bayes factors asthe model dimension grows The Annals of Statistics 38(4) 1937ndash1952
URL httpprojecteuclidorgeuclidaos1278861238
Nelder J A (1977) Reformulation of linear-models Journal of the Royal StatisticalSociety Series A - Statistics in Society 140 48ndash77
Nelder J A (1998) The selection of terms in response-surface models - how strong isthe weak-heredity principle American Statistician 52(4) 315ndash318
Nelder J A (2000) Functional marginality and response-surface fitting Journal ofApplied Statistics 27 (1) 109ndash112
Nichols J Hines J amp Mackenzie D (2007) Occupancy estimation and modeling withmultiple states and state uncertainty Ecology 88(6) 1395ndash1400
URL httpwwwesajournalsorgdoipdf10189006-1474
Ovaskainen O Hottola J amp Siitonen J (2010) Modeling species co-occurrenceby multivariate logistic regression generates new hypotheses on fungal interactionsEcology 91(9) 2514ndash21
URL httpwwwncbinlmnihgovpubmed20957941
Peixoto J L (1987) Hierarchical variable selection in polynomial regression-modelsAmerican Statistician 41(4) 311ndash313
Peixoto J L (1990) A property of well-formulated polynomial regression-modelsAmerican Statistician 44(1) 26ndash30
Pericchi L R (2005) Model selection and hypothesis testing based on objectiveprobabilities and bayes factors In Handbook of Statistics Elsevier
Polson N G Scott J G amp Windle J (2013) Bayesian Inference for Logistic ModelsUsing Polya-Gamma Latent Variables Journal of the American Statistical Association108 1339ndash1349
URL httpdxdoiorg101080016214592013829001
Rao C R amp Wu Y (2001) On model selection vol Volume 38 of Lecture NotesndashMonograph Series (pp 1ndash57) Beachwood OH Institute of Mathematical Statistics
URL httpdxdoiorg101214lnms1215540960
137
Reich B J Hodges J S amp Zadnik V (2006) Effects of residual smoothing on theposterior of the fixed effects in disease-mapping models Biometrics 62 1197ndash1206
Reiners W amp Lockwood J (2009) Philosophical Foundations for the Practices ofEcology Cambridge University Press
URL httpbooksgooglecombooksid=dr9cPgAACAAJ
Rigler F amp Peters R (1995) Excellence in Ecology Science and Limnology EcologyInstitute Germany
URL httportoncatieaccrcgi-binwxisexeIsisScript=CIENLxis
ampmethod=postampformato=2ampcantidad=1ampexpresion=mfn=008268
Robert C Chopin N amp Rousseau J (2009) Harold Jeffreysrsquo Theory of Probabilityrevisited Statistical Science Volume 24(2) 141ndash179
URL httpswwwnewtonacukpreprintsNI08021pdf
Robert C P (1993) A note on jeffreys-lindley paradox Statistica Sinica 3 601ndash608
Royle J A amp Kery M (2007) A Bayesian state-space formulation of dynamicoccupancy models Ecology 88(7) 1813ndash23
URL httpwwwncbinlmnihgovpubmed17645027
Scott J amp Berger J (2010) Bayes and Empirical-Bayes Multiplicity Adjustment in thevariable selection problem The Annals of Statistics
URL httpprojecteuclidorgeuclidaos1278861454
Spiegelhalter D J amp Smith A F M (1982) Bayes factor for linear and log-linearmodels with vague prior information J R Statist Soc B 44 377ndash387
Tierney L amp Kadane J B (1986) Accurate approximations for posterior moments andmarginal densities Journal of the American Statistical Association 81 82ndash86
Tyre A J Tenhumberg B Field S a Niejalke D Parris K amp Possingham H P(2003) Improving Precision and Reducing Bias in Biological Surveys EstimatingFalse-Negative Error Rates Ecological Applications 13(6) 1790ndash1801
URL httpwwwesajournalsorgdoiabs10189002-5078
Waddle J H Dorazio R M Walls S C Rice K G Beauchamp J Schuman M Jamp Mazzotti F J (2010) A new parameterization for estimating co-occurrence ofinteracting species Ecological applications a publication of the Ecological Society ofAmerica 20 1467ndash1475
Wasserman L (2000) Bayesian Model Selection and Model Averaging Journal ofmathematical psychology 44(1) 92ndash107
138
URL httpwwwncbinlmnihgovpubmed10733859
Wilson M Iversen E Clyde M A Schmidler S C amp Schildkraut J M (2010)Bayesian model search and multilevel inference for SNP association studies TheAnnals of Applied Statistics 4(3) 1342ndash1364
URL httpwwwncbinlmnihgovpmcarticlesPMC3004292
Womack A J Leon-Novelo L amp Casella G (2014) Inference from Intrinsic BayesProcedures Under Model Selection and Uncertainty Journal of the AmericanStatistical Association (June) 140114063448000
URL httpwwwtandfonlinecomdoiabs101080016214592014880348
Yuan M Joseph V R amp Zou H (2009) Structured variable selection and estimationThe Annals of Applied Statistics 3(4) 1738ndash1757
URL httpprojecteuclidorgeuclidaoas1267453962
Zeller K A Nijhawan S Salom-Perez R Potosme S H amp Hines J E (2011)Integrating occupancy modeling and interview data for corridor identification A casestudy for jaguars in nicaragua Biological Conservation 144(2) 892ndash901
Zellner A amp Siow A (1980) Posterior odds ratios for selected regression hypothesesIn Trabajos de estadıstica y de investigacion operativa (pp 585ndash603)
URL httpwwwspringerlinkcomindex5300770UP12246M9pdf
139
BIOGRAPHICAL SKETCH
Daniel Taylor-Rodrıguez was born in Bogota Colombia He earned a BS
degree in economics from the Universidad de Los Andes (2004) and a Specialist
degree in statistics from the Universidad Nacional de Colombia In 2009 he traveled
to Gainesville Florida to pursue a masterrsquos in statistics under the supervision of
George Casella Upon completion he started a PhD in interdisciplinary ecology with
concentration in statistics again under George Casellarsquos supervision After Georgersquos
passing Linda Young and Nikolay Bliznyuk continued to oversee Danielrsquos mentorship
He has currently accepted a joint postdoctoral fellowship at the Statistical and Applied
Mathematical Sciences Institute and the Department of Statistical Science at Duke
University
140
- ACKNOWLEDGMENTS
- TABLE OF CONTENTS
- LIST OF TABLES
- LIST OF FIGURES
- ABSTRACT
- 1 GENERAL INTRODUCTION
-
- 11 Occupancy Modeling
- 12 A Primer on Objective Bayesian Testing
- 13 Overview of the Chapters
-
- 2 MODEL ESTIMATION METHODS
-
- 21 Introduction
-
- 211 The Occupancy Model
- 212 Data Augmentation Algorithms for Binary Models
-
- 22 Single Season Occupancy
-
- 221 Probit Link Model
- 222 Logit Link Model
-
- 23 Temporal Dynamics and Spatial Structure
-
- 231 Dynamic Mixture Occupancy State-Space Model
- 232 Incorporating Spatial Dependence
-
- 24 Summary
-
- 3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
-
- 31 Introduction
- 32 Objective Bayesian Inference
-
- 321 The Intrinsic Methodology
- 322 Mixtures of g-Priors
-
- 3221 Intrinsic priors
- 3222 Other mixtures of g-priors
-
- 33 Objective Bayes Occupancy Model Selection
-
- 331 Preliminaries
- 332 Intrinsic Priors for the Occupancy Problem
- 333 Model Posterior Probabilities
- 334 Model Selection Algorithm
-
- 34 Alternative Formulation
- 35 Simulation Experiments
-
- 351 Marginal Posterior Inclusion Probabilities for Model Predictors
- 352 Summary Statistics for the Highest Posterior Probability Model
-
- 36 Case Study Blue Hawker Data Analysis
-
- 361 Results Variable Selection Procedure
- 362 Validation for the Selection Procedure
-
- 37 Discussion
-
- 4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
-
- 41 Introduction
- 42 Setup for Well-Formulated Models
-
- 421 Well-Formulated Model Spaces
-
- 43 Priors on the Model Space
-
- 431 Model Prior Definition
- 432 Choice of Prior Structure and Hyper-Parameters
- 433 Posterior Sensitivity to the Choice of Prior
-
- 44 Random Walks on the Model Space
-
- 441 Simple Pruning and Growing
- 442 Degree Based Pruning and Growing
-
- 45 Simulation Study
-
- 451 SNR and Sample Size Effect
- 452 Coefficient Magnitude
- 453 Special Points on the Scale
-
- 46 Case Study Ozone Data Analysis
- 47 Discussion
-
- 5 CONCLUSIONS
- A FULL CONDITIONAL DENSITIES DYMOSS
- B RANDOM WALK ALGORITHMS
- C WFM SIMULATION DETAILS
- D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
- REFERENCES
- BIOGRAPHICAL SKETCH
-
top related