spatial statistics presentation texas a&m census rdc
DESCRIPTION
The purpose of this workshop is twofold. A primary goal is to provide researchers with a basic overview of spatial analysis. A secondary goal is to give attention to issues in GIS and spatial analysis that may be relevant to researchers planning to work with location data and unique geographies in confidential data sets in the Texas Census Research Data Center. The workshop will consist of three sessions. Each session will be led by Dr. Corey Sparks, Assistant Professor at UTSA's College of Public Policy. Dr. Spark's research focuses on statistical demography, Geographic Information Systems and the application of modern statistical methods to problems in demography and health. His teaching interests focus on use and application of advanced statistical techniques including hazards analysis, multivariate methods and spatial statistics in human population analysis.TRANSCRIPT
![Page 1: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/1.jpg)
An Introduction to Spatial Analysis in Socio-Economic
and Health ResearchCorey S. Sparks, PhD
Department of DemographyThe University of Texas at San Antonio
![Page 2: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/2.jpg)
Outline
1) What’s special about spatial? Good news Bad news
2) Modes of thinking about spatial analysis Macro and Micro
3) Concepts of space
4) Spatial analysis is NOT statistics, but spatial statistics needs spatial analysis
5) Using this for something meaningful
![Page 3: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/3.jpg)
What’s special about spatial?
Spatial data have more information than ordinary data Think of them as a triplet
Y, X, and Z, where Y is the variable of interest, X is some other information that influences Y and Z is the geographic location where Y occurred
If our data aren’t spatial, we don’t have Z Spatial information is a key attribute of behavioral
data This adds a potentially interesting attribute to any
data we collect
![Page 4: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/4.jpg)
What’s special about spatial?
Spatial data monkey with models Most analytical models have assumptions, spatial
structure can violate these models We typically want to jump into modeling, but
without acknowledging or handling directly, spatial data can make our models meaningless
Some of the problems are..
![Page 5: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/5.jpg)
The ecological fallacy
The tendency for aggregate data on a concept to show correlations when individual data on a concept do not.
In general the effect of aggregation bias, whereby those studying macro-level data try to make conclusions or statements about individual-level behavior
This also is felt when you analyze data at a specific level, say counties, your results are only generalizeable at that level, not at the level of congressional districts, MSA’s or states.
The often-arbitrary nature of aggregate units also needs to be considered in such analysis.
![Page 6: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/6.jpg)
The modifiable area unit problem (MAUP)
This is akin to the ecological fallacy and the notion of aggregation bias.
The MAUP occurs when inferences about data change when the spatial scale of observation is modified. i.e. at a county level there may be a significant association between
income and health, but at the state or national level this may become insignificant, likewise at the individual level we may see the relationship disappear.
This problem also exists when we suspect that a characteristic of an aggregate unit is influencing an individual behavior, but because the level at which aggregate data are available, we are unable to properly measure the variable at the aggregate level. E.g. we suspect that neighborhood crime rates will the recidivism hazard
for a parolee, but we can only get crime rates at the census tract or county level, so we cannot really measure the effect we want.
![Page 7: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/7.jpg)
Spatial structure
Structure is the idea that your data have an organization to them that has a specific spatial dimension
Think of a square grid Each cell in the grid can be though of as being
neighbors of other cells base on their proximity, distance, direction, etc.
This structure generally influences data by making them non-independent of one another
At best, you can have a correlation with your neighbor
At worst, your characteristics are a linear or nonlinear function of your neighbors
![Page 8: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/8.jpg)
Spatial heterogeneity
Spatial heterogeneity is the idea that characteristics of a population or a sample vary by location
This can manifest itself by generating clusters of like observations
Statistically, this is bad because many models assume constant variance, but if like observations are spatially co-incident, then variance is not constant
![Page 9: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/9.jpg)
Modes of thinking about spatial analysis
Macro Observations are areas, or aggregates of individuals Processes generally involve interaction among
these areas We can’t really get at variation within these areas
Can’t see the people within the county All we have are counts of some variable of interest Think of you stereotypical Census tract, or county
In social science, we often refer to such analyses as “ecological”
Analyzing places Looking for trends an associations at the population
level
![Page 10: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/10.jpg)
Macro models
An example I think the infant mortality rate in US counties is a
function of the socioeconomic status of the county residents Furthermore, I expect that characteristics of the built
environment in counties will further influence the infant mortality rate
I might hypothesize that areas that are more “built up” may have higher rates of infant mortality than less “built up” places
My outcome is a rate My predictors are other rates Everything is measured at the county level
![Page 11: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/11.jpg)
Modes of thinking in spatial analysis
Micro
Observations are individuals
Processes may still involve interactions between observations, but we often go beyond that
Look at behaviors of people Why do we do what we do?
![Page 12: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/12.jpg)
Multi-level models
The concept that individuals are nested within a hierarchy and the behavior of interest is determined by both the characteristics of the individual and the level of nesting.
This can be though of broadly as individuals within any hierarchy, such as a city block, an organization, a villages, or a community.
The spatial idea of multi-level analysis is that the individual is nested within a distinct geographic unit. Substantively, something about that area is thought to influence our
behavior of interest. Sometimes, we have a specific idea how that happens, other times we
have a weak conceptual model
Statistically these methods correct for the common occurrence of dependence within aggregate units (i.e. people from the same area may be more alike than expected by random chance alone) which violates the assumptions of most linear regression models (iid residuals)
![Page 13: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/13.jpg)
Concepts of space
To start, let’s think about how to define space as a concept i.e. what levels of space are important for human
behavior? Neighborhoods Villages/towns Social networks>social “space” Households Climatic zones Political/administrative boundaries
![Page 14: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/14.jpg)
What we really want to get at
What concept of space is important in my work? That’s for you to decide
More often than not, as social scientists, we will undertake a macro analysis or some form of nested micro analysis
At the macro level, we have to conceptualize the space of our units Where are counties relative to one another Is the spatial heterogeneity in my outcome at the county level
At the micro level, we have to justify why we believe the level of nesting we can measure is in fact the right level in terms of our outcome Just because I know what MSA a child lives in, doesn’t tell me much
about the neighborhood in which they grew up
![Page 15: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/15.jpg)
Spheres of social integration
Individuals exist within formal or more often informal spheres of interaction. You can think of these as areas where common
ideas unite people together, or the range (meaning distance) of social norms.
Common ideology often unites individuals into groups and groups into larger aggregates, look at the concept of nationalism for example.
They can likewise have a strictly a-spatial representation, such as a belief network.
![Page 16: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/16.jpg)
Some parting thoughts about space
Realization that human behavior is a dynamic process, not a static phenomena
Use of the cross-section as opposed to the cohort (longitudinal data) or with respect to changes in the neighborhood.
Defining “meaningful” neighborhoods, not just available sampling areas, how do individuals interact within these areas?
How do individuals interact with their neighborhood?
Ideas of agency and how an individual is an agent of change
Do people choose neighborhoods? Or are some chosen by the neighborhoods they are born into, or only the ones they can afford?
![Page 17: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/17.jpg)
Context or Composition?
Is it the places where people live that is important? (Context)
Or
Is it the nature of the people that live in the place? (Composition)
Again, when justifying a spatially oriented analysis, these things need to be addressed
![Page 18: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/18.jpg)
Spatial analysis is NOT statistics
Enter the GIS
GIS is the manager of spatial information Goes beyond the realm of relational databases by
incorporating the spatial context of all data Uses this as the overarching infrastructure for organizing
information
The use of space as a tool The GIS allows us to edit, merge, split, add, delete all
types of information based purely on the spatial location of the data
But, it is not statistics!
![Page 19: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/19.jpg)
GIS operations as spatial analysis
We use the GIS to help us manage and visualize spatial data
There are thousands of specific tools offered in a standard GIS
We can organize the tools of spatial analysis by what type of data they use: Point patterns Areas (polygons) Images (raster) Interactions Networks Sure I’m missing something
![Page 20: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/20.jpg)
Spatial statistics needs spatial analysis
1) Without spatial analysis, we would miss out on the interactions between our spatial information Unable to construct important variables for
statistical analysis
2) Without spatial analysis, we would be limited in our visualization of our data, and our results From maps to 3-D visualizations
![Page 21: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/21.jpg)
Wrap Up
Spatial data is special For better or for worse
Ignoring spatial structure can invalidate models
![Page 22: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/22.jpg)
![Page 23: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/23.jpg)
Exploratory Spatial Data Analysis
Corey S. Sparks, PhDDepartment of Demography
The University of Texas at San Antonio
![Page 24: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/24.jpg)
Exploratory Spatial Data AnalysisWhat is ESDA? In exploratory data analysis, we are looking for trends
and patterns in data
We are working under the assumption that the more one knows about the data, the more effectively it may be used to develop, test and refine theory
This generally requires we follow two principles: skepticism and openness
![Page 25: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/25.jpg)
One should be skeptical of measures that summarize data, since they can conceal or misrepresent the most informative aspects of data,
and..
We must be open to patterns in the data that we did not expect to find, because these can often be the most revealing outcomes of the analysis.
We must avoid the temptation to automatically jump to the confirmatory model of data analysis.
![Page 26: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/26.jpg)
The confirmatory model says: Do my data confirm the hypothesis that X causes Y.
We would normally fit a linear model (or something close to it) and use summary measures (means and variances) to test if the pattern we observe in the data is real.
The exploratory model says,
To the contrary, “What do the data I have tell me about the relationship between X and Y. This lends to a much more open range of alternative explanations
![Page 27: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/27.jpg)
The principle of EDA is explained best in the simple model:
Data = smooth + rough
The smooth bit is the underlying simplified structure of a set of observations. This is the straight line that we expect a relationship to follow in linear regression for example, or the smooth curve describing the distribution of data: the pattern or regularity in the data.
You can also think of the smooth as a central parameter of a distribution, the mean or median for example
The rough is the difference between our actual data, and the smooth. This can be measured as the deviation of each point from the mean.
Outliers, for example are very rough data, they have a large difference from the central tendency in a distribution, where all the other data points tend to clump
![Page 28: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/28.jpg)
Doing some ESDA We will use new tools (namely GeoDa) to do some Spatial EDA.
The way GeoDa differs from traditional ways of viewing data (on paper, or with summary statistics), is that it goes beyond presenting the data, to visualizing the data.
Data visualization is a dynamic process that incorporates multiple views of the same information, i.e. we can examine a histogram that is linked to a scatter plot that is
linked to a map to visualize the distribution of a single variable, how it is related to another and how it is patterned over space.
We can also do what is called brushing the data, so we can select subsets of the information and see if, there are distinct subsets of the data that have different relational or spatial properties. This allows us to really visualize how the processes we study unfolds over space and allows us to see locations of potentially influential observations.
![Page 29: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/29.jpg)
Some examples of ESDA items
Histograms
Boxplots
Scatter plots
Parallel coordinate plots
Area statistics/Local statistics
![Page 30: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/30.jpg)
Histograms Histograms are
useful tools for representing the distribution of data
They can be used to judge central tendency (mean, median, mode), variation (standard deviation, variance), modality (unimodal or multi-modal), and graphical assessment of distributional assumptions (Normality)
![Page 31: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/31.jpg)
Box Plots
Box plots (or box and whisker plots) are another useful tool for examining the distribution of a variable
You can visualize the 5 number summary of a variable Miniumum, Maximum,
lower quartile, Median, and upper quartile
Upper Quartile
Median
Lower Quartile
Maximum
Minimum
IQR
![Page 32: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/32.jpg)
Scatter Plots Scatterplots show
bivariate relationships
This can give you a visual indication of an association between two variables
Positive association (positive correlation)
Negative association (negative correlation)
Also allows you to see potential outliers (abnormal observations) in the data
Slight positive association
Potential Outlier
![Page 33: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/33.jpg)
Parallel Coordinate Plots
Parallel coordinate plots allow for visualization of the association between multiple variables (multivariate)
Each variable is plotted according to its “coordinates” or values
![Page 34: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/34.jpg)
Local Statistics Local statistics
allow you to see how the mean, or variation, of a variable varies over space
You can generate a local statistic by weighting an ordinary statistic by some kind of spatial weight
Example ->
Property crime in San Antonio, TX
![Page 35: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/35.jpg)
Global vs. Local Statistics
By global we imply that one statistic is used to adequately summarize the data
i.e. the mean or median
Or, a regression model that is suitable for all areas in the data
Local statistics are useful when the process you are studying varies over space, i.e. different areas have different local values that might
cluster together to form a local deviation from the overall mean
Or a regression model that accounts for the same level of variation in the outcome in all locations
![Page 36: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/36.jpg)
Autocorrelation This can occur in either space or time
Really boils down to the non-independence between neighboring values
The values of our independent variable (or our dependent variables) may be similar because
Our values occur closely in time (temporal autocorrelation) closely in space (spatial autocorrelation)
![Page 37: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/37.jpg)
Preliminaries to assessing autocorrelation
Basic Assessment of Spatial Dependency
Before we can model the dependency in spatial data, we must first cover the ideas of creating and modeling neighborhoods in our data.
By neighborhoods, I mean the clustering or connectedness of observations
The exploratory methods we will cover depend on us knowing how our data are arranged in space, who is next to who.
This is important (as we will see later) because most correlation in spatial data tends to die out as we get further away from a specific location (Tobler’s law)
![Page 38: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/38.jpg)
Tobler's first law of geography Waldo Tobler (1970) suggested the “jokingly” first law
of geography “Everything is related to everything else, but near things
are more related than distant things”
We can see this better in graphical form: We expect the correlation between the attributes of two points to diminish as the distance between them grows
![Page 39: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/39.jpg)
One thing about autocorrelation
Autocorrelation is typically a local process
Meaning it typically dies out as distance between observations increase
plot(exp(-.05*seq(1:100)), type="l", lwd=2, ylab="Correlation", xlab="Distance")
![Page 40: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/40.jpg)
W
To identify which observations are close to one another, we use W
W is the spatial weight matrix Square n by n matrix with 1/0 entries identifying if
any two observations are spatial neighbors
How is this constructed?
There are two typical ways in which we measure spatial relationships
Distance and contiguity
![Page 41: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/41.jpg)
In a distance based connectivity method, features (generally points) are considered to be contiguous if they are within a given radius of another point. The radius is really left up to the researcher to decide.
Likewise, we can calculate the distance matrix between a set of points
This is usually measured using the standard Euclidean distance
Where x and y are coordinates (lat/long) of the point or polygon in question, this is the as the crow flies distance
Distance based neighbors
![Page 42: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/42.jpg)
More on distances There are lots of distance measures
Manhattan distances
d = |x1 – x2| + |y1- y2|
These are city-block distances
We can use these distances to create a neighborhood of points using certain criteria
![Page 43: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/43.jpg)
Who is who's neighbor There are many different criteria for deciding if two
observations are neighbors
Generally two observations must be within a critical distance, d, to be considered neighbors.
This is the Minimum distance criteria, and is very popular.
This will generate a matrix of binary variables describing the neighborhood.
We can also describe the neighborhoods in a continuous weighting scheme based on the distance between them
Inverse Distance Weight wij=1d ij
or
Inverse Squared Distance Weight wij=1
d ij2
![Page 44: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/44.jpg)
Contiguity/Polygon Adjacency
Polygons are contiguous if they share common topology, like an edge (line segment) or a vertex (point)
Neighborhoods are created based on which observations are judged “contiguous”
This is generally the best way to treat polygon features
Distances aren’t typically used for polygons for several reasons Spacing of observations What centroid? Is it a good measure?
![Page 45: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/45.jpg)
• Rook adjacency
• Neighbors must share a line segment
• Queen adjacency
• Neighbors must share a vertex or a line segment
• If polygons share these boundaries (based on the specific definition: rook or queen), they are given a weight of 1 (adjacent), else they are given a value 0, (nonadjacent)
![Page 46: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/46.jpg)
Order of adjacency
Observation of interest
First order neighbors (Rook)
Second Order neighbors (Rook)
Observation of interest
First order neighbors (Queen)
Second Order neighbors (Queen)
First and second orderRook adjacency
First and second orderQueen adjacency
![Page 47: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/47.jpg)
What does a spatial weight matrix look like?
Set of polygons
Spatial Weight Matrix, W
![Page 48: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/48.jpg)
Standardized W
Most times, in analytical work, W is standardized
Typically this is done by dividing each 1/0 element in the row by the sum of the row, which would yield this matrix:Polygon 1 2 3 4
1 0 0.5 0.5 02 0.5 0 0 0.53 0.5 0 0 0.54 0 0.5 0.5 0
![Page 49: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/49.jpg)
Forms of autocorrelation
Positive autocorrelation
This means that a feature is positively associated with the values of the surrounding area (as defined by the spatial weight matrix), high values occur with high values, low with low
Negative autocorrelation
This means that a feature is negatively associated with the values of the surrounding area (as defined by the spatial weight matrix), high with low, low with high
The (probably) most popular global autocorrelation statistic is Moran’s I
![Page 50: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/50.jpg)
Measuring spatial autocorrelation:Common measures
Moran’s I
Geary’s C
Getis-Ord G
All typically give similar results
![Page 51: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/51.jpg)
Moran's I
Measure of standardized spatial autocovariance
with xi being the value of the attribute at location i, xj being the
value of the attribute at location j,
S0 is the sum of all spatial weights
wij is the weight for location ij (0 if they are not neighbors, 1 otherwise)
Approximately bound on 0,1 with a similar interpretation as a Pearson or Spearman correlation
![Page 52: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/52.jpg)
Geary’s C
Measure of spatial covariance
with xi being the value of the attribute at location i, xj being the value of the attribute at location j,
S0 is the sum of all spatial weights
wij is the weight for location ij (0 if they are not neighbors, 1 otherwise)
Bound on 0,2 0 to 1 = positive autocorrelation 1= no autocorrelation 1 to 2=negative autocorrelation
![Page 53: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/53.jpg)
Getis – Ord G
Unlike I and C, the interpretation of G focuses on whether high values of x tend to cluster with other high values of x
The measure is directional
Likewise, low with low
![Page 54: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/54.jpg)
Local Moran’s I
We can also describe the local trends in autocorrelation in our data by constructing local versions of our autocorrelation statistics
Each has a local version
This is useful to see where the autocorrelation is high or low within our data Goes beyond the global statistics to help us
visualize where in our data associations are strong
![Page 55: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/55.jpg)
How do you determine if the dataare in a cluster?
This is done via the Moran scatterplot.
If an observation is in the top right quadrant of the plot = high-high
Bottom left quadrant = low – low
Top left quadrant = high – low
Bottom right quadrant = low-high
![Page 56: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/56.jpg)
Moran scatterplot (univariate)
It is sometimes useful to visualize the relationship between the actual values of the dependent variable and their lagged values. This is the so called
Moran scatterplot
Lagged values are the average value of the surrounding neighborhood around location I
lag(x) = wij*x = Wx in matrix terms
The Moran scatterplot shows the association between the observation of interest and its neighborhood's average value
The variables are generally plotted as z-scores, to avoid scaling issues
![Page 57: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/57.jpg)
Moran scatterplot for San Antonio neighborhood deprivation
I = .688
High Positive autocorrelation
![Page 58: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/58.jpg)
Local Moran cluster map
Significant high neighborhood deprivation on the west, east and south sides
Significant low neighborhood deprivation on the north side
![Page 59: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/59.jpg)
Spatially lagged values If we have a value xi at location i and a spatial weight
matrix wij describing the spatial neighborhood around location i, we can find the lagged value of the variable by:
Wxi = xi * wij
This calculates what is effectively, the neighborhood average value in locations around location i, often stated x-i
![Page 60: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/60.jpg)
![Page 61: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/61.jpg)
Moran scatterplot (Bivariate) We can also compare the lagged value of one variable
verses the value of another variable
Wy versus x
This is the so-called “multivariate Moran scatterplot”
It is often useful in so-called space-time analysis, when we compare the value of one variable measured at two different time points. This will show the correlation between space over time.
![Page 62: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/62.jpg)
Positive autocorrelation between a tract’s deprivation and the average minority concentration in the neighboring tracts
I = .537
![Page 63: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/63.jpg)
What these methods tell you
Moran's I is a descriptive statistic It simply indicates if there is spatial
association/autocorrelation in a variable
Local Moran's I tells you if there is significant localized clustering of the variable Where spatial clusters and what type of cluster is
present
![Page 64: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/64.jpg)
![Page 65: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/65.jpg)
Two examples of spatial analysis and statistics
![Page 66: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/66.jpg)
Goals
To show how you can use R, software that is free and available in the RDC to: 1) Perform interesting spatially-oriented analysis 2) Merge spatial data from multiple sources
Spatial join 3) Visualize results from a spatial analysis much as
you would in a GIS environment
Without a GIS!
![Page 67: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/67.jpg)
Residential Segregation and Crime in San Antonio, TX
Four data sources: American Community Survey 5 year summary file
Census tract level Census 2000 Summary File 1
Block level San Antonio Police Department call data
Point data based on geocoded addresses All calls received by the SAPD in a year
Census TIGER file for San Antonio tracts
![Page 68: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/68.jpg)
Methods
1)Merge ACS calculated fields to shapefile based on tract FIPS code
2) Perform spatial join of crimes to tracts Count # of crimes in each tract
3) Create a thematic map of the crime rate using quantile breaks
4) Create tract-level indices of residential segregation by aggregating up from the block level Merge these to the shapefile
5) Perform ESDA using Local Moran statistics on the segregation index and crime rate
![Page 69: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/69.jpg)
Thematic Map/Histogram
![Page 70: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/70.jpg)
Local Moran Cluster Maps
![Page 71: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/71.jpg)
Spatial Disease Mapping
Application to crime data, but principles remain the same Trying to locate areas of excess disease risk
Finding spatial clusters of disease Kulldorff and Nagarwalla’s Spatial Scan Statistic DCluster library
Modeling spatial relative risk Bayesian Hierarchical Regression Posterior model summaries INLA library Approximate Bayesian inference by Integrated Nested
Laplace Approximation Great for Gaussian random field models
![Page 72: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/72.jpg)
Methods
Spatial Scan Statistic Constructs a grid over the area, centered on each
observation Using a set radius (defined based on a proportion of
the population size at risk), a circle is created on each point
The likelihood function
is used, and the area with the maximum value of the function is the most likely cluster of cases Allows for ranked clusters (primary, secondary, etc)
![Page 73: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/73.jpg)
Scan StatisticYou know where this is?The Riverwalk
Take home message:Watch your wallet!
![Page 74: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/74.jpg)
Basics of Bayesian Modeling
Some statistical ideas: Likelihood
Usually we have some data that we assume follow some distribution
For counts, maybe Y ~ Poisson(λ) L (y, λ) is the joint probability distribution of the data and
the parameters The model likelihood is the product of this distribution for
all observations We get the estimate of λ that is the most likely to have
generated our data (y) Maximum likelihood estimate of λ
This is traditional statistics
![Page 75: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/75.jpg)
Bayesian statistics
Now, lets consider the case where we have some knowledge about λ, say we believe it comes from a certain distribution, so : λ ~ Gamma(α,β) And we can either assume some value for these
parameters (α=β=1) or allow them to also come from some distribution
α~Exp(u) ,β~Exp(p) Why a Gamma distn? Because Gamma is >0, and
we know θ is >0 because it is a relative risk
![Page 76: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/76.jpg)
Combining information
When we combine the likelihood and the prior, we form what is called the posterior distribution
Thus we have Bayes Theorem which in the continuous case is:
Which states, the posterior distribution of θ, conditional on y is the product of the likelihood and the prior distribution of θ
![Page 77: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/77.jpg)
The denominator in Bayes theorem is a constant, and this is generally written as:
Which says the posterior is proportional to the likelihood times the prior
![Page 78: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/78.jpg)
A simple Poisson – Gamma model
A common relative risk model is the Poisson - Gamma model
yi ~ Pois(eiθ)θ ~Gamma(α,β)The posterior distribution of θ (the relative risk) is[θ |yi ,α,β ] =L(y|θ,α,β)p(θ ) which is fully written:
[θ |yi ,α,β ] =β *α*
Γ(α*)θα*−1 exp(−θβ*)
where α* = yi +∑ α, β* = ei +∑ β
Which, barring the constant gamma function is:
[θ |yi ,α,β ] ∝Gamma( yi +∑ α, ei +∑ β)
![Page 79: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/79.jpg)
Simple Models
Let’s now consider a simple realistic model
yi~Poisson(ei θi) Assume the usual log link function
log (θi) = C1+C2
C1=a0+x’β Linear Poisson regression with intercept
C2=other terms Could add random effect - > this would be like an individual frailty
model, or a group frailty model or a random intercept model Generally this is called a Generalized Linear Mixed Model
Now we just need to code it up!
![Page 80: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/80.jpg)
Mapping and spatial modeling
We just achieved a better, more accurate SMR map with a simple Poisson model with a random effect This model “drew strength” from counties with good
data to help smooth counties with poor data, but it didn’t account for correlation between neighboring counties
We can also build a model that includes a spatially correlated random effect between counties Spatially smoothed rates
![Page 81: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/81.jpg)
Building a Spatial Model
Take a Poisson-Lognormal model y~Pois(ei θi)
θi=α+ x’β + ui
What if we introduce more interesting structure on u? A common form of spatial correlation structure is the
CAR model, or also called the Besag model This is:ui ~ N (u j ,τ /nj )
Where uj is the mean of the random effects of
the neighboring j observations, and nj is the
number of neighboring observations
![Page 82: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/82.jpg)
More models
Another model that is commonly used in practice is the convolution model, or the Besag, York and Mollie model y~Pois(ei θi)
θi=α+ x’β + ui + vi
Where now ui is a correlated heterogeneity (CH) term and vi is an Uncorrelated Heterogeneity (UH) term
![Page 83: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/83.jpg)
Posterior Parameter Estimates
![Page 84: Spatial statistics presentation Texas A&M Census RDC](https://reader036.vdocuments.us/reader036/viewer/2022062514/55839cb0d8b42aea578b4910/html5/thumbnails/84.jpg)
Why Use R?
R is free (So are GeoDA and QGIS)
R is available in the RDC (So are GeoDA and QGIS)
R is extremely capable and flexible (So is QGIS)
R is a scripting and statistical language (Neither GeoDA or QGIS are)
R is one stop shopping for many geospatial techniques Spatial joins, projections, data merging, raster
analysis, vector operations