exploring spatial patterns in your data - mit libraries · a voronoi map is created by defining...

47
EXPLORING SPATIAL P ATTERNS IN YOUR DATA

Upload: lamdat

Post on 28-Mar-2019

213 views

Category:

Documents


0 download

TRANSCRIPT

EXPLORING SPATIAL PATTERNS IN

YOUR DATA

OBJECTIVES

Learn how to examine your data using the

Geostatistical Analysis tools in ArcMap.

Learn how to use descriptive statistics in ArcMap

and Geoda to analyze data.

Be able to identify Geostatistical Analysis tools that

can be used for further analysis.

WHY EXPLORE YOUR DATA?

It allows you to better select an appropriate tool to

analyze your data.

If you skip exploring your data, you may miss key

information about it that may lead to incorrect

conclusions and decisions.

GEODA VS. ARCMAP

Geoda – free, open-source, simple, software

specifically for statistical analysis

ArcMap – proprietary, GIS software that can

perform statistical analysis along with hundreds of

other analyses

GEODA VS. ARCMAP

With ArcMap you

can view several

data layers at once.

In Geoda, you view

only one data layer.

Some tools are

found in both

programs, while

some are found in

only one.

EXPLORE THE LOCATION OF YOUR

DATA

EXPLORE THE LOCATION OF YOUR DATA

Explore:

size of the study area

mean

median

direction data are oriented

You will see where data are clustered relative to the

rest of the data.

MEAN CENTER

The geographic center for a set of features.

Constructed from the average x and y values for

the input feature centroids (middle points, if input

features are polygons).

MEDIAN CENTER

Median Center is robust to outliers.

Uses an algorithm to find the point that minimizes

travel from it to all other features in the dataset.

At each step (t) in the algorithm, a candidate

Median Center is found (Xt, Yt) and refined until it

represents the location that minimizes Euclidian

Distance d to all features (i) in the dataset.

DIRECTION DISTRIBUTION (STANDARD

DEVIATIONAL ELLIPSE)

Standard deviational ellipses summarize the spatial characteristics of geographic features: central tendency, dispersion, and directional trends.

The ellipse allows you to see if the distribution of features is elongated and hence has a particular orientation.

When the underlying spatial pattern of features is concentrated in the center with fewer features toward the periphery (a spatial normal distribution),

a one standard deviation ellipse polygon will cover approximately 68 percent of the features

two standard deviations will contain approximately 95 percent of the features

three standard deviations will cover approximately 99 percent of the features

EXPLORE THE VALUES OF YOUR DATA

NORMAL DISTRIBUTION

Some analysis tools assume a normal distribution:

Mean and median are similar

Data are symmetrical

DATA FREQUENCY USING HISTOGRAMS

DATA DISTRIBUTION USING A QQ PLOT

A normally distributed dataset Many characteristics of a normal dataset Not normal

A normal QQ plot shows the relationship of your data to a normal distribution line.

BOX PLOT

Displays the median and interquartile range (IQ) (25%-75%)

Hinge = multiple of interquartile range

MAPS

For examining data values and frequencies:

Quantile Map

Natural breaks

Equal intervals

For finding outliers:

Percentile Map

Box Map

Standard Deviation Map

QUANTILE MAP

Displays the distribution of values in categories with

an equal number of observations in each category.

EQUAL INTERVAL MAP

Sets the value ranges in each category equal in size.

The entire range of data values is divided equally into

however many categories have been chosen.

NATURAL BREAKS MAP

Seeks to reduce the variance within classes and

maximize the variance between classes

OTHER EXPLORATORY METHODS

Scatter Plot (2 variables)

Parallel coordinate plot (A pattern of lines is drawn

that connects the coordinates of each observation

across the variables on parallel x-axes.)

DETECT OUTLIERS

OUTLIERS

Outliers can reveal mistakes, unusual occurrences,

and shift points in data patterns (a valley in a

mountain range).

You should use more than one method to find

outliers because some techniques will only highlight

data values near the two ends of your range.

PERCENTILE MAP

Groups ranked data into 6 categories

Lowest and highest 1% are potential outliers

BOX MAP

Groups data into

4 categories, plus

2 outlier

categories at both

ends

Data are outliers

if they are 1.5 or

3 times the IQ.

Detects outliers

with more

certainty than a

percentile map

STANDARD DEVIATION MAP

Displays data 3 standard deviations above and

below the mean.

As a parametric map, it is sensitive to outliers.

SEMIVARIOGRAM CLOUD

When points closer together have greater differences in their values, this may indicate an outlier in the data.

The selected points may be outliers.

VORONOI MAP

Cluster Voronoi maps show spatial outliers in your data; simple Voronoi maps can pinpoint data values that are many class breaks removed from surrounding polygons.

The gray

polygons may

be outliers.

HISTOGRAM

Values in the last bars to the left or right, if far

removed from the adjacent values, may indicate

outliers.

NORMAL QQ PLOT

Values at the tails of a normal QQ plot can also be

outliers. This can happen when the tail values do

not fall along the reference line.

BOXPLOT

Points outside the hinges (represented by the

black, horizontal lines), maybe outliers.

EXPLORE SPATIAL RELATIONSHIPS IN

YOUR DATA

SPATIAL AUTOCORRELATION

Everything is related, but objects closer together are more related than objects farther apart.

Explore using a semivariogram graph or cloud

Can also be explored using Moran’s I and Getis-Ord G statistics

Height (sill) = variation between

data values.

Range = distance between

points at which the

semivariogram flattens out.

As the range increase, height

should increase, since points

further away from each other are

not as related, so there should

be more variation.

If a semivariogram is a

horizontal line, there is no

spatial autocorrelation.

VARIATION IN YOUR DATA

Many spatial statistics analysis techniques assume your data are stationary, meaning the relationship between two points and their values depends on the distance between them, not their exact location.

Explore variation using a Voronoi map.

A Voronoi map is created by defining Thiessen polygons around each point in your dataset.

Any location inside a polygon represents the area closer to that data point than to any other data point.

This allows you to explore the variation of each sample point based on its relationship to surrounding sample points.

A SIMPLE VORONOI MAP

A simple Voronoi map shows the data value at each

location. The map is symbolized using a geometrical

interval classification. This will show the variation in data

values across your entire dataset.

Green = little local

variation

Orange and Red =

greater local variation

TYPES OF VORONOI MAPS

Simple: The value assigned to a polygon is the value recorded at the sample point within that polygon.

Mean: The value assigned to a polygon is the mean value that is calculated from the polygon and its neighbors.

Mode: All polygons are categorized using five class intervals. The value assigned to a polygon is the mode (most frequently occurring class) of the polygon and its neighbors.

Cluster: All polygons are categorized using five class intervals. If the class interval of a polygon is different from each of its neighbors, the polygon is colored gray and put into a sixth class to distinguish it from its neighbors.

Entropy: All polygons are categorized using five classes based on a natural grouping of data values (smart quantiles). The value assigned to a polygon is the entropy that is calculated from the polygon and its neighbors.

Entropy = - Σ (pi * Log pi ),

EXPLORE TRENDS IN YOUR DATA

TREND ANALYSIS

You can use the trend analysis tool in Arcmap to

visually compare the trend lines with any patterns in

your data.

When exploring trends, your data locations are

mapped along the x- and y-axes. The values of

each data location are mapped as height (z-axis).

Trends are analyzed based on direction and on the

order of the line that fits the trend. The trend line is

a mathematical function, or polynomial, that

describes the variation in the data.

These polynomials show

a clear curve, indicating

a second-order trend

in the data.

You can determine whether

the order of the polynomial

fits your data based on the

shape created by the line.

A second-order polynomial

will appear as an upward

or a downward curve

(known as a parabola).

SELECTING AN ANALYSIS TECHNIQUE

Each of the following techniques are types of

interpolation. Interpolation creates surfaces based

on spatially continuous data.

Each surface uses the values and locations of your

points to create (or interpolate) the values for the

remaining points in the surface.

GEOSTATISTICAL INTERPOLATION

Creates surfaces using the relationships between your data locations and their values.

Predicts values based on your existing data.

Assumptions:

Data is not clustered. (Simple kriging technique has a declustering option.)

Data is normally distributed. (Transformation options are available.)

Data is stationary (no local variation).

Data is autocorrelated.

Data has no local trends. (You can remove trends from data as part of the interpolation

process. )

GLOBAL DETERMINISTIC INTERPOLATION

Creates surfaces using the existing values at each

location.

Uses your entire dataset to create your surface.

Assumptions:

Outliers have been removed from the data.

Global trends exist in the data.

LOCAL DETERMINISTIC INTERPOLATION

Uses several subsets, or neighborhoods, within an

entire dataset to create the different components of

the surface.

Assumption:

Data is normally distributed.

INVERSE DISTANCE WEIGHTED

INTERPOLATION (IDW)

A type of local deterministic interpolation.

Assumptions:

Data is not clustered.

Data is autocorrelated.

OTHER SPATIAL STATISTICAL TESTS

Tests for spatial autocorrelation

Getis-Ord General G and Global Moran’s I (to determine

overall clustering and dispersion of values)

Hot Spot Analysis (Getis-Ord Gi*) and Anselin’s Local

Moran’s I (to determine specific clusters of high and low

values)

Regression

Used to evaluate relationships between two or more

feature attributes. Are location, crime rates, racial make-

up, and income related to housing values in a census

tract?