exploring spatial correlation in rivers by joshua french

Post on 29-Mar-2015

224 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

EXPLORING SPATIAL CORRELATION IN RIVERS

by Joshua French

Introduction

A city is required to extends its sewage pipelines farther in its bay to meet EPA requirements.

How far should the pipelines be extended?

The city doesn’t want to spend any more money than it needs to extend the pipelines. It needs to find a way to make predictions for the waste levels at different sites in the bay.

With the passage of the Clean Water Act in the 1970’s, spatial analysis of aquatic data has become even more important.

Section 305 b) requires state governments to make, “a description of the water quality of all navigable waters in such State. . .”

It is not physically or financially possible to make measurements at all sites. Some sort of spatial interpolation will need to be used.

Usually we might try to fit some sort of linear model to the data to make predictions. Usually we assume observations are independent.

For spatial data however, we intuitively know that two sampling sites close together will probably be similar.

We would expect that two sites in close proximity would be more similar than two sites separated by a great distance.

We can use the correlation between sampling sites to make better predictions with our model.

The Ohio River

The Road Ahead

- Methods- Introduction to the Variogram- Exploratory Analysis- Sample Variogram- Modeling the Variogram

- Analysis- 3 types of results

- Conclusions- Future Work

Introduction to the Variogram

Spatial data is often viewed as a stochastic process.

For each point x, a specific property Z(x) is viewed as a random variable with mean µ, variance σ2, higher-order moments, and a cumulative distribution function.

Each individual Z(xi) is assumed to have its own distribution, and the set {Z(x1),Z(x2),…} is a stochastic process.

The data values in a given data set are simply a realization of the stochastic process.

We want to measure the relationship between different points. Define the covariance for Z(xj) and Z(xk) to be:

Cov(Z(xj),Z(xk))=E[{Z(xj)-µ(xj)} {Z(xk)-µ(xk)}]

where µ(xj) and µ(xk) is the mean of Z at each respective location.

However, we have a problem. We don’t know the means at each point because we only have one realization.

To solve this, we must assume sort of stationarity – certain features of the distribution are identical everywhere.

We will work with data that satisfies second-order stationarity.

Second-order stationarity means that the mean is the same everywhere: i.e. E[Z(xj)]=µ for all points xj.

It also implies that Cov(Z(xj),Z(xk)) becomes a function of the distance xj to xk.

Thus,

Cov(Z(xj),Z(xk)) = Cov(Z(x),Z(x+h))

= Cov(h)

where h measures the distance between two points.

We can then derive that

Cov(Z(x),Z(x+h)) =E[(Z(x)-µ)(Z(x+h)- µ)]

= E[(Z(x)(Z(x+h))-µ2]

Sometimes it is clear that our data is not second-order stationary.

Georges Matheron solved this problem in 1965 by establishing his “intrinisic hypothesis”.

For small distances h, Matheron held that

E[Z(x)-Z(x+h)]=0

Looking at the variance of differences, this

leads to

Var[Z(x)-Z(x+h)] =E[ (Z(x)-Z(x+h))2 ]

= 2 γ(h)

Intrinsic stationarity is good because analysis may be conducted even if second-order stationarity is violated. Unfortunately, the covariance equation is not defined for intrinsic stationarity.

For this reason, we will work with data that is second-order stationarity. If second-order stationarity is violated by the original data, then we will perform additional procedures to work with data that is second-order stationary.

Note that second-order stationarity implies intrinsic stationarity, so the variogram equation is still defined.

Under second-order stationarity, γ(h)=Cov(0)-Cov(h).

γ(h) is known as the semi-variogram. In practice however, it is usually referred to as the variogram.

Things to know about variograms:

1. γ(h)= γ(-h). Because it is an even function, usually only positive lag distances are shown.

2. Nugget effect - by definition, γ(0)= 0. In practice however, sample variograms often have a positive value at lag 0. This is called the “nugget effect”.

3. Tend to increase monotonically

4. Sill – the maximum variance of the variogram

5. Range – the lag distance at which the sill is reached

The following figure shows these features

Variogram Example

Lag Distance

Var

ianc

e

0 1 2 3 4 5

0.0

0.5

1.0

1.5

sill

nugget

range

Exploratory Analysis

Before we model variograms, we should explore the data.

We need to make sure that the data analyzed satisfies second-order stationarity

We need to check for outliers

We need to make sure that the data is not too badly skewed (G1>1)

We can look at the river data as a one-dimensional linear system. It is fairly easy to check for stationarity using a scatter plot.

RMI

Squ

are

Roo

t of P

erce

nt In

vert

ivor

e

0 200 400 600 800 1000

02

46

810

If there is an obvious trend in the data, we should remove it and analyze the residuals.

If the variance increases or decreases with lag distance, then we should transform the variable to correct this.

To check for outliers, we may use a typical boxplot.

If the data contains outliers, we should do analysis both with and without outliers present.

020

4060

80

If G1>1, then we should transform the data to approximate normality if possible. To check approximate normality, the standard qqplot can be used.

Quantiles of Standard Normal

Obs

erve

d

-2 0 2

23

45

67

8

3.3 The Sample Variogram

One of the previous definitions of semivariance is:

The logical estimator is:

where N(h) is the number of pairs of observations associated with that lag.

].) )Z()Z( ( [ E2

1)γ( 2hxxh

N(h)

1j

2jj ] )z(x)z(x [

)2N(

1)(γ h

hhˆ

Sample Variogram Example

Lag Distance

Va

ria

nce

0 20 40 60 80

20

00

04

00

00

60

00

08

00

00

10

00

00

Modeling the Variogram

Our goal is to estimate the true variogram of the data.

There were four variogram models used to model the sample variogram: the spherical, Gaussian, exponential, and Matern models.

Variogram Models

Lag Distance

Va

ria

nce

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

ExponentialSphericalGaussianMatern

The algorithm used to fit the spherical model uses least squares.

The algorithm used to fit the exponential, Gaussian, and Matern models is maximum likelihood.

The spherical model is fit to get an estimate of the sill, nugget, and range.

These estimates will be used to fit the other three models.

The “best model” will be the model that minimizes the AICC statistic.

Analysis

The data analyzed is a set of particle size and biological variables for the Ohio River.

The data was collected by The Ohio River Valley Sanitation Commission. This is better known as ORSANCO.

ORANSCO data collection

There were between 190 and 235 unique sampling sites, depending on the variable.

Some sites had more than one observation. In these situations, the average value for the site was used for analysis.

Ohio River Sampling Sites

Longitude (NAD27)

La

titu

de

(N

AD

27

)

-88 -86 -84 -82 -80

37

38

39

40

Pittsburgh, PA

Cairo, IL

Cincinnati, OH

Louisville, KY

There were two main types of data: particle size data and biological levels.

The particle size data measured percent gravel, percent sand, percent fines, percent hardpan, percent boulder, and percent cobble.

The biological data measured- Number of individuals at a site- Number of species at a site- Percent tolerant fish- Percent simple lithophilic fish (fish that lay eggs

on rocks)- Percent non-native fish- Percent detritivore fish (fish that eat mostly

decomposed plants or animals)- Percent invertivore (fish that eat mostly

invertebrate animals)- Percent Piscivore (fish that eat mostly other fish)

The results of the analysis fell into three main groups:

- Sample variogram fit well

- Sample variogram did not fit well

- Analysis not reasonable

Good Results: Number of Individuals at a site

Skewness coefficient of data is 8.16. This is much too high.

The data is transformed using the natural logarithm

New skewness coefficient is reduced to .56. Not perfect, but much less skewed.

Check Normality of log(Num Individuals)

Quantiles of Standard Normal

log

(Nu

mb

er

of

Ind

ivid

ua

ls)

-3 -2 -1 0 1 2 3

45

67

8

Check Second-Order Stationarity of log(Num Individuals)

RMI

log

(Nu

mb

er

of

Ind

ivid

ua

ls)

0 200 400 600 800 1000

45

67

8

Check for outliers of log(Num Individuals)4

56

78

There are a number of outliers for the transformed variable

We should do analysis with and without the outliers present

log(Num Individuals) Sample Variogramwith outliers

Lag Distance (Mi)

Va

ria

nce

0 50 100 150 200 250

0.3

00

.35

0.4

00

.45

0.5

00

.55

Check normality of log(Num Individuals) without outliers

Quantiles of Standard Normal

log

(Num

In

div

idu

als

) w

/o o

utli

ers

-3 -2 -1 0 1 2 3

4.0

4.5

5.0

5.5

6.0

6.5

log(Num Individuals) Sample Variogramwithout outliers

Lag Distance (Mi)

Va

ria

nce

0 50 100 150 200 250

0.2

00

.25

0.3

0

We were not able to model the sample variogram perfectly, but we were able to detect some amount of spatial correlation in the data, especially when the outliers were removed.

For the transformed variable without outliers, the exponential model estimated the nugget to be .20, the sill to be .2709, and the range to be 37.7 miles.

Poor Results: Percent Sand

Skewness coefficient only .18, so skewness not a major factor.

Check second-order stationarity using scatter plot.

Check Stationarity of Percent Sand

RMI

Pe

rce

nt

Sa

nd

0 200 400 600 800 1000

02

04

06

08

01

00

There appears to be a trend in the data.

After removing the trend, the data appears to be second-order stationary.

The residuals are also approximately normal.

Check stationarity of percent sand residuals

RMI

Pe

rce

nt

Sa

nd

Re

sid

ua

ls

0 200 400 600 800 1000

-60

-40

-20

02

04

0

Check normality of percent sand residuals

Quantiles of Standard Normal

san

d$

resi

d

-3 -2 -1 0 1 2 3

-60

-40

-20

02

04

0

Sample Variogram of percent sand residuals

Lag Distance (Mi)

Va

ria

nce

0 50 100 150 200 250

40

04

50

50

05

50

60

06

50

70

0

The sample variogram does not really increase monotonically with distance.

Our variogram models cannot fit this very well.

Though we can obtain estimates of the nugget, sill, and range, the estimates cannot be trusted.

No results: Percent Hardpan

This variable was so badly skewed that analysis was not reasonable.

The skewness coefficient is 12.38. This is extremely high.

QQplot of Percent Hardpan

Quantiles of Standard Normal

Pe

rce

nt

Ha

rd P

an

-3 -2 -1 0 1 2 3

05

01

00

15

02

00

25

0

Scatter plot of Percent Hardpan

RMI

Pe

rce

nt

Ha

rd P

an

0 200 400 600 800 1000

05

01

00

15

02

00

25

0

The data is nearly all zeros!

There is also an erroneous data value. A percentage cannot be greater than 100%.

Data analysis does not seem reasonable. Our data does not meet the conditions necessary to use the spatial methods discussed.

Conclusions

Able to fit sample variogram reasonably well – percent gravel, number of individuals, number of species

Not able to fit sample variogram well – percent sand, percent detritivore, percent simple lithophilic individuals, percent invertivore

No results – remaining variables

Summary of ResultsResponse Transformation Trend Removed Model Nugget Sill Range

Percent Gravel Exponential 286.09 335.53 72.9 milesPercent Sand 38.1082+.0330x Gaussian 520.88 658.32 71.67 miles

Percent CobblePercent Hardpan

Percent FinesPercent Boulder

Number of Individuals Natural Log Gaussian 0.29 0.39 44.19 milesNumber of Individuals Natural Log (no outliers) Exponential 0.2 0.27 37.69 miles

Number of Native Species 17.7849-.0042x Gaussian 10.1 11.87 39.93 milesPercent Tolerant Individuals

Percent Lithophilic Individuals Square Root 15.5364-.0023x Matern 0.92 2.76 44.02 milesPercent Nonnative Individuals

Percent Detritivore Square Root Exponential 1.09 1.57 24.08 milesPercent Detritivore Square Root (no outliers) Exponential 0.94 1.4 19.17 milesPercent Invertivore Square Root 6.5207-.0039x Exponential 1.4 2.97 13.43 milesPercent Piscivore

Future Work

Data set involving three streams in Norfolk, Virginia. Each stream has 25 observations. Collected by researchers at Old Dominion University.

Difficulties to overcome - What is the best way to measure distance between points? - Few observations - Overlapping points after coordinate conversion

Problem: What is the best way to measure distance between points?

There is some aspect of two-dimensionality to the data, but it is still really a one-dimensional problem.

Paradise Creek Region of Interest

Paradise Creek Sampling Sites

UTMX

UT

MY

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

0.0

0.2

0.4

0.6

Problem: 25 observations per stream is considered the minimum number of points to create a variogram

- the sample variogram will be very rough

- our variogram model estimates will probably be bad

To correct this, we will explore the possibility of combining the data from the three streams

Problem: Overlapping points after conversion

- Original data in longitude/latitude coordinates

- Convert to UTM coordinates so that Euclidian distance makes sense

- Converted UTM coordinates often result in overlapping sites (and even fewer unique sampling sites)

Stream Sampling Sites (Lat/Long)

Longitude (NAD27)

La

titu

de

(N

AD

27

)

-76.290 -76.288 -76.286 -76.284 -76.282

36

.80

03

6.8

02

36

.80

43

6.8

06

36

.80

8

Stream Sampling Sites (UTM)

UTM X

UT

M Y

920200 920400 920600 920800 921000

40

82

80

04

08

32

00

40

83

60

0

Stream Sampling Sites (Lat/Long)

Longitude (NAD27)

La

titu

de

(N

AD

27

)

-76.290 -76.288 -76.286 -76.284 -76.282

36

.80

03

6.8

02

36

.80

43

6.8

06

36

.80

8

Stream Sampling Sites (UTM)

UTM X

UT

M Y

920200 920400 920600 920800 921000

40

82

80

04

08

32

00

40

83

60

0

Acknowledgments

- My committee: Dr. Urquhart, Dr. Wang, and Dr. Theobald

- Dr. Davis and Dr. Reich for answering my spatial questions and letting me use their S-Plus spatial library

Concluding Thought

Before you criticize someone, you should walk a mile in their shoes. That way, when you criticize them, you’re a mile away and you have their shoes.

- Jack Handey

top related