spatial data mining: three case studies

C.T. Lu Spatial Data Mining 1

Spatial Data Mining: Three Case Studies

Presented by: Chang-Tien Lu

Spatial Database Lab Department of Computer Science

University of Minnesota

[email protected]://www.cs.umn.edu/research/shashi-group

Group Members:Shashi Shekhar, Weili Wu, Yan Huang, C.T. Lu


Outline

IntroductionCase 1: Location PredictionCase 2: Spatial Association: Co-locationCase 3: Spatial Outlier DetectionConclusion and Future Directions


Introduction: spatial data miningSpatial Databases are too large to analyze manually

NASA Earth Observation System (EOS)National Institute of Justice – Crime mappingCensus Bureau, Dept. of Commerce - Census Data

Spatial Data Mining Discover frequent and interesting spatial patterns for post processing (knowledge discovery)Pattern examples: spatial outliers, location prediction, clustering, spatial association, trends, ..

Historical ExampleLondon, 1854

• Cholera & water pump


Framework

Problem statement: capture special needsData exploration: mapsTry reusing classical methods

data mining, spatial statistics

Invent new methods if reuse is not applicable Develop efficient algorithmsValidation, Performance tuning


Case 1: Location Prediction

Problem: predict nesting site in marshesGiven vegetation, water depth, distance to edge, etc.

Data - maps of nests and attributesspatially clustered nests, spatially smooth attributes

Classical method: logistic regression, decision trees, bayesian classifier

but, independence assumption is violated ! • Misses auto-correlation !

Spatial auto-regression (SAR)Open issues: spatial accuracy vs. classification accuraryOpen issue: performance - SAR learning is slow!


Given:1. Spatial Framework

2. Explanatory functions:3. A dependent class:4. A family of function

mappings:

Find: Classification model:

Objective:maximizeclassification_accuracy

Constraints: Spatial Autocorrelation

exists

},...{ 1 nssS RSf

kX :

},...{: 1 MC ccCSf

CRR ...

cf̂

),ˆ( cc ff

Nest locations Distance to open water

Vegetation durability Water depth

Location Prediction


Evaluation: Change ModelLinear Regression

• Spatial Autoregression Model (SAR)• y = Wy + X +

• W models neighborhood relationships models strength of spatial dependencies error vector

• Mixed Spatial Autoregression Model (MSAR)• y = Wy + X + WX +

• Consider the impact of the explanatory variables from the neighboring observations

Xy


Measure: ROC Curve

ROC Curve: Locus of the pair (TPR,FPR) for each cut-off probability

Receiver Operating Characteristic (ROC)TPR = AnPn / (AnPn + AnPnn)

FPR = AnnPn / (AnnPn+AnnPnn)

Classification accuracy: confusion matrix


Evaluation: Change ModelLinear RegressionSpatial RegressionSpatial model is better

Xy

XWyy


• Spatial Autoregression Model (SAR)• y = Wy + X +

• Solutions and - can be estimated using Maximum

likelihood theory or Bayesian statistics.• e.g., spatial econometrics package uses Bayesian

approach using sampling-based Markov Chain Monte Carlo (MCMC) method.

• Maximum likelihood-based estimation requires O(n3) ops.

Solution Procedures


Evaluation: Chang measure

))(.,(),( PnearestAAdistPAADNP kk

k

New measure: ADNPAverage distance to nearest prediction

Spatial accuracy (map similarity)


Predicting Location using Map Similarity


Predicting location using Map Similarity

PLUMS components Map Similarity : Avg. Distance to Nearest

Prediction(ADNP) ,..

Search Algorithm : Greedy, gradient descent

Function family : generalized linear (GL)(logit, probit), non-linear,

GL with auto-correlation

Discretization of parameter space : Uniform, non-uniform,

multi-resolution, …


Association RuleSupermarket shelf management

Goal: To identify items that are bought together by sufficiently many customersApproach: Process the point-of-scale data collected with barcode scanners to find dependencies among items (Transaction data)

A classic rule –If a customer buys diaper and milk, then he is very likely to buy beerSo, don’t be surprised if you find six-packs of beer stacked next to diapers!


Association Rules:Support and confidence

Item set I = {i1, i2, ….ik}Transactions T = {t1, t2, …tn}Association rule: A -> B

Support S • (A and B) occur in at least S percent of

the transactions • P (A U B)

Confidence C : • Of all the transactions in which A occurs,

at least C percent of them contains B • P (B|A)


Case 2: Spatial Association Rule

Problem: Given a set of boolean spatial featuresfind subsets of co-located features,

• e.g. (fire, drought, vegetation)

Data - continuous space, partition not natural

Classical data mining approach: association rules

But, No Transactions!!! No support measure!!

Approach: Work with continuous data without transactionizing it!

Participation index (support) : min. fraction of instances of a features in join resultConfidence = Pr.[fire at s | drought in N(s) and vegetation in N(s)] new algorithm using spatial joins


Answers: and

Can you find co-location patterns from the following sample dataset?

Co-location


Co-locationCan you find co-location patterns from the following sample dataset?


Spatial Co-location A set of features frequently co-

located

Given A set T of K boolean spatial feature

types T={f1,f2, … , fk}

A set P of N locations P={p1, …, pN } in a spatial frame work S, pi P is of some spatial feature in T

A neighbor relation R over locations in S

Find Tc = subsets of T frequently co-

located

Objective Correctness Completeness Efficiency

Constraints R is symmetric and reflexive Monotonic prevalence measure

Reference Feature Centric

Window Centric Event Centric

Co-location


Participation index

Participation index = min{pr(fi, c)}

Participation ratio pr(fi, c) of feature fi in co-location c = {f1, f2, …, fk}

Fraction of instances of fi with feature {f1, f2, f i-1, f i+1,…, fk} nearby.

Association rules Co-location rules

underlying space discrete sets continuous space

item-types item-types events /Boolean spatial features

collections transactions neighborhoods

Prevalence (A -> B) Support: P(A U B) Participation index

Conditional probability (A ->B)

Confidence: P[A|B] P [A in N(L) | B at L)

Comparison with association rules

Co-location


Spatial Co-location Patterns

• Spatial feature A,B,C and their instances• Possible associations are (A, B), (B, C), etc.• Neighbor relationship includes following pairs:

•A1, B1•A2, B1•A2, B2•B1, C1•B2, C2

Dataset



Spatial feature A,B, C,and their instances

Support (A,B) =2 (B,C)=2 Support (A,B)=1 (B,C)=2

Partition approach [Yasuhiko, KDD 2001]

Support not well defined

i.e., not independent of execution trace

Has a fast heuristic which is hard to analyze for

correctness/completeness

Dataset




Dataset Reference feature approach [Han SSD 95]

• Use C as reference feature to get transactions• Transactions: (B1) (B2)• Support (A,B) = Ǿ

• Note: Neighbor relationship includes following pairs:

•A1, B1•A2, B1•A2, B2•B1, C1•B2, C2




Our approach (Event Centric)• Neighborhood instead of transactions

• Spatial join on neighbor relationship

• Support

•Participation index = Min ( p_ratio )

•P_ratio(A, (A,B)) = fraction of instance of A participating in join(A,B, neighbor)

•Examples

Support(A, B)=min(3/2,3/2)=1.5

Support(B, C)=min(2/2,2/2)=1

Dataset




Support A,B =2 B,C=2

Support A,B=1 B,C=2

Support(A,B)=min(3/2,3/2)=1.5 Support(B,C)=min(2/2,2/2)=1

Partition approach

Our approach Dataset

Reference feature approach

C as reference featureTransactions: (B1) (B2)Support (A,B) = Ǿ


Case 3: Spatial Outliers DetectionSpatial Outlier: A data point that is extreme relative to it

neighbors


Application Domain: Traffic Data


Spatial Outlier DetectionGiven

A spatial framework SF consisting of locations s1, s2, …, sn

An attribute function f : si R

(R : set of real numbers)A neighborhood relationship N SF SFA neighborhood aggregation function : RN RA difference function Fdiff : R R R

Statistic test function ST : R { True, False }• Test is based on Fdiff (f, (f, N)

Find

O = {vi | vi V, vi is a spatial outlier}Objective

Correctness: The attribute values of vi is extreme, compared with its neighbors

Computational efficiency

Naggrf

Naggrf


An example of Spatial outlier


Spatial Outlier Detection: Zs(x) approach

))((

1)()( )( yf

kxfxS xNy

Function:

s

sxs

xSZ

)()(

Declare x as a spatial outlier

If


Evaluation of Statistical AssumptionDistribution of traffic station attribute f(x) looks normalDistribution of looks normal too!

))((

1)()( )( yf

kxfxS xNy


Different Spatial Outlier Test

Spatial Statistic ApproachScatter plot approach(Luc Anselin 94’)Moran scatter plot approach (Luc Anselin 95’)Variogram cloud approach (Graphic)


Scatter plot approachGiven

An attribute function f(x)A neighborhood relationship N(x)An aggregation function

A difference function Fdiff : є = E(x) – (m f(x) +

b)

Detect spatial outlier byStatistic test function

ST :

)(1

)(: )( yfk

xEf xNyN

aggr


Graphical Spatial Outlier Test


Original Data

Graphical Spatial Tests


A Unified Algorithm

Separate two phasesModel building Testing (a node or a set of nodes)

Computation structure of model building

Key insights:• Spatial self join using N(x) relationship • Algebraic aggregate functions can be

computed in one disk scan of spatial join

Computation structure of testing Single node: spatial range query• Get-All-Neighbors(x) operation


An example: Scatter plotModel building

An attribute function f(x)Neighborhood aggregate function Distributive aggregate functions

•

Algebraic aggregate functions •

• where ,

TestingDifference function

• where

Statistic test function•

)(1

)( )( yfk

xE xNy

)(),(),()(),(),( 22 xExfxExfxExf

22 ))(()(

)()()()(

xfxfN

xExfxExfNm

22

2

))(()(

))()(()()()(

xfxfN

xExfxfxExfNb

)2()( 2

nSmS xxyy

n

xfxfSxx

2

2 ))(()(

n

xExESyy

2

2 ))(()(

))(()( bxfmxE )(1

)( )( yfk

xE xNy


Outlier Stations Detected


Outlier Station Detected


Conclusion and Future DirectionsSpatial domains may not satisfy assumptions of classical methods

data: auto-correlation, continuous geographic spacepatterns: global vs. local, e.g., outliers vs. spatial outliersdata exploration: maps and albums

Open Issuespatterns: hot-spots, spatial trends,…metrics: spatial accuracy (predicted locations), spatial contiguity(clusters)spatio-temporal dataset: spatial-temporal outliersscale and resolutions sentivity of patternsgeo-statistical confidence measure for mined patterns


Reference1. S. Shekhar and Y. Huang, “Discovering Spatial Co-location Patterns: a Summary of Results”, In Proc. of 7th

International Symposium on Spatial and Temporal Databases (SSTD01), July 2001.

2. S. Shekhar, C.T. Lu, P. Zhang, "Detecting Graph-based Spatial Outliers: Algorithms and Applications“, the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001.

3. S. Shekhar, C.T. Lu, P. Zhang, “Detecting Graph-based Saptial Outlier”, Intelligent Data Analysis, To appear in Vol. 6(3), 2002

4. S. Chawla, S. Shekhar, W. Wu and U. Ozesmi, “Extending Data Mining for Spatial Applications: A Case Study in Predicting Nest Locations”, Proc. Int. Confi. on 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2000), Dallas, TX, May 14, 2000.

5. S. Chawla, S. Shekhar, W. Wu and U. Ozesmi, “Modeling Spatial Dependencies for Mining Geospatial Data”, First SIAM International Conference on Data Mining, 2001.

6. S. Shekhar, Y. Huang, W. Wu, C.T. Lu, What's Spatial about Spatial Data Mining: Three Case Studies , as Chapter of Book: Data Mining for Scientific and Engineering Applications. V. Kumar, R. Grossman, C. Kamath, R. Namburu (eds.), Kluwer Academic Pub., 2001, ISBN 1-4020-0033-2

7. Shashi Shekhar and Yan Huang , Multi-resolution Co-location Miner: a New Algorithm to Find Co-location Patterns in Spatial Datasets, Fifth Workshop on Mining Scientific Datasets (SIAM 2nd Data Mining Conference), April 2002


http://www.cs.umn.edu/research/shashi-group

Thank you !!!Thank you !!!

spatial data mining: three case studies

Documents