1
Cascading spatio-temporal pattern discovery: A summary of results
Pradeep Mohan¹, Shashi Shekhar¹, James A.Shine², James P.Rogers2
¹University of Minnesota, Twin-Cities, {mohan,shekhar}@cs.umn.edu ²Engineering Research and Development Center, Alexandria, VA
{James.A.Shine, James.P.Rogers.II}@usace.army.mil
2
Outline
IntroductionMotivation
Problem Statement
Related Work Contributions
Conclusion and Future Work
Interest Measure
CSTP Miner Algorithm
Evaluation and Case Study
3
Motivation : Public Safety
Stages: Bar Closing, Assault , Drunk Driving, Hurricane, Climate change etc.
Cascading spatio-temporal pattern (CSTP)Bar Closing
Assault
Drunk Driving
Partially ordered subsets of ST event types. Located together in space.Occur in stages over time.
Other Applications: Climate change, epidemiology, evacuation planning.
T1 T2 T3
B.2
B.1C.1
C.2 C.3
C.4
A.1
A.3
A.2
A.4
Assault(A)
Drunk Driving (C)
Bar Closing(B)
Aggregate(T1,T2,T3)
C2 C.3
C.4
C.1
A.1
A.3
A.2
A.4
B.2B.1
4
Problem Definition Input : a) ST framework, b) directed ST
neighbor relation R, c) Interest measure threshold
Objective : a) Minimize computation costs while discovering statistically meaningful CSTPs.
Output : A set of CSTPs with interestingness >= threshold
Constraints : a) Correctness and Completeness
ST Join (R)R = {0.5 Miles, 2 min.}
Example:
BA
C
Threshold = 0.5
Aggregate(T1,T2,T3)
C2 C.3
C.4
C.1
A.1
A.3
A.2
A.4
B.2B.1
5
Challenges and Contributions
Space and Time are continuous Many overlapping ST neighborhoods Neighborhood enumeration is computationally challenging
Conflicting Requirements Ex., Statistical interpretation Vs. computational scalability
Exponential Candidate Space Ex., Candidate CSTPs exponential in the number of event types
Interest Measures Statistical Interpretation Computational Structure
CSTP Miner Algorithm Filtering Strategies
Evaluation Experimental Evaluation Case study
Challenges
Contributions
6
Limitations of Related Work: ST Data Mining
Limitations [ST Co-occurrence]
Treating space and time independently. Absence of partial order
[ST Sequence] Does not account for multiply connected patterns(e.g. nonlinear) Misses non-linear semantics. No ST statistical interpretation.
Related Work
ST Sequences
ST Subsets
Partial Order √ XMultiply connected
X √
Multiple patterns
√ √
ST Statistical Interpretation
X (only spatial)
X
7
Interest Measures
Cascade Participation Ratio (CPR) :
Datasetin M ,Event type of instances of #
CSTP M Event type of instances of #),(
j
j
totalMCSTPCPR j
[Conditional Probability of observing an instance of CSTP having seen an Instance of A]
Cascade Participation Index (CPI) :
Datasetin M ,Event type of instances of #
CSTP M Event type of instances of #min)(
j
j
totalCSTPCPI
Lower bound on the Conditional Probability of observing an instance of CSTP having seen an Instance of A, B or C
BA
CAggregate(T1,T2,T3)
C2 C.3
C.4
C.1
A.1
A.3
A.2
A.4
B.2B.1
5.042),( ACSTPCPR
5.021),( BCSTPCPR
5.042),( CCSTPCPR
5.0),(),,(),,(min BCSTPCPRACSTPCPRCCSTPCPRCPI
8
Interest Measures: Statistical Interpretation
ST K-Function 2/9 3/9 = 1/3 9/9 = 1
CPI 2/3 1 1
Time Axis
X Axis
Y Axis
Spatial Statistics: ST K-Function (Diggle et al. 1995)
)),(),,((1.1),(ˆ
jii dj jihtBA
AB BAtBAdITS
thK
Cascade Participation Index (CPI) is an upper bound to the ST K-FunctionExample:
AB
A
B A
B
A
BA
B
A
B A
B
A
B
A
B
9
CSTP Miner Algorithm: Overview
Upper Bound Filter
Candidate Generation*
Multi-resolution Filter
Cycle checking
Compute CPI
Prune CSTP
Prevalent CSTPs
*using same strategy as [Kuramochi and Karypis’04]
Cycles Removed
RCPI Threshold
Filtering Choice
Pruned CSTPs
CPI computation involves ST Join. ST Join
Sort-merge over time Nested loop over space.
Computational Bottleneck!
10
Filtering strategies Enhance Savings : Filter Non-prevalent CSTPs before CPI computation
Before Candidate Generation: Upper bound (UB)filter
After Candidate Generation: Multi-resolution ST(MST) filterKey Idea There exists a low dimensional embedding in space and time. Over estimate CPI by coarsening ST dataset. If Overestimate (CPI) < Threshold : Pruned
Key Idea CPI has anti-monotone upper bound.
11
Evaluation
Real Dataset: City of Lincoln, Nebraska, Year 2007
Matlab 7.0 , X5355 2.66 GHZ with 16 GB Main Memory and Linux OS
Events within an interval of 10 minutes were assigned the same time stamp.
Goalsa. What is the effect of # event types on
execution time ? b. What is the effect of CPI threshold ?c. Other experiments: Effect of Neighborhood size, Dataset size, Grid Parameters
12
Experimental AnalysisQuestions
a. What is the effect of # event types ? b. What is the effect of CPI threshold ?
Trends:
a. Patten size is exponential in the number of event types.b. MST filter enhances computational savings.
Fixed parameters : a. CPI = 0.2b. Time Neighborhood = 1750 Time stamps.
Fixed parameters : a. # of event types = 5b. Time Neighborhood = 1750 Time stamps.
13
Lincoln, NE crime dataset: Case study Is bar closing a generator for crime related CSTP ?
Observation: Crime peaks around bar-closing!
Bar locations in Lincoln, NE
Is bar closing a crime generator ? Are there other generators (e.g. Saturday Nights )?
Questions
Bar closing Increase(Larceny,vandalism, assaults)
Saturday Night Increase(Larceny,vandalism, assaults)
K.S Test: Saturday night significantly different than normal day bar closing (P-value = 1.249x10-7 , K =0.41)
14
Conclusions Cascading ST Patterns are useful in applications like Public Safety and Climate change science.
Future work New interest measure alternatives.
Qualitative Comparison with Graphical Models (e.g. Dynamic Bayes Nets, Hidden Markov Models etc.)
ST Multi-resolution filtering enhances computational performance.
Complementary filtering strategies.
Statistically interpretable interest measure.
15
Acknowledgment Members of the Spatial Database and Data Mining Research Group University of Minnesota, Twin-Cities.
This Work was supported by Grants from USARMY and NSF.
Thank You for your Questions, Comments and Patience!
Crime Report Schema Alignment
University of Texas at Dallas
Overview
Washington DC Incidents ReportedLincoln_Nebraska Incidents Reported
NID CCN ence … Long Latitude3768 57139
8Arson 38.870
10181-76.9822237
3787 519110
Theft 38..88852
-76.9370033
3779 519097
Burglary 38.95143
-77.0238048
INC_ Time_ Date_ … Team_Area
45111 21:24 11-17-2007
Northwest Team
41000 18:22 12-2-2007 Center Team
•Two different tables from two different data sources. Our goal is to align attributes between two tables.
Code Crime
45111 Arson
41000 Auto-theft
41000 Unauthorized use of motor vehicle
Dataset ER Diagram
Washington DCLincoln
Crime_type
BarsFootball Match
Incident_2007_reported
Incident_2007_reported
Football Match Bars
located located
crime
Crime
locatedlocated
Crime is an attribute in Washington DC Dataset, while it is a table in Lincoln Dataset.
• Heterogeneity
Schema Alignment– Syntactic Matching: Keyword-based matching on
Crime name• Lincoln.CrimeType. IncidentClassification = “Robbery”• Washington.Crime = “Robbery”
– Semantic Matching: Semantically RelevantA. Specialization vs. Generalization
– Lincoln.CrimeType. IncidentClassification = “Death”– Washington.Crime = “Homicide”– Death is super class of Homicide
B. Finding Semantic MatchingI. Definition of Crimes
Using shared Words to determine SimilarityII.Relevant Words
Find relevant words using K-medoid Clustering and Normalized Google Distance (NGD) *
* Jeffrey Partyka, Latifur Khan, Bhavani Thuraisingham, “Geographically-Typed Semantic Schema Matching,” In Proc. of ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2009), Seattle, Washington, USA, November 2009. Extended Version Submitted to Journal of Web Semantics, Springer.
I. Finding Semantic Matching using Definition of Crime
• Finding shared words to determine similarity• Larceny-Theft: Unlawful taking, carrying, leading, or riding away of property from the possession or constructive possession of another; attempts to do these acts are included in the definition. [1]• Theft: Illegal taking of another person's property without that person's freely-given consent. [2]• Assault: An act that causes another to apprehend an immediate harmful contact. [3]Red keywords are common words in crime definitions, while blue keywords are not common..
[1] http://www.fbi.gov/ucr/cius_04/offenses_reported/property_crime/larceny-theft.html[2] http://en.wikipedia.org/wiki/Theft[3] http://en.wikipedia.org/wiki/Assult
: Column 1
: Column 2
Similarity = H(C|T) / H(C)
WashingtonDC
Lincoln
Step 3 Calculate Similarity
Extract distinct keywords from compared columns
Group distinct keywords together into semantic clusters
Keywords extracted from columns = {Arson, Theft, Stolen, …}
“Arson”,”Theft”,”Burglary”,….“Arson”,”Theft”,”Northwest”….
C1 C2
C1 U C2
Step 1
Step 2
II. K-medoid + NGD Instance Similarity
Offence Long LatitudeArson 38.8701018
1-76.9822237
Theft 38..88852 -76.9370033
Burglary 38.95143 -77.0238048
INC_ Team_Area
Arson Northwest Team
Theft Center Team