towards situational awareness systems for disaster response naveen ashish calit2@uc-irvine bell labs...
Post on 27-Dec-2015
222 Views
Preview:
TRANSCRIPT
Towards situational awareness systems for disaster response
Naveen Ashish
Calit2@UC-Irvine
Bell Labs India, Bangalore,
04/23/07
Organization Introduction to
SAMI
Selected research areas
Technology transition
Discussion
RESCUE NSF funded “large-ITR” project
Advance information technologies for disaster response 5 year project
Oct 2003 to Oct 2008 Institutions
6 universities (UCI, UCSD, UIUC, BYU, U-Colorado, U-Maryland) and 1 company (ImageCat)
Active and formal community partners City of LA, OCFA, Irvine Police, ….
People Director: Sharad Mehrotra ~ 25 researchers and staff, ~40 students
Web: http://www.itr-rescue.org
The SAMI TEAM
StudentsStella Chen, Chaitanya Desai, Vibhav Gogate, Jon Hutchinson, Ram Hariharan, Shengyue Ji, Yiming Ma, Rabia Nuray-Turan, Dawit Seid, Shankar Shivappa
StaffJay Lickfett, Chris Davison
CollaboratorsCharles Huyck, Ron Eguchi, Shubharoop Ghosh
Faculty, Scientists and Post-docsDmitri Kalashnikov, Rajesh Hedge, Sharad Mehrotra, Sangho Park
Slide Aggregator (aka Project Leader)Naveen Ashish
RESCUE Mission
The mission of RESCUE is to enhance the ability of
emergency response organizations and the public to mitigate
crises, save lives, and prevent secondary and indirect human
and economic loss by radically transforming ways in which
these organizations gather, process, manage, use and
disseminate information during man-made and natural
catastrophes.
Motivation: Transform the Ability of First Responders to Mitigate Crisis
Observation: Right Information to the Right Person at the Right Time can result in dramatically better response
Response Effectiveness• lives & property saved • damage prevented• cascades avoided
Quality & Timeliness of
Information
Situational Awareness• incidences• resources• victims• needs
Quality of Decisions• first responders• consequence planners• public
RESCUE Objectives Develop technologies to dramatically improve
situational awareness of first-responders, response organizations, and the public by providing them with timely access to accurate, reliable and actionable information about the disaster.
RESCUE Objectives Develop technologies to dramatically improve situational awareness of first-
responders, response organizations, and the public by providing them with timely access to accurate, reliable and actionable information about the disaster.
Develop technologies that enable seamless information sharing and collective decision making across highly dynamic virtual organizations consisting of diverse entities (government, private sector, NGOs, individuals).
RESCUE Objectives Develop technologies to dramatically improve situational awareness of first-
responders, response organizations, and the public by providing them with timely access to accurate, reliable and actionable information about the disaster.
Develop technologies that enable seamless information sharing and collective decision making across highly dynamic virtual organizations consisting of diverse entities (government, private sector, NGOs, individuals).
Develop robust communication systems that continue to operate in crisis situations despite partial/total failure of infrastructure and increased communication demands.
RESCUE Objectives Develop technologies to dramatically improve situational awareness of first-
responders, response organizations, and the public by providing them with timely access to accurate, reliable and actionable information about the disaster.
Develop technologies that enable seamless information sharing and collective decision making across highly dynamic virtual organizations consisting of diverse entities (government, private sector, NGOs, individuals).
Develop robust communication systems that continue to operate in crisis situations despite partial/total failure of infrastructure and increased communication demands.
Develop technologies that can be used for timely and customized dissemination of crisis information that inform the public at large thus enhancing the abilities of the affected populations to take appropriate self-protective actions.
RESCUE Objectives Develop technologies to dramatically improve situational awareness of first-
responders, response organizations, and the public by providing them with timely access to accurate, reliable and actionable information about the disaster.
Develop technologies that enable seamless information sharing and collective decision making across highly dynamic virtual organizations consisting of diverse entities (government, private sector, NGOs, individuals).
Develop robust communication systems that continue to operate in crisis situations despite partial/total failure of infrastructure and increased communication demands.
Develop technologies that can be used for timely and customized dissemination of crisis information that inform the public at large thus enhancing the abilities of the affected populations to take appropriate self-protective actions.
Explore the privacy challenges that emerge as a result of infusing technology to improve information flow in crisis response networks and the public.
RESCUE Objectives Develop technologies to dramatically improve situational awareness of first-
responders, response organizations, and the public by providing them with timely access to accurate, reliable and actionable information about the disaster.
Develop technologies that enable seamless information sharing and collective decision making across highly dynamic virtual organizations consisting of diverse entities (government, private sector, NGOs, individuals).
Develop robust communication systems that continue to operate in crisis situations despite partial/total failure of infrastructure and increased communication demands.
Develop technologies that can be used for timely and customized dissemination of crisis information that inform the public at large thus enhancing the abilities of the affected populations to take appropriate self-protective actions.
Explore the privacy challenges that emerge as a result of infusing technology to improve information flow in crisis response networks and the public.
Promote interdisciplinary education at all levels (graduate, undergraduate, K-12) and across diverse student groups to expose the future community of citizens to issues in emergency management and homeland security – an area of global and national importance.
RESCUE Research Projects SAMI: Situational Awareness from Multi-Modal
Input (Project Lead: N. Ashish, UCI)
PISA: Policy-driven Information Sharing Architecture (Project Lead: M. Winslett, UIUC)
Customized Dissemination in the Large (Project Leads: K. Tierney, UC-B & N. Venkatasubramanian, UCI)
Privacy Implications of Technology Adoption (Project Lead: S. Mehrotra, UCI)
Robust Networking and Information Collection (Project Lead: BS Manoj, UCSD)
A Situational Awareness Application
Reports Responders News Weather Traffic
Damage Assessment
Evacuation Planning
Situational Dashboard
Simulations Reconnaissance System
Information
Applications
Architecture
Situational data management
Analysis
Extraction and synthesis
Events as fundamental abstraction units
Areas
Situational awareness systems
Extraction and synthesisData management
Analysis
semantic extraction from text
audio-visualextraction
E event model
SAT-ware
graph analysis
geospatial
predictive modeling
damage assessmentspatial indexing
Extraction and Synthesis
Extraction and Synthesis
Semantic extractionfrom text
Audio eventextraction
Visual eventextraction
Why do we need “Data Cleaning”?
An actual excerpt from a person’s CV sanitized for privacy quite common in CVs, etc this particular person
argues he is good because his work is well-cited
but, there is a problem with using CiteSeer ranking
in general, it is not valid (in CVs) let’s see why...
“... In June 2004, I was listed as the 1000th most cited author in computer science (of 100,000 authors) by CiteSeer, available at
http://citeseer.nj.nec.com/allcited.html. ...”
Suspicious entries Let us go to the DBLP
website which stores
bibliographic entries of many CS authors
Let us check who are “A. Gupta” “L. Zhang”
What is the problem in the example?
CiteSeer: the top-k most cited authors DBLP DBLP
Comparing raw and cleaned CiteSeerRank Author Location # citations
1 (100.00%) douglas schmidt cs@wustl 5608
2 (100.00%) rakesh agrawalalmaden@ib
m4209
3 (100.00%)hector
garciamolina@ 4167
4 (100.00%) sally floyd @aciri 3902
5 (100.00%) jennifer widom @stanford 3835
6 (100.00%) david culler cs@berkeley 3619
6 (100.00%) thomas henzingereecs@berkele
y3752
7 (100.00%) rajeev motwani @stanford 3570
8 (100.00%) willy zwaenepoel cs@rice 3624
9 (100.00%) van jacobson lbl@gov 3468
10 (100.00%) rajeev alur cis@upenn 3577
11 (100.00%) john ousterhout @pacbell 3290
12 (100.00%) joseph halpern cs@cornell 3364
13 (100.00%) andrew kahng @ucsd 3288
14 (100.00%) peter stadler tbi@univie 3187
15 (100.00%) serge abiteboul @inria 3060
CiteSeer top-k
Cleaned CiteSeer top-k
What is the lesson?
data should be cleaned first e.g., determine the (unique) real authors of publications solving such challenges is not always “easy” that explains a large body of work on data cleaning note
CiteSeer is aware of the problem with its ranking there are more issues with CiteSeer many not related to data cleaning
“Garbage in, garbage out” principle: Making decisions based on bad data, can lead to wrong results.
High-level view of the problem
...??
"J. Smith"
Raw Dataset
...J. Smith ...
.. John Smith ...
.. Jane Smith ...
MIT
Intel Inc.
?
Normalized Dataset(now can apply data analysis techniques)
Extraction(uncertainty,
duplicates, ...)
John Smith Intel
Jane Smith MIT
... ...
John SmithJane Smith
Intel
MIT
=
Attributed Relational Graph (ARG)
??
The problem:
...
(nodes, edges can have labels)(for any objects, not only people)
Traditional Domain-Independent DC Methods
objectX
feature1
feature2
feature3
objectY
feature1
feature2
feature3
?
?
?
Feature-based similarity (FBS)
objectX
feature1
feature2
feature3
Context
feature4A new feature is derived from context
f1
f2
f3
?
?
?
f4
Y
f1
f2
f3
f4?
XRelDC =
Traditional FBS
+ X Y
A
B C
D
E F
Relationship Analysis(enhance the core)
ARG
Traditional techniques (FBS-based)
What is “Reference Disambiguation”?
A1, ‘Dave White’, ‘Intel’A2, ‘Don White’, ‘CMU’A3, ‘Susan Grey’, ‘MIT’A4, ‘John Black’, ‘MIT’A5, ‘Joe Brown’, unknownA6, ‘Liz Pink’, unknown
P1, ‘Databases . . . ’, ‘John Black’, ‘Don White’P2, ‘Multimedia . . . ’, ‘Sue Grey’, ‘D. White’P3, ‘Title3 . . .’, ‘Dave White’P4, ‘Title5 . . .’, ‘Don White’, ‘Joe Brown’P5, ‘Title6 . . .’, ‘Joe Brown’, ‘Liz Pink’P6, ‘Title7 . . . ’, ‘Liz Pink’, ‘D. White’
Author table (clean) Publication table (to be cleaned)?
Analysis (‘D. White’ in P2, our approach):
1. ‘Don White’has a paper with ‘John Black’@MIT
2. ‘Dave White’is not connected to MIT in any way
3. ‘Sue Grey’
is coauthor of P2 too, and @ MIT
Thus: ‘D. White’ in P2 is probably Don
(since we know he collaborates with MIT ppl.)
Analysis (‘D. White’ in P6, our approach):
1. ‘Don White’has a paper (P4) with Joe Brown;Joe has a paper (P5) with Liz Pink;Liz Pink is a coauthor of P6.
2. ‘Dave White’
does not have papers with Joe or Liz
Thus: ‘D. White’ in P6 is probably Don
(since co-author networks often form clusters)
Attributed Relational Graph (ARG)
View dataset as a graph nodes for entities
papers, authors, organizations e.g., P2, Susan, MIT
edges for relationships “writes”, “affiliated with” e.g. Susan → P2 (“writes”)
“Choice” nodes for uncertain relationships mutual exclusion “1” and “2” in the figure
Analysis can be viewed as application of the “Context AP” to this graph defined next...
w1 = ?
P1
P2
P3
Dave White
Don White
Susan Grey
John Black
Intel
CMU
MIT
1
Joe BrownP4
Liz Pink
P5
P62
w3 = ?
Q: How come domain-independent?
In designing the RelDC approach- our goal was to use CAP as an axiom - then solve problem formally, without heuristics
if reference r, made in the context of entity x,
refers to an entity yj but, the description, provided by r, matches
multiple entities: y1,…,yj,…,yN,
thenx and yj are likely to be more strongly connected
to each other via chains of relationships
than x and yk (k = 1, 2, … , N; k j).
Context Attraction Principle (CAP)“J. Smith”
publication P1
John E. SmithSSN = 123
Joe A. SmithP1
John E. Smith Jane Smith
Analyzing paths: linking entities and contexts
D. White is a reference in the context of P2, P6 can link P2, P6 to Don cannot link P2, P6 to
Dave more complex paths in
general
w1 = ?
P1
P2
P3
Dave White
Don White
Susan Grey
John Black
Intel
CMU
MIT
1
Joe BrownP4
Liz Pink
P5
P62
w3 = ?
Analysis (‘D. White’ in P2): path P2→Don
1. ‘Don White’has a paper with ‘John Black’@MIT
2. ‘Dave White’is not connected to MIT in any way
3. ‘Sue Grey’
is coauthor of P1 too, and @ MIT
Thus: ‘D. White’ is probably Don White
Analysis (‘D. White’ in P6): path P6→Don
1. ‘Don White’has a paper (P4) with Joe Brown;Joe has a paper (P5) with Liz Pink;Liz Pink is a coauthor of P6.
2. ‘Dave White’
does not have papers with Joe or Liz
Thus: ‘D. White’ is probably Don White
Questions to answer
1. Does the CAP principle hold over real datasets? That is, if we disambiguate references based on it, will the
references be correctly disambiguated?
2. Can we design a generic solution to exploiting relationships for disambiguation?
Problem formalization
Notation Meaning
X={x1, x2, ... , xN} the set of all entities in in the database
xi .rk the k-th reference of entity xi
a reference a description of an object, multiple attributes
d[xi .rk] the “answer” for xi .rk -- the real entity xi .rk refers to (unknown, the goal is to find it)
CS[xi .rk] the “choice set” for xi .rk -- the set of all entities matching the description provided by xi .rk
y1, y2, ... , yN the “options” for xi .rk -- elements in CS[xi .rk]
v[xi] the node in the graph for entity xi
the name of k-th author of paper xi, e.g. ‘J. Smith’
the true k-th author of paper xi
‘John A. Smith’, ‘Jane B. Smith’, ...
Handling References: Linking(references correspond to
relationships)if |CS[xi .rk]| = 1 then
we know the answer d[xi .rk] link xi and d[xi .rk] directly, w = 1
else the answer is uncertain for xi .rk create a “choice” node, link it “option-weights”, w1 + ... + wN = 1 option-weights are variables
Entity-Relationship Graph RelDC views dataset as a graph
undirected nodes for entities
don’t have weights edges for relationships
have weights real number in [0,1] the confidence the relationship
exists
w1 = ?
P1
P2
P3
Dave White
Don White
Susan Grey
John Black
Intel
CMU
MIT
1
Joe BrownP4
Liz Pink
P5
P62
w3 = ?
v[xi]
v[yN]
cho[xi.rk]
v[y1]
v[y2]w0=1
...
N nodesfor entities in CS[xi.rk]
e0
“J. Smith”P1
“Jane Smith”
“John Smith”
Definition: To resolve a reference xi .rk means
to pick one yj from CS[xi .rk] as d[xi .rk]. Graph interpretation
among w1, w2, ... , wN, assign wj = 1 to one wj
means yj is chosen as the answer d[xi .rk]
Definition: Reference xi .rk is resolved correctly, if the chosen yj = d[xi .rk].
Definition: Reference xi .rk is unresolved or uncertain, if not yet resolved...
Goal: Resolve all uncertain references as correctly as possible.
Objective of Reference Disambiguation
v[xi]
v[yN]
cho[xi.rk]
v[y1]
v[y2]
...
Formalizing the CAP
CAP is based on “connection strength” c(u,v) for entities u and v
measures how strongly u and v are connected to each other via relationships
e.g. c(u,v) > c(u,z) in the figure will formalize c(u,v) later
if c(xi, yj) ≥ c(xi, yk)
then wj ≥ wk (most of the time)
Context Attraction Principle (CAP)
u v
A
B C
D
E F
G H z
v[xi]
v[yN]
cho[xi.rk]
v[y1]
v[y2]
...
We use proportionality:
c(xi, yj) ∙ wk = c(xi, yk) ∙ wj
RelDC approachInput: the ARG for the dataset
1. Computing connection strengths− for each unresolved reference xi .rk
− determine equations for all (i.e., N) c(xi , yj)’s− c(xi , yj) = gij(w)
− a function of other option-weights
2. Determining equations for option-weights− use CAP to relate all wj’s and connection strengths− since c(xi , yj) = gij(w), hence wij = fij(w)
3. Computing option-weights− solve the system of equations from Step 2.
4. Resolving references− use the interpretation procedure to resolve weights
v[xi]
v[yN]
cho[xi.rk]
v[y1]
v[y2]
...
2
2
Computing connection strength (Step 1)
Computation of c(u,v) consists of two phases Phase 1: Discover connections
all L-short simple paths between u and v bottleneck optimizations, not in SDM05
Phase 2: Measure the strength in the discovered connections many c(u,v) models exist we use random walks in graphs model
Graph
v[xi]
v[y1]
v[y2]
v[yN]u va
N-2... ... ... ... ...
b
Measuring connection strength
wk,
i
w1,
i
w2,
i
v1 vkw1,0
n1
... ...
Sour
ce
wk-1,0...
nk
... ...
Des
tinat
ion
edge E1,0
v2w2,0
n2
... ...
u v
A
B C
D
E F
G H z
Note:
– c(u,v) returns an equations
– because paths can go via various option-edges
– cuv = c(u,v) = guv(w)
Equations for option-weights (Step 2)
CAP (proportionality):
System (over-constrained):
Add slack:
Solving the system (Steps 3 and 4)
Step 3: Solve the system of equations1. use a math solver, or2. iterative method (approx. solution ), or3. bounding-interval-based method (tech. report).
Step 4: Interpret option-weights to determine the answer for each reference pick yj with the largest weight as the answer
Experimental Setup
Parameters When looking for L-short simple paths, L = 7 L is the path-length limit
RealPub dataset: CiteSeer + HPSearch
publications (255K) authors (176K) organizations (13K) departments (25K)
ground truth is not known accuracy...
SynPub datasets: many ds of two types emulation of RealPub
publications (5K) authors (1K) organizations (25K) departments (125K)
ground truth is known
RealMov: movies (12K) people (22K)
actors directors producers
studious (1K) producing distributin
g
Sample Publication Data
CiteSeer: publication records
HPSearch: author records
Efficiency and Long paths
Non-exponential cost Longer paths do help
Web Disambiguation
Music Composer
Football Player
UCSD Professor
Comedian
Botany Professor @ Idaho
Web Disambiguation
Web Disambiguation Extract key information such as mentions of
entities (persons, names, locations) and other information such as hyperlinks and email addresses from Web pages
Cast as a relationship analysis problem Prototype at:
http://opteron.calit2.uci.edu:1977/Diamond/people_search.jsp
Extraction and Synthesis
Semantic extractionfrom text
Audio eventextraction
Visual eventextraction
Information extraction from text Many systems and techniques May benefit from semantics Limitations
All or nothing extraction Towards probabilistic extraction systems
Leads Disambiguation and data cleaning
Dmitri Kalashnikov, Stella Chen, Rabia Nuray-Turan Information extraction
Naveen Ashish, Sharad Mehrotra
Extraction and Synthesis
Semantic extractionfrom text
Audio eventextraction
Visual eventextraction
Multi-microphone speech processing Speaker identification Noise reduction
Audio-visual speech recognition Combine visual features (venemes) with audio
Speech recognition on light-weight devices Team
Rajesh Hegde, Bhaskar Rao, Shankar Shivappa (UCSD)
Extraction and Synthesis
Semantic extractionfrom text
Audio eventextraction
Visual eventextraction
Combine views from multiple cameras Homomorphic transformations
Multi-perspective “view-binding” Team
Sangho Park, Mohan Trivedi (UCSD)
Situational Data Management
Situational Data Management
Spatial Indexing Event data model SAT-Ware
Outline Overall Goal Use examples to illustrate:
Different approaches in modeling and querying Advantage of our approach
Extracting spatial expression Building model for spatial expression Experiments Conclusion
Overall Goal
Goal: Situation Awareness from Textual SourcesDatabas
e
...reports
Textual data after crisis
first responders reports Internet sources for post factum analysis
Info about events, that constitute a crisis, is often available as text.
Textual data during crisis
transcribed 911 calls first responder
communications
Motivating Examples Two reports filed by first responders after 9/11 attack:
“…the PAPD Mobile Command Post was located on West St. north of WTC …”
“…a PAPD Command Truck parked on the west side of Broadway St. and north of Vesey St….”
Query: Retrieve Events around WTC
Goal: Both events should be retrieved with high scores attached.
Approach 1: Using IR approach Direct Keyword retrieval
Only one report mentioned keyword “WTC”
Query expansion based on nearby spatial
objects E.g. Nearby streets and
buildings… Ad-hoc and Objects might
not be bounded
Approach 2: Mapping Using Uncertain Region Query : Near WTC
Report 1: West St. north of WTC
Report 2: west side of Broadway St. and
north of Vesey St
Rank based on the ratio of intersection Problem: rank score is not accurate based on the uniform
assumptions
Our Approach Step 1: Converting Text to Spatial Expression
S-expression: has well-defined function form
Near WTC Near(WTC)
West St. north of WTC
On(West St.) North(WTC)
• west side of Broadway St. and north of Vesey St
West(Broadway St.) North(Vesey St.)
Our ApproachStep 2: Mapping S-expression to probabilistic density function
(PDF)
Near(A)
On(West St.) North(WTC)
Answering Range Query Given a query region
Retrieve objects based on the degree of belonging
On(West St.) North(WTC)
West(Broadway St.) North(Vesey St.)
Consider location as a random variable
Advantages of Our Approach More explicit spatial mapping remove the needs for
keyword expansion (IR approach)
Probabilistic representation is more formal and accurate than uncertain region (UR) approach
Decouple the extraction and modeling modules Better extraction and modeling modules can be easily
plug-in
Extracting Spatial Expression
Step1: Discovering landmarks buildings, roads, intersections
Step2: Generating s-descriptors Use spatial relations to connect the landmarks Spatial relations: near, behind, between in the format D(L1, L2, ... ,Ln)
Step3: Generating s-expressions compositions of s-descriptors near(A) near(B)
Step1: Discovering landmarks
Markup the text by the landmarks Using Gazetteers (Incorporate into information extractor,
GATE) Note: not only markup the “name”, features also attached
Examples of Landmark
Step2: Generating s-descriptors
Discover spatial relations around the landmarks Dictionary approach (convert spatial relations to potential
words) Machine learning techniques can also be used
Examples of s-descriptors
Modeling S-expression Goal: generating a reasonable probabilistic
representation for s-expression
Step1: Modeling S-descriptors
Step2: Combining s-descriptors
Modeling S-descriptors
Modeling templates e.g Uniform, Normal distribution
Using parameter learning techniques
Generating s-expression In a s-expression, we assume the s-descriptors are
conditional independent. If a s-expression has 2 descriptors, S1, S2
It can be generalized to n descriptors, S1…Sn
Generating s-expression
Near(A)
Outdoor()
Outdoor() Near(WTC)
Experimental Setup rdsf Domain real geographic dataset Manhattan, NY, near WTC buildings, streets, roads 4 4 km2
Data Based on 164 reports
by Police Officers participants of 9/11
s-expressions near(A), on(A), outdoor intersections, buildings,
street Construct 2359 pdfs
Queries 50 Range Queries
Simulate the Errors Extraction Errors:
With human supervision, error is small. Modeling Errors:
Even with supervision, model parameters can still be away from the ideal settings.
E.g., the mean and variance settings for the Gaussian model.
We simulate two types of modeling errors for the analysts: Overly confident: estimated model is too “tight”
By reducing variance of the “ideal” Gaussian model Not confident: estimated model is too “loose”
By increasing variance in the “ideal” Gaussian model
Results Event with large errors, probabilistic models are still
better than bounding region methods
Conclusions
Ongoing work database aspects of the problem
more types of queries
Future work spatio-temporal aspects better modeling (text to PDF)
Novel in this work approach for mapping text to PDF
query requirements for SA apps
query design issues
representation of PDFs
Spatial Awareness from Textual SourcesDatabas
e
...reports
Lead Spatial awareness
Yiming Ma
Situational Data Management
Spatial Indexing Event data model SAT-Ware
Situational Data Management
Spatial Indexing Event data model SAT-Ware
Analysis
Analysis and Visualization
Graph analysis GIS Predictive modeling Damage assessment
Graph Analysis
SEMANTIC METADATA
DESCRIBED DATA
Semantic Graphs(Attributed graphs)Entity-Relationship
Schemas
Relations Document Repositories
Taxonomies(“ReferenceData”)
Ontologies(“Semantic Models”)
DBMS
Graph Pattern-Based Querying
Ranked Graph Pattern Matching
Multi-dimensionalAnalysis[For Documents]
Relationship Summarization/Exploration[Relations]
Graph Data Model (Entity-Attribute-Value Model) Graph (edge sets aka triple sets):
E.g. (&dawit ns:studentAt &UCI)(&UCI ns:type &university)
(ns:university ns:subClassOf ns:oraganization) Two kinds of nodes: object-ids, literals (e.g. integer, string, etc.)
Blank nodes (e.g. (&dawit :studentAt _) Directed edges (aka predicates or properties)
there exists only one edge with a given label between a pair of nodes
Symmetric representation of Metadata + data Nodes: object classes or link classes Links: predicates on classes:
(:studentAt :domain :person)(:studentAt :range :organization)(:universty :subclassOf :organization)
Object identity + relationship identity Objects and relationships have unique ids (called URIs)
&dawitns:studentAt
&UCI
Graphs for actual data storage - beyond data modeling
Graphs normally used for conceptual data modeling the entity-relationship (ER) model
What is different ? Using graphs for actual (minimally structured) data
representation. Why ?
Store/represent and query data without schema Symmetrically Store/query both schema (ontology) and data Graph traversal based query + reasoning (inference) Multi-schema queries on the same graph Query unstructured data annotated with
taxonomies/ontologies using traditional (structured) query operators
topic ontology
editor
publication
bookproceeding
article
researcher
author
String
String
editsProc
writesArticle
writesBook
produces
chapter inProceeding
Date String
pages
refersTo
titleyearname
book
Literal
price
Literal
list_pricerating
Literal
(a) (b) (c)
MODEL
INSTANCE
organization
affiliates
String
org_name
&o1
&r2
&r1
&r3
90
110
affiliates
affiliates
affiliates
&b1
&b2
writesBook
writesBook
price
writesBook
IBM
org_name
UCI
Johnname
Alex
name
&o2Sara name
“”title
year 2003
price
year1998
&r4
Comp.Sc
Info. Sys.Data
InterfacesDB
IR Encrypt.DataStruct.
Onlineservices
D. Lib.Systems
Languages
DistributedDB
MultimediaDB
&a1
affiliates
writesArticle
org_name
editsBook
rdf:type
subClassOf/subPropertyOf
LEGEND
produces&o organization&r researcher&b book&p proceeding&a article
book
Literal
LiteralLiteral
Info. Sys.
InterfacesDB
SystemsLanguages
DistributedDB
MultimediaDB
&b3
writesBook
affiliates
100
year1998
price
&p1inPRoceeding
Graph Pattern based Querying
SELECT *WHERE { ?org :affiliates ?aut .
?aut :produces ?b .?b :type :book .?b :price ?p .?b ?pred ?x . }
variable
triple pattern
queries schema (a)
super-class of writesBook
uses schema (b)
Variable on predicates - matches all applicable predicates
Graph Pattern based Querying
&o1 &r1 &b1 90
2003
Graph set GraphRelation
&o1 &r2 &b1 90
2003...
org aut book price year&o1 &r1 &b1 90 2003&o1 &r2 &b1 90 2003&o1 &r2 &b2 110 1998&o1 &r3 &b3 100 1998&o2 &r2 &b1 90 2003&o2 &r2 &b2 110 1998
SELECT *WHERE { ?org :affiliates ?aut .
?aut :produces ?b .?b :type :book .?b :price ?p .?b ?pred ?x . }
CONSTRUCT *WHERE { ?org :affiliates ?aut .
?aut :produces ?b .?b :type :book .?b :price ?p .?b ?pred ?x . }
EnumerativeSemantics
ExtractiveSemantics
&o1 :affiliates &r1&r1 :produces &b1&b1 :price 90&b1 :year 2003&o1 :affiliates &r2
...
Graph Pattern based Querying
&o1 &r1 &b1 90
2003
Graph
&o1 &r2 &b1 90
2003
&o1 &r1 &b1 90
2003
&r2 &b2
1998
&b3&r3
110
1998&o2
100...
org aut book price year&o1 &r1 &b1 90 2003&o1 &r2 &b1 90 2003&o1 &r2 &b2 110 1998&o1 &r3 &b3 100 1998&o2 &r2 &b1 90 2003&o2 &r2 &b2 110 1998
SELECT *WHERE { ?org :affiliates ?aut .
?aut :produces ?b .?b :type :book .?b :price ?p .?b ?pred ?x . }
CONSTRUCT *WHERE { ?org :affiliates ?aut .
?aut :produces ?b .?b :type :book .?b :price ?p .?b ?pred ?x . }
EnumerativeSemantics
ExtractiveSemantics
Relation
Enumerative Algebra
Enumerative algebra - algebra over sets of variable bindings
?aut :produces ?b?org :affiliates ?autTriple patterns …
Variablesaut b
Bindings (per triple pattern)
&r1
&r2
&r2
&b1
&b1
&b2
Joinable Bindings – same variable, same value.
autorg
&o1
&o1
&o1
&r1
&r2
&r3
Enumerative Algebra (ctd.)
Given two set of bindings T1 and T2, and r denoting a binding:
T1 = {r | r T1 or r T2 }T2
T1 ⋈ = {r1T2 r2 | r1 T1 and r T2 and r1 and r2 are joinable}
&r1
&r2
&r2
&b1
&b1
&b2
&r3
&01
&01
&o1
&o1
?aut ?b?org
&r1
&r2
&r2
&b1
&b1
&b2
&01
&01
&o1
?aut ?b?org
Enumerative Algebra (ctd.)
match[P] (G) – matches the graph pattern P to graph G Given P = {p1, p2, …, pm}
match [P](G) = match [p1] ⋈
G
match [p2] ⋈ ⋈ match [pm]…
Sets of sets (tuples) of bindings
Enumerative Algebra (ctd.) Other operators:
Difference:T1 \ T2 = {r T1 | for all r’ T2,
r and r’ are not joinable}
Filter, (T), evaluate the Boolean condition on T. E.g. of is: ?p > 100.
Outer Join:T1 T2 = (T1 ⋈ T2) ∪ (T1 \ T2)
Extractive Algebra
Given two graphs G1 and G2, and t denoting a triple :
G1 = {t | t G1 or t G2 }G2
?aut :produces ?b?org :affiliates ?aut
&r1 :prod
&r2 :prod
&r2 :prod
&b1
&b1
&b2
&o1 :aff
&o1 :aff
&o1 “aff
&r1
&r2
&r3
&o1 :aff
&o1 :aff
&o1 “aff
&r1
&r2
&r3
&r1 :prod
&r2 :prod
&r2 :prod
&b1
&b1
&b2
• Matching retains Structure
• More compact Representation during implementation
Extractive Algebra (ctd.)
1. For all t1 G1, either there exists t2 G2 such that t1 and t2 are joinable by p or t1 does not match p1 p.
2. For all t2 G2, either there exists t1 G1 such that t2 and t1 are joinable by p or t2 does not match p2 p
G1 ⋈p G2 = {G1
G2 |˄
?aut :produces ?b?org :affiliates ?aut
&r1 :prod
&r2 :prod
&r2 :prod
&b1
&b1
&b2
&o1 :aff
&o1 :aff
&o1 “aff
&r1
&r2
&r3
where p = (p1,p2), i.e. a pair of triple patterns.
⋈((?org :affiliates ?aut),(?aut :produces ?b))
&o1 :aff
&o1 :aff
&o1 “aff
&r1
&r2
&r3&r1 :prod
&r2 :prod
&r2 :prod
&b1
&b1
&b2
Extractive Algebra (ctd.)
?b :price ?p .?b ?pred ?x
?org :affiliates ?aut .?aut :produces ?b
⋈((?aut :produces ?b),(?b :price ?p))
&o1 :aff
&o1 :aff
&r1
&r2
&r1 :prod
&r2 :prod
&r2 :prod
&b1
&b1
&b2
&b1 :price
&b3 :price
90
110
&b1 :year
&b3 :year
2003
1998
&o1 :aff
&o1 :aff
&r1
&r2
&r1 :prod
&r2 :prod
&b1
&b1
&b1 :price 90
&b1 :year
&b3 :year
2003
1998
Extractive Algebra (ctd.)
?b :price ?p .?b ?pred ?x
?org :affiliates ?aut .?aut :produces ?b
⋈((?aut :produces ?b),(?b ?pred ?x))
&o1 :aff
&o1 :aff
&r1
&r2
&r1 :prod
&r2 :prod
&r2 :prod
&b1
&b1
&b2
&b1 :price
&b3 :price
90
110
&b1 :year
&b3 :year
2003
1998
&o1 :aff
&o1 :aff
&r1
&r2
&r1 :prod
&r2 :prod
&b1
&b1
&b1 :price 90
&b1 :year 2003
Extractive Algebra (ctd.) extract[P] (G) – matches the graph pattern P
Given P = {p1, p2, …, pm}
extract [P](G) = match [p1]
⋈
G
match [p2]
⋈⋈ match [pm]
…
Graph
˄ ˄ ˄
Extractive Algebra (ctd.) Other operations:
Difference:G1 \ G2 = {t G1 and t G2}
Filter: (G) = G \ {t | (t) true}
Implementing Extract – Naïve/Join-split As a post-process of enumerative matching
Do enumerative matching Produces a joined relation
Vertically split join result into triples IO cost: for a pair of triple-sets:
2 reads of triple sets + 1 write of joined result + 2 reads of join result (one for each split/projection) + 2 writes of projected result + 2 reads of the projected triple sets 1 write of unioned result Total: 6 reads and 4 writes (4 reads and 3 write if no
union).
Implementing Extract – 2-way semi-joins Use 2-way semi-joins
Given two joinable triple sets A and B,
A B
A’
B’
⋃
IO Cost 2 reads of triplesets (first semi-
join) 1 write of result to union (writes
smaller table) 2 reads to perform next
semijoin (1 read is on smaller table)
1 write of result to union Total: 4 reads and 2 writes.
Implementing Extract – 2-stream operator
Scan each input and produce triples that have at least one match in the other
Is a high-level operator that can be implemented via: Hashing or Sort-merge A B
⋈˄
A’ B’
Grouping and Aggregation : Flatten-and-Aggregate Approach
SELECT ?org, sum (?p) as totalPriceWHERE { ?org :affiliates ?aut .
?aut :writesBook ?b .?b :price ?p }
GROUP BY ?org
&o1
&r2
&r1
&r3
affiliates
affiliates
&b1
&o2
&b2
&b3
affiliates
affiliates
writesBook
writesBook
writesBook
writesBook
110
90
100org aut book price year&o1 &r1 &b1 90 2003&o1 &r2 &b1 90 2003&o1 &r2 &b2 110 1998&o1 &r3 &b3 100 1998
This is how Oracle supports aggregation over graph data ! Also, [Hung, Deng, and Subrahmanian, ICDE 2005]
Group and Aggregate EnumerativeMatch Results
Result: 390. WRONG !
Group By
Should be based on extractive matching (graphs).
What should group by mean on graphs ? Collapse a set of
triples into a single triple.
Use Bag nodes.
&o1
&r2
&r1
&r3
affiliates
affiliates
&b1
writesBook
&o2
Bag
type
&b2
writesBook
Bag
type
&b3
writesBook
Bag
type:1
:1:2
:1
affiliates
affiliates
CONSTRUCT *WHERE { ?org :affiliates ?aut .
?aut :writesBook ?b .?b :price ?p }
GROUP BY ?aut ON :writesBook
Grouping Target
Grouping Basis
Aggregation
Two types (modes) of aggregations on graphs Branch-wise : aggregate a set of values adjacent to a node type Path-wise : aggregate over a path in the graph
Not discussed here. Branch-wise Example :
SELECT ?b, branch sum (:price) as totalPriceWHERE { ?org :affiliates ?aut .
?aut :writesBook ?b .?b :price ?p }
Anchor ModeAggregationbasis
label
&b1
&b2
&b3
110
90
100
2003
1998
1998
year
price
year
price
year
price
Aggregation – revisit example
SELECT ?org, branch sum (:price) as totalPriceWHERE { ?org :affiliates ?aut .
?aut :writesBook ?b .?b :price ?p }
GROUP BY ?org
Optional
Anchor and aggregation basisnot adjacent !
Anchor ModeAggregationbasis
label
&o1
&r2
&r1
&r3
affiliates
affiliates
&b1
&o2
&b2
&b3
affiliates
affiliates
writesBook
writesBook
writesBook
writesBook
110
90
100
price
price
price
Aggregation - solution
&o1&r1&r2
&r3
affiliates&b1
writesBook
&o2
affiliates
affiliates
Bag
:1 :2
&b2writesBook
Bagtype
:1
&b3writesBook
Bag
type
:1
90
110
100
RULE: All nodes between anchor and aggregation basis should be bags ! If anchor and
aggregation basis are adjacent, push aggregation into group by.
Otherwise, iteratively perform graph grouping with edge-propagation making each intermediary node an aggregation target. Result: &o1, 300.
&o2, 200
Lead Dawit Yimam Seid
Analysis and Visualization
Graph analysis GIS Predictive modeling Damage assessment
Ram Hariharan (with Sharad Mehrotra and Chen Li) Searching (open source) GIS data and datasets
Metadata Compression
Analysis and Visualization
Graph analysis GIS Predictive modeling Damage assessment
Vibhav Gogate and Jon Hutchinson (with Padhraic Smyth)
Activity monitoring and prediction Anomalous event detection
Analysis and Visualization
Graph analysis GIS Predictive modeling Damage assessment
ImageCat Inc (Ron Eguchi, Charles Huyck) INLET, MetaSIM
Artifacts
Many Communities – Many Disaster Portals Contents of sites are administered by respective city emergency mgmt. Easily customized to meet needs of different communities. Regional summarization capabilities built in (eg. county/state level
summary view).
Objectives of the Disaster Portal project are to provide: An integrated platform for RESCUE team members to develop, test, and
demonstrate their research projects in real-life scenarios. Next-generation capabilities to first responders and the public.
Key development partner: City of Ontario
The Disaster Portal is a suite of web applications for disseminating information and providing situational awareness to the general public during a disaster.
Disaster Portal
Community Deployment of Disaster Portal
Applications selected from Disaster Portal suite.
Portal framework providing situation summary page, custom look-and-feel
http://www.disasterportal.org:8380/Ontario/
Applications Available in Disaster Portal Suite Research Topics
Crisis AlertsKey contacts at companies / organizations can sign up for customized information updates via web or phone.
Scalable rapid dissemination
Donation ManagementIndividuals and organizations post needs and donations; helps coordinate the matching process.
Complex publish-subscribe systems
Family ReunificationSearch for contact info of a displaced family member.
Information extraction &Data cleaning
Shelter InformationAnnouncements and status information for open emergency shelters.
Travel PlanningCurrent and predicted traffic conditions.
Activity modeling algorithms
Disaster-Oriented Web SearchFind information not already included in the site.
Multidimensional analysis algorithms
Included in Ontario Pilot Disaster Portal
Disaster Portal
SAMI
Situational awareness systems
Extraction and synthesisData management
Analysis
semantic extraction from text
audio-visualextraction
E event model
SAT-ware
graph analysis
geospatial
predictive modeling
damage assessmentspatial indexing
Conclusions Situational data
management Semantics Synergies Integrated demonstration
Thank you !
ashish@ics.uci.edu
top related