1
QSX: Querying Social Graphs
Querying big graphs
Parallel query processing
Boundedly evaluable queries
Query-preserving graph compression
Query answering using views
Bounded incremental query evaluation
2
How to make big graphs small
Input: A class Q of queries
Question: Can we effectively find, given queries Q Q and any (possibly big) graph G, a small GQ such that
Q(G) = Q(GQ)?
Effective methods for making big graphs small
Distributed query processing Boundedly evaluable graph queries Query preserving graph compression Query answering using views Bounded incremental evaluation
Q( )GG
Q( ) GQGQ
Much smaller than G
Graph pattern matching by graph simulation
Input: A directed graph G, and a graph pattern Q Output: the maximum simulation relation R
Using views? Incremental?
3
Maximum simulation relation: always exists and is unique• If a match relation exists, then there exists a maximum one• Otherwise, it is the empty set – still maximum
Complexity: O((| V | + | VQ |) (| E | + | EQ| )
The output is a unique relation, possibly of size |Q||V|
Graph pattern matching using views
44
Answering queries using views
5The complexity is no longer a function of |G|
can we compute Q(G) without accessing G, i.e., independent of |G|?
The cost of query processing: f(|G|, |Q|)
Query answering using views: given a query Q in a language L
and a set V views, find another query Q’ such that
Q and Q’ are equivalent
Q’ only accesses V(G )
for any G, Q(G) = Q’(G)
Answering graph pattern queries on big social graphs:
Regardless of how big G is – the cost is “independent” of G
V(G ) is often much smaller than G (4% -- 12% on real-life data)
Q’( )Q( ) V(G)V(G)GG
Querying collaborative network
6
customer developer
project manager
query 1
Customer developer
query 2
PM 2PM 1
customer 2 developer 3developer 2
customer 2
developer 3
developer 2
customer 3
customer developer
project manager
A collaborative pattern
PM 2PM 1
customer 2
customer 1
developer 2
developer 3
developer 1
customer 3
A collaborative (chat) network
developer k
customer 3 customer n
…
…
tester
expensive!
Detecting Coordination Problems in Collaborative Software Development Environments,
Amrit Chintan et al, Information System management, 2010
views
Answering query using views
7
query A
database D
database views V(D)
Q(D)query result query Q
A(V)query result
1995 2000 2011
relational algebra
2002
XPath
2007
XML
2006
tree pattern query
1998
regular path
queriesRDF/SPARQL
graph pattern query simulation
When possible?
What to choose?
How to evaluate?
A classical techniques, but in their infancy for graphs
When a pattern can be matched using views
8Pattern containment: a characterization
A necessary and sufficient condition
Pattern containment
9
customer developer
project manager
customer developer
project managerView 1
customer developer
View 2
(customer, developer)
{(customer 2, developer 2),(customer 3, developer 3)}
(developer, customer)
{(developer 2, customer 2),(developer 2, customer 3),(developer 3, customer 2)}
(project manager, developer)
{(PM 1, developer 2),(PM 2, developer 3)}
(project manager, customer)
{(PM 1, customer 2),(PM 2, customer 2)}
(project manager, developer) (PM 1, developer 2)
(project manager, customer) (PM 1, customer 2)
(developer, customer) (developer 2, customer 2)
(customer, developer) (customer 2, developer 2)
Query result
How to determine the existence of ?
Determining Pattern containment
10
NP-complete for relational conjunctive queries, undecidable for relational algebra
A practical characterization: patterns are small in practice
Pattern containment: example
11
customer developer
project manager
View 1
customer developer
View 2
customer developer
project manager
queryas “data graph”
λ
customer
project manager
developer
view matches
V: the set of views; Q: query
Query containment: given Q and Q’, it is to determine whether for any graph G, Q(G) is contained in Q’(G)?A classical problem.What is its complexity for pattern queries?
efficient
Test: Pattern query containment
Pattern query
PM
DBAPRG
DBA PRG
PM
DBAPRG
View 1
e1 e2
DBAPRGView 2
e3
e4
It takes 0.5 second to check containment of large cyclic patterns12
Query evaluation using views
13
Input: pattern query Q, graph G, a set of views V and extensions in G, and a mapping λ
Output: Find the query result Q(G)
Algorithm
◦ Collect edge matches for each query edge e and λ(e)
◦ Iteratively remove non-matches until no change happens
◦ Return Q(G)
Q(G) can be evaluated in O(|Q||V(G)| + |V(G)|2) time
Recall simulation algorithm
More efficient. Why?
Query evaluation using views
14
customer developer
query
project manager
customer developer
project manager
View 1
customer developer
View 2
(customer, developer)
{(customer 2, developer 2),(customer 3, developer 3)}
(developer, customer)
{(developer 2, customer 2),(developer 2, customer 3),(developer 3, customer 2)}
(project manager, developer)
{(PM 1, developer 2),(PM 2, developer 3)}
(project manager, customer)
{(PM 1, customer 2),(PM 2, customer 2)}
(project manager, developer) {(PM 1, developer 2),(PM 2, developer 3)}
(project manager, customer) {(PM 1, customer 2),(PM 2, customer 2)}
(developer, customer) {(developer 2, customer 2),(developer 2, customer 3),(developer 3, customer 2)}
(customer, developer) {(customer 2, developer 2),(customer 3, developer 3)}
Query result
“bottom-up” strategy
Without accessing the underlying big graph G
4% -- 12% of G Are we done yet?
What views to choose?
15
customer
developer
project manager
softwaretester
customer
softwarecustomer developer
project manager
customerdeveloper
software
customer developer
project manager
software
customer developer
project manager
software
testerdeveloper
software
query view 2 view 1
view 3 view 4
view 5 view 6
choose all?
Why do we care?
efficiency
Minimum containment
16
Minimum containment is NP-complete◦ APX-hard as optimization
What can we do?
Give two options
An log|Ep|-approximation
17
Idea: greedily select views V that “cover” more query edges
Ec: already covered
To decide whether to include a particular view V
Approximation: performance guarantees
Minimum containment: example
18
customer
developer
project manager
softwaretester
customer
softwarecustomer developer
project manager
customer developer
project manager
software
customer developer
project manager
software
testerdeveloper
software
query view 2 view 1
view 4
view 6 view 5
customerdeveloper
software
view 3
Ec
Greedy: based on the metric
Minimal containment
19
Algorithm◦ Computes view match for each view
◦ Iteratively selects a view that extends Ec
◦ Repeats until Ec= Ep or return empty set
O(|Q|2 card(V) + |V|2 + |Q| |V|) time
new addition
Minimal containment is in PTIME
Minimal containment: example
20
customer
developer
project manager
softwaretester
customer
software
customer developer
project manager
customer developer
project manager
software
customer developer
project manager
software
testerdeveloper
software
query view 2
view 1
view 4
view 6 view 5
customerdeveloper
software
view 3
Eliminate redundant views
Putting together
21
Problem Complexity Algorithm
containment PTIME O(card(V)|Q|2+|V|2+|Q||V|)
minimum containment
NP-c/APX-hard
log|Ep|-approximableO(card(V)|Q|2+|V|2+|Q||V|+|Q|card(V)3/2)
minimal containment
PTIME O(card(V)|Q|2+|V|2+|Q||V|)
evaluation PTIME O(|Q||V(G)| + |V(G)|2)
characterization: sufficient and necessary condition for deciding
whether a query can be answered using a set of views
evaluation: how to evaluate queries using views
view section: what views to choose for answering queries
The study is still in its infancy for graph queries
Subgraph isomorphism?
View maintenance?Improvement: 23 times faster
Bounded incremental graph pattern matching
2222
Incremental query answering
23Minimizing unnecessary recomputation
Incremental query processing:
Input: Q, G, Q(G), ∆G
Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M
Changes to the outputNew output
Changes to the inputOld output
When changes ∆G to the graph G are small, typically so are the
changes ∆M to the output Q(G⊕∆G)
Changes ∆G are typically small
Compute Q(G) once, and then incrementally maintain it
Real-life data is dynamic – constantly changes, ∆G
Re-compute Q(G ∆G⊕ ) starting from scratch?
5%/week in
Web graphs
5%/week in
Web graphs
Complexity of incremental problems
Bounded: the cost is expressible as f(|CHANGED|, |Q|)?
Optimal: in O(|CHANGED| + |Q|)?
24Complexity analysis in terms of the size of changes
Incremental query answering
Input: Q, G, Q(G), ∆G
Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M
The cost of query processing: a function of |G| and |Q|
incremental algorithms: |CHANGED|, the size of changes in • the input: ∆G, and • the output: ∆M
The updating cost that is inherent to the incremental problem itself
The amount of work absolutely necessary to perform for any incremental algorithm
Incremental algorithms?
Incremental graph simulation: bounded
G. Ramalingam, Thomas W. Reps: On the Computational Complexity of Dynamic Graph Problems. TCS 158(1&2), 1996
24
Why study incremental query answering?
View maintenance: in response to changes to the underlying
graph
Compressed graphs: maintenance in the presence of changes
Indexing structure: 2-hop covers
25An important issue
Incremental query answering
Input: Q, G, Q(G), ∆G
Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M
E-commerce systems: a fixed set of (parameterized) queries
– Repeatedly invoked and evaluated
One of important issues for querying big graphs
|CHANGED|: the affected area
Result graphs: Gr = (Vr, Er) for graph simulation
26
Q
* 12
1
Ann, CTO
Pat, DB
John, DB
Bill, Bio Mat, Bio
simulation
Vr : the nodes in G that match pattern nodes in Q Er: the paths in G that match edges in Q
Affected Area (AFF)• the difference between Gr and Gr’• The size of changes in the output
The complexity and boundedness analyses of incremental matching
the result graph of Q(G ∆G)⊕
|CHANGED| = |∆G| + |AFF|
the result graph of Q(G)
Incremental graph pattern matching
27
Ann, CTO
Pat, DB Dan, DB
Bill, Bio Mat, Bio
Don, CTOPat, DB
Ann, CTO John, DB
Bill, Bio
Mat, Bio
Ross, Med
Tom, Bio
Q
*1
2
1
CTO
DB
Bio
Insert e2
G
Gr
∆G
Insert e1
e2
John, CTO
Tom, Bio
e3
e4
e5
e1
Insert e3
Insert e4
Insert e5
Comparing the cost of incremental matching with its batch counterpart
affected areaaffected area
27
Incremental simulation matching
Input: Q, G, Q(G), ∆G
Output: ∆M such that Q(G ⊕ ∆G) = Q(G) ⊕ ∆M
282 times faster than its batch counterpart for changes up to 10%
in O(|AFF|) time
Optimal for – single-edge deletions and general patterns– single-edge insertions and DAG patterns
Incremental simulation is in
unbounded
O(|∆G|(|Q||AFF| + |AFF|2)) timeO(|∆G|(|Q||AFF| + |AFF|2)) time
General patterns and graphs; batch updates
Batch updates
Semi-boundedness
Incremental simulation is in
29
Semi-boundedness is good enough!
Independent of | G |
Semi-bounded: the cost is a PTME function f(|CHANGED|, |Q|)
| Q | is small
O(|∆G|(|Q||AFF| + |AFF|2)) timeO(|∆G|(|Q||AFF| + |AFF|2)) time
for batch updates and general patterns
Independent of | G |
unit deletions and general patterns: Algorithm IncMatch
<#>
optimal with the size of changes
-
Ann, CTO
Pat, DB Dan, DB
Bill, Bio Mat, Bio
Don, CTOPat, DB
Ann, CTO Dan, DB
Bill, BioMat, Bio
QCTO
DB
Bio
delete e6
G
Gr affected area / ∆Gr
e6
e6
1. identify s-s edges
2. find invalid match
3. propagate affected area and refine matches
Incremental Simulation: optimal results
e = (v, v’), if v and v’ are matches
Use a stack, upward propagation
Linear time wrt. the size of changes
unit insertion and DAG patterns: Algorithm IncMatch
<#>
optimal with the size of changes
+
Ann, CTO
Pat, DBDan, DB
Bill, Bio Mat, Bio
Don, CTOPat, DB
Ann, CTO Dan, DB
Bill, BioMat, Bio
QCTO
DB
Bio
insert e7
G
Gr candidate
1. identify cs and cc edges
2. find new valid matches
3. propagate affected area and refine matches
e7
e7e7
Linear time wrt. the size of changes
Incremental Simulation: optimal results
e = (v, v’), if v’ is a match and v a candidate
e = (v, v’), if v’ and v are candidate
Incremental subgraph isomorphism
Input: Q, G, Miso(Q, G), ∆G
Output: ∆M such that Miso (Q, G⊕∆G) = Miso(Q, G) ⊕ ∆M
Boundedness and complexity• Incremental matching via subgraph isomorphism is unbounded
even for unit updates over DAG graphs for path patterns• Incremental subgraph isomorphism is NP-complete even when
G is fixed
32Neither bounded nor semi-bounded
not semi-bounded unless P = NP Input: Q, G, M(Q, G), ∆G Question: whether there exists a subgraph in
G ∆G ⊕ that is isomorphic to Q What should we do?
Compress G by leveraging the equivalence relation
Equivalence relation:
• reachability relation Re: a node pair (u,v) R∈ e iff they have the same set of ancestors and descendants in G.
• for any graph G, there is a unique maximum Re, i.e., the reachability equivalence relation of G
Recall reachability queries
Reachability• Input: A directed graph G, and a pair of nodes s and t in G• Question: Does there exist a path from s to t in G?
O(|V| + |E|) time
33
Incremental Reachability Preserving Compression
Incremental reachability preserving compression (RCM)– unbounded even for unit update, i.e., a single edge insertion
and deletion
RCM is solvable in O(|AFF||Gc|) time without decompressing Gc
16
Reduction from single source reachability problem
FA1
C2
C1
FA2
G
FA1
C1FA2 C2
Gr
C1 FA2 C2
FA2
Gr’
C1
FA1FA2C2
Gr’’
1. Update topological ranking, initialize AFF
FA1
C1FA2 C2
2. (iteratively) split/merge nodes and update Gc
Without decompressing Gc
Graph pattern matching by graph simulation
Input: A directed graph G, and a graph pattern Q Output: the maximum simulation relation R
35
Bisimulation: a binary relation B over V of G, such that for each node pair (u,v) B, ∈
• L(u) = L(v)• for each edge (u,u’) E, there exists (v,v’) E, s.t. (u’,v’) B,∈ ∈ ∈• for each edge (v,v’) E, there exists (u,u’) E, s.t. (u’,v’) B∈ ∈ ∈
Equivalence relation Rb: the unique maximum bisimulation relation
Compress G by leveraging the equivalence relation
Incremental simulation Preserving Compression
17
GBSA1
MSA2
BSA2
…
MSA1
FA1 FA2 FA3 FA4
C1 C2 C3 C4
FA2
C2FA1 FA3 FA4
…C1 C3 C4
MSA2MSA1
BSA1 BSA2
Gq
Incremental pattern preserving compression (PCM) is unbounded
even for unit update
RCM is solvable in O(|AFF|2+|Gc|) time without decompressing
Gc 1. Update node ranking, initialize AFF
2. Iteratively split/merge nodes in Gc and update AFF
Affected area
Incremental compression without recomputation
Incremental graph compression
Input: G, Gc = R(G), ∆G
Output: ∆Gc such that R(G ⊕ ∆G) = R(G) ⊕ ∆Gc
Compressed once and incrementally maintained
No need to decompress Gc
Gc is computed once for all queries Q in L
Boundedness and complexity
• unbounded even for unit updates
• in O( |AFF|2 + | Gc | ) time
37
Putting together
38
Prove (semi-)boundedness: develop a (semi-)bounded
incremental algorithms
Disprove (semi-)boundedness: by contradiction or reduction
Semi-bounded incremental algorithms for querying big data
Bounded and semi-bounded incremental algorithms Incremental graph simulation: semi-bounded
– Cyclic patterns and graphs– Batch updates
Optimal for – single-edge deletions and general patterns– single-edge insertions and DAG patterns
Summing up
3939
40
Making big data small
Yes, it is doable!
Parallel query processing: divide and conquer
Bounded evaluable queries: dynamic reduction
Query preserving compression: convert big data to small data
Query answering using views: make big data small
Bounded incremental query answering: depending on the size of the changes rather than the size of the original big data
. . .
Combinations of these are more effective
Including but not limited to graph queries
MapReduce not the only way, and it is not the best way!
5.28 years * 365 * 24 * 3600 (EB) 24 second!
Improvement: 28587 times (bounded evaluability), 60%55 times (parallel processing via partial evaluation)23 times (query answering using views)2.3 times faster (compression)2 times faster for changes up to 10% (incremental)
41
Summary and review
What is query answering using views?
What is query containment? What is the complexity of deciding query
containment for relations? For XML? Graph pattern queries via graph
simulation?
What questions do we have to answer for answering graph queries
using views?
What is incremental query evaluation? What are the benefits?
What is a unit update? Batch updates?
When can we say that an incremental problem is bounded? Semi-
bounded?
How to show that an incremental problem is bounded? How to disprove
it?
42
Project (1)
42
Develop a characterization (a sufficient and necessary condition) for deciding whether subgraph queries can be answered using views.
Develop an algorithm for determining whether a subgraph query can be answered using views, based on your characterization.
Develop an algorithm that, given a graph G, a set V of views and a subgraph query Q that can be answered using the views, computes Q(G) by using views in V
Give correctness and complexity analyses of your algorithms. Experimentally evaluate your algorithms, especially their scalability with the
size of graphs
A research and development project
Recall graph pattern matching via subgraph isomorphism (Lecture 3),referred to as subgraph queries in the sequel.
43
Project (2)
43
Study incremental maintenance of 2-hop covers, in response to • node insertion• node deletion• edge insertion• edge deletion
Develop an incremental algorithm in each of these settings.Is the incremental problem bounded in each of the settings? If so, show that your incremental algorithm is bounded; otherwise disprove the boundedness of the incremental problemImplement your algorithms, and prove their correctnessExperimentally evaluate your algorithms, especially their scalability
A research and development project
Recall 2-hop covers for reachability queries (Lecture 2): for each node v in G, maintain 2hop(v) = (Lin(v), Lout(v)) such that for a node s can reach t if and only if Lout(s) Lin(t)
44
Project (3)
44
Study incremental maintenance of SSC, in response to • node insertion• node deletion• edge insertion• edge deletion
Develop an incremental algorithm in each of these settings.Is the incremental problem bounded in each of the settings? If so, show that your incremental algorithm is bounded; otherwise disprove the boundedness of the incremental problemImplement your algorithms, and prove their correctness; Experimentally evaluate your algorithms, especially their scalability
A research and development project
Recall strongly connected components (SSC, Lecture 2).
45
• W. Le, S. Duan, A. Kementsietsidis, F. Li, and M. Wang. Rewriting queries on SPARQL views. In WWW, 2011.
http://www.cs.fsu.edu/~lifeifei/papers/rdfview.pdf
• D. Saha. An incremental bisimulation algorithm. In FSTTCS, 2007.
http://cs.famaf.unc.edu.ar/~rfervari/sites/all/files/readings/incremental-bis-07.pdf
• S. K. Shukla, E. K. Shukla, D. J. Rosenkrantz, H. B. H. Iii, and R. E. Stearns. The polynomial time decidability of simulation relations for finite state processes: A HORNSAT based approach. In DIMACS Ser. Discrete, 1997. (search Google Scholar)
• W. Fan, X. Wang, and Y. Wu. Answering Graph Pattern Queries using Views, ICDE 2014. (query answering using views)
• W. Fan, X. Wang, and Y. Wu. Incremental Graph Pattern Matching, TODS 38(3), 2013. (bounded incremental query answering)
Papers for you to review