Download - 1 QSX: Querying Social Graphs Querying big graphs Parallel query processing Boundedly evaluable queries Query-preserving graph compression Query answering

1

QSX: Querying Social Graphs

Querying big graphs

Parallel query processing

Boundedly evaluable queries

Query-preserving graph compression

Query answering using views

Bounded incremental query evaluation

2

How to make big graphs small

Input: A class Q of queries

Question: Can we effectively find, given queries Q Q and any (possibly big) graph G, a small GQ such that

Q(G) = Q(GQ)?

Effective methods for making big graphs small

Distributed query processing Boundedly evaluable graph queries Query preserving graph compression Query answering using views Bounded incremental evaluation

Q( )GG

Q( ) GQGQ

Much smaller than G

Graph pattern matching by graph simulation

Input: A directed graph G, and a graph pattern Q Output: the maximum simulation relation R

Using views? Incremental?

3

Maximum simulation relation: always exists and is unique• If a match relation exists, then there exists a maximum one• Otherwise, it is the empty set – still maximum

Complexity: O((| V | + | VQ |) (| E | + | EQ| )

The output is a unique relation, possibly of size |Q||V|

Graph pattern matching using views

44

Answering queries using views

5The complexity is no longer a function of |G|

can we compute Q(G) without accessing G, i.e., independent of |G|?

The cost of query processing: f(|G|, |Q|)

Query answering using views: given a query Q in a language L

and a set V views, find another query Q’ such that

Q and Q’ are equivalent

Q’ only accesses V(G )

for any G, Q(G) = Q’(G)

Answering graph pattern queries on big social graphs:

Regardless of how big G is – the cost is “independent” of G

V(G ) is often much smaller than G (4% -- 12% on real-life data)

Q’( )Q( ) V(G)V(G)GG

Querying collaborative network

6

customer developer

project manager

query 1

Customer developer

query 2

PM 2PM 1

customer 2 developer 3developer 2

customer 2

developer 3

developer 2

customer 3

customer developer

project manager

A collaborative pattern

PM 2PM 1

customer 2

customer 1

developer 2

developer 3

developer 1

customer 3

A collaborative (chat) network

developer k

customer 3 customer n

…

…

tester

expensive!

Detecting Coordination Problems in Collaborative Software Development Environments,

Amrit Chintan et al, Information System management, 2010

views

Answering query using views

7

query A

database D

database views V(D)

Q(D)query result query Q

A(V)query result

1995 2000 2011

relational algebra

2002

XPath

2007

XML

2006

tree pattern query

1998

regular path

queriesRDF/SPARQL

graph pattern query simulation

When possible?

What to choose?

How to evaluate?

A classical techniques, but in their infancy for graphs

When a pattern can be matched using views

8Pattern containment: a characterization

A necessary and sufficient condition

Pattern containment

9

customer developer

project manager

customer developer

project managerView 1

customer developer

View 2

(customer, developer)

{(customer 2, developer 2),(customer 3, developer 3)}

(developer, customer)

{(developer 2, customer 2),(developer 2, customer 3),(developer 3, customer 2)}

(project manager, developer)

{(PM 1, developer 2),(PM 2, developer 3)}

(project manager, customer)

{(PM 1, customer 2),(PM 2, customer 2)}

(project manager, developer) (PM 1, developer 2)

(project manager, customer) (PM 1, customer 2)

(developer, customer) (developer 2, customer 2)

(customer, developer) (customer 2, developer 2)

Query result

How to determine the existence of ?

Determining Pattern containment

10

NP-complete for relational conjunctive queries, undecidable for relational algebra

A practical characterization: patterns are small in practice

Pattern containment: example

11

customer developer

project manager

View 1

customer developer

View 2

customer developer

project manager

queryas “data graph”

λ

customer

project manager

developer

view matches

V: the set of views; Q: query

Query containment: given Q and Q’, it is to determine whether for any graph G, Q(G) is contained in Q’(G)?A classical problem.What is its complexity for pattern queries?

efficient

Test: Pattern query containment

Pattern query

PM

DBAPRG

DBA PRG

PM

DBAPRG

View 1

e1 e2

DBAPRGView 2

e3

e4

It takes 0.5 second to check containment of large cyclic patterns12

Query evaluation using views

13

Input: pattern query Q, graph G, a set of views V and extensions in G, and a mapping λ

Output: Find the query result Q(G)

Algorithm

◦ Collect edge matches for each query edge e and λ(e)

◦ Iteratively remove non-matches until no change happens

◦ Return Q(G)

Q(G) can be evaluated in O(|Q||V(G)| + |V(G)|2) time

Recall simulation algorithm

More efficient. Why?

Query evaluation using views

14

customer developer

query

project manager

customer developer

project manager

View 1

customer developer

View 2

(customer, developer)

{(customer 2, developer 2),(customer 3, developer 3)}

(developer, customer)

{(developer 2, customer 2),(developer 2, customer 3),(developer 3, customer 2)}

(project manager, developer)

{(PM 1, developer 2),(PM 2, developer 3)}

(project manager, customer)

{(PM 1, customer 2),(PM 2, customer 2)}

(project manager, developer) {(PM 1, developer 2),(PM 2, developer 3)}

(project manager, customer) {(PM 1, customer 2),(PM 2, customer 2)}

(developer, customer) {(developer 2, customer 2),(developer 2, customer 3),(developer 3, customer 2)}

(customer, developer) {(customer 2, developer 2),(customer 3, developer 3)}

Query result

“bottom-up” strategy

Without accessing the underlying big graph G

4% -- 12% of G Are we done yet?

What views to choose?

15

customer

developer

project manager

softwaretester

customer

softwarecustomer developer

project manager

customerdeveloper

software

customer developer

project manager

software

customer developer

project manager

software

testerdeveloper

software

query view 2 view 1

view 3 view 4

view 5 view 6

choose all?

Why do we care?

efficiency

Minimum containment

16

Minimum containment is NP-complete◦ APX-hard as optimization

What can we do?

Give two options

An log|Ep|-approximation

17

Idea: greedily select views V that “cover” more query edges

Ec: already covered

To decide whether to include a particular view V

Approximation: performance guarantees

Minimum containment: example

18

customer

developer

project manager

softwaretester

customer

softwarecustomer developer

project manager

customer developer

project manager

software

customer developer

project manager

software

testerdeveloper

software

query view 2 view 1

view 4

view 6 view 5

customerdeveloper

software

view 3

Ec

Greedy: based on the metric

Minimal containment

19

Algorithm◦ Computes view match for each view

◦ Iteratively selects a view that extends Ec

◦ Repeats until Ec= Ep or return empty set

O(|Q|2 card(V) + |V|2 + |Q| |V|) time

new addition

Minimal containment is in PTIME

Minimal containment: example

20

customer

developer

project manager

softwaretester

customer

software

customer developer

project manager

customer developer

project manager

software

customer developer

project manager

software

testerdeveloper

software

query view 2

view 1

view 4

view 6 view 5

customerdeveloper

software

view 3

Eliminate redundant views

Putting together

21

Problem Complexity Algorithm

containment PTIME O(card(V)|Q|2+|V|2+|Q||V|)

minimum containment

NP-c/APX-hard

log|Ep|-approximableO(card(V)|Q|2+|V|2+|Q||V|+|Q|card(V)3/2)

minimal containment

PTIME O(card(V)|Q|2+|V|2+|Q||V|)

evaluation PTIME O(|Q||V(G)| + |V(G)|2)

characterization: sufficient and necessary condition for deciding

whether a query can be answered using a set of views

evaluation: how to evaluate queries using views

view section: what views to choose for answering queries

The study is still in its infancy for graph queries

Subgraph isomorphism?

View maintenance?Improvement: 23 times faster

Bounded incremental graph pattern matching

2222

Incremental query answering

23Minimizing unnecessary recomputation

Incremental query processing:

Input: Q, G, Q(G), ∆G

Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M

Changes to the outputNew output

Changes to the inputOld output

When changes ∆G to the graph G are small, typically so are the

changes ∆M to the output Q(G⊕∆G)

Changes ∆G are typically small

Compute Q(G) once, and then incrementally maintain it

Real-life data is dynamic – constantly changes, ∆G

Re-compute Q(G ∆G⊕ ) starting from scratch?

5%/week in

Web graphs

5%/week in

Web graphs

Complexity of incremental problems

Bounded: the cost is expressible as f(|CHANGED|, |Q|)?

Optimal: in O(|CHANGED| + |Q|)?

24Complexity analysis in terms of the size of changes




The cost of query processing: a function of |G| and |Q|

incremental algorithms: |CHANGED|, the size of changes in • the input: ∆G, and • the output: ∆M

The updating cost that is inherent to the incremental problem itself

The amount of work absolutely necessary to perform for any incremental algorithm

Incremental algorithms?

Incremental graph simulation: bounded

G. Ramalingam, Thomas W. Reps: On the Computational Complexity of Dynamic Graph Problems. TCS 158(1&2), 1996

24

Why study incremental query answering?

View maintenance: in response to changes to the underlying

graph

Compressed graphs: maintenance in the presence of changes

Indexing structure: 2-hop covers

25An important issue




E-commerce systems: a fixed set of (parameterized) queries

– Repeatedly invoked and evaluated

One of important issues for querying big graphs

|CHANGED|: the affected area

Result graphs: Gr = (Vr, Er) for graph simulation

26

Q

* 12

1

Ann, CTO

Pat, DB

John, DB

Bill, Bio Mat, Bio

simulation

Vr : the nodes in G that match pattern nodes in Q Er: the paths in G that match edges in Q

Affected Area (AFF)• the difference between Gr and Gr’• The size of changes in the output

The complexity and boundedness analyses of incremental matching

the result graph of Q(G ∆G)⊕

|CHANGED| = |∆G| + |AFF|

the result graph of Q(G)

Incremental graph pattern matching

27

Ann, CTO

Pat, DB Dan, DB

Bill, Bio Mat, Bio

Don, CTOPat, DB

Ann, CTO John, DB

Bill, Bio

Mat, Bio

Ross, Med

Tom, Bio

Q

*1

2

1

CTO

DB

Bio

Insert e2

G

Gr

∆G

Insert e1

e2

John, CTO

Tom, Bio

e3

e4

e5

e1

Insert e3

Insert e4

Insert e5

Comparing the cost of incremental matching with its batch counterpart

affected areaaffected area

27

Incremental simulation matching


Output: ∆M such that Q(G ⊕ ∆G) = Q(G) ⊕ ∆M

282 times faster than its batch counterpart for changes up to 10%

in O(|AFF|) time

Optimal for – single-edge deletions and general patterns– single-edge insertions and DAG patterns

Incremental simulation is in

unbounded

O(|∆G|(|Q||AFF| + |AFF|2)) timeO(|∆G|(|Q||AFF| + |AFF|2)) time

General patterns and graphs; batch updates

Batch updates

Semi-boundedness

Incremental simulation is in

29

Semi-boundedness is good enough!

Independent of | G |

Semi-bounded: the cost is a PTME function f(|CHANGED|, |Q|)

| Q | is small

O(|∆G|(|Q||AFF| + |AFF|2)) timeO(|∆G|(|Q||AFF| + |AFF|2)) time

for batch updates and general patterns

Independent of | G |

unit deletions and general patterns: Algorithm IncMatch

<#>

optimal with the size of changes

-

Ann, CTO

Pat, DB Dan, DB

Bill, Bio Mat, Bio

Don, CTOPat, DB

Ann, CTO Dan, DB

Bill, BioMat, Bio

QCTO

DB

Bio

delete e6

G

Gr affected area / ∆Gr

e6

e6

1. identify s-s edges

2. find invalid match

3. propagate affected area and refine matches

Incremental Simulation: optimal results

e = (v, v’), if v and v’ are matches

Use a stack, upward propagation

Linear time wrt. the size of changes

unit insertion and DAG patterns: Algorithm IncMatch

<#>

optimal with the size of changes

+

Ann, CTO

Pat, DBDan, DB

Bill, Bio Mat, Bio

Don, CTOPat, DB

Ann, CTO Dan, DB

Bill, BioMat, Bio

QCTO

DB

Bio

insert e7

G

Gr candidate

1. identify cs and cc edges

2. find new valid matches

3. propagate affected area and refine matches

e7

e7e7

Linear time wrt. the size of changes

Incremental Simulation: optimal results

e = (v, v’), if v’ is a match and v a candidate

e = (v, v’), if v’ and v are candidate

Incremental subgraph isomorphism

Input: Q, G, Miso(Q, G), ∆G

Output: ∆M such that Miso (Q, G⊕∆G) = Miso(Q, G) ⊕ ∆M

Boundedness and complexity• Incremental matching via subgraph isomorphism is unbounded

even for unit updates over DAG graphs for path patterns• Incremental subgraph isomorphism is NP-complete even when

G is fixed

32Neither bounded nor semi-bounded

not semi-bounded unless P = NP Input: Q, G, M(Q, G), ∆G Question: whether there exists a subgraph in

G ∆G ⊕ that is isomorphic to Q What should we do?

Compress G by leveraging the equivalence relation

Equivalence relation:

• reachability relation Re: a node pair (u,v) R∈ e iff they have the same set of ancestors and descendants in G.

• for any graph G, there is a unique maximum Re, i.e., the reachability equivalence relation of G

Recall reachability queries

Reachability• Input: A directed graph G, and a pair of nodes s and t in G• Question: Does there exist a path from s to t in G?

O(|V| + |E|) time

33

Incremental Reachability Preserving Compression

Incremental reachability preserving compression (RCM)– unbounded even for unit update, i.e., a single edge insertion

and deletion

RCM is solvable in O(|AFF||Gc|) time without decompressing Gc

16

Reduction from single source reachability problem

FA1

C2

C1

FA2

G

FA1

C1FA2 C2

Gr

C1 FA2 C2

FA2

Gr’

C1

FA1FA2C2

Gr’’

1. Update topological ranking, initialize AFF

FA1

C1FA2 C2

2. (iteratively) split/merge nodes and update Gc

Without decompressing Gc

Graph pattern matching by graph simulation

Input: A directed graph G, and a graph pattern Q Output: the maximum simulation relation R

35

Bisimulation: a binary relation B over V of G, such that for each node pair (u,v) B, ∈

• L(u) = L(v)• for each edge (u,u’) E, there exists (v,v’) E, s.t. (u’,v’) B,∈ ∈ ∈• for each edge (v,v’) E, there exists (u,u’) E, s.t. (u’,v’) B∈ ∈ ∈

Equivalence relation Rb: the unique maximum bisimulation relation

Compress G by leveraging the equivalence relation

Incremental simulation Preserving Compression

17

GBSA1

MSA2

BSA2

…

MSA1

FA1 FA2 FA3 FA4

C1 C2 C3 C4

FA2

C2FA1 FA3 FA4

…C1 C3 C4

MSA2MSA1

BSA1 BSA2

Gq

Incremental pattern preserving compression (PCM) is unbounded

even for unit update

RCM is solvable in O(|AFF|2+|Gc|) time without decompressing

Gc 1. Update node ranking, initialize AFF

2. Iteratively split/merge nodes in Gc and update AFF

Affected area

Incremental compression without recomputation

Incremental graph compression

Input: G, Gc = R(G), ∆G

Output: ∆Gc such that R(G ⊕ ∆G) = R(G) ⊕ ∆Gc

Compressed once and incrementally maintained

No need to decompress Gc

Gc is computed once for all queries Q in L

Boundedness and complexity

• unbounded even for unit updates

• in O( |AFF|2 + | Gc | ) time

37

Putting together

38

Prove (semi-)boundedness: develop a (semi-)bounded

incremental algorithms

Disprove (semi-)boundedness: by contradiction or reduction

Semi-bounded incremental algorithms for querying big data

Bounded and semi-bounded incremental algorithms Incremental graph simulation: semi-bounded

– Cyclic patterns and graphs– Batch updates

Optimal for – single-edge deletions and general patterns– single-edge insertions and DAG patterns

Summing up

3939

40

Making big data small

Yes, it is doable!

Parallel query processing: divide and conquer

Bounded evaluable queries: dynamic reduction

Query preserving compression: convert big data to small data

Query answering using views: make big data small

Bounded incremental query answering: depending on the size of the changes rather than the size of the original big data

. . .

Combinations of these are more effective

Including but not limited to graph queries

MapReduce not the only way, and it is not the best way!

5.28 years * 365 * 24 * 3600 (EB) 24 second!

Improvement: 28587 times (bounded evaluability), 60%55 times (parallel processing via partial evaluation)23 times (query answering using views)2.3 times faster (compression)2 times faster for changes up to 10% (incremental)

41

Summary and review

What is query answering using views?

What is query containment? What is the complexity of deciding query

containment for relations? For XML? Graph pattern queries via graph

simulation?

What questions do we have to answer for answering graph queries

using views?

What is incremental query evaluation? What are the benefits?

What is a unit update? Batch updates?

When can we say that an incremental problem is bounded? Semi-

bounded?

How to show that an incremental problem is bounded? How to disprove

it?

42

Project (1)

42

Develop a characterization (a sufficient and necessary condition) for deciding whether subgraph queries can be answered using views.

Develop an algorithm for determining whether a subgraph query can be answered using views, based on your characterization.

Develop an algorithm that, given a graph G, a set V of views and a subgraph query Q that can be answered using the views, computes Q(G) by using views in V

Give correctness and complexity analyses of your algorithms. Experimentally evaluate your algorithms, especially their scalability with the

size of graphs

A research and development project

Recall graph pattern matching via subgraph isomorphism (Lecture 3),referred to as subgraph queries in the sequel.

43

Project (2)

43

Study incremental maintenance of 2-hop covers, in response to • node insertion• node deletion• edge insertion• edge deletion

Develop an incremental algorithm in each of these settings.Is the incremental problem bounded in each of the settings? If so, show that your incremental algorithm is bounded; otherwise disprove the boundedness of the incremental problemImplement your algorithms, and prove their correctnessExperimentally evaluate your algorithms, especially their scalability


Recall 2-hop covers for reachability queries (Lecture 2): for each node v in G, maintain 2hop(v) = (Lin(v), Lout(v)) such that for a node s can reach t if and only if Lout(s) Lin(t)

44

Project (3)

44

Study incremental maintenance of SSC, in response to • node insertion• node deletion• edge insertion• edge deletion

Develop an incremental algorithm in each of these settings.Is the incremental problem bounded in each of the settings? If so, show that your incremental algorithm is bounded; otherwise disprove the boundedness of the incremental problemImplement your algorithms, and prove their correctness; Experimentally evaluate your algorithms, especially their scalability


Recall strongly connected components (SSC, Lecture 2).

45

• W. Le, S. Duan, A. Kementsietsidis, F. Li, and M. Wang. Rewriting queries on SPARQL views. In WWW, 2011.

http://www.cs.fsu.edu/~lifeifei/papers/rdfview.pdf

• D. Saha. An incremental bisimulation algorithm. In FSTTCS, 2007.

http://cs.famaf.unc.edu.ar/~rfervari/sites/all/files/readings/incremental-bis-07.pdf

• S. K. Shukla, E. K. Shukla, D. J. Rosenkrantz, H. B. H. Iii, and R. E. Stearns. The polynomial time decidability of simulation relations for finite state processes: A HORNSAT based approach. In DIMACS Ser. Discrete, 1997. (search Google Scholar)

• W. Fan, X. Wang, and Y. Wu. Answering Graph Pattern Queries using Views, ICDE 2014. (query answering using views)

• W. Fan, X. Wang, and Y. Wu. Incremental Graph Pattern Matching, TODS 38(3), 2013. (bounded incremental query answering)

Papers for you to review

Download - 1 QSX: Querying Social Graphs Querying big graphs Parallel query processing Boundedly evaluable queries Query-preserving graph compression Query answering

Top Related