jeffrey xu yu large graph processing
TRANSCRIPT
![Page 1: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/1.jpg)
Large Graph Processing
Jeffrey Xu Yu (于旭 )Department of Systems Engineering and Engineering ManagementThe Chinese University of Hong [email protected], http://www.se.cuhk.edu.hk/~yu
![Page 2: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/2.jpg)
The Chinese University of Hong KongShatin, NT, Hong Kong
2
![Page 3: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/3.jpg)
Social Networks
3
![Page 4: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/4.jpg)
Social Networks
4
![Page 5: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/5.jpg)
Facebook Social Network
In 2011, 721 million users, 69 billion friendship links. The degree of separation is 4. (Four Degrees of Separation by Backstrom, Boldi, Rosa, Ugander, and Vigna, 2012)
5
![Page 6: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/6.jpg)
The Scale/Growth of Social Networks Facebook statistics
829 million daily active users on average in June 2014 1.32 billion monthly active users as of June 30, 2014 81.7% of daily active users are outside the U.S. and
Canada 22% increase in Facebook users from 2012 to 2013
Facebook activities (every 20 minutes on Facebook) 1 million links shared 2 million friends requested 3 million messages sent
http://newsroom.fb.com/company-info/http://www.statisticbrain.com/facebook-statistics/
6
![Page 7: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/7.jpg)
The Scale/Growth of Social Networks Twitter statistics
271 million monthly active users in 2014 135,000 new users signing up every day 78% of Twitter active users are on mobile 77% of accounts are outside the U.S.
Twitter activities 500 million Tweets are sent per day 9,100 Tweets are sent per second
https://about.twitter.com/companyhttp://www.statisticbrain.com/twitter-statistics/
7
![Page 8: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/8.jpg)
Location Based Social Networks
8
![Page 9: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/9.jpg)
Financial Networks
We borrow £1.7 trillion, but we're lending £1.8 trillion. Confused? Yes, inter-nation finance is complicated..." 9
![Page 10: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/10.jpg)
US Social Commerce -- Statistics and Trends
10
![Page 11: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/11.jpg)
Activities on Social Networks When all functions are integrated ….
11
![Page 12: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/12.jpg)
Graph Mining/Querying/Searching We have been working on many graph problems.
Keyword search in databases Reachability query over large graphs Shortest path query over large graphs Large graph pattern matching Graph clustering Graph processing on Cloud ……
12
![Page 13: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/13.jpg)
Part I: Social Networks
13
![Page 14: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/14.jpg)
Some Topics Ranking over trust networks Influence on social networks Influenceability estimation in Social Networks Random-walk domination Diversified ranking Top-k structural diversity search
14
![Page 15: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/15.jpg)
Ranking over Trust Networks
15
![Page 16: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/16.jpg)
Real rating systems (users and objects) Online shopping websites (Amazon) www.amazon.com Online product review websites (Epinions) www.epinions.com Paper review system (Microsoft CMT) Movie rating (IMDB) Video rating (Youtube)
Reputation-based Ranking
16
![Page 17: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/17.jpg)
The Bipartite Rating Network Two entities: users and objects Users can give rating to objects
If we take the average as the ranking score of an object, o1 and o3 are the top.
If we consider the user’s reputation, e.g., u4, …
Objects
UsersRatings
17
![Page 18: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/18.jpg)
Reputation-based Ranking Two fundamental problems
How to rank objects using the ratings? How to evaluate users’ rating reputation?
Algorithmic challenges Robustness
Robust to the spamming users Scalability
Scalable to large networks Convergence
Convergent to a unique and fixed ranking vector
18
![Page 19: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/19.jpg)
Signed/Unsigned Trust Networks Signed Trust Social Networks (users): A user can
express their trust/distrust to others by positive/negative trust score. Epinions (www.epinions.com) Slashdot (www.slashdot.org)
Unsigned Trust Social Networks (users): A user can only express their trust. Advogato (www.advogato.org) Kaitiaki (www.kaitiaki.org.nz)
Unsigned Rating Networks (users and objects) Question-Answer systems Movie-rating systems (IMDB) Video rating systems in Youtube
19
![Page 20: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/20.jpg)
The Trustworthiness of a User The final trustworthiness of a user is determined by how
users trust each other in a global context and is measured by bias.
The bias of a user reflects the extend up to which his/her opinions differ from others.
If a user has a zero bias, then his/her opinions are 100% unbaised and 100% taken.
Such a user has high trustworthiness. The trustworthiness, the trust score, of a user is
1 – his/her bias score.
20
![Page 21: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/21.jpg)
An Existing Approach MB [Mishra and Bhattacharya, WWW’11]
The trustworthiness of a user cannot be trusted, because MB treats the bias of a user by relative differences between itself and others.
If a user gives all his/her friends a much higher trust score than the average of others, and gives all his/her foes a much lower trust score than the average of others, such differences cancel out. This user has zero bias and can be 100% trusted.
21
![Page 22: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/22.jpg)
An Example Node 5 gives a trust score
to node 1. Node 2 and node 3 give a high trust score to node 1.
Node 5 is different from others (biased), 0.1 – 0.8.
22
![Page 23: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/23.jpg)
MB Approach The bias of a node is .
The prestige score of node is .
The iterative system is
23
![Page 24: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/24.jpg)
An Example Consider 51, 21, 31.
A trust score = 0.1 – 0.8 = -0.7.
Consider 23, 43, 53. A trust score = 0.9 – 0.2 = 0.7
Node 5 has zero bias. The bias scores by MB.
24
![Page 25: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/25.jpg)
Our Approach To address it, consider a contraction mapping. Given a metric space with a distance function . A mapping from to is a contraction mapping if there
exists a constant c where such that.
The has a unique fixed point.
25
![Page 26: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/26.jpg)
Our Approach We use two vectors, and , for bias and prestige. The denotes the bias of node , where is the prestige
vector of the nodes, and is a vector-valued contractive function. denotes the -th element of vector .
Let , and For any , the function is a vector-valued contractive
function if the following condition holds, where and denotes the infinity norm.
26
![Page 27: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/27.jpg)
The Framework Use a vector-valued contractive function, which is a
generalization of the contracting mapping in the fixed point theory.
MB is a special case in our framework. The iterative system can converges into a unique fixed
prestige and bias vector in an exponential rate of convergence.
We can handle both unsigned and singed trust social networks.
27
![Page 28: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/28.jpg)
Influence on Social Networks
28
![Page 29: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/29.jpg)
Diffusion in Networks We care about the decisions made by friends and
colleagues. Why imitating the behavior of others
Informational effects: the choices made by others can provide indirect information about what they know.
Direct-benefit effects: there are direct payoffs from copying the decisions of others.
Diffusion: how new behaviors, practices, opinions, conventions, and technologies spread through a social network.
29
![Page 30: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/30.jpg)
A Real World Example Hotmail’s viral climb to
the top spot (90’s): 8 million users in 18 months!
Far more effective than conventional advertising by rivals and far cheaper too!
30
![Page 31: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/31.jpg)
Stochastic Diffusion Model Consider a directed graph . The diffusion of information (or influence) proceeds in
discrete time steps, with time . Each node has two possible states, inactive and active.
Let be the set of active nodes at time (active set at time ). is the seed set (the seeds of influence diffusion).
A stochastic diffusion model (with discrete time steps) for a social graph specifies the randomized process of generating active sets for all given the initial .
A progressive model is a model for .
31
![Page 32: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/32.jpg)
Influence Spread Let be the final active set (eventually stable active set)
where is the initial seed set. is a random set determined by the stochastic process
of the diffusion model. To maximize the expected size of the final active set. Let denote the expected value of a random variable . The influence spreed of seed set is defined as . Here
the expectation is taken among all random events leading to .
32
![Page 33: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/33.jpg)
Independent Cascade Model (IC) IC takes , the influence probability on all edges, and
initial seed set as the input, and generates the active sets for all . At every time step , first set . Next for every inactive node , for node , executes an
activation attempt with success probability . If successful, is added into and it is said activates at time . If multiple nodes active at time , the end effect is the same.
33
![Page 34: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/34.jpg)
An Example
34
![Page 35: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/35.jpg)
Another Example
35
![Page 36: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/36.jpg)
Influenceability Estimation in Social Networks Applications
Influence maximization for viral marketing Influential nodes discovery Online advertisement
The fundamental issue How to evaluate the influenceability for a give node
in a social network?
36
![Page 37: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/37.jpg)
The independent cascade model. Each node has an independent probability to
influence his neighbors. Can be modeled by a probabilistic graph, called
influence network, . A possible graph has probability
There are possible graphs ().
Reconsider IC Model
37
![Page 38: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/38.jpg)
An Example
38
![Page 39: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/39.jpg)
Independent cascade model. Given a probabilistic graph
Given a graph , and a node , estimate the expected number of nodes that are reachable from . where is the number of nodes that are reachable
from the seed node .
The Problem
39
![Page 40: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/40.jpg)
Reduce the Variance The accuracy of an approximate algorithm is measured
by the mean squared error
By the variance-bias decomposition
Make an estimator unbiased the 2nd term will be cancelled out.
Make the variance as small as possible.
40
![Page 41: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/41.jpg)
Naïve Monte-Carlo (NMC) Sampling possible graphs . For each sampled possible graph , compute the
number of nodes that are reachable from . Estimator: Average of the number of reachable nodes
over possible graphs. is an unbiased estimator of since . is the only existing algorithm used in the influence
maximization literature.
41
![Page 42: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/42.jpg)
Naïve Monte-Carlo (NMC) Estimator: Average of the number of reachable nodes
over possible graphs. is an unbiased estimator of
since . The variance of is
Computing the variance is extreme expensive, because it needs to enumerate all the possible graphs.
42
![Page 43: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/43.jpg)
Naïve Monte-Carlo (NMC) In practice, it resorts to an unbiased estimator of . The variance of is
But, may be very large, because fall into the interval . The variance can be up to .
43
![Page 44: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/44.jpg)
Stratified Sampling Stratified is to divide a set of data items into subsets
before sampling. A stratum is a subset. The strata should be mutually exclusive, and should
include all data items in the set. Stratified sampling can be used to reduce variance.
44
![Page 45: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/45.jpg)
A Recursive Estimator [Jin et al. VLDB’11]
Randomly select 1 edge to partition the probability space (the set of all possible graphs) into 2 strata (2 subsets) The possible graphs in the first subset include
the selected edge. The possible graphs in the second subset do
not include the selected edge. Sample possible graphs in each stratum with a
sample size proportioning to the probability of that stratum.
Recursively apply the same idea in each stratum.
45
![Page 46: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/46.jpg)
A Recursive Estimator [Jin et al. VLDB’11] Advantages:
unbiased estimator with a smaller variance.
Limitations: Select only one edge for stratification, which is not
enough to significantly reduce the variance. Randomly select edges, which results in a possible
large variance.
46
![Page 47: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/47.jpg)
More Effective Estimators Four Stratified Sampling (SS) Estimators
Type-I basic SS estimator (BSS-I) Type-I recursive SS estimator (RSS-I) Type-II basic SS estimator (BSS-II) Type-II recursive SS estimator (RSS-II)
All are unbiased and their variances are significantly smaller than the variance of NMC.
Time and space complexity of all are the same as NMC.
47
![Page 48: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/48.jpg)
Type-I Basic Estimator (BSS-I) Select edges to partition the probability space (all the
possible graphs) into strata.
Each stratum corresponds to a probability subspace (a set of possible graphs).
Let . How to select edges: BFS or random
48
![Page 49: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/49.jpg)
Type-I BSS-I Estimator
Sample size = 2𝑟
𝑁=𝑁 𝜋 1
BSS-I
49
![Page 50: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/50.jpg)
Type-I Recursive Estimator (RSS-I) Recursively apply the BSS-I into each stratum, until the sample
size reaches a given threshold. RSS-I is unbiased and its variance is smaller than BSS-I Time and space complexity are the same as NMC.
Sample size = BSS-I
RSS-I
𝑁=𝑁 𝜋 1
50
![Page 51: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/51.jpg)
Type-II Basic Estimator (BSS-II) Select edges to partition the probability space (all the
possible graphs) into strata.
Similarly, each stratum corresponds to a probability subspace (a set of possible graphs).
How to select edges: BFS or random𝑟51
![Page 52: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/52.jpg)
Type-II Estimators
Sample size = 𝑟+1BSS-II
RSS-II
𝑁=𝑁 𝜋 1
52
![Page 53: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/53.jpg)
Random-walk Domination
53
![Page 54: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/54.jpg)
Social browsing: a process that users in a social network find information along their social ties. photo-sharing Flickr, online advertisements
Two issues: Problem-I: How to place items on users in a social network
so that the other users can easily discover by social browsing? To minimize the expected number of hops that every
node hits the target set. Problem-II: How to place items on users so that as many
users as possible can discover by social browsing? To maximize the expected number of nodes that hit the
target set.
Social Browsing
54
![Page 55: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/55.jpg)
The two problems are a random walk problem. -length random walk model where the path length of
random walks is bounded by a nonnegative number . A random walk in general can be considered as .
Let be the position of an -length random walk, starting from node , at discrete time .
Let be a random walk variable.
The hitting time can be defined as the expectation of .
The Random Walk
55
![Page 56: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/56.jpg)
The Hitting Time Sarkar and Moore in UAI’07 define the hitting time of the
-length random walk in a recursive manner.
Our hitting time can be computed by the recursive procedure. Let be the degree of node and be the set of neighbor
nodes of . be the transition probability for and otherwise.
56
![Page 57: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/57.jpg)
The Random-Walk Domination Consider a set of nodes . If a random walk from reaches
by an -length random walk, we say dominates by an -length random walk.
Generalized hitting time over a set of nodes, . The hitting time can be defined as the expectation of a random walk variable .
It can be computed recursively.
57
![Page 58: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/58.jpg)
How to place items on users in a social network so that the other users can easily discover by social browsing?
To minimize the total expected number of hops of which every node hits the target set.
Problem-I
or
58
![Page 59: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/59.jpg)
How to place items on users so that as many users as possible can discover by social browsing? To maximize the expected number of nodes that hit the target set.
Let be an indicator random variable such that if hits any one node in , then , and otherwise by an -length random walk.
Let be the probability of an event that an -length random walk starting from hits a node in .
Then, .
Problem-II
59
![Page 60: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/60.jpg)
Influence Maximization vs Problem II Influence maximization is to select nodes to maximize
the expected number of nodes that are reachable from the nodes selected. Independent cascade model Probability associated with the edges are independent A target node can influence multiple immediate neighbors
at a time. Problem II is to select nodes to maximize the
expected number of nodes that reach a node in the nodes selected. -length random walk model
60
![Page 61: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/61.jpg)
The submodular set function maximization subject to cardinality constraint is -hard.
The greedy algorithm There is a approximation algorithm. Linear time and space complexity w.r.t. the size of the
graph. Submodularity: is submodular and non-decreasing.
Non-decreasing: ) for . Submodular: Let be the marginal gain. Then, , for and .𝑆 ⊆ 𝑇 ⊆ 𝑉
Submodular Function Maximization
61
![Page 62: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/62.jpg)
The submodular set function maximization subject to cardinality constraint is -hard.
Both Problem I and Problem II use a submodular set function. Problem-I: Problem-II:
Submodular Function Maximization
62
![Page 63: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/63.jpg)
The Algorithm
Let It implies dynamic programming (DP) is needed to
compute the marginal gain.
Marginal gain
63
![Page 64: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/64.jpg)
Diversified Ranking
64
![Page 65: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/65.jpg)
Diversified Ranking [Li et al, TKDE’13] Why diversified ranking?
Information requirements diversity Query incomplete
PAKDD09-65
![Page 66: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/66.jpg)
Problem Statement The goal is to find K nodes in a graph that are relevant to
the query node, and also they are dissimilar to each other.
Main applications Ranking nodes in social network, ranking papers, etc.
66
![Page 67: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/67.jpg)
Challenges Diversity measures
No wildly accepted diversity measures on graph in the literature.
Scalability Most existing methods cannot be scalable to large
graphs. Lack of intuitive interpretation.
67
![Page 68: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/68.jpg)
Grasshopper/ManiRank The main idea
Work in an iterative manner. Select a node at one iteration by random walk. Set the selected node to be an absorbing node, and
perform random walk again to select the second node. Perform the same process iterations to get nodes.
No diversity measure Achieving diversity only by intuition and experiments.
Cannot scale to large graph (time complexity O())
68
![Page 69: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/69.jpg)
Grasshopper/ManiRank Initial random walk with no absorbing states
Absorbing random walk after ranking the first item
69
![Page 70: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/70.jpg)
Our Approach The main idea
Relevance of the top-K nodes (denoted by a set S) is achieved by the large (Personalized) PageRank scores.
Diversity of the top-K nodes is achieved by large expansion ratio. Expansion ratio of a set nodes .
Larger expansion ratio implies better diversity
70
![Page 71: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/71.jpg)
The submodular set function maximization subject to cardinality constraint is -hard.
The greedy algorithm There is a approximation algorithm. Linear time and space complexity w.r.t. the size of the
graph. Submodularity: is submodular and non-decreasing.
Non-decreasing: ) for . Submodular: Let be the marginal gain. Then, , for and .𝑆 ⊆ 𝑇 ⊆ 𝑉
Submodular Function Maximization
71
![Page 72: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/72.jpg)
Top-k Structural Diversity Search
72
![Page 73: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/73.jpg)
Social contagion is a process of information (e.g. fads, news, opinions) diffusion in the online social networks Traditional biological contagion model, the affected
probability depends on degree.
MarketingOpinions Diffusion Social Network
Social Contagion
73
![Page 74: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/74.jpg)
Facebook Study [Ugander et al., PNAS’12]
Case study: The process of a user joins Facebook in response to an invitation email from an existing Facebook user.
Social contagion is not like biological contagion.
74
![Page 75: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/75.jpg)
Structural diversity of an individual is the number of connected components in one’s neighborhood.
The problem: Find individuals with highest structural diversity. Connected components in the
neighborhood of “white center”
Structural Diversity
75
![Page 76: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/76.jpg)
Part II: I/O Efficiency
76
![Page 77: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/77.jpg)
Big Data: The Volume Consider a dataset of 1 PetaByte ( bytes). A
linear scan of takes 46 hours with a fastest Solid State Drive (SSD) of speed of 6GB/s. PTIME queries do not always serve as a good yardstick
for tractability in “Big Data with Preprocessing” by Fan. et al., PVLDB”13.
Consider a function . One possible way is to make small to be , and find the answers from as it can be answered by , . There are many ways we can explore.
77
![Page 78: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/78.jpg)
Big Data: The Volume
Consider a function . One possible way is to make small to be , and find the answers from as it can be answered by , .
There are many ways we can explore. Make data simple and small
Graph sampling, Graph compression Graph sparsification, Graph simplification Graph summary Graph clustering Graph views
78
![Page 79: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/79.jpg)
More Work on Big Data We also believe that there are many things we need to
do on Big Data. We are planning explore many directions.
Make data simple and small Graph sampling, graph simplification, graph
summary, graph clustering, graph views. Explore different computing approaches
Parallel computing, distributed computing, streaming computing, semi-external/external computing.
79
![Page 80: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/80.jpg)
I/O Efficient Graph Computing I/O Efficient: Computing SCCs in Massive Graphs by
Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, Lijun Chang, and Xuemin Lin, SIGMOD’13.
Contract & Expand: I/O Efficient SCCs Computing by Zhiwei Zhang, Lu Qin, and Jeffrey Xu Yu.
Divide & Conquer: I/O Efficient Depth-First Search, Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, and Zechao Shang.
80
![Page 81: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/81.jpg)
Reachability Query Two possible but infeasible solutions:
Traverse to answer a reachability query Low query performance: query time
Precompute and store the transitive closure Fast query processing Large storage requirement:
The labeling approaches: Assign labels to nodes in a
preprocessing step offline. Answer a query using the labels
assigned online.
81
![Page 82: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/82.jpg)
A B
DC
F
G
A B
C
G
F D
Make a Graph Small and Simple Any directed graph can be represented as a DAG (Directed
Acyclic Graph), , by taking every SCC (Strongly Connected Component) in as a node in .
An SCC of a directed graph is a maximal set of nodes such that for every pair of nodes and in , and are reachable from each other.
82
![Page 83: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/83.jpg)
A B
DC
I
EF
G
H
A B
C
G
F
D
E
H I
The Reachability Queries
Reachability queries can be answered by DAG.
83
![Page 84: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/84.jpg)
The Issue and the Challenge
It needs to convert a massive directed graph into a DAG in order to process it efficiently because cannot be held in main
memory, and can be much smaller.
It is assumed that it can be done in the existing works.
But, it needs a large main memory to convert.
84
![Page 85: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/85.jpg)
The Issue and the Challenge
The Dataset uk-2007 Nodes: 105,896,555 Edges: 3,738,733,648 Average degree: 35
Memory: 400 MB for nodes, and 28 GB for edges.
85
![Page 86: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/86.jpg)
In Memory Algorithm? In Memory Algorithm: Scan twice
DFS(G) to obtain a decreasing order for each node of Reverse every edge to obtain , and DFS() according to the same decreasing order to find all SCCs.
86
![Page 87: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/87.jpg)
4
7
2 3
51
9
68
In Memory Algorithm? DFS(G) to obtain a decreasing order for each node of
87
![Page 88: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/88.jpg)
4
7
2 3
51
9
684
7
2 3
51
9
68
In Memory Algorithm? Reverse every edge to obtain .
88
![Page 89: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/89.jpg)
4
7
2 3
51
9
68
In Memory Algorithm? DFS() according to the same decreasing order to find all
SCCs. (A subtree (in black edges) form an SCC.)
89
![Page 90: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/90.jpg)
(Semi)-External Algorithms In Memory Algorithm: Scan twice
The in memory algorithm cannot handle a large graph that cannot be held in memory. Why? No locality. A large number of random I/Os.
Consider external algorithms and/or semi-external algorithms. Let be the size of main memory. External algorithm: Semi-external algorithm:
It assumes that a tree can be held in memory.
90
![Page 91: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/91.jpg)
A B
DC
I
EF
G
H
A B
C
G
F
Main Memory
Contraction Based External Algorithm (1) Load in a subgraph and merge SCCs
in it in main memory in every iteration [Cosgaya-Lozano et al. SEA'09]
91
![Page 92: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/92.jpg)
A B
DC
I
EF
G
H
A B
C
G
F
A B
C
G
D
Main MemoryContraction Based External Algorithm (2)
92
![Page 93: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/93.jpg)
A B
DC
I
EF
G
H
A B
C
G
F
A B
C GD FD
H
E
Cannot Find All SCCs Always!
Main Memory
DAG! And memory is full!
Cannot load in “I” into memory!
Contraction Based External Algorithm (3)
93
![Page 94: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/94.jpg)
2
83
4
7
5
1
96
Tree-Edge
Forward-Cross-Edge
Backward-Edge
Forward-Edge
Backward-Cross-Edgedelete old tree edge
New tree edge
DFS Based Semi-External Algorithm
Find a DFS-tree without forward-cross-edges [Sibeyn et al. SPAA’02].
For a forward-cross-edge , delete tree edge to , and as a new tree edge.
94
![Page 95: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/95.jpg)
DFS Based Approaches: Cost-1
DFS-SCC uses sequential I/Os. DFS-SCC needs to traverse a graph twice using DFS to
compute all SCCs. In each DFS, in the worst case it needs the number of
I/Os, where is the block size.
95
![Page 96: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/96.jpg)
DFS Based Approaches: Cost-2
Partial SCCs cannot be contracted to save space while constructing a DFS tree.
Why? DFS-SCC needs to traverse a graph twice using DFS to 𝐺
compute all SCCs. DFS-SCC uses a total order of nodes (decreasing postorder)
in the second DFS, which is computed in the first DFS. SCCs cannot be partially contracted in the first DFS. SCCs can be partially contracted in the second DFS, but we
have to remember which nodes belongs to which SCCs with extra space. Not free!
96
![Page 97: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/97.jpg)
DFS Based Approaches: Cost-3 High CPU cost for reshaping a DFS-tree, when it attempts
to reduce the number of forward-cross-edges.
97
![Page 98: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/98.jpg)
Our New Approach [SIGMOD’13] We propose a new two phase algorithm, 2P-SCC:
Tree-Construction and Tree-Search. In Tree-Construction phase, we construct a tree-like
structure. In Tree-Search phase, we scan the graph only once.
We further propose a new algorithm, 1P-SCC, to combine Tree-Construction and Tree-Search with new optimization techniques, using a tree. Early-Acceptance Early-Rejection Batch Edge Reduction
A joint work by Zhiwei Zhang, Jeffrey Yu, Qin Lu, Lijun Chang, and Xuemin Lin98
![Page 99: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/99.jpg)
A New Weak Order The total order used in DFS-SCC is too strong and there
is no obvious relationship between the total order and the SCCs per se, in order to reduce I/Os. The total order cannot help to reduce I/O costs.
We introduce a new weak order. For an SCC, there must exist at least one cycle. While constructing a tree for , a cycle will appear to
contain at least one edge that links to a higher level node in . .
There are two cases when . A cycle: is an ancestor of in Not a cycle (up-edge): is not an ancestor of in .𝑇
We reduce the number of up-edges iteratively.
99
![Page 100: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/100.jpg)
Let be the set of nodes including and nodes that can reach by a tree of .
: The length of the longest simple path from root to .
drank is used as the weak order!
Nodes do not need to have a unique order.
B
H
C D
GE
A
IF depth(B) = 1 drank(B) = 1 dlink(B) = B
depth(E) = 3 drank(E) = 1 dlink(E) = B
The Weak Order: drank
100
![Page 101: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/101.jpg)
2P-SCC To reduce Cost-1, we use a BR+-tree to compute all
SCCs in the Tree-Construction phase. We compute all SCCs by traversing only once using the BR+-tree constructed in the Tree-Search phase.
To reduce Cost-3, we have shown that we only need to update the depth of nodes locally.
101
![Page 102: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/102.jpg)
B
H
C D
G
E
A
IF
BR-Tree is a spanning tree of G.
BR+-Tree is a BR-Tree plus some additional edges such that is an ancestor of .
BR-Tree and BR+-Tree
In Memory: Black edges
102
![Page 103: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/103.jpg)
B
H
C D
GE
A
IFdrank(I) = 1
drank(H) = 2
Up-edge
Tree-Construction: Up-edge An edge is an up-edge on the
conditions: is not an ancestor of in
Up-edges violate the existing order
103
![Page 104: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/104.jpg)
When there is an violate up-edge, then Modify T
Delete the old tree edge
Set the up-edge as a new tree edge
Graph Reconstruction No I/O cost, low CPU
cost.
B
H
C D
GE
A
IF
Tree-Construction (Push-Down)
104
![Page 105: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/105.jpg)
B
D
E
A
F
drank(E) = 1dlink(E) = B
drank(F) = 1
Up-edge
Tree-Construction (Graph Reconstruction) Tree edges and one
extra edge in BR+-Tree form a part of an SCC.
For an up-edge , if is an ancestor of in , delete and add .
In Tree-Search, scan the graph only once to find all SCCs, which reduces I/O costs.
A new edge
105
![Page 106: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/106.jpg)
Tree-Construction When a BR+-tree is completely constructed, there are no
up-edges. There are only two kinds of edges in G.
The BR+-tree edges, and The edges where .
Such edges do not play in any role in determining an SCC.
106
![Page 107: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/107.jpg)
B
H
C D
G
E
A
IF
In memory for each node u: TreeEdge(u) dlink(u) drank(u)In total: Search Procedure: If an edge points to an
ancestor, merge all nodes from to in the tree
Only need to scan the graph once.
Tree-Search
107
![Page 108: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/108.jpg)
From 2P-SCC To 1P-SCC With 2P-SCC:
In Tree-Construction phase, we construct a tree by an approach similar to DFS-SCC, and
In Tree-Search phase, we scan the graph only once. The memory used for BR+-tree is .
With 1P-SCC: We combine Tree-Construction and Tree-Search with new optimization techniques to reduce Cost-2 and Cost-3: Early-Acceptance Early-Rejection Batch Edge Reduction Only need to use a BR-tree with memory of .
108
![Page 109: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/109.jpg)
Early-Acceptance and Early Rejection
Early acceptance: we contract a partial SCC into a node in an early stage while constructing a BR-tree.
Early rejection: we identify necessary conditions to remove nodes that will not participate in any SCCs while constructing a BR-tree.
109
![Page 110: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/110.jpg)
Early-Acceptance and Early Rejection Consider an example. The three nodes on the left can be contracted into a node on the right. The node “a” and the subtrees, C and D, can be rejected.
110
![Page 111: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/111.jpg)
B
I
CD
H
E
A
JG
Memory: 2 × | | 𝑉 Reduce I/O Cost
KF
Up-edge: Modify Tree
Up-edge: Modify Tree
Early-AcceptanceEarly-Acceptance
Modify Tree + Early Acceptance
111
![Page 112: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/112.jpg)
DFS Based vs Ours Approaches I/O cost for DFS is high
Use a total order Cannot merge SCCs
when found earlier Total order cannot be
changed. Large # of I/Os.
Cannot prune non-SCC nodes Total order cannot be
changed
Smaller I/O Cost Use a weaker order
Merge SCCs as early as possible Merge nodes with the
same order. Small size, small # of I/Os.
prune non-SCC nodes as early as possible Weaker order is flexible
112
![Page 113: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/113.jpg)
Optimization: Batch Edge Reduction
With 1PC-SCC, CPU cost is still high. In order to determine whether is a backward-edge/up-
edge, it needs to check the ancestor relationships between and over a tree. The tree is frequently updated. The average depth of nodes in the tree becomes
larger with frequently push-down operation.
113
![Page 114: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/114.jpg)
Optimization: Batch Edge Reduction When memory can hold more edges, there is no need to
contract partial SCCs edge by edge. Find all SCCs in the main memory at the same time
Read all edges that can be read into memory. Construct a graph with the edges of the tree
constructed in memory already plus the edges newly read into memory.
Construct a DAG in memory using the existing memory algorithm, which finds all SCCs in memory.
Reconstruct the BR-Tree according to the DAG.
114
![Page 115: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/115.jpg)
Performance Studies Implement using visual C++ 2005 Test on a PC with Intel Core2 Quard 2.66GHz CPU
and 3.43GB memory running Windows XP Disk Block Size: Memory Size:
115
![Page 116: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/116.jpg)
|V| |E| Average Degree
cit-patent 3,774,768 16,518,947 4.70go-uniprot 6,967,956 34,770,235 4.99citeseerx 6,540,399 15,011,259 2.30WEBSPAM-UK2002
105,896,555 3,738,733,568 35.00
Four Real Data Sets
116
![Page 117: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/117.jpg)
Parameter Range DefaultNode Size 30M - 70M 30MAverage Degree 3 - 7 5Size of Massive SCCs 200K – 600K 400KSize of Large SCCs 4K - 12K 8KSize of Small SCCs 20 - 60 40# of Massive SCCs 1 1# of Large SCCs 30 - 70 50# of Small SCCs 6K – 14K 10K
Synthetic Data Sets We construct a graph G by (1) randomly selecting all
nodes in SCCs first, (2) adding edges among the nodes in an SCC until all nodes form an SCC, and (3) randomly adding nodes/edges to the graph.
117
![Page 118: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/118.jpg)
1PB-SCC 1P-SCC 2P-SCC DFS-SCCcit-patent(s) 24s 22s 701s 840sgo-uniprot(s) 22s 21s 301s 856sciteseerx(s) 10s 8s 517s 669scit-patent(I/O) 16,031 13,331 133,467 667,530go-uniprot(I/O)
26,034 47,947 471,540 619,969
citeseerx(I/O) 15,472 13,482 104,611 392,659
Performance Studies
118
![Page 119: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/119.jpg)
WEBSPAM-UK2007: Vary Node Size
119
![Page 120: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/120.jpg)
WEBSPAM-UK2007: Vary Memory
120
![Page 121: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/121.jpg)
Synthetic Data Sets: Vary SCC Sizes
121
![Page 122: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/122.jpg)
Synthetic Data Sets: Vary # of SCCs
122
![Page 123: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/123.jpg)
From Semi-External to External Existing semi-external solutions work under the condition
that it can held a tree in main-memory | | ≤ | |𝑘 𝑉 𝑀 , and generate a large I/Os.
We study an external algorithm by removing the condition of
123A joint work by Zhiwei Zhang, Qin Lu, and Jeffrey Yu
![Page 124: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/124.jpg)
The New Approach: The Main Idea DFS based approaches generate random accesses Contraction based semi-external approach reducesand
together at the same time. Cannot find all SCCs.
The main idea of our external algorithm: Work on a small graph of by reducing because can be
small. Find all SCCs in . Add removed nodes back to find SCCs in .
124
![Page 125: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/125.jpg)
The New Approach: The Property Reducing the given graph
. If can reach in , can also reachin . Maintaining this property may generate a large
number of random I/O access. Reason: several nodes on the path from to may be
removed from in the previous iterations.
125
![Page 126: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/126.jpg)
The New Approach: The Approach We introduce a new Contraction & Expansion approach.
Contraction phase: Reduce nodes iteratively, .
It decreases , but may increase . Expansion phase:
In the reverse order in contraction phase, . Find all SCCs in using a semi-external algorithm.
The semi-external algorithm deals with edges. Expand back to .
126
![Page 127: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/127.jpg)
The Contraction In Contraction phase, graph are generated, is generated by removing a batch of nodes from Stops until when semi-external approach can be applied.
G1 G2 G3
127
![Page 128: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/128.jpg)
The Expansion In Expansion phase, removed nodes are added Addition is in the reverse order of their removal in
contraction phase.
G1 G2 G3
128
![Page 129: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/129.jpg)
The Contraction Phase Compared with should have the following properties
Contractable:
SCC-Preservable:
Recoverable:
129
![Page 130: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/130.jpg)
Contract Vi+1
Recoverable:
is recoverable if and only if is a vertex cover of . At this condition, we can determine which SCCs the
nodes in belong to by scanning once. For each edge, we select the node with a higher degree
or a higher order.
130
![Page 131: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/131.jpg)
Contract Vi+1
cd
h
a
b
e
f
g
i
ID1 ID2 Deg1 Deg2a b 3 3
a d 3 4
b c 3 2
c d 2 4
d e 4 4
d g 4 4
e b 4 3
e g 4 4
f g 2 4
g h 4 2
h i 2 2
i f 2 2
DISK
For each edge, we select the node with a higher degree or a higher order.
131
![Page 132: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/132.jpg)
Construct Ei+1
SCC-Preservable:
If , remove and and add
Althoughmay be larger, is sure to be smaller. Smaller implies semi-external approach can be applied.
132
![Page 133: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/133.jpg)
ID1 ID2e d
b d
i g
g i
Construct Ei+1
cd
h
a
b
e
f
g
i
ID1 ID2d e
d g
e b
e g
DISK
If , remove and and add
Existing Edges
New Edges
133
![Page 134: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/134.jpg)
The Expansion Phase ) in For any node , can be computed using and only.
a
b
c
134
![Page 135: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/135.jpg)
ID1 ID2a b
a d
b c
c d
d e
d g
e b
e g
f g
g h
h i
i f
Expansion Phase
c
d h
a
b
e
f
g
i
DISK
) in
135
![Page 136: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/136.jpg)
Performance Studies Implement using visual C++ 2005 Test on a PC with Intel Core2 Quard 2.66GHz CPU and 3.5GB
memory running Windows XP Disk Block Size: 64KB Default memory Size:
136
![Page 137: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/137.jpg)
Data Set Real Data set
Synthetic Data
V E Average Degree
WEBSPAM-UK2007
105,896,555 3,738,733,568 35.00
Parameter
Node Size 25M – 100M
Average Degree 2 - 6
Size of SCCs 20 – 600K
Number of SCCs 1 – 14 K
137
![Page 138: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/138.jpg)
Performance Studies
Vary Memory Size
138
![Page 139: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/139.jpg)
DFS [SIGMOD’15] Given a graph , depth-first search is to search
following the depth-first order.
A
B E
D
C
F
IH
J
G
A joint work by Zhiwei Zhang, Jeffrey Yu, Qin Lu, and Zechao Shang139
![Page 140: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/140.jpg)
The Challenge It needs to DFS a
massive directed graph , but it is possible that cannot be entirely held in main memory.
Our work only keeps nodes in memory, which is much smaller.
140
![Page 141: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/141.jpg)
The Issue and the Challenge (1) Consider all edges from , like . Suppose DFS
searches from to . It is hard to estimate when it will visit .
It is hard to know when C/D will be visited eventhey are near A and B.
It is hard to design the format of graph on disk.
A
B C
D
E
141
![Page 142: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/142.jpg)
The Issue and the Challenge (2) A small part of graph can change DFS a lot. Even almost the entire graph can be kept in
memory, it still costs a lot to find the DFS. (E,D) will change the
existing DFS significantly. A large number of
iterations is needed even the memory keeps a large portion of graph.
A
B C D
E F G
142
![Page 143: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/143.jpg)
Problem Statement
We study a semi-external algorithm that computes a DFS-Tree by which DFS can be obtained.
The limited memory is a small constant number.
143
![Page 144: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/144.jpg)
DFS-Tree & Edge Type A DFS of forms a DFS-Tree A DFS procedure can be obtained by a DFS-Tree.
A
B E
D
C
F
IH
J
G
144
![Page 145: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/145.jpg)
DFS-Tree & Edge Type Given a spanning tree , there exist 4 types of
non-tree edges.
A
B E
D
C
F
IH
J
G
Forward Edge
Forward-cross Edge Backward-cross EdgeBackward Edge
145
![Page 146: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/146.jpg)
DFS-Tree & Edge Type An ordered spanning tree is a DFS-Tree if there
does not have any forward-cross edges.
A
B E
D
C
F
IH
J
G
Forward Edge
Forward-cross Edge Backward-cross EdgeBackward Edge
146
![Page 147: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/147.jpg)
Existing Solutions Iteratively remove the forward-cross edges. Procedure:
If there exists a forward-cross edge Construct a new by conducting DFS over the graph
in memory
147
![Page 148: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/148.jpg)
Existing Solutions Construct a new by conducting DFS over the graph in
memory until no forward-cross edges exist.
A
B E
D
C
F
IH
J
G
Forward-cross Edge
148
![Page 149: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/149.jpg)
The Drawbacks D-1: A total order in needs to be maintained in
the whole process. D-2: A large number of I/Os is produced
Need to scan all edges in every iteration. D-3: A large number of iterations is needed.
The possibility of grouping the edges near each other in DFS is not considered.
149
![Page 150: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/150.jpg)
Why Divide & Conquer We aim at dividing the graph into several
subgraphs with possible overlaps among them.
Goal: The DFS-Tree for can be computed by the DFS-Trees for all .
Divide & Conquer approach can overcome the existing drawbacks.
150
![Page 151: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/151.jpg)
Why Divide & Conquer To address D-1
A total order in needs to be maintained in the whole process.
After dividing the graph into , we only need to maintain the total order in .
151
![Page 152: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/152.jpg)
Why Divide & Conquer To address D-2
A large number of I/Os is produced. It needs to scan all edges in each iterations.
After dividing the graph into , we only need to scan the edges in to eliminate forward-cross edges.
152
![Page 153: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/153.jpg)
Why Divide & Conquer To address D-3
A large number of iterations is needed. It cannot group the edges together that are near
each other in DFS visiting sequence.
After dividing the graph into , the DFS procedure can be applied to independently.
153
![Page 154: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/154.jpg)
Valid Division
A
B
F
C
D
E
𝐺1
𝐺2
A
BF
C
D
E
𝐺1
𝐺2
The left is not a DFS-tree The right is a DFS-tree
154
![Page 155: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/155.jpg)
Invalid Division An example:
A
B
FC
D
E
𝐺1
𝐺2
No matter how the DFS-Trees for and are ordered, the merged tree cannot be a DFS-Tree for .
155
![Page 156: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/156.jpg)
How to Cut: Challenges Challenge-1: uneasy to check whether a division
is valid. Need to make sure a DFS-Tree for a divided subgraph
will not affect the DFS-Tree of others. Challenge-2: finding a good division is non-trivial.
The edge types between different subgraphs are complicated.
Challenge-3: The merge procedure needs to make sure that the result is the DFS-Tree for .
156
![Page 157: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/157.jpg)
Our New Approach To address Challenge-1:
Compute a light-weight summary graph (S-graph) denoted as .
Check whether a division is valid by searching To address Challenge-2:
Recursively divide & conquer. To address Challenge-3:
The DFS-Tree for is computed only by and .
157
![Page 158: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/158.jpg)
Four Division Properties Node-Coverage: Contractible: Independence: any pair of nodes in
are consistent. and can be dealt with independently
( and are DFS-Tree for and ) DFS-Preservable: there exists a DFS-Tree for
graph such that and DFS-Tree for can be computed by
158
![Page 159: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/159.jpg)
DFS-Preservable Property DFS-Tree for can be computed by .
-Tree: A spanning tree with the same edge set of a DFS-Tree (without order).
Suppose the independence property is satisfied, then the DFS-preservable property is satisfied if and only if the spanning tree T with and is a -Tree.
159
![Page 160: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/160.jpg)
Independence Property Any pair of nodes in are consistent
( and are DFS-Tree for and ). , can be dealt with independently. This may not hold: is an ancestor of in , but is a
sibling in . Theorem:
Given a division of , the independence property is satisfied if and only if for any subgraphs and , .
160
![Page 161: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/161.jpg)
Independence Property
C
A
ED
B
F
𝐺1
𝐺3
𝐺2
161
![Page 162: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/162.jpg)
DFS-Preservable Example
A
B
D
C
E
F
G
DFS-preservable property is not satisfied.
The DFS-Tree for does not exist given the DFS-Tree for each subgraph.
Forward-cross edges always exist.
162
![Page 163: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/163.jpg)
Our Approach Root based division: independence is satisfied.
For each , it has a spanning tree . For a division , …, , . is the root of and the leaf of
𝐺0
𝐺𝑖 𝐺 𝑗
163
![Page 164: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/164.jpg)
Our Approach We expand to capture the relationship between
different and call it S-graph. S-graph is used to check whether the current division
is valid (DFS-preservable property is satisfied)
𝐺0
𝐺𝑖 𝐺 𝑗
S-graph
164
![Page 165: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/165.jpg)
S-edge S-edge: given a spanning tree of , is the S-edge of
if is ancestor of and is ancestor of in , Both are the children of , where is the lowest
common ancestor of in .
165
![Page 166: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/166.jpg)
S-edge ExampleA
B
D
H
I
E
K
F
C
J
𝐺0
Cross edge
S-edge
G
166
![Page 167: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/167.jpg)
S-graph For a division , …, and is the DFS-Tree for , S-
graph is constructed in the following: Remove all backward and forward edges w.r.t. Replace all cross-edges with their corresponding S-
edge if the S-edge is between nodes in , For edge , if and , add edge and do the same for .
167
![Page 168: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/168.jpg)
S-graph ExampleA
B
D
H
I
E
K
F
C
J
𝐺0
Cross edge
S-edge
G
link
168
![Page 169: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/169.jpg)
S-graph ExampleA
B
D
H
I
E
K
F
C
J
𝐺0
Cross edge
S-edge
G
link
-graph
169
![Page 170: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/170.jpg)
Division Theorem Consider a division , …, and suppose is the DFS-
Tree for , the division is DFS-preservable if and only if the S-graph is a DAG.
170
![Page 171: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/171.jpg)
Divide-Star Algorithm Divide according to the children of the root of . If the corresponding S-graph is a DAG, each
subgraph can be computed independently. Deal with strongly connected component:
Modify : add a virtual node RS representing a SCC S. Modify :
For any edge in S-graph , if and , add edge . Do the same for .
Remove all nodes in S and corresponding edges. Modify Division: create a new tree rooted at the virtual
root RS and connect to the roots in the SCC.
171
![Page 172: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/172.jpg)
Divide-Star AlgorithmA
B
D
H
I
E
K
F
C
J
G
-graph
SCC
Add a virtual root DF
172
![Page 173: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/173.jpg)
Divide-Star AlgorithmA
B
D
H
I
E
KF
C
JG
-graph
DF
-graph is DAG
173
![Page 174: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/174.jpg)
Divide-Star AlgorithmA
B
D
H
I
E
K
F
JG
𝐺0
DF
Divide the graph into 4 parts
B
C
DF
H
𝐺1𝐺2 𝐺3
174
![Page 175: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/175.jpg)
Divide-TD Algorithm Divide-Star algorithm divides the graph according to
the children of the root. The depth of is 1. The max number of subgraphs after dividing will not
be larger than the number of children. Divide-TD algorithm enlarges and the
corresponding S-graph. It can result in more subgraphs than that Divide-Star
can provide.
175
![Page 176: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/176.jpg)
Divide-TD Algorithm Divide-TD algorithm enlarges to a Cut-Tree.
Cut-Tree: Given a tree with root , a cut-tree is a subtree of which satisfies two conditions. The root of is . For any node with child nodes , if , then either is a
leaf node or a node in with all child nodes . With such conditions, for any S-edge , only two
situations exist.
176
![Page 177: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/177.jpg)
Cut-Tree Construction Given a tree T with root . Initially contains only the root . Iteratively pick a leaf node in and all the child nodes of
in . The process stops until the memory cannot hold it after
adding the next nodes.
177
![Page 178: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/178.jpg)
Divide-TD AlgorithmA
B
D
H
I
E
KF
C
JG
Cut-Tree
178
![Page 179: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/179.jpg)
Divide-TD AlgorithmA
B
D
H
I
E
KF
C
JG
Add a virtual node DF
SCC
Cut-Tree
179
![Page 180: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/180.jpg)
Divide-TD AlgorithmA
B
D
I
E
K
F
JG
DF
S-Graph is a DAGDivide the graph into 5 parts
B
C
DF
H
𝐺1𝐺2 𝐺3 I K𝐺4
𝐺0
180
![Page 181: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/181.jpg)
Merge Algorithm According to the properties, the DFS-Tree for
subgraphs are ,…,, there exists a DFS-Tree T with and .
Only need to organize in the merged tree such that the result tree is a DFS-Tree.
Since S-graph is a DAG in the division procedure, we can topological sort and organize according to the topological order.
Remove virtual nodes and add edges from the father of to the children of .
It can be proven that the result tree is a DFS-Tree.
181
![Page 182: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/182.jpg)
Merge AlgorithmA
B
D
H
I
E
K
F
JG
𝐺0
DF
B
C
DF
H
𝐺1𝐺2 𝐺3
Topological sort Removing S-edges and find the DFS-Tree
182
![Page 183: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/183.jpg)
Merge AlgorithmA
B
D
H
I
E
K
F
JG
𝑇 0
DF
B
C
DF
H
𝑇 1𝑇 2 𝑇 3
Merge trees according to the order
183
![Page 184: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/184.jpg)
Merge AlgorithmA
B
D
I
E
K
F
JG
𝑇 0
C
H
𝑇 1𝑇 2 𝑇 3
184
![Page 185: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/185.jpg)
Performance Studies Implement using visual C++ 2010 Test on a PC with Intel Core2 Quard 2.66GHz
CPU and 4GB memory running Windows 7 Enterprise
Disk Block Size: 64KB
185
![Page 186: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/186.jpg)
|V| |E| Average Degree
Wikilinks 25,942,246 601,038,301 23.16Arabic-2005 22,744,080 639,999,458 28.14Twitter-2010 41,652,230 1,468,365,182 35.25WEBGRAPH-UK2007
105,895,908 3,738,733,568 35.00
Four Real Data Sets
186
![Page 187: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/187.jpg)
Web-graph Results Memory size 2GB Varying node size percentage
187
![Page 188: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/188.jpg)
We study the I/O efficient DFS algorithms for a large graph.
We analyze the drawbacks of existing semi-external DFS algorithm.
We discuss the challenges and four properties in order to find a divide & conquer approach.
Based on the properties, we design two novel graph division algorithms and a merge algorithm to reduce the cost to DFS the graph.
We have conducted extensive performance studies to confirm the efficiency of our algorithms.
Conclusion
188
![Page 189: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/189.jpg)
We also believe that there are many things we need to do on large graphs or big graphs.
We know what we have known on graph processing. We do not know yet what we do not know on graph
processing. We need to explore many directions such as
parallel computing distributed computing streaming computing semi-external/external computing.
Some Conclusion Remarks
189
![Page 190: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/190.jpg)
I/O Cost Minimization If there does not exist node for that = can be removed
from . For a node , if ,can be removed from . The I/O complexity is
190
![Page 191: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/191.jpg)
B
H
C D
G
E
A
IF
This edge makes all nodes in a partial SCC the same order.
Another Example Keep tree structure edges in memory.
Only concern the depth of nodes reachable but not the exact positions.
Early-acceptance: merging SCCs partially whenever possible does not affect the order of others.
Early-rejection: prune non-SCC nodes when possible. Prune the node “A”.
In Memory: Black edges On Disk: Red edges
191
![Page 192: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/192.jpg)
B
I
CD
H
E
A
JG
No need to remember .
Merge nodes of the same order when an edge is found, where is an ancestor of in .
Smaller graph size, smaller I/O Cost
KF
Up-edge: Modify Tree
Up-edge: Modify Tree
Memory:
Early-AcceptanceEarly-Acceptance
Optimization: Early Acceptance
192
![Page 193: Jeffrey xu yu large graph processing](https://reader031.vdocuments.us/reader031/viewer/2022012923/587e4f671a28abeb1a8b5be1/html5/thumbnails/193.jpg)
Performance Studies
Vary Degree
193