the shortest path is not always a straight line
TRANSCRIPT
THE SHORTEST PATH IS NOT ALWAYS A STRAIGHT LINE
leveraging semi-metricity in large-scale graph analysis
Vasiliki Kalavri ([email protected]) KTH Royal Institute of TechnologyTiago Simas ([email protected]) Telefonica Research Dionysios Logothetis ([email protected]) Facebook
2
Alice42 likes
Weighted graphs capture relationship strength
distance
similarity social proximity
rating preference
influential nodes
optimal propagation paths
communities
recommendations
BobMax
3 likes
3
Sparsification techniques reduce the graph size and still give exact or good
approximate results
G G’f(G) ~ f(G’)
THE METRIC BACKBONE
Reduces the graph size while maintaining relevant structure
The minimum subgraph of a weighted graph, that preserves the shortest paths of the original graph
4
B
E
DA
C2
3
10
4
2
1
B
E
DA
C2
3
2
1
WHAT CAN WE USE IT FOR?• Exact computations
• any algorithm that depends on the shortest paths• reachability, connectivity• betweenness centrality, closeness centrality
• Approximation• PageRank, random walks• eigenvector centrality• community detection, clustering
5
WHAT CAN WE USE IT FOR?• Exact computations
• any algorithm that depends on the shortest paths• reachability, connectivity• betweenness centrality, closeness centrality
• Approximation• PageRank, random walks• eigenvector centrality• community detection, clustering
5
Improves community detection modularity and recommender
systems accuracy
IMPACT ON LARGE-SCALE SYSTEMS• Graph Databases
• fewer edges => smaller path search space
• Batch Graph Processing• CPU and memory requirements depend on #messages
• #messages proportional to #edges
• fewer edges => improved analysis performance
• Graph Compression• fewer edges => storage reduction
6
BACKGROUND
SEMI-METRICITYIn a weighted graph, an edge is semi-metric, if there exists a shorter indirect path between its endpoints
8
B
E
DA
C2
3
10
4
2
1
SEMI-METRICITYIn a weighted graph, an edge is semi-metric, if there exists a shorter indirect path between its endpoints
9
B
E
DA
C2
3
10
4
2
1
CE is 1st-order semi-metric:
C-D-E is a shorter2-hop path
SEMI-METRICITYIn a weighted graph, an edge is semi-metric, if there exists a shorter indirect path between its endpoints
10
B
E
DA
C2
3
10
4
2
1
AD is 2nd-order semi-metric:
A-B-C-D is a shorter 3-hop path
CE is 1st-order semi-metric:
C-D-E is a shorter2-hop path
SEMI-METRICITYIn a weighted graph, an edge is semi-metric, if there exists a shorter indirect path between its endpoints
11
B
E
DA
C2
3
10
4
2
1
CE is 1st-order semi-metric:
C-D-E is a shorter2-hop path
AD is 2nd-order semi-metric:
A-B-C-D is a shorter 3-hop path
AB, BC, CD, DE are metric
BACKBONE ALGORITHM
BACKBONE CALCULATION• Calculating the backbone:
• find all semi-metric edges: 1 BFS per edge?• compute APSP and store O(N2) paths
13
BACKBONE CALCULATION• Calculating the backbone:
• find all semi-metric edges: 1 BFS per edge?• compute APSP and store O(N2) paths
Can we calculate or approximate the backbone
without solving APSP?
13
ORDER OF SEMI-METRICITY
14
ORDER OF SEMI-METRICITY
14
Most semi-metric edges are1st-order semi-metric
A 3-PHASE BACKBONE ALGORITHM
15
Find 1st-order semi-metric edges: only look at triangles
1.
A 3-PHASE BACKBONE ALGORITHM
15
Find 1st-order semi-metric edges: only look at triangles
1. Scalable & practicalfor large graphs
EXAMPLE
16
B
E
DA
C2
3
10
4
2
1
EXAMPLE
17
B
E
DA
C2
3
10
4
2
1
Phase 1
EXAMPLE
18
B
E
DA
C2
3
10 2
1
Phase 1
A 3-PHASE BACKBONE ALGORITHM
19
Find 1st-order semi-metric edges: only look at triangles
1. Scalable & practicalfor large graphs
A 3-PHASE BACKBONE ALGORITHM
19
Find 1st-order semi-metric edges: only look at triangles
1.
Identify metric edges in 2-hop paths
2.
Scalable & practicalfor large graphs
A 3-PHASE BACKBONE ALGORITHM
19
Find 1st-order semi-metric edges: only look at triangles
1.
Identify metric edges in 2-hop paths
2.
Scalable & practicalfor large graphs
Most semi-metric edgeshave been removed
EXAMPLE
20
B
E
DA
C2
3
10 2
1
Phase 2
EXAMPLE
20
B
E
DA
C2
3
10 2
1
Phase 2
M
M
MM
The lowest-weight edge of every vertex is metric
EXAMPLE
20
B
E
DA
C2
3
10 2
1
Phase 2
M
M
MM
The lowest-weight edge of every vertex is metric
uv2
4
2
1
any indirect pathfrom u to vwould have
larger weight
EXAMPLE
20
B
E
DA
C2
3
10 2
1
Phase 2
?
M
M
MM
The lowest-weight edge of every vertex is metric
uv2
4
2
1
any indirect pathfrom u to vwould have
larger weight
A 3-PHASE BACKBONE ALGORITHM
21
Find 1st-order semi-metric edges: only look at triangles!
1.
Identify metric edges in 2-hop paths
2.
Scalable & practicalfor large graphs!
Most semi-metric edgeshave been removed
A 3-PHASE BACKBONE ALGORITHM
21
Find 1st-order semi-metric edges: only look at triangles!
1.
Identify metric edges in 2-hop paths
2.
Run a BFS for remaining unlabeled edges.
3.
Scalable & practicalfor large graphs!
Most semi-metric edgeshave been removed
A 3-PHASE BACKBONE ALGORITHM
21
Find 1st-order semi-metric edges: only look at triangles!
1.
Identify metric edges in 2-hop paths
2.
Run a BFS for remaining unlabeled edges.
3.
Scalable & practicalfor large graphs!
1%-9% edges
Most semi-metric edgeshave been removed
EXAMPLE
22
B
E
DA
C2
3
10 2
1
Phase 3
M
M
MM
BFS
EXAMPLE
22
B
E
DA
C2
3
10 2
1
Phase 3
M
M
MM
BFS
Explore paths with shorter
distances only
EXAMPLE
22
B
E
DA
C2
3
10 2
1
Phase 3
M
M
MM
BFS
Explore paths with shorter
distances only
If the BFS arrives at the target, the edge
is semi-metric
EXAMPLE
23
B
E
DA
C2
3
2
1
Metric Backbone
DISTRIBUTED IMPLEMENTATION
code available: http://grafos.ml/okapi.html#analytics
24
Implementation in the vertex-centric model
EVALUATION
EVALUATION GOALS
• How does our algorithm compare to APSP?
• Are large, real-world graphs semi-metric?
• Can we improve graph analysis performance?
26
COMPARISON TO APSPComputing APSP in Giraph• multiple SSSPs• multiple MSSPs, i.e. SSSPs from
several sources in parallel
27
COMPARISON TO APSPComputing APSP in Giraph• multiple SSSPs• multiple MSSPs, i.e. SSSPs from
several sources in parallel
27
In the order of months for million-edge graphs
COMPARISON TO APSPComputing APSP in Giraph• multiple SSSPs• multiple MSSPs, i.e. SSSPs from
several sources in parallel
27
In the order of months for million-edge graphs
In the order of days for million-edge graphs
COMPARISON TO APSPComputing APSP in Giraph• multiple SSSPs• multiple MSSPs, i.e. SSSPs from
several sources in parallel
27
In the order of months for million-edge graphs
In the order of days for million-edge graphs
Our algorithm is 120-180x faster than SSSPand 11-14x faster than MSSP: order of hours for million-edge graphs
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fastand scalable
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fastand scalable
Removes up to 90%of semi-metric edges
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fastand scalable
Removes up to 90%of semi-metric edges
Moderately fast
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fastand scalable
Removes up to 90%of semi-metric edges
Moderately fast
Labels up to 60%of the unlabeled edges
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fastand scalable
Removes up to 90%of semi-metric edges
Moderately fast
Labels up to 60%of the unlabeled edges
Slow
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fastand scalable
Removes up to 90%of semi-metric edges
Moderately fast
Labels up to 60%of the unlabeled edges
Slow
Labels up to 1-9%of the total edges
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fastand scalable
Removes up to 90%of semi-metric edges
Moderately fast
Labels up to 60%of the unlabeled edges
Slow
Labels up to 1-9%of the total edges
Phase 1 is the fastest and most useful phase
PHASE 1 SCALABILITY
29
PHASE 1 SCALABILITY
29
<200s on a billion-edge graph
PHASE 1 SCALABILITY
29
almost linear scalability
<200s on a billion-edge graph
SEMI-METRICITY IN REAL GRAPHS
30
Graph |V| |E| metric semi-metricity
Facebook 190M 49.9B custom 26.5%Twitter 40M 1.5B jaccard 39%Tuenti 12M 685M jaccard 59%
Livejournal 4.8M 34M jaccard 40%NotreDame 0.3M 1.5M jaccard, adamic 45%-29%
DBLP 318K 1M jaccard, adamic 23%-9%Twitter-ego 81K 1.7M jaccard, adamic 57%-39%Movielens 1.6K 1.9M jaccard 88%
Facebook 1K 143K #messages, message size 78%-77%
US-Airports 0.5K 6K #passengers 72%C-Elegans 0.3K 2.3K #connections 17%
SEMI-METRICITY IN REAL GRAPHS
30
Graph |V| |E| metric semi-metricity
Facebook 190M 49.9B custom 26.5%Twitter 40M 1.5B jaccard 39%Tuenti 12M 685M jaccard 59%
Livejournal 4.8M 34M jaccard 40%NotreDame 0.3M 1.5M jaccard, adamic 45%-29%
DBLP 318K 1M jaccard, adamic 23%-9%Twitter-ego 81K 1.7M jaccard, adamic 57%-39%Movielens 1.6K 1.9M jaccard 88%
Facebook 1K 143K #messages, message size 78%-77%
US-Airports 0.5K 6K #passengers 72%C-Elegans 0.3K 2.3K #connections 17%
% 1st-order semi-metric edges =>
reduction in memory and communication
QUERY SPEEDUP ON NEO4J
31
6.7x speedup
APACHE GIRAPH SPEEDUP
32
Including the time to calculate the backbone
4x speedup
APACHE GIRAPH SPEEDUP
33
6x speedup
COMMUNICATION REDUCTION
34
Up to 70% for highly semi-metric graphs
BEST PRACTICESWhen to use the backbone?
• semi-metric weighting schemes, e.g. neighborhood similarity• we can amortize the overhead: e.g. many algorithms on the same graph,
multiple distance queries• lossy compression is ok
When not to use the backbone?
• for metric weighting schemes• we need to run one-off analysis• we need lossless compression
35
RECAP: MAIN CONTRIBUTIONS
36
• An algorithm for computing the metric backbone without solving APSP
• An open-source distributed implementation• Graph query and graph analytics speedup on
Neo4j and Apache Giraph
THE SHORTEST PATH IS NOT ALWAYS A STRAIGHT LINE
leveraging semi-metricity in large-scale graph analysis
Vasiliki Kalavri ([email protected]) KTH Royal Institute of TechnologyTiago Simas ([email protected]) Telefonica Research Dionysios Logothetis ([email protected]) Facebook