shortest paths on large graphs: systems, algorithms...
TRANSCRIPT
Shortest paths on large graphs:Systems, Algorithms, Applications
Andrey Gubichev
TU Munchen
January 2012
Andrey Gubichev Shortest paths on large graphs 1 / 53
Outline
Introduction
Systems
Algorithms
ApplicationsSemantic WebSocial Search
Andrey Gubichev Shortest paths on large graphs 2 / 53
Everything is a graph
Internet Graph,RichardsonWeb Graph Social Network
Wikipedia, TulipProteins, Bordalier Inst
Andrey Gubichev Shortest paths on large graphs 3 / 53
RDF: format for graph data
Marie Curie U Paris
Warsaw
Poland
1867
1934
Maria SklodowskaNobel Prize Chemistry
Pierre Curie Nobel Prize Physics
Henri BecquerelbornIn
marriedTo
bornOn
diedOn
bornAs
in hasWon
hasWon
almamater
adviser
hasWon
RDF:(id1,Name,”Marie Curie”)(id1,bornOn,1867)(id1,bornIn,id2)(id2,Name,”Warsaw”)(id2,locatedIn,id3)(id3,Name,”Poland”)
(G.Weikum, WSDM’09)
• pay-as-you-go: schema-agnostic, schema-later
• RDF triples form ER graph
Andrey Gubichev Shortest paths on large graphs 4 / 53
RDF: format for graph data
Marie Curie U Paris
Warsaw
Poland
1867
1934
Maria SklodowskaNobel Prize Chemistry
Pierre Curie Nobel Prize Physics
Henri BecquerelbornIn
marriedTo
bornOn
diedOn
bornAs
in hasWon
hasWon
almamater
adviser
hasWon
RDF:(id1,Name,”Marie Curie”)(id1,bornOn,1867)(id1,bornIn,id2)(id2,Name,”Warsaw”)(id2,locatedIn,id3)(id3,Name,”Poland”)
(G.Weikum, WSDM’09)
• pay-as-you-go: schema-agnostic, schema-later
• RDF triples form ER graph
Andrey Gubichev Shortest paths on large graphs 4 / 53
RDF: format for graph data
Marie Curie U Paris
Warsaw
Poland
1867
1934
Maria SklodowskaNobel Prize Chemistry
Pierre Curie Nobel Prize Physics
Henri BecquerelbornIn
marriedTo
bornOn
diedOn
bornAs
in hasWon
hasWon
almamater
adviser
hasWon
RDF:(id1,Name,”Marie Curie”)(id1,bornOn,1867)(id1,bornIn,id2)(id2,Name,”Warsaw”)(id2,locatedIn,id3)(id3,Name,”Poland”)
(G.Weikum, WSDM’09)
• pay-as-you-go: schema-agnostic, schema-later
• RDF triples form ER graph
Andrey Gubichev Shortest paths on large graphs 4 / 53
RDF: a lot of data out there
Linked Data Project, linkeddata.org
Linked Data: extract explicit knowledge (ER-oriented facts) from theworld‘s best information sources (Wikipedia, Web, Web 2.0)
Andrey Gubichev Shortest paths on large graphs 5 / 53
SPARQL: a query language
Select ?c
Where
{
?p isa scientist.
?p bornIn ?t.
?p hasWon ?a.
?t locatedIn ?c.
?a Name NobelPrize.
}
...
...
• SQL-like syntax
• triple patterns
• common variables form joins
Andrey Gubichev Shortest paths on large graphs 6 / 53
SPARQL: a query language for RDF
...
Select ?c
Where
{
?p isa scientist.
?p bornIn ?t.
?p hasWon ?a.
?t locatedIn ?c.
?a Name NobelPrize.
Filter (?t < 1900)
}
...
• SQL-like syntax
• triple patterns
• common variables form joins
• filter predicates
Andrey Gubichev Shortest paths on large graphs 7 / 53
SPARQL: a query language
...
...
Select Distinct ?c
Where
{
?p ?r1 ?t.
?t ?r2 ?c.
?c isa Country.
?p bornOn ?b.
Filter (?b > 1945)
}
• SQL-like syntax
• triple patterns
• common variables form joins
• filter predicates
• wildcard joins
Andrey Gubichev Shortest paths on large graphs 8 / 53
RDF & SPARQL Engines
giant triples table
clustered property tables property table
S P Oid1 Name Marie Curie
id1 bornOn 1867
id1 bornIn id2
id2 Name Warsaw
...
Sesame/OpenRDFYARS2 (DERI)
PersonS Name bornOn bornIn ...id1 Marie C 1867 id3 ...
id2 Henri B 1852 id9 ...
...
TownS Name Countryid3 Warsaw id11
...
Jena (HP Labs)Oracle RDF MATCH
bornOnS Oid1 1867
id5 1852
... ....
AdvisorS Oid1 id5
... ....
C-Store (MIT)MonetDB(CWI)
Why a new engine?
Three main things in database design:
1. Performance
2. Performance
3. Performance
Andrey Gubichev Shortest paths on large graphs 9 / 53
RDF & SPARQL Engines
giant triples table clustered property tables
property table
S P Oid1 Name Marie Curie
id1 bornOn 1867
id1 bornIn id2
id2 Name Warsaw
...
Sesame/OpenRDFYARS2 (DERI)
PersonS Name bornOn bornIn ...id1 Marie C 1867 id3 ...
id2 Henri B 1852 id9 ...
...
TownS Name Countryid3 Warsaw id11
...
Jena (HP Labs)Oracle RDF MATCH
bornOnS Oid1 1867
id5 1852
... ....
AdvisorS Oid1 id5
... ....
C-Store (MIT)MonetDB(CWI)
Why a new engine?
Three main things in database design:
1. Performance
2. Performance
3. Performance
Andrey Gubichev Shortest paths on large graphs 9 / 53
RDF & SPARQL Engines
giant triples table clustered property tables property table
S P Oid1 Name Marie Curie
id1 bornOn 1867
id1 bornIn id2
id2 Name Warsaw
...
Sesame/OpenRDFYARS2 (DERI)
PersonS Name bornOn bornIn ...id1 Marie C 1867 id3 ...
id2 Henri B 1852 id9 ...
...
TownS Name Countryid3 Warsaw id11
...
Jena (HP Labs)Oracle RDF MATCH
bornOnS Oid1 1867
id5 1852
... ....
AdvisorS Oid1 id5
... ....
C-Store (MIT)MonetDB(CWI)
Why a new engine?
Three main things in database design:
1. Performance
2. Performance
3. Performance
Andrey Gubichev Shortest paths on large graphs 9 / 53
RDF & SPARQL Engines
giant triples table clustered property tables property table
S P Oid1 Name Marie Curie
id1 bornOn 1867
id1 bornIn id2
id2 Name Warsaw
...
Sesame/OpenRDFYARS2 (DERI)
PersonS Name bornOn bornIn ...id1 Marie C 1867 id3 ...
id2 Henri B 1852 id9 ...
...
TownS Name Countryid3 Warsaw id11
...
Jena (HP Labs)Oracle RDF MATCH
bornOnS Oid1 1867
id5 1852
... ....
AdvisorS Oid1 id5
... ....
C-Store (MIT)MonetDB(CWI)
Why a new engine?
Three main things in database design:
1. Performance
2. Performance
3. Performance
Andrey Gubichev Shortest paths on large graphs 9 / 53
Scalable Semantic Web: RDF-3X Engine[T.Neumann et al: VLDB’08]
• tuning-free system architecture: giant tripletable
• map literals into ids (dictionary)
andprecompute
exhaustive indexing for SPO triples:SPO, SOP, OPS, OSP, PSO, POS,SP*, SO*, OS*, PO*, OP*, S*, P*, O*
very high compression, index-only store
• directly store indexes into clustered B+ trees
• can choose any order for scan and join
• also store two mapping indexes:literal → id, id → literal
• efficient merge joins with order-preservation
Andrey Gubichev Shortest paths on large graphs 10 / 53
Scalable Semantic Web: RDF-3X Engine[T.Neumann et al: VLDB’08]
• tuning-free system architecture: giant tripletable
• map literals into ids (dictionary)
andprecompute
exhaustive indexing for SPO triples:SPO, SOP, OPS, OSP, PSO, POS,SP*, SO*, OS*, PO*, OP*, S*, P*, O*
very high compression, index-only store
• directly store indexes into clustered B+ trees
• can choose any order for scan and join
• also store two mapping indexes:literal → id, id → literal
• efficient merge joins with order-preservation
S P Oid1 Name Marie Curie
id1 bornOn 1867
id1 bornIn id2
id2 Name Warsaw
...
S P O1 3 4
1 5 6
1 7 2
2 3 8
...
map
ID
Andrey Gubichev Shortest paths on large graphs 10 / 53
Scalable Semantic Web: RDF-3X Engine[T.Neumann et al: VLDB’08]
• tuning-free system architecture: giant tripletable
• map literals into ids (dictionary) andprecompute
exhaustive indexing for SPO triples:SPO, SOP, OPS, OSP, PSO, POS,SP*, SO*, OS*, PO*, OP*, S*, P*, O*
very high compression, index-only store
• directly store indexes into clustered B+ trees
• can choose any order for scan and join
• also store two mapping indexes:literal → id, id → literal
• efficient merge joins with order-preservation
P O S3 4 1
3 8 2
5 6 1
7 2 1
Andrey Gubichev Shortest paths on large graphs 10 / 53
Scalable Semantic Web: RDF-3X Engine[T.Neumann et al: VLDB’08]
• tuning-free system architecture: giant tripletable
• map literals into ids (dictionary) andprecompute
exhaustive indexing for SPO triples:SPO, SOP, OPS, OSP, PSO, POS,SP*, SO*, OS*, PO*, OP*, S*, P*, O*
very high compression, index-only store
• directly store indexes into clustered B+ trees
• can choose any order for scan and join
• also store two mapping indexes:literal → id, id → literal
• efficient merge joins with order-preservation
Andrey Gubichev Shortest paths on large graphs 10 / 53
Scalable Semantic Web: RDF-3X Engine[T.Neumann et al: VLDB’08]
• tuning-free system architecture: giant tripletable
• map literals into ids (dictionary) andprecompute
exhaustive indexing for SPO triples:SPO, SOP, OPS, OSP, PSO, POS,SP*, SO*, OS*, PO*, OP*, S*, P*, O*
very high compression, index-only store
• directly store indexes into clustered B+ trees
• can choose any order for scan and join
• also store two mapping indexes:literal → id, id → literal
• efficient merge joins with order-preservation
Andrey Gubichev Shortest paths on large graphs 10 / 53
Scalable Semantic Web: RDF-3X Engine[T.Neumann et al: VLDB’08]
• tuning-free system architecture: giant tripletable
• map literals into ids (dictionary) andprecompute
exhaustive indexing for SPO triples:SPO, SOP, OPS, OSP, PSO, POS,SP*, SO*, OS*, PO*, OP*, S*, P*, O*
very high compression, index-only store
• directly store indexes into clustered B+ trees
• can choose any order for scan and join
• also store two mapping indexes:literal → id, id → literal
• efficient merge joins with order-preservation
Andrey Gubichev Shortest paths on large graphs 10 / 53
RDF-3X Query Optimization[T.Neumann et al: VLDB’08]
• bottom-up dynamical programming for plan enumaration
• exploit numerous indexes, order-preservation
• cost model based on selectivity estimation
Andrey Gubichev Shortest paths on large graphs 11 / 53
Evaluation[T.Neumann et al: SIGMOD’09]
• Queries like: find a polishscientist with a french advisor,both got some awards
• YAGO knowledge base: 40 Mio.triples
• Billion Triple dataset, Uniprot(845 Mio.) - similar results
Try it out!
RDF-3X is freely available:http://code.google.com/p/rdf3x/
Andrey Gubichev Shortest paths on large graphs 12 / 53
Evaluation[T.Neumann et al: SIGMOD’09]
• Queries like: find a polishscientist with a french advisor,both got some awards
• YAGO knowledge base: 40 Mio.triples
• Billion Triple dataset, Uniprot(845 Mio.) - similar results
Try it out!
RDF-3X is freely available:http://code.google.com/p/rdf3x/
Andrey Gubichev Shortest paths on large graphs 12 / 53
Outline
Introduction
Systems
Algorithms
ApplicationsSemantic WebSocial Search
Andrey Gubichev Shortest paths on large graphs 13 / 53
What is missing?
What kind of queries we CAN answer?
• Find lat and long of the Eiffel Tower
• Find politicians who are also scientists
What kind of queries we CAN NOT answer?
• Find common things between Angela Merkel and ArnoldSchwarznegger
• Find all European-born Nobel prize winners
Why?
They require path traversals over RDF graph.
Andrey Gubichev Shortest paths on large graphs 14 / 53
Why is SPARQL not enough?
Sometimes we need to form join chains with unknown length(e.g., we need the transitive closure of the predicate).
Example Triples
Humboldt bornIn Berlin.Berlin locatedIn Germany.
Example Triples
Einstein bornIn Ulm.Ulm locatedIn Baden-Wurttemberg.Baden-Wurttemberg locatedIn Germany.
Were they both born in Germany? Yes.
How to figure that out?
Einstein Ulm Baden-Wurttemberg Germany
Humboldt Berlin
bornIn locatedIn locatedIn
bornIn
locatedIn
Andrey Gubichev Shortest paths on large graphs 15 / 53
Why is SPARQL not enough?
Sometimes we need to form join chains with unknown length(e.g., we need the transitive closure of the predicate).
Example Triples
Humboldt bornIn Berlin.Berlin locatedIn Germany.
Example Triples
Einstein bornIn Ulm.Ulm locatedIn Baden-Wurttemberg.Baden-Wurttemberg locatedIn Germany.
How to find all scientists that were born in Germany?
SPARQL
?person bornIn ?place. ?place locatedIn Germany.UNION?person bornIn ?place. ?place locatedIn ?place1. ?place1 locatedInGermany.UNION...
Andrey Gubichev Shortest paths on large graphs 16 / 53
Why is SPARQL not enough?
Sometimes we need to form join chains with unknown length(e.g., we need the transitive closure of the predicate).
Example Triples
Humboldt bornIn Berlin.Berlin locatedIn Germany.
Example Triples
Einstein bornIn Ulm.Ulm locatedIn Baden-Wurttemberg.Baden-Wurttemberg locatedIn Germany.
How to find all scientists that were born in Germany?
SPARQL with paths
?person bornIn ?place. ?place ??path Germany.
Andrey Gubichev Shortest paths on large graphs 17 / 53
SPARQL with path variables
Introduced by K.Anyanwu et al. (WWW’07)
• Example: select ??p ?obj where {?place ??path Germany} (pathtriple)
• ??p: there exists a path from place to Germany in the RDF graph
• we consider only shortest paths
• we can specify filter (conditions) on ??p
• we can join such path patterns with regular patterns
Example
select ?name where { ?m type Mountain.?m hasName ?name.?m ??location Europe.filter(ContainsOnly(??location, locatedIn)) }
Andrey Gubichev Shortest paths on large graphs 18 / 53
How to execute SPARQL with path variables?[A.Gubichev et al: WebDB’11]
We build upon RDF-3X. Two goals:
• Query Optimization: How to estimate cardinality of path triples?
• Physical Level: How to perform path scan efficiently?
Andrey Gubichev Shortest paths on large graphs 19 / 53
Outline
Introduction
Systems
Algorithms
ApplicationsSemantic WebSocial Search
Andrey Gubichev Shortest paths on large graphs 20 / 53
Can we do better?
• Dijkstra’s algo is fine, but let’s consider approximate algorithms(trade quality for speed)
• Let’s change the setting for now: shortest paths on social network
Social network:
• a set of people
• a social relationship linking them
Andrey Gubichev Shortest paths on large graphs 21 / 53
Problem Statement
Exact shortest path:
• V — users, E — ”friend of” relationships
• Graph G (V ,E ) — directed, unweighted, static
• Given u, v ∈ V find the shortest path from u to v
Approximate shortest path:
• Graph is disk-resident
• Offline step: Do some precomputation, store on disk
• Online step: for u,v ∈ V quickly find some path from u to v
• Approximation error:
|approximate| − |exact||exact|
Andrey Gubichev Shortest paths on large graphs 22 / 53
Different approaches
Exact SP
• Dijkstra: very slow
• A∗: works well for road networks, slow for OSN
• Hierarchy-based decomposition: works well for road networks, slow forOSN
Approximate SP
• Different types of preprocessing: keep distances from all nodes tosmall subset of nodes (random, with high degree or centrality)
• Poor results for OSN: average error is ≥ 10%
• Find just the distance, not the path itself
Andrey Gubichev Shortest paths on large graphs 23 / 53
Precomputation
Step1 Set r = blog |V |cStep2 Sample r + 1 sets of nodes (uniformly, at random) of sizes:
1, 2, 22, 23,...,2r
Step3 For every u ∈ V and for every set S
1. Find the closest nodes to u in S (landmarks):
landmark h ∈ S : dist(u, h) = dist(u,S)
landmark h′ ∈ S : dist(h′, u) = dist(S , u)
2. Find the distance from u to h and from h′ to u
Andrey Gubichev Shortest paths on large graphs 24 / 53
Precomputation - WSDM’10 approach[A.Das Sarma et al: WSDM’10]
u
...
2
3
1
h1 ∈ S1
h2 ∈ S2
hr ∈ Sr
Sketch in RDF:〈u〉〈2〉〈h1〉〈u〉〈3〉〈h2〉· · ·
〈u〉〈1〉〈hr 〉
Andrey Gubichev Shortest paths on large graphs 25 / 53
Precomputation - our approach[A.Gubichev et al: CIKM’10]
u
x
y
...
h1 ∈ S1
h2 ∈ S2
hr ∈ Sr
Sketch in RDF:〈u〉〈x〉〈h1〉〈u〉〈x y〉〈h2〉· · ·
〈u〉〈 〉〈hr 〉
Andrey Gubichev Shortest paths on large graphs 26 / 53
Precomputation
Step1 Set r = blog |V |cStep2 Sample r + 1 sets of nodes (uniformly, at random) of sizes:
1, 2, 22, 23,...,2r
Step3 For every u ∈ V and for every set S
1. Find the closest nodes to u in S (landmarks):
landmark h ∈ S : dist(u, h) = dist(u,S)
landmark h′ ∈ S : dist(h′, u) = dist(S , u)
2. Find the path from u to h and from h′ to u3. Store the paths (RDF):〈u〉 〈path〉 〈h〉, 〈h′〉 〈path′〉 〈u〉
Step4 Repeat Steps 2-3 k times (we use k = 2).
Andrey Gubichev Shortest paths on large graphs 27 / 53
Sketch
Sketch for a node u consists of
1. Landmarks h1,...,hkr
2. Paths from u to landmarks
3. Paths from landmarks to u
Sketch for u consists of two trees (u is the root)
We keep sketches for every u ∈ V
Andrey Gubichev Shortest paths on large graphs 28 / 53
SKETCH algorithm: online part[A.Das Sarma et al: WSDM’10]
Input: nodes s, d ∈ V
1. Load all the distances from s
2. Load all the distances to d
3. Find common landmarks
4. Construct the paths
5. Select the shortest distance
Output: distance from s to d
s
d
3
4
2
3
Andrey Gubichev Shortest paths on large graphs 29 / 53
SKETCH algorithm: online part[A.Das Sarma et al: WSDM’10]
Input: nodes s, d ∈ V
1. Load all the distances from s
2. Load all the distances to d
3. Find common landmarks
4. Construct the paths
5. Select the shortest distance
Output: distance from s to d
s
d
3
4
2
3
Andrey Gubichev Shortest paths on large graphs 29 / 53
SKETCH algorithm: online part[A.Das Sarma et al: WSDM’10]
Input: nodes s, d ∈ V
1. Load all the distances from s
2. Load all the distances to d
3. Find common landmarks
4. Construct the paths
5. Select the shortest distance
Output: distance from s to d
s
d
3
4
2
3
Andrey Gubichev Shortest paths on large graphs 29 / 53
SKETCH algorithm: online part[A.Das Sarma et al: WSDM’10]
Input: nodes s, d ∈ V
1. Load all the distances from s
2. Load all the distances to d
3. Find common landmarks
4. Construct the paths
5. Select the shortest distance
Output: distance from s to d
s
d
3
4
2
3
Andrey Gubichev Shortest paths on large graphs 29 / 53
SKETCH algorithm: online part[A.Das Sarma et al: WSDM’10]
Input: nodes s, d ∈ V
1. Load all the distances from s
2. Load all the distances to d
3. Find common landmarks
4. Construct the paths
5. Select the shortest distance
Output: distance from s to d
s
d
3
4
2
3
Andrey Gubichev Shortest paths on large graphs 29 / 53
SKETCH algorithm: online part[A.Das Sarma et al: WSDM’10]
Input: nodes s, d ∈ V
1. Load all the distances from s
2. Load all the distances to d
3. Find common landmarks
4. Construct the paths
5. Select the shortest distance
Output: distance from s to d
s
d
3
4
2
3
Andrey Gubichev Shortest paths on large graphs 29 / 53
SKETCH algorithm with paths[A.Gubichev et al: CIKM’10]
Input: nodes s, d ∈ V
1. Load all the paths from s
2. Load all the paths to d
3. Find common landmarks
4. Construct the paths
5. Select the shortest path
Output: path from s to d :〈s x y h z d〉
s
d
z
y
x
h
Andrey Gubichev Shortest paths on large graphs 30 / 53
Datasets
• Slashdot: 77 K nodes, undirected
• YouTube: 1.1 Mln nodes
• Flickr: 1.7 Mln nodes
• WikiTalk: 2.2 Mln nodes
• Twitter: 2.4 Mln nodes
• Orkut: 3 Mln nodes, undirected
Sources: Stanford, MPI, Telefonica Research
Andrey Gubichev Shortest paths on large graphs 31 / 53
Approximation error of the Sketch algorithm
Error =|approximate| − |exact|
|exact|
Dataset (#nodes) Sketch error
Slashdot (77K) 46%YouTube (1.1M) 30%Flickr (1.7M) 28%WikiTalk (2.2M) 55%Twitter (2.4M) 51%Orkut (3M) 71%
Andrey Gubichev Shortest paths on large graphs 32 / 53
Precomputation
Step1 Set r = blog |V |cStep2 Sample r + 1 sets of nodes (uniformly, at random) of sizes:
1, 2, 22, 23,...,2r
Step3 For every u ∈ V and for every set S
1. Find the closest nodes to u in S (landmarks):
landmark h ∈ S : dist(u, h) = dist(u,S)
landmark h′ ∈ S : dist(h′, u) = dist(S , u)
2. Find the path from u to h and from h′ to u3. Store the paths (RDF):〈u〉 〈path〉 〈h〉, 〈h′〉 〈path′〉 〈u〉
Step4 Repeat Steps 2-3 k times (we use k = 2).
Andrey Gubichev Shortest paths on large graphs 33 / 53
First modification
We find the path, not just the distance!
Are there cycles?Construct a shorter path
s d
a as da
Andrey Gubichev Shortest paths on large graphs 34 / 53
First modification
We find the path, not just the distance!
Are there cycles?
Construct a shorter path
s da a
s da
Andrey Gubichev Shortest paths on large graphs 34 / 53
First modification
We find the path, not just the distance!
Are there cycles?
Construct a shorter path
s da a
s da
Andrey Gubichev Shortest paths on large graphs 34 / 53
First modification
We find the path, not just the distance!Are there cycles?
Construct a shorter path
s da a
s da
Andrey Gubichev Shortest paths on large graphs 34 / 53
Approximation error of the first modification
No time overhead!
Dataset (#nodes) Sketch error Sketch I error
Slashdot (77K) 46% 26%YouTube (1.1M) 30% 12%Flickr (1.7M) 28% 11%WikiTalk (2.2M) 55% 31%Twitter (2.4M) 51% 38%Orkut (3M) 71% 48%
Andrey Gubichev Shortest paths on large graphs 35 / 53
Second modification
Are there any ”hidden” connections?If yes, construct a shorter path
s d
?
s d
Andrey Gubichev Shortest paths on large graphs 36 / 53
Second modification
Are there any ”hidden” connections?
If yes, construct a shorter path
s d
?
s d
Andrey Gubichev Shortest paths on large graphs 36 / 53
Second modification
Are there any ”hidden” connections?
If yes, construct a shorter path
s d
?
s d
Andrey Gubichev Shortest paths on large graphs 36 / 53
Second modification
How to check it?
1. For every node in the path load the list of friends from the originaldataset
2. For every pair of nodes from the path check whether they are friends
Number of nodes in the path is usually small!
Andrey Gubichev Shortest paths on large graphs 37 / 53
Approximation error of the second modification
Dataset (#nodes) Sketch error Sketch I error Sketch II error
Slashdot (77K) 46% 26% 0.6%YouTube (1.1M) 30% 12% 0.6%Flickr (1.7M) 28% 11% 0.3%WikiTalk (2.2M) 55% 31% 0.2%Twitter (2.4M) 51% 38% 0.8%Orkut (3M) 71% 48% 0.6%
Andrey Gubichev Shortest paths on large graphs 38 / 53
Tree algorithm
Paths from a node to landmarks form atree
s
...
... ...
...
landmarks
Andrey Gubichev Shortest paths on large graphs 39 / 53
Tree algorithm
• Load paths from s and to d
• Start BFS from s and d
• For every visited node load a listof friends
• For every pair of visited nodescheck:
1. are they equal? (s3, d1)2. are they friends? (s1, d)
• Form a new path and put it to thequeue Q
• Don’t go too deep: terminate if
levels + leveld > Q.top.length
s
...
... ...
d
s4 s5
s1 s2 s3
s
s3
d
d2
d4 d3
d1
s3
d1
s4 s5
d4 d3
levels + leveld = 4 > 2
Andrey Gubichev Shortest paths on large graphs 40 / 53
Tree algorithm
• Load paths from s and to d
• Start BFS from s and d
• For every visited node load a listof friends
• For every pair of visited nodescheck:
1. are they equal? (s3, d1)2. are they friends? (s1, d)
• Form a new path and put it to thequeue Q
• Don’t go too deep: terminate if
levels + leveld > Q.top.length
s
...
... ...
d
s4 s5
s1 s2 s3
s
s3
d
d2
d4 d3
d1
s3
d1
s4 s5
d4 d3
levels + leveld = 4 > 2
Andrey Gubichev Shortest paths on large graphs 40 / 53
Tree algorithm
• Load paths from s and to d
• Start BFS from s and d
• For every visited node load a listof friends
• For every pair of visited nodescheck:
1. are they equal? (s3, d1)2. are they friends? (s1, d)
• Form a new path and put it to thequeue Q
• Don’t go too deep: terminate if
levels + leveld > Q.top.length
s
...
... ...
d
s4 s5
s1 s2 s3
s
s3
d
d2
d4 d3
d1
s3
d1
s4 s5
d4 d3
levels + leveld = 4 > 2
Andrey Gubichev Shortest paths on large graphs 40 / 53
Tree algorithm
• Load paths from s and to d
• Start BFS from s and d
• For every visited node load a listof friends
• For every pair of visited nodescheck:
1. are they equal? (s3, d1)2. are they friends? (s1, d)
• Form a new path and put it to thequeue Q
• Don’t go too deep: terminate if
levels + leveld > Q.top.length
s
...
... ...
d
s4 s5
s1 s2 s3
s
s3
d
d2
d4 d3
d1
s3
d1
s4 s5
d4 d3
levels + leveld = 4 > 2
Andrey Gubichev Shortest paths on large graphs 40 / 53
Tree algorithm
• Load paths from s and to d
• Start BFS from s and d
• For every visited node load a listof friends
• For every pair of visited nodescheck:
1. are they equal? (s3, d1)2. are they friends? (s1, d)
• Form a new path and put it to thequeue Q
• Don’t go too deep: terminate if
levels + leveld > Q.top.length
s
...
... ...
d
s4 s5
s1 s2 s3
s
s3
d
d2
d4 d3
d1
s3
d1
s4 s5
d4 d3
levels + leveld = 4 > 2
Andrey Gubichev Shortest paths on large graphs 40 / 53
Approximation error of the Tree algorithm
Dataset Sketch error Sketch I error Sketch II error Tree error
Slashdot 46% 26% 0.6% 0YouTube 30% 12% 0.6% 0.06%Flickr 28% 11% 0.3% 0.04%WikiTalk 55% 31% 0.2% 0Twitter 51% 38% 0.8% 0.03%Orkut 71% 48% 0.6% 0.1%
Andrey Gubichev Shortest paths on large graphs 41 / 53
Experimental setup
• Pick 100 nodes (uniformly at random) from the OSN.
• For each node compute Shortest Path Tree (Dijkstra)
• The result is {(x , y , dist)|x , y ∈ V , dist = dist(x , y)}• Group triples by distance and randomly choose 50 triples from every
group
• For every chosen triple (x , y , dist): find approximate shortest pathsfrom x to y and compare their lengths with dist
Andrey Gubichev Shortest paths on large graphs 42 / 53
Implementation details
• Datasets in RDF:〈user1〉 〈friend-of〉 〈user2〉
• Precomputed paths in RDF:
〈u〉 〈path〉 〈h〉
〈h′〉 〈path′〉 〈u〉
• RDF3X for datasets and precomputed data
• C++
• Laptop: 2.0GHz Intel Core 2 Duo, 4 Gb RAM, L2 cache 3 Mb
Andrey Gubichev Shortest paths on large graphs 43 / 53
Time
Dataset (#nodes) Sketch Sketch II Tree Dijkstra Dijkstra(sec) (sec) (sec) (sec) (queue)
Flickr (1.7M) 1.2 2.1 1.9 73 696KWikiTalk (2.2M) 0.7 1.4 1.7 101 2 MlnTwitter (2.4M) 1.9 3.9 4.0 119 1.1 MlnOrkut (3M) 1.1 2.6 2.7 503 2.5 Mln
Andrey Gubichev Shortest paths on large graphs 44 / 53
Disk space
Disk space for precomputed data, Gb
Dataset Dataset size Sketch with distances Sketch with paths
Flickr 0.57 2.3 4.4WikiTalk 0.22 1.9 2.1Twitter 1.3 3.4 6.1Orkut 5.6 6.0 7.4
Andrey Gubichev Shortest paths on large graphs 45 / 53
Number of shortest paths
We find several shortest paths:
Dataset (#nodes) Sketch II Tree
Flickr (1.7M) 33.3 55.6Wikitalk (2.2M) 18.6 50.7Twitter (2.4M) 45.5 92Orkut (3M) 9.5 30
Andrey Gubichev Shortest paths on large graphs 46 / 53
Outline
Introduction
Systems
Algorithms
ApplicationsSemantic WebSocial Search
Andrey Gubichev Shortest paths on large graphs 47 / 53
Application #1: Semantic Web
• SPARQL v.1.1 - SPARQL + path traversal
• Querying the DB of entire human knowledge (everything thatWikipedia knows)
Andrey Gubichev Shortest paths on large graphs 48 / 53
Outline
Introduction
Systems
Algorithms
ApplicationsSemantic WebSocial Search
Andrey Gubichev Shortest paths on large graphs 49 / 53
Small World
Milgram 1967
• People are given letters, asked to forward to one friend
• Source: random Omahaians; Target: stockbrocker in Sharon, MA
• Of completed chains, averaged 6 hops to reach target
Andrey Gubichev Shortest paths on large graphs 50 / 53
Shortest paths on Social NetworksShortest paths are interesting...• per se:
• what is the distance between you and Angela Merkel?• for geeks: Erdos number
• as an important primitive for• social network analysis (diameter, centrality, etc)• social search
• Of course, we can do one-to-many shortest paths algo
M. Potamias et al. CIKM 2009
John searches MaryRanking:
1. Mary A
2. Mary B
3. Mary C
M. Potamias et al. CIKM 2009
Andrey Gubichev Shortest paths on large graphs 51 / 53
Shortest paths on Social NetworksShortest paths are interesting...• per se:
• what is the distance between you and Angela Merkel?• for geeks: Erdos number
• as an important primitive for• social network analysis (diameter, centrality, etc)• social search
• Of course, we can do one-to-many shortest paths algo
M. Potamias et al. CIKM 2009
John searches MaryRanking:
1. Mary A
2. Mary B
3. Mary C
M. Potamias et al. CIKM 2009
Andrey Gubichev Shortest paths on large graphs 51 / 53
Acknowledgements
• Srikanta Bedathur
• Gerhard Weikum
• Josep M. Pujol
• Thomas Neumann
• Sihem Amer-Yahia
Andrey Gubichev Shortest paths on large graphs 52 / 53