compressing social networks - clemson university

41
Compressing Social Networks The Minimum Logarithmic Arrangement Problem Chad Waters School of Computing Clemson University [email protected] March 4, 2013

Upload: others

Post on 29-Oct-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Compressing Social Networks - Clemson University

Compressing Social NetworksThe Minimum Logarithmic Arrangement Problem

Chad Waters

School of ComputingClemson University

[email protected]

March 4, 2013

Page 2: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Motivation

Determine the extent to which social networks can be compressed.

Like Web graph, typical queries seek the neighbors of a node.

Need a model that facilitates efficient adjacency queries.

Social networks are not random graphs.

Exhibit distinctive local properties, such as degree seqences.

Compressibility ≈ “Randomness”

C. Waters Compressing Social Networks

Page 3: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Motivation

Determine the extent to which social networks can be compressed.

Like Web graph, typical queries seek the neighbors of a node.

Need a model that facilitates efficient adjacency queries.

Social networks are not random graphs.

Exhibit distinctive local properties, such as degree seqences.

Compressibility ≈ “Randomness”

C. Waters Compressing Social Networks

Page 4: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Web Graphs

Web graphs are a special case of a social network.

Known to be highly compressible.

Exploit lexicographic locality: order URLs naturally.

Proximal pages have similar neighborhoods.

Lexicographic ordering produces the following properties:

Similarity: Proximal pages have similar sets of neighbors.

Locality: Many links are in same domain, and share neighbors.

No obvious natural ordering for general social networks.

C. Waters Compressing Social Networks

Page 5: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Web Graphs

Web graphs are a special case of a social network.

Known to be highly compressible.

Exploit lexicographic locality: order URLs naturally.

Proximal pages have similar neighborhoods.

Lexicographic ordering produces the following properties:

Similarity: Proximal pages have similar sets of neighbors.

Locality: Many links are in same domain, and share neighbors.

No obvious natural ordering for general social networks.

C. Waters Compressing Social Networks

Page 6: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Web Graphs

Web graphs are a special case of a social network.

Known to be highly compressible.

Exploit lexicographic locality: order URLs naturally.

Proximal pages have similar neighborhoods.

Lexicographic ordering produces the following properties:

Similarity: Proximal pages have similar sets of neighbors.

Locality: Many links are in same domain, and share neighbors.

No obvious natural ordering for general social networks.

C. Waters Compressing Social Networks

Page 7: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Web Graphs

Need to leverage a different feature..

Lexicographic ordering not reasonable for true social networks.

Need an ordering that rewards link reciprocity.

C. Waters Compressing Social Networks

Page 8: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Boldi-Vigna Compression Scheme

Incorporates three main ideas

1 If many nodes have similar neighborhoods, express aneighborhood in terms of a node with a similar neighborhood.

2 If destinations of edges exhibit locality, small integers can beused to encode them.

3 Use gap encodings to store a sequence of edge destinations.

C. Waters Compressing Social Networks

Page 9: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Boldi-Vigna Compression Scheme

Incorporates three main ideas

1 If many nodes have similar neighborhoods, express aneighborhood in terms of a node with a similar neighborhood.

2 If destinations of edges exhibit locality, small integers can beused to encode them.

3 Use gap encodings to store a sequence of edge destinations.

C. Waters Compressing Social Networks

Page 10: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Boldi-Vigna Compression Scheme

Incorporates three main ideas

1 If many nodes have similar neighborhoods, express aneighborhood in terms of a node with a similar neighborhood.

2 If destinations of edges exhibit locality, small integers can beused to encode them.

3 Use gap encodings to store a sequence of edge destinations.

C. Waters Compressing Social Networks

Page 11: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Boldi-Vigna Compression Scheme

BV scheme for compressing Web graphs.

Nodes are Web pages, edges are hyperlinks between them.

Web pages are ordered lexicographically by URL.

Let w be a window parameter (BV recommends w = 8)

C. Waters Compressing Social Networks

Page 12: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Boldi-Vigna Compression Scheme

Consider a web page v . First a copy phase:

Examine the list out(v) of v ’s out neighbors.

Check if it is similar to one of the w − 1 preceding Web pages.

Call this prototype page u, if it exists.

Followed by an encoding phase:

If we have a prototype, encode the offset with log w bits.

Then write the changes between out(v) and out(u).

Otherwise, write log w zeroes, followed by the list out(v).

Benefits from a natural ordering with locality on the Web pages.

C. Waters Compressing Social Networks

Page 13: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Boldi-Vigna Compression Scheme

Consider a web page v . First a copy phase:

Examine the list out(v) of v ’s out neighbors.

Check if it is similar to one of the w − 1 preceding Web pages.

Call this prototype page u, if it exists.

Followed by an encoding phase:

If we have a prototype, encode the offset with log w bits.

Then write the changes between out(v) and out(u).

Otherwise, write log w zeroes, followed by the list out(v).

Benefits from a natural ordering with locality on the Web pages.

C. Waters Compressing Social Networks

Page 14: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Boldi-Vigna Compression Scheme

Consider a web page v . First a copy phase:

Examine the list out(v) of v ’s out neighbors.

Check if it is similar to one of the w − 1 preceding Web pages.

Call this prototype page u, if it exists.

Followed by an encoding phase:

If we have a prototype, encode the offset with log w bits.

Then write the changes between out(v) and out(u).

Otherwise, write log w zeroes, followed by the list out(v).

Benefits from a natural ordering with locality on the Web pages.

C. Waters Compressing Social Networks

Page 15: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Boldi-Vigna Compression Scheme

This scheme has two nice properties:

1 Dependent only on locality given by a canonical ordering.

2 Adjacency queries can be served extremely quickly.

Decode neighbors backwards through prototype chain.

C. Waters Compressing Social Networks

Page 16: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Backlinks Compression Scheme

A modification to the BV compression scheme leveraging observedproperties of social networks, specifically link reciprocity.

C. Waters Compressing Social Networks

Page 17: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Backlinks Compression Scheme

Consider a node v . First a copy phase:

Again, encode v as changes between it and a prototype u.

If no prototype can be found, no copying will occur.

A bit added for each of out(u) signifying if it is in out(v).

Then we record residual edges:

Let v1 ≤ . . . ≤ vk be the out-neighbors of v .

Encode the gaps: |v1 − v |, |v2 − v1|, . . ., |vk − vk−1|

And remove the redundancy of reciprocal edges:

Encode the reciprocal out-neighbors of v .

For each v ′ ∈ out(v) such that v ′ > v , discard (v ′, v) ifv ∈ out(v ′) (the edge is reciprocal).

C. Waters Compressing Social Networks

Page 18: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Backlinks Compression Scheme

Consider a node v . First a copy phase:

Again, encode v as changes between it and a prototype u.

If no prototype can be found, no copying will occur.

A bit added for each of out(u) signifying if it is in out(v).

Then we record residual edges:

Let v1 ≤ . . . ≤ vk be the out-neighbors of v .

Encode the gaps: |v1 − v |, |v2 − v1|, . . ., |vk − vk−1|

And remove the redundancy of reciprocal edges:

Encode the reciprocal out-neighbors of v .

For each v ′ ∈ out(v) such that v ′ > v , discard (v ′, v) ifv ∈ out(v ′) (the edge is reciprocal).

C. Waters Compressing Social Networks

Page 19: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Backlinks Compression Scheme

Consider a node v . First a copy phase:

Again, encode v as changes between it and a prototype u.

If no prototype can be found, no copying will occur.

A bit added for each of out(u) signifying if it is in out(v).

Then we record residual edges:

Let v1 ≤ . . . ≤ vk be the out-neighbors of v .

Encode the gaps: |v1 − v |, |v2 − v1|, . . ., |vk − vk−1|

And remove the redundancy of reciprocal edges:

Encode the reciprocal out-neighbors of v .

For each v ′ ∈ out(v) such that v ′ > v , discard (v ′, v) ifv ∈ out(v ′) (the edge is reciprocal).

C. Waters Compressing Social Networks

Page 20: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Backlinks Compression Scheme

This method potentially outperforms BV in terms of compression.

Redundant edges are compressed out.

Adjacency queries may be slower than with BV:

BL does not bound length of prototype chains.

If performance degrades, a limit can be implemented.

C. Waters Compressing Social Networks

Page 21: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Backlinks Compression Scheme

This method potentially outperforms BV in terms of compression.

Redundant edges are compressed out.

Adjacency queries may be slower than with BV:

BL does not bound length of prototype chains.

If performance degrades, a limit can be implemented.

C. Waters Compressing Social Networks

Page 22: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Compression-Friendly Ordering

Performance of both the Boldi-Vigna and Backlinks compressionschemes depends heavily on the ordering of nodes. We wish to findan ordering that is practical for graphs that may not admit anobvious natural ordering.

We wish to find an ordering that provides both locality andsimilarity, just as lexicographic ordering gave us with the Webgraphs. This leads us to the minimum logarithmic arrangementproblem.

C. Waters Compressing Social Networks

Page 23: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

MinLogA Problem

MinLogA

Find a permutation π : V → [n] such that∑(u,v)∈E

log |π(u)− π(v)|

is minimized.

Using this cost log |π(u)− π(v)| represents theinformation-theoretic optimal ordering.

C. Waters Compressing Social Networks

Page 24: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

MinLogA Problem

Replacing the cost in this equation with simply |π(u)− π(v)|, weget the minimum linear arrangement problem.

Known to be NP-Hard.

Not much is known about its approximability.

Best known algorithm approximates to O(√

log n log log n)

Ambhul et al rule out the existence of a PTAS.

Fundamentally different problem than MinLogA in that a solutionto one is not necessarily a solution to the other in the general case.

C. Waters Compressing Social Networks

Page 25: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

MinLogGapA Problem

It is always more efficient to compress the gaps induced by theneighbors of a node.

Suppose u < v1 < v2 and (u, v1), (u, v2) ∈ E .

Compressing v1 − u and v2 − v1 is always less expensive.

Therefore, the problem we really want to solve is:

MinLogGapA

Find a permutation π : V → [n] such that

∑u∈V

k∑i=1

log |π(ui )− π(ui−1)|

is minimized, where u0 = u and {u1, . . . , uk} = out(u).

C. Waters Compressing Social Networks

Page 26: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

MinLogGapA Problem

It is always more efficient to compress the gaps induced by theneighbors of a node.

Suppose u < v1 < v2 and (u, v1), (u, v2) ∈ E .

Compressing v1 − u and v2 − v1 is always less expensive.

Therefore, the problem we really want to solve is:

MinLogGapA

Find a permutation π : V → [n] such that

∑u∈V

k∑i=1

log |π(ui )− π(ui−1)|

is minimized, where u0 = u and {u1, . . . , uk} = out(u).

C. Waters Compressing Social Networks

Page 27: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Other Orderings

Lindstrom suggests a few more ordering metrics:

p-sum (equivalent to MinLinA where p = 1):

minσp = limq→p

(∑uv∈E

|π(u)− π(v)|q)1/q

p-mean:

minµp = limq→p

(1

|E |∑uv∈E

|π(u)− π(v)|q)1/q

Which, by way of the geometric mean, is the minimum edge productproblem for µ0:

µ0 = exp

(1

|E |∑uv∈E

log |π(u)− π(v)|

)=

( ∏uv∈E

|π(u)− π(v)|

)1/|E |

C. Waters Compressing Social Networks

Page 28: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Shingle Ordering Heuristic

We wish to find a practical heuristic for both MinLogA andMinLogGapA. Based on concept of obtaining a “fingerprint” forthe out-neighbors for each node, and sorting on this fingerprint.

Jaccard Coefficient

A notion of similarity between two sets, defined as

J(A,B) =|A ∩ B||A ∪ B|

C. Waters Compressing Social Networks

Page 29: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Shingle Ordering Heuristic

Let σ be a random permutation of the elements in A ∪ B

and let Mσ(A) = σ−1(mina∈A σ(a)), the first element in Aσ.

Pr[Mσ(A) = Mσ(B)] =|A ∩ B||A ∪ B|

= J(A,B)

C. Waters Compressing Social Networks

Page 30: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Shingle Ordering Heuristic

Let σ be a random permutation of the elements in A ∪ B

and let Mσ(A) = σ−1(mina∈A σ(a)), the first element in Aσ.

Pr[Mσ(A) = Mσ(B)] =|A ∩ B||A ∪ B|

= J(A,B)

C. Waters Compressing Social Networks

Page 31: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Shingle Ordering Heuristic

Let σ be a random permutation of the elements in A ∪ B

and let Mσ(A) = σ−1(mina∈A σ(a)), the first element in Aσ.

Pr[Mσ(A) = Mσ(B)] =|A ∩ B||A ∪ B|

= J(A,B)

C. Waters Compressing Social Networks

Page 32: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Shingle Ordering Heuristic

Treat the out-neighbors of each u ∈ V as a set.

Compute the shingle Mσ(out(u)) for a suitable σ.

Nodes in V can now be ordered by their shingles.

If two nodes share a large number of out-neighbors, with highprobability, they will have the same shingle and will be neareach other in the shingle ordering.

Therefore, shingle ordering maintains the properties of localityand similarity.

C. Waters Compressing Social Networks

Page 33: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Compression Performance

Shingle ordering and double-shingle ordering performs better thanalmost all baseline sorting methods:

Random order

Natural order - Most obvious ordering. For Web and hostgraphs, these would be lexicographic order on URL. For socialnetworks, crawl ordering.

DFS/BFS order - Performs almost as poorly as randomordering

Gray code order - v precedes w if the row of v precedes therow of w in the binary adjacency matrix, according to thecanonical Gray code

Geographic order - Liben-Nowell et al, show that ∼ 70% ofsocial network edges “arise from geographical proximity”.

C. Waters Compressing Social Networks

Page 34: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Compression Performance

Shingle ordering and double-shingle ordering performs better thanalmost all baseline sorting methods:

Random order

Natural order - Most obvious ordering. For Web and hostgraphs, these would be lexicographic order on URL. For socialnetworks, crawl ordering.

DFS/BFS order - Performs almost as poorly as randomordering

Gray code order - v precedes w if the row of v precedes therow of w in the binary adjacency matrix, according to thecanonical Gray code

Geographic order - Liben-Nowell et al, show that ∼ 70% ofsocial network edges “arise from geographical proximity”.

C. Waters Compressing Social Networks

Page 35: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Compression Performance

Shingle ordering and double-shingle ordering performs better thanalmost all baseline sorting methods:

Random order

Natural order - Most obvious ordering. For Web and hostgraphs, these would be lexicographic order on URL. For socialnetworks, crawl ordering.

DFS/BFS order - Performs almost as poorly as randomordering

Gray code order - v precedes w if the row of v precedes therow of w in the binary adjacency matrix, according to thecanonical Gray code

Geographic order - Liben-Nowell et al, show that ∼ 70% ofsocial network edges “arise from geographical proximity”.

C. Waters Compressing Social Networks

Page 36: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Compression Performance

Shingle ordering and double-shingle ordering performs better thanalmost all baseline sorting methods:

Random order

Natural order - Most obvious ordering. For Web and hostgraphs, these would be lexicographic order on URL. For socialnetworks, crawl ordering.

DFS/BFS order - Performs almost as poorly as randomordering

Gray code order - v precedes w if the row of v precedes therow of w in the binary adjacency matrix, according to thecanonical Gray code

Geographic order - Liben-Nowell et al, show that ∼ 70% ofsocial network edges “arise from geographical proximity”.

C. Waters Compressing Social Networks

Page 37: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Compression Performance

Shingle ordering and double-shingle ordering performs better thanalmost all baseline sorting methods:

Random order

Natural order - Most obvious ordering. For Web and hostgraphs, these would be lexicographic order on URL. For socialnetworks, crawl ordering.

DFS/BFS order - Performs almost as poorly as randomordering

Gray code order - v precedes w if the row of v precedes therow of w in the binary adjacency matrix, according to thecanonical Gray code

Geographic order - Liben-Nowell et al, show that ∼ 70% ofsocial network edges “arise from geographical proximity”.

C. Waters Compressing Social Networks

Page 38: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Why does Shingle Ordering Work?

For most graphs, we increase the number of small gaps, which canbe encoded succinctly. The notable exception is with theLiveJournal graph, where the crawl ordering performed marginallybetter.

Minimizing the size of the gaps and increasing the number of smallgaps (within some threshold window) equates to a performanceboost for both BV and BL.

C. Waters Compressing Social Networks

Page 39: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Latent Incompressiblity

Why might a graph be hard to compress?

Which compresses better: dense or sparse portions of a graph?

The low-degree nodes in social networks are to blame.

We can compress k-cores seperately (down to virtual nodes).

C. Waters Compressing Social Networks

Page 40: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Latent Incompressiblity

Why might a graph be hard to compress?

Which compresses better: dense or sparse portions of a graph?

The low-degree nodes in social networks are to blame.

We can compress k-cores seperately (down to virtual nodes).

C. Waters Compressing Social Networks

Page 41: Compressing Social Networks - Clemson University

Introduction Compression Schemes Orderings Heuristic Conclusion

Conclusion

The analogous structure of Web graphs and social networks isunmistakable. Social networks do not, however, admit a naturalordering that consistently performs as well as lexicographic orderingdoes for Web graphs. Shingle ordering does beat most otherbaseline orderings, and highlights the value of link reciprocity.

C. Waters Compressing Social Networks