![Page 1: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/1.jpg)
1
Sampling Massive Online GraphsChallenges, Techniques, and Applications to Facebook
Maciej Kurant (UC Irvine)
Joint work with:
Minas Gjoka (UC Irvine), Athina Markopoulou (UC Irvine),
Carter T. Butts (UC Irvine),Patrick Thiran (EPFL).
14 Nov, 2011, KTH
![Page 2: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/2.jpg)
Why study Online Social Networks (OSNs)?Engineering• Search engine accuracy• Better spam filters• Efficient data centers• New apps/Third party services• Offload 3G operators• …
Social Media• Predict the spread and importance of information• Social filters• …
Social Sciences• Great source of data for studying the structure of the
society, online behavior, …
Marketing• Influential users• Recommendations• Ad placement• …
Large scale data mining• understand user communication patterns, community
structure• “human sensors”
Privacy
….
2
![Page 3: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/3.jpg)
3
OSNs cover 50% of world’s Internet users
> 1 billion users October 2011
800 million
200 million
200 million
66 million
50 million
34 million
Active users
![Page 4: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/4.jpg)
Facebook:•800+M users•150 friends each (on average)•8 bytes (64 bits) per user ID
The raw connectivity data, with no attributes:•800 x 150 x 8B = 960 GB
This is neither feasible nor practical. Solution: Sampling!
To get this data, one would have to download:•200 TB of HTML data!
4
![Page 5: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/5.jpg)
Sampling
5
• Node attributes• Topology• Graph size• Evolution in time• Random node
selection• …
Objective:
![Page 6: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/6.jpg)
Sampling
6
• Node attributes• Topology• Graph size• Evolution in time• Random node
selection• …
Objective:• NodesWhat:
![Page 7: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/7.jpg)
Sampling
7
• Node attributes• Topology• Graph size• Evolution in time• Random node
selection• …
Objective:• Nodes• Edges
What:
![Page 8: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/8.jpg)
Sampling
• Node attributes• Topology• Graph size• Evolution in time• Random node
selection• …
Objective:• Nodes• Edges•
Subgraphs
What:
![Page 9: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/9.jpg)
Sampling
• Node attributes• Topology• Graph size• Evolution in time• Random node
selection• …
Objective:• Nodes• Edges•
Subgraphs
What:• Directly
• Often not possible
How:
![Page 10: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/10.jpg)
Sampling
• Node attributes• Topology• Graph size• Evolution in time• Random node
selection• …
Objective:• Nodes• Edges•
Subgraphs
What:• Directly
• Often not possible
• Exploration
How:
![Page 11: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/11.jpg)
• OSNs• P2P, distributed systems• WWW• “Offline” social network
• Nodes• Edges•
Subgraphs
What:• Directly
• Often not possible
• Exploration
How:
Sampling
• Node attributes• Topology• Graph size• Evolution in time• Random node
selection• …
Objective:
![Page 12: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/12.jpg)
Random Walks in graph sampling: • WWW [Henzinger et at. 2000, Baykan et al. 2009]• P2P [Gkantsidis et al. 2004 , Stutzbach et al. 2006, Rasti et al. 2009]• OSN [Rasti et al. 2008, Krishnamurthy et al, 2008]• “Offline” social networks [Salganik et al. 2004, Volz et al. 2008]
Random Walks mixing improvements: • Random jumps [Henzinger et al. 2000, Avrachenkov, et al. 2010]• Fastest Mixing Markov Chain [Boyd et al. 2004]• Multiple dependent walks [Ribeiro et al. 2010]
BFS and other traversals in graph sampling: • Najork et al. 2001, Achlioptas et al. 2005, Leskovec et al. 2006, Mislove et al. 2007, Cha 2007,
Ahn et al. 2007, Wilson et al. 2009, Viswanath 2009, Ye et al. 2010, Gile and Handcock 2011
Measurement/Characterization studies of OSNs: • Cyworld, Orkut, Myspace, Flickr, Youtube [Mislove et al. 2007, …]• Facebook [Krishnamurthy et al. ’08, Wilson et al. 2009, …]
Independence sampling: • Hansen-Hurwitz estimator [Hansen and Hurwitz 1943]• Stratified sampling [Neyman 1934]
Related work
![Page 13: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/13.jpg)
OutlineIntroduction
Sampling with replacements (Random Walks):• MHRW vs RWRW• Multigraph Sampling• Stratified Weighted Random Walk (S-WRW)
Sampling without replacements (Traversals):• The bias of BFS (and of DFS/RDS/…)
Estimation from a sample
Conclusion and Future Directions
![Page 14: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/14.jpg)
OutlineIntroduction
Sampling with replacements (Random Walks):• MHRW vs RWRW• Multigraph Sampling• Stratified Weighted Random Walk (S-WRW)
Sampling without replacements (Traversals):• The bias of BFS (and of DFS/RDS/…)
Estimation from a sample
Conclusion and Future Directions
![Page 15: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/15.jpg)
qk - observed
node degree distribution
pk - real node
degree distribution
Random Walk in Facebook
15
degree of node v
Pr(sampling v) ~ kv
![Page 16: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/16.jpg)
16
Metropolis-Hastings Random Walk (MHRW):
DA AC…
…
C
DM
J
N
A
B
IE
K
F
LH
G
How to get an unbiased sample?
S = asymptotically uniform
![Page 17: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/17.jpg)
17
Metropolis-Hastings Random Walk (MHRW):
DA AC…
…
C
DM
J
N
A
B
IE
K
F
LH
G
17
Re-Weighted Random Walk (RWRW):
Collect a classic (biased) RW sample…
Now apply the Hansen-Hurwitz estimator:
How to get an unbiased sample?
S = asymptotically uniform
![Page 18: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/18.jpg)
18
Metropolis-Hastings Random Walk (MHRW): Re-Weighted Random Walk (RWRW):
Facebook results
Also corrects for the bias of all other metrics:
Not corrected:
![Page 19: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/19.jpg)
19
MHRW or RWRW ?
~3.0
19
![Page 20: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/20.jpg)
20
RWRW is better than MHRW • RWRW requires 1.5 to 7 times fewer samples to achieve the same
• Intuition?
However:• Pathological counter-examples exist.
• MHRW is easier to use (it does not require reweighting)
MHRW or RWRW ?
[1] Minas Gjoka, Maciej Kurant, Carter T. Butts and Athina Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, INFOCOM 2010.
![Page 21: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/21.jpg)
Online Convergence Diagnostics
Acceptable convergence between 500 and 3000 iterations (depending on property of interest)
• Inferences assume that samples are drawn from stationary distribution
• No ground truth available in practice• MCMC literature, online diagnostics
[1] Minas Gjoka, Maciej Kurant, Carter T. Butts and Athina Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, INFOCOM 2010.
![Page 22: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/22.jpg)
OutlineIntroduction
Sampling with replacements (Random Walks):• MHRW vs RWRW• Multigraph Sampling• Stratified Weighted Random Walk (S-WRW)
Sampling without replacements (Traversals):• The bias of BFS (and of DFS/RDS/…)
Estimation from a sample
Conclusion and Future Directions
![Page 23: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/23.jpg)
C
DM
J
N
A
B
IE
K
F
LH
G Friends
C
DM
J
N
A
B
IE
K
F
LH
G
Events
C
DM
J
N
A
B
IE
K
F
LH
G
Groups
E.g., in LastFM
![Page 24: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/24.jpg)
C
DM
J
N
A
B
IE
K
F
LH
G Friends
C
DM
J
N
A
B
IE
K
F
LH
G
Events
C
DM
J
N
A
B
IE
K
F
LH
G
Groups
E.g., in LastFM
![Page 25: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/25.jpg)
JC
DM
N
A
B
IE
G* = Friends + Events + Groups
( G* is a multigraph )F
LH
G K
25
Multigraph sampling
[2] Minas Gjoka, Carter T. Butts, Maciej Kurant, Athina Markopoulou, “Multigraph Sampling of Online Social Networks”, JSAC 2011.
Efficient implementation (saves bandwidth):1) Select relation graph Gi with probability deg(H,Gi) / deg(H, G*)2) Within Gi choose an edge uniformly at random, i.e., with probability 1/deg(H, Gi).
Applied to LastFM:- better coverage of previously isolated nodes - better estimates of distributions and means
![Page 26: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/26.jpg)
OutlineIntroduction
Sampling with replacements (Random Walks):• MHRW vs RWRW• Multigraph Sampling• Stratified Weighted Random Walk (S-WRW)
Sampling without replacements (Traversals):• The bias of BFS (and of DFS/RDS/…)
Estimation from a sample
Conclusion and Future Directions
![Page 27: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/27.jpg)
Not all nodes are equal
irrelevant
important(equally) important
Node categories:e.g. China
e.g., Sweden
Stratification under Weighted Independence Sampler (WIS)(node size is proportional to its sampling probability)
27
![Page 28: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/28.jpg)
Not all nodes are equal
But graph exploration techniques have to follow the links!
Trade-off between • ideal (WIS) sampling weights• fast convergence
Enforcing WIS weights may lead to slow (or no) convergence
28
Assumption: On sampling a node, we learn the categories
of its neighbors.
irrelevant
important(equally) important
Node categories: Stratification under Weighted Independence Sampler (WIS)(node size is proportional to its sampling probability)
Fastest Mixing Markov Chain [Boyd et al. 2004]
![Page 29: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/29.jpg)
Measurement objective
E.g., compare the size of red and green categories.
29
![Page 30: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/30.jpg)
Measurement objective
Category weights optimal under WIS
Stratified sampling theory +
Information collected by pilot RW
E.g., compare the size of red and green categories.
30
![Page 31: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/31.jpg)
Problem 2: “Black holes”
Measurement objective
Category weights optimal under WIS
Modified category weights
Problem 1: Poor or no connectivity
Solution: Small weight>0 for irrelevant categories. f* -the fraction of time we plan to spend
in irrelevant nodes (e.g., 1%)
Solution:Limit the weight of tiny relevant categories.Γ - maximal factor by which we can
increase edge weights (e.g., 100 times)
E.g., compare the size of red and green categories.
![Page 32: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/32.jpg)
Measurement objective
Category weights optimal under WIS
Modified category weights
Edge weights in G
E.g., compare the size of red and green categories.
20=
vol(green), from pilot RW
Target edge weights:
22=
4=
![Page 33: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/33.jpg)
Measurement objective
Category weights optimal under WIS
Modified category weights
Edge weights in G
Resolve conflicts: • arithmetic mean, • geometric mean, • max, • …
E.g., compare the size of red and green categories.
20=
vol(green), from pilot RW
Target edge weights:
22=
4=
![Page 34: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/34.jpg)
Measurement objective
Category weights optimal under WIS
Modified category weights
Edge weights in G
WRW sample
E.g., compare the size of red and green categories.
![Page 35: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/35.jpg)
Measurement objective
Category weights optimal under WIS
Modified category weights
Edge weights in G
WRW sample
Final result
Hansen-Hurwitz estimator
E.g., compare the size of red and green categories.
![Page 36: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/36.jpg)
Stratified Weighted Random Walk
(S-WRW)
Measurement objective
Category weights optimal under WIS
Modified category weights
Edge weights in G
WRW sample
Final result
E.g., compare the size of red and green categories.
![Page 37: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/37.jpg)
Colleges in Facebook
versions of S-WRW
Random Walk (RW)
Samples in colleges: 86% of S-WRW, 9% of RW.
This is because S-WRW avoids irrelevant categories.
The difference is larger (100x) for small colleges. This is due
to S-WRW’s stratification.
[3] Maciej Kurant, Minas Gjoka, Carter T. Butts and Athina Markopoulou, “Walking on a Graph with a Magnifying Glass”, SIGMETRICS 2011.
RW required 10-15 times more samples than S-WRW to achieve the same accuracy.
![Page 38: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/38.jpg)
Sampling with replacements: Summary
RWRW is 1.5-7 times more efficient than MHRW• counter-examples exists
Multigraph Sampling• walking on multiple relations improves efficiency
Stratified Weighted Random Walk • oversamples relevant regions, undersamples irrelevant regions• 10-15 fold gains in sampling costs
Online Convergence Diagnostics
39
[1] Minas Gjoka, Maciej Kurant, Carter T. Butts and Athina Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, INFOCOM 2010.[2] Minas Gjoka, Carter T. Butts, Maciej Kurant, Athina Markopoulou, “Multigraph Sampling of Online Social Networks”, JSAC 2011.[3] Maciej Kurant, Minas Gjoka, Carter T. Butts and Athina Markopoulou,
“Walking on a Graph with a Magnifying Glass”, SIGMETRICS 2011.
![Page 39: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/39.jpg)
OutlineIntroduction
Sampling with replacements (Random Walks):• MHRW vs RWRW• Multigraph Sampling• Stratified Weighted Random Walk (S-WRW)
Sampling without replacements (Traversals):• The bias of BFS (and of DFS/RDS/…)
Estimation from a sample
Conclusion and Future Directions
![Page 40: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/40.jpg)
41
Sampling without replacements (Traversals)
![Page 41: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/41.jpg)
42
Sampling without replacements (Traversals)
![Page 42: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/42.jpg)
43
Sampling without replacements (Traversals)
![Page 43: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/43.jpg)
44
Sampling without replacements (Traversals)
Examples:•BFS (Breadth-First Search)•DFS (Depth-First Search)•Forest Fire•RDS (Respondent-Driven Sampling)•Snowball sampling•…
Why sample with BFS?• BFS is a well known textbook technique• BFS sample is a nice looking graph• It is used in practice [Ahn et al. 2007,
Mislove et al. 2007, Wilson et al. 2009]
![Page 44: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/44.jpg)
45
BFS in Facebook
pk
qk
BFS (Breadth First Search) with f=0.5% of nodes sampled
(338 for RW)
This bias has been empirically observed in the past [Najork et al. 2001].
Our goals:• Formally analyze the bias of BFS (challenging due to dependencies)• Correct for this bias.• (no new sampling method proposed)
![Page 45: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/45.jpg)
46
- real average node degree
- real average squared node degree.
Goal: Analyze the bias of BFS
Graph traversals on RG(pk):
?BFS
qk ( f ) = ?
true average node degree
![Page 46: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/46.jpg)
47
Graph model RG(pk)
• Random graph RG(pk) with a given node degree distribution pk (sequence)
• Can be generated by configuration modelExample:
‘stubs’
![Page 47: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/47.jpg)
48
Approach 1: Brute force
Remedy: “The Principle of Deferred Decisions”
So we can generate the graph ‘on the fly’, while exploring it!
Generate all possible graphs, and ... No way!
![Page 48: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/48.jpg)
49
wv u
v
vwu uv
w
vu u
v
u
v
i
uXkE
k
kkE
kwX
uXkE
kvX
kE
kuXvX
uX
iX
)Pr(22
)Pr(
)Pr(2
)Pr(
28
3)|Pr(
)Pr(
1,
3
12
12
1
node sampled th
Approach 2: The Principle of Deferred Decisions
This does not scale! (because of dependencies between stubs)
v
u
?
* we assumed that the generated graph is connected
![Page 49: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/49.jpg)
1
2
1
time t0 1
Originally proposed in:J. H. Kim, “Poisson cloning model for random graphs,” International Congress of Mathematicians (ICM), 2006 (preprint in 2004).
Developped in:D. Achlioptas, A. Clauset, D. Kempe, and C. Moore, “On the bias of traceroute sampling: or, power-law degree distributions in regular graphs,” in STOC, 2005.
(both in a different context)
Approach 2b: Breaking the stub dependencies
V2
1 23
v4
v31
2 3 4
v1
![Page 50: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/50.jpg)
time t0 1
Originally proposed in:J. H. Kim, “Poisson cloning model for random graphs,” International Congress of Mathematicians (ICM), 2006 (preprint in 2004).
Developped in:D. Achlioptas, A. Clauset, D. Kempe, and C. Moore, “On the bias of traceroute sampling: or, power-law degree distributions in regular graphs,” in STOC, 2005.
(both in a different context)
Approach 2b: Breaking the stub dependencies
1
2
1
V2
1 23
v4
1
2 3 4
v1v3
![Page 51: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/51.jpg)
1
))(1(1
ˆ
))(1(1
ˆˆ
)(
)1(1)(
))1(1(
))1(1()(
))1(1()1(1)Pr(
)1()Pr(
ondistributi degree node Corrected
defined well
nodes sampled offraction Expected
observed be toexpectedon distributi Degree
before sampled degree of nodes ofnumber Expected
timebefore sampled is degree of node
timebefore sampled is degree of node
l lk
kk
k
k
kk
l
ll
kk
k
kk
k
k
ft
q
ft
qp
ft
tptf
tp
tptq
tVpt
t
tk
tkv
tkv not
f
Approach 2b: Breaking the stub dependencies
number of nodes of degree k
![Page 52: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/52.jpg)
1
))(1(1
ˆ
))(1(1
ˆˆ
)(
)1(1)(
))1(1(
))1(1()(
))1(1()1(1)Pr(
)1()Pr(
ondistributi degree node Corrected
defined well
nodes sampled offraction Expected
observed be toexpectedon distributi Degree
before sampled degree of nodes ofnumber Expected
timebefore sampled is degree of node
timebefore sampled is degree of node
l lk
kk
k
k
kk
l
ll
kk
k
kk
k
k
ft
q
ft
qp
ft
tptf
tp
tptq
tVpt
t
tk
tkv
tkv not
f
Approach 2b: Breaking the stub dependencies
number of nodes of degree k
![Page 53: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/53.jpg)
54
Graph traversals on RG(pk):
MHRW, RWRW
Main results
true average node degree
[4] Maciej Kurant, Athina Markopoulou, Patrick Thiran, “On the Bias of BFS”, JSAC 2011.
Python code available at: http://mkurant.com/publications
![Page 54: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/54.jpg)
55
Graph traversals on RG(pk):
MHRW, RWRW
Main results
RDS
true average node degree
[4] Maciej Kurant, Athina Markopoulou, Patrick Thiran, “On the Bias of BFS”, JSAC 2011.
Python code available at: http://mkurant.com/publications
![Page 55: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/55.jpg)
Main results
56
Graph traversals on RG(pk):
For small sample size (for f→0),BFS has the same bias as RW.
This bias monotonically decreases with f. We found analytically the shape of this curve.
MHRW, RWRWFor large sample size (for f→1),
BFS becomes unbiased.
RDS
56
true average node degree
Under RG(pk), all traversals are subject to exactly the same bias.
[4] Maciej Kurant, Athina Markopoulou, Patrick Thiran, “On the Bias of BFS”, JSAC 2011.
Python code available at: http://mkurant.com/publications
![Page 56: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/56.jpg)
57
What if the graph is not random?
[4] Maciej Kurant, Athina Markopoulou, Patrick Thiran, “On the Bias of BFS”, JSAC 2011.
Python code available at: http://mkurant.com/publications
expected,sampled
true,corrected
![Page 57: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/57.jpg)
Sampling without replacements: Summary
58[4] Maciej Kurant, Athina Markopoulou, Patrick Thiran, “On the Bias of BFS”, JSAC 2011.
Python code available at: http://mkurant.com/publications
Graph traversals on RG(pk):
MHRW, RWRW
A difficult problem • Dependencies between samples
We computed analytically the bias of BFS in RG(pk)• Initial bias as of RW• Same bias for all traversals (BFS, DFS, RDS,…) under RG(pk)• A bias correction procedure• Works well for real-life graphs
If possible, prefer methods with replacements.
![Page 58: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/58.jpg)
OutlineIntroduction
Sampling with replacements (Random Walks):• MHRW vs RWRW• Multigraph Sampling• Stratified Weighted Random Walk (S-WRW)
Sampling without replacements (Traversals):• The bias of BFS (and of DFS/RDS/…)
Estimation from a sample
Conclusion and Future Directions
![Page 59: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/59.jpg)
1) Local properties
Node properties:• Community membership information• Privacy settings• Names• …
Local topology properties:• Node degree distribution• Assortativity• Clustering coefficient• …
60
![Page 60: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/60.jpg)
61
Example: Privacy Awareness in Facebook’091) Local properties
Privacy Awareness - fraction of users that change the default privacy settings.PA =
![Page 61: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/61.jpg)
2) Estimating the graph size
• Counts repeated nodes – “Reversed Birthday Paradox”• Work in progress
62
![Page 62: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/62.jpg)
Probability that a random node in A is a neighbor of a random node in B
63
From a randomly sampled set of nodes we infer a valid topology!
3) Coarse-grained topology
A
B
[5] M. Kurant, M. Gjoka, Y. Wang, Z. W. Almquist, C. T. Butts, A. Markopoulou, “Coarse-Grained Topology Estimation”, arXiv:1105.5488
(estimator)
![Page 63: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/63.jpg)
geosocialmap.com
64[5] M. Kurant, M. Gjoka, Y. Wang, Z. W. Almquist, C. T. Butts, A. Markopoulou, “Coarse-Grained Topology Estimation”, arXiv:1105.5488
![Page 64: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/64.jpg)
Public and private colleges in the USA
geosocialmap.com 65
![Page 65: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/65.jpg)
geosocialmap.com
The world according to Facebook
66
![Page 66: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/66.jpg)
67
Egypt
Saudi Arabia
United Arab Emirates
Lebanon
Jordan
Israel
Strong clusters among middle-eastern countries
![Page 67: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/67.jpg)
Summary
68
![Page 68: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/68.jpg)
C
D
M
J
N
A
B
I
E
K
F
L
H
G
C
D
M
J
N
A
B
I
E
K
F
L
H
G
C
D
M
J
N
A
B
I
E
K
F
L
H
G
J
C
D
M
N
A
B
I
E
F
L
G
K
H
Multigraph sampling [2] Stratified WRW [3]Random Walks (with replacements)
• RWRW > MHRW [1]• Convergence Diagnostics
References[1] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, INFOCOM 2010.[2] M. Gjoka, C. T. Butts, M. Kurant and A. Markopoulou, “Multigraph Sampling of Online Social Networks”, JSAC 2011[3] M. Kurant, M. Gjoka, C. T. Butts and A. Markopoulou, “Walking on a Graph with a Magnifying Glass”, SIGMETRICS 2011.[4] M. Kurant, A. Markopoulou and P. Thiran, “On the bias of BFS (Breadth First Search)”, JSAC, 2011.[5] M. Kurant, M. Gjoka, Y. Wang, Z. W. Almquist, C. T. Butts, A. Markopoulou, “Coarse-Grained Topology Estimation”, arXiv:1105.5488[6] Datasets available from : http://odysseas.calit2.uci.edu/osn
![Page 69: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/69.jpg)
Stratified WRW [3]
Graph traversals on RG(pk):
MHRW, RWRW
[4]
Traversals (no replacements)
70
J
C
D
M
N
A
B
I
E
F
L
G
K
H
Multigraph sampling [2]
References[1] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, INFOCOM 2010.[2] M. Gjoka, C. T. Butts, M. Kurant and A. Markopoulou, “Multigraph Sampling of Online Social Networks”, JSAC 2011[3] M. Kurant, M. Gjoka, C. T. Butts and A. Markopoulou, “Walking on a Graph with a Magnifying Glass”, SIGMETRICS 2011.[4] M. Kurant, A. Markopoulou and P. Thiran, “On the bias of BFS (Breadth First Search)”, JSAC, 2011.[5] M. Kurant, M. Gjoka, Y. Wang, Z. W. Almquist, C. T. Butts, A. Markopoulou, “Coarse-Grained Topology Estimation”, arXiv:1105.5488[6] Datasets available from : http://odysseas.calit2.uci.edu/osn
Random Walks (with replacements)
• RWRW > MHRW [1]• Convergence Diagnostics
![Page 70: 1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),](https://reader030.vdocuments.us/reader030/viewer/2022033106/56649e835503460f94b84c79/html5/thumbnails/70.jpg)
Stratified WRW [3]
Graph traversals on RG(pk):
MHRW, RWRW
A
B
[4]
Coarse-grained topologies [5]
Traversals (no replacements)
References[1] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, INFOCOM 2010.[2] M. Gjoka, C. T. Butts, M. Kurant and A. Markopoulou, “Multigraph Sampling of Online Social Networks”, JSAC 2011[3] M. Kurant, M. Gjoka, C. T. Butts and A. Markopoulou, “Walking on a Graph with a Magnifying Glass”, SIGMETRICS 2011.[4] M. Kurant, A. Markopoulou and P. Thiran, “On the bias of BFS (Breadth First Search)”, JSAC, 2011.[5] M. Kurant, M. Gjoka, Y. Wang, Z. W. Almquist, C. T. Butts, A. Markopoulou, “Coarse-Grained Topology Estimation”, arXiv:1105.5488[6] Datasets available from : http://odysseas.calit2.uci.edu/osn
J
C
D
M
N
A
B
I
E
F
L
G
K
H
Multigraph sampling [2]
Thank you mkurant.com
Random Walks (with replacements)
• RWRW > MHRW [1]• Convergence Diagnostics