unbiased-sampling.ppt
TRANSCRIPT
-
Minas Gjoka, UC IrvineWalking in Facebook*Walking in Facebook:A Case Study of Unbiased Sampling of OSNs
Minas Gjoka, Maciej Kurant , Carter Butts, Athina Markopoulou UC Irvine, EPFL
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*OutlineMotivation and Problem StatementSampling MethodologyData AnalysisConclusion
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*Online Social Networks (OSNs) A network of declared friendships between users
Allows users to maintain relationships
Many popular OSNs with different focusFacebook, LinkedIn, Flickr,
Social Graph
Walking in Facebook
C
A
E
G
F
B
D
H
-
Minas Gjoka, UC IrvineWalking in Facebook*Why Sample OSNs?Representative samples desirablestudy propertiestest algorithms
Obtaining complete dataset difficultcompanies usually unwilling to share data tremendous overhead to measure all (~100TB for Facebook)
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*Problem statementObtain a representative sample of users in a given OSN by exploration of the social graph. in this work we sample Facebook (FB)explore graph using various crawling techniques
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*Related WorkGraph traversal (BFS)A. Mislove et al, IMC 2007Y. Ahn et al, WWW 2007C. Wilson, Eurosys 2009Random walks (MHRW, RDS)M. Henzinger et al, WWW 2000D. Stutbach et al, IMC 2006A. Rasti et al, Mini Infocom 2009
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*Our ContributionsCompare various crawling techniques in FBs social graph Breadth-First-Search (BFS)Random Walk (RW)Metropolis-Hastings Random Walk (MHRW)
Practical recommendationsonline convergence diagnostic testsproper use of multiple parallel chainsmethods that perform better and tradeoffs
Uniform sample of Facebook userscollection and analysismade available to researchers
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*OutlineMotivation and Problem StatementSampling Methodologycrawling methodsdata collectionconvergence evaluationmethod comparisonsData AnalysisConclusion
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*(1) Breadth-First-Search (BFS)Starting from a seed, explores all neighbor nodes. Process continues iteratively without replacement.
BFS leads to bias towards high degree nodes Lee et al, Statistical properties of Sampled Networks, Phys Review E, 2006
Early measurement studies of OSNs use BFS as primary sampling techniquei.e [Mislove et al], [Ahn et al], [Wilson et al.]
Walking in Facebook
C
A
E
G
F
B
D
H
Unexplored
Explored
Visited
-
Minas Gjoka, UC IrvineWalking in Facebook*(2) Random Walk (RW)Explores graph one node at a time with replacement
In the stationary distribution
Degree of node Number of edges
Walking in Facebook
C
A
E
G
F
B
D
H
Next candidate
Current node
1/3
1/3
1/3
-
Minas Gjoka, UC IrvineWalking in Facebook*(3) Re-Weighted Random Walk (RWRW) Hansen-Hurwitz estimator Corrects for degree bias at the end of collection
Without re-weighting, the probability distribution for node property A is:
Re-Weighted probability distribution :
Subset of sampled nodes with value iAll sampled nodesDegree of node u
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*(4) Metropolis-Hastings Random Walk (MHRW)Explore graph one node at a time with replacement
In the stationary distribution
Walking in Facebook
C
A
E
G
F
B
D
H
1/3
1/5
1/3
Next candidate
Current node
2/15
-
Minas Gjoka, UC IrvineWalking in Facebook*Uniform userID Sampling (UNI)As a basis for comparison, we collect a uniform sample of Facebook userIDs (UNI)rejection sampling on the 32-bit userID space
UNI not a general solution for sampling OSNsuserID space must not be sparsenames instead of numbers
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*Summary of Datasets
Egonets for a subsample of MHRW- local properties of nodes
Datasets available at: http://odysseas.calit2.uci.edu/research/osn.html
Sampling methodMHRWRWBFSUNI#Valid Users28x81K28x81K28x81K984K# Unique Users957K2.19M2.20M984K
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*Data CollectionBasic Node InformationWhat information do we collect for each sampled node u?
Walking in Facebook
The height of the text box and its associated braces increases or decreases as you add text. To change the width of the comment, drag one of the side handles.
Drag the side handles to change the width of the text block.
UserIDName NetworksPrivacy settings
UserIDName NetworksPrivacy settings
Friend ListUserIDNameNetworksPrivacy Settings
UserIDName NetworksPrivacy settings
u
1
1
1
1
Profile Photo
Add as Friend
Regional School/Workplace
View Friends
Send Message
-
Minas Gjoka, UC IrvineWalking in Facebook*Detecting Convergence
Number of samples (iterations) to loose dependence from starting points?
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*Online Convergence DiagnosticsGewekeDetects convergence for a single walk. Let X be a sequence of samples for metric of interest.
J. Geweke, Evaluating the accuracy of sampling based approaches to calculate posterior moments in Bayesian Statistics 4, 1992Xa
Xb
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*Online Convergence DiagnosticsGelman-RubinDetects convergence for m>1 walks
A. Gelman, D. Rubin, Inference from iterative simulation using multiple sequences in Statistical Science Volume 7, 1992Walk 1Walk 2Walk 3Between walksvarianceWithin walksvariance
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*When do we reach equilibrium?Burn-in determined to be 3K
Node Degree
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*Methods ComparisonNode DegreePoor performance for BFS, RW
MHRW, RWRW produce good estimates per chainoverall28 crawls
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*Sampling BiasBFSLow degree nodes under-represented by two orders of magnitude
BFS is biased
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*Sampling Bias MHRW, RW, RWRWDegree distribution identical to UNI (MHRW,RWRW)RW as biased as BFS but with smaller variance in each walk
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*Practical Recommendations for Sampling MethodsUse MHRW or RWRW. Do not use BFS, RW.
Use formal convergence diagnosticsassess convergence onlineuse multiple parallel walks
MHRW vs RWRWRWRW slightly better performanceMHRW provides a ready-to-use sample
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*OutlineMotivation and Problem StatementSampling MethodologyData AnalysisConclusion
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*FB Social GraphDegree DistributionDegree distribution not a power law
a2=3.38a1=1.32
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*FB Social GraphTopological Characteristics
Our MHRW sample Assortativity Coefficient = 0.233range of Clustering Coefficient [0.05, 0.35]
[Wilson et al, Eurosys 09] Assortativity Coefficient = 0.17range of Clustering Coefficient [0.05, 0.18]
More details in our paper and technical report
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*ConclusionCompared graph crawling methodsMHRW, RWRW performed remarkably wellBFS, RW lead to substantial bias
Practical recommendationscorrect for biasusage of online convergence diagnosticsproper use of multiple chains
Datasets publicly availablehttp://odysseas.calit2.uci.edu/research/osn.html
Walking in Facebook
-
Minas Gjoka, UC IrvineWalking in Facebook*
Thank you Questions?
Walking in Facebook
*****