unbiased-sampling.ppt

28
Minas Gjoka, UC Irvine Walking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts, Athina Markopoulou UC Irvine, EPFL ‡

Upload: sreenupes

Post on 18-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

  • Minas Gjoka, UC IrvineWalking in Facebook*Walking in Facebook:A Case Study of Unbiased Sampling of OSNs

    Minas Gjoka, Maciej Kurant , Carter Butts, Athina Markopoulou UC Irvine, EPFL

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*OutlineMotivation and Problem StatementSampling MethodologyData AnalysisConclusion

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*Online Social Networks (OSNs) A network of declared friendships between users

    Allows users to maintain relationships

    Many popular OSNs with different focusFacebook, LinkedIn, Flickr,

    Social Graph

    Walking in Facebook

    C

    A

    E

    G

    F

    B

    D

    H

  • Minas Gjoka, UC IrvineWalking in Facebook*Why Sample OSNs?Representative samples desirablestudy propertiestest algorithms

    Obtaining complete dataset difficultcompanies usually unwilling to share data tremendous overhead to measure all (~100TB for Facebook)

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*Problem statementObtain a representative sample of users in a given OSN by exploration of the social graph. in this work we sample Facebook (FB)explore graph using various crawling techniques

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*Related WorkGraph traversal (BFS)A. Mislove et al, IMC 2007Y. Ahn et al, WWW 2007C. Wilson, Eurosys 2009Random walks (MHRW, RDS)M. Henzinger et al, WWW 2000D. Stutbach et al, IMC 2006A. Rasti et al, Mini Infocom 2009

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*Our ContributionsCompare various crawling techniques in FBs social graph Breadth-First-Search (BFS)Random Walk (RW)Metropolis-Hastings Random Walk (MHRW)

    Practical recommendationsonline convergence diagnostic testsproper use of multiple parallel chainsmethods that perform better and tradeoffs

    Uniform sample of Facebook userscollection and analysismade available to researchers

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*OutlineMotivation and Problem StatementSampling Methodologycrawling methodsdata collectionconvergence evaluationmethod comparisonsData AnalysisConclusion

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*(1) Breadth-First-Search (BFS)Starting from a seed, explores all neighbor nodes. Process continues iteratively without replacement.

    BFS leads to bias towards high degree nodes Lee et al, Statistical properties of Sampled Networks, Phys Review E, 2006

    Early measurement studies of OSNs use BFS as primary sampling techniquei.e [Mislove et al], [Ahn et al], [Wilson et al.]

    Walking in Facebook

    C

    A

    E

    G

    F

    B

    D

    H

    Unexplored

    Explored

    Visited

  • Minas Gjoka, UC IrvineWalking in Facebook*(2) Random Walk (RW)Explores graph one node at a time with replacement

    In the stationary distribution

    Degree of node Number of edges

    Walking in Facebook

    C

    A

    E

    G

    F

    B

    D

    H

    Next candidate

    Current node

    1/3

    1/3

    1/3

  • Minas Gjoka, UC IrvineWalking in Facebook*(3) Re-Weighted Random Walk (RWRW) Hansen-Hurwitz estimator Corrects for degree bias at the end of collection

    Without re-weighting, the probability distribution for node property A is:

    Re-Weighted probability distribution :

    Subset of sampled nodes with value iAll sampled nodesDegree of node u

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*(4) Metropolis-Hastings Random Walk (MHRW)Explore graph one node at a time with replacement

    In the stationary distribution

    Walking in Facebook

    C

    A

    E

    G

    F

    B

    D

    H

    1/3

    1/5

    1/3

    Next candidate

    Current node

    2/15

  • Minas Gjoka, UC IrvineWalking in Facebook*Uniform userID Sampling (UNI)As a basis for comparison, we collect a uniform sample of Facebook userIDs (UNI)rejection sampling on the 32-bit userID space

    UNI not a general solution for sampling OSNsuserID space must not be sparsenames instead of numbers

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*Summary of Datasets

    Egonets for a subsample of MHRW- local properties of nodes

    Datasets available at: http://odysseas.calit2.uci.edu/research/osn.html

    Sampling methodMHRWRWBFSUNI#Valid Users28x81K28x81K28x81K984K# Unique Users957K2.19M2.20M984K

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*Data CollectionBasic Node InformationWhat information do we collect for each sampled node u?

    Walking in Facebook

    The height of the text box and its associated braces increases or decreases as you add text. To change the width of the comment, drag one of the side handles.

    Drag the side handles to change the width of the text block.

    UserIDName NetworksPrivacy settings

    UserIDName NetworksPrivacy settings

    Friend ListUserIDNameNetworksPrivacy Settings

    UserIDName NetworksPrivacy settings

    u

    1

    1

    1

    1

    Profile Photo

    Add as Friend

    Regional School/Workplace

    View Friends

    Send Message

  • Minas Gjoka, UC IrvineWalking in Facebook*Detecting Convergence

    Number of samples (iterations) to loose dependence from starting points?

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*Online Convergence DiagnosticsGewekeDetects convergence for a single walk. Let X be a sequence of samples for metric of interest.

    J. Geweke, Evaluating the accuracy of sampling based approaches to calculate posterior moments in Bayesian Statistics 4, 1992Xa

    Xb

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*Online Convergence DiagnosticsGelman-RubinDetects convergence for m>1 walks

    A. Gelman, D. Rubin, Inference from iterative simulation using multiple sequences in Statistical Science Volume 7, 1992Walk 1Walk 2Walk 3Between walksvarianceWithin walksvariance

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*When do we reach equilibrium?Burn-in determined to be 3K

    Node Degree

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*Methods ComparisonNode DegreePoor performance for BFS, RW

    MHRW, RWRW produce good estimates per chainoverall28 crawls

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*Sampling BiasBFSLow degree nodes under-represented by two orders of magnitude

    BFS is biased

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*Sampling Bias MHRW, RW, RWRWDegree distribution identical to UNI (MHRW,RWRW)RW as biased as BFS but with smaller variance in each walk

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*Practical Recommendations for Sampling MethodsUse MHRW or RWRW. Do not use BFS, RW.

    Use formal convergence diagnosticsassess convergence onlineuse multiple parallel walks

    MHRW vs RWRWRWRW slightly better performanceMHRW provides a ready-to-use sample

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*OutlineMotivation and Problem StatementSampling MethodologyData AnalysisConclusion

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*FB Social GraphDegree DistributionDegree distribution not a power law

    a2=3.38a1=1.32

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*FB Social GraphTopological Characteristics

    Our MHRW sample Assortativity Coefficient = 0.233range of Clustering Coefficient [0.05, 0.35]

    [Wilson et al, Eurosys 09] Assortativity Coefficient = 0.17range of Clustering Coefficient [0.05, 0.18]

    More details in our paper and technical report

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*ConclusionCompared graph crawling methodsMHRW, RWRW performed remarkably wellBFS, RW lead to substantial bias

    Practical recommendationscorrect for biasusage of online convergence diagnosticsproper use of multiple chains

    Datasets publicly availablehttp://odysseas.calit2.uci.edu/research/osn.html

    Walking in Facebook

  • Minas Gjoka, UC IrvineWalking in Facebook*

    Thank you Questions?

    Walking in Facebook

    *****