sampling methods for graphs

68
Statistics and networks : motivations and methods Survey sampling Extending the sampling design Results and future work An overview of sampling methods for graphs - Application to Twitter Antoine Rebecq Universit´ e Paris X - INSEE 6/15/16 Antoine Rebecq Sampling the Twitter graph

Upload: antoine-rebecq

Post on 21-Feb-2017

478 views

Category:

Science


1 download

TRANSCRIPT

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

An overview of sampling methods for graphs -Application to Twitter

Antoine Rebecq

Universite Paris X - INSEE

6/15/16

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

1 Statistics and networks : motivations and methodsGraphs and statsMethods

2 Survey samplingEstimatesSampling design

3 Extending the sampling designSnowball samplingAdaptive sampling

4 Results and future workResultsSample sizeFuture work

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Section 1

Statistics and networks : motivations andmethods

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Subsection 1

Graphs and stats

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Examples of statistics on graphs

Official statistics : measuring “hidden populations”

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Examples of statistics on graphs

Rise of “big graphs”

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Statistics of interest

Degree

Centrality

Clustering

Communities

. . .

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Subsection 2

Methods

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Methods for graph statistics

Algorithms (computer science, “big data”)

Model-based estimation

Sampling (“Design-based estimation”)

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Methods for graph statistics

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Computer science methods

Efficient algorithms (speed / memory).

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Big data begets big graph

Twitter in 2013

Image from [1]

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Computer science methods

Efficient algorithms (speed / memory).

Sometimes require sampling.

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Model-based estimation

Famous graph models :

Erdos-Renyi

Price / Barabasi-Albert (High tailed degree distribution)

Watts-Strogatz / “small-world” (short path lengths)

Stochastic block models (communities)

Images from [9]

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Model-based estimation : Erdos-Renyi (“random graphs”)

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Model-based estimation : Barabasi-Albert (“preferentialattachment”)

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Model-based estimation : Watts-Strogatz (“small world”)

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Model-based estimation : Stochastic Block Models

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Example : Star Wars : The Force Awakens

Star Wars : The Force Awakens

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Example : Star Wars : The Force Awakens

How many (real) users behind these tweets ?

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Example : “Star Wars, The Force Awakens”

Let’s write :

yk = Number of tweets @starwars by user k

between 10/29/15, 7 :48 - 10 :48 PM EST

zk = 1{yk ≥ 1}

Goal : estimate NC = T (Z )

Additionally, we write : nC =∑k∈s

zk

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

The Twitter graph

The Twitter graph ([6]) :

Is directed

Degree distribution is heavy-tailed

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

The Twitter graph

Has small path lengths

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Graphs and statsMethods

Sampling / Design-based estimation

Sampling : select a few vertices/edges and compute estimatorsusing sample data. Very little exists about design-based statisticalinference on networks (Kolaczyk 2009 , [5])

We try survey sampling methods used in official StatisticsInstitutes to make design-based inference about “big graphs”

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

EstimatesSampling design

Section 2

Survey sampling

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

EstimatesSampling design

Subsection 1

Estimates

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

EstimatesSampling design

Horvitz-Thompson estimator

Population U : vertices of the Twitter graph.Assign all k ∈ U an inclusion probability P(k ∈ s) = πk

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

EstimatesSampling design

Horvitz-Thompson estimator

Classic unbiased estimator for totals and means :Horvitz-Thompson

T (Y )HT =∑k∈s

ykπk

ˆy =1

N

∑k∈s

ykπk

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

EstimatesSampling design

Horvitz-Thompson estimator

Variance of the Horvitz-Thompson estimator depends on the firstand second-order inclusion probabilities :

πk = P(k ∈ s)

πkl = P(k , l ∈ s)

V(T (Y )HT ) =∑k∈U

∑l∈U

(πkl − πkπl)ykπk

ylπl

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

EstimatesSampling design

Calibrated estimator

Deville-Sarndal, 1992 ([2]). Modification of the Horvitz-Thompsonestimator to take auxiliary information into account. For example :

T (Y ) = Number of tweets @StarWars

N = Number of users in scope

Structure of number of followers

Number of verified users

. . .

Very similar to empirical likelihood methods ([8]).

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

EstimatesSampling design

Subsection 2

Sampling design

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

EstimatesSampling design

Sampling design : Bernoulli

Poisson sampling : For each k ∈ U , run a πk -Bernoulli experimentto decide whether to include unit k in the sample.

Bernoulli sampling : ∀k, πk = p

Sampling design of non-fixed sample size. We set the expectedsample size to 20000.

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

EstimatesSampling design

Sampling design : Stratified Bernoulli

More efficient estimators/design : use of external information.

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

EstimatesSampling design

Sampling design : Stratified Bernoulli

We write : U = U1⊕U2 (h = 1, 2 being called “strata”) and

draw two independant Bernoulli samples in U1 and U2.

Here :

U1 = Followers of official @starwars account

U2 = Rest of Twitter users

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

EstimatesSampling design

Sampling design : Neyman allocation

Optimal variance of the Horvitz-Thompson estimator is obtainedfor (Neyman, [7]) :

nh =NhS2

h∑h

NhS2h

Given the expected values, we set :

n1 = 9700

n2 = 10300

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

EstimatesSampling design

Sampling design : Stratified Bernoulli

Estimators for the two “simple” designs :

NC1 =nC

p

NC2 =N1

n1nC1 +

N − N1

n2nC2

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

EstimatesSampling design

Variance estimators

V(T (Y ))1 =1

p(

1

p− 1)

∑k∈s

y 2k

V(T (Y ))2 =2∑

h=1

1

ph(

1

ph− 1)

∑k∈sh

y 2k

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Snowball samplingAdaptive sampling

Section 3

Extending the sampling design

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Snowball samplingAdaptive sampling

Snowball sampling

From now on, our sampling designs will include extensions :s = s0 ∪ sext

s0 is still selected using stratified Bernoulli, but with expectedsample size of 1000, so that the expected sample size of s is moreor less 20000.

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Snowball samplingAdaptive sampling

Subsection 1

Snowball sampling

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Snowball samplingAdaptive sampling

Snowball sampling

Population U

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Snowball samplingAdaptive sampling

Snowball sampling

Initial sample s0

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Snowball samplingAdaptive sampling

Snowball sampling

One stage snowball extension s = A(s0)

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Snowball samplingAdaptive sampling

Snowball sampling

Formally, we write :

Bi = {i} ∪ {j ∈ V ,Eji 6= ∅}Ai = {i} ∪ {j ∈ V ,Eij 6= ∅}

s = A(s0)

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Snowball samplingAdaptive sampling

Snowball sampling

NC3 =∑k∈s

zi1− π(Bi )

where :

π(Bi ) = P(Bi ⊂ s)

=∏k∈Bi

(1− P(k ∈ s))

= q#(Bi∩U1)S1 · q#(Bi∩U2)

S2

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Snowball samplingAdaptive sampling

Snowball sampling

V(NC3) =∑i∈s

∑j∈s

zizjπ(Bi ∪ Bj)

γ′ij

where :

γ′ij =π(Bi ∪ Bj)− π(Bi )π(Bj)

[1− π(Bi )][1− π(Bj)]

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Snowball samplingAdaptive sampling

Subsection 2

Adaptive sampling

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Snowball samplingAdaptive sampling

Adaptive sampling

In adaptive sampling, when (Thompson, [10])

Used in official statistics to measure number of drugs users orHIV-positive people

Sampling design often compared to the video game“minesweeper”

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Snowball samplingAdaptive sampling

Adaptive sampling

Image from [11]

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Snowball samplingAdaptive sampling

Adaptive sampling

Once a unit bearing the characteristic of interest (i.e. a user whotweeted about the Star Wars trailer) is found, all its network (i.e.its friends and friends of friends, etc. who have tweeted about StarWars) is included in the sample.

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Snowball samplingAdaptive sampling

Adaptive sampling

Estimator :

NC4 =K∑

k=1

n∗CkJkπgk

where :

K = number of networks

y∗k = total of Y in the network k

n∗Ck= Number of people with yk ≥ 1 in the network k

Jk = 1{k ∈ C}πgk = probability that the initial sample intersects k

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Snowball samplingAdaptive sampling

Adaptive sampling

When using an adaptive design, it is often better to use theRao-Blackwell of the previous estimate. It has a very simple closedform in the case of the adaptive stratified.

NC5 = n0 +K∑

k=1

nr

1− (1− p)nr

where : n0 = #s0 and s0 = ∪r{k ∈ s, δ(k ,C ) = 1} is the union ofthe sides of C.

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Snowball samplingAdaptive sampling

Adaptive sampling - Variance

V(NC4) =K∑

k=1

K∑k ′=1

ykyk ′

πgkk ′

(πgkk ′

πgkπgk ′− 1

)where :

πgkk ′ = 1− πgk − πgk ′ + (1− p)ngk+ngk′

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

Snowball samplingAdaptive sampling

Adaptive sampling - Variance

Variance estimation for the Rao-Blackwell can be done by selectingm samples :

V(NC5) = V(NC4)− 1

m − 1

m∑i=1

(NC5i − NC4)2

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

ResultsSample sizeFuture work

Section 4

Results and future work

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

ResultsSample sizeFuture work

Subsection 1

Results

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

ResultsSample sizeFuture work

Results

Design n nscope n0 NC CV ˆDeff

Bernoulli 20013 3946 354121 0.231 1.04

Stratified 20094 9832 316889 0.097 0.68

1-snowball 159957 73570 1000 331097 0.031 0.60

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

ResultsSample sizeFuture work

Results

Mean number of tweets @StarWars per user : 1.18± 0.07

Suggests that bots are not responsible for this very large number oftweets (see [4], [3]) !

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

ResultsSample sizeFuture work

Subsection 2

Sample size

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

ResultsSample sizeFuture work

Snowball sampling - sample size

Expected sample size ≈ 20000.

Actual sample size : > 150000 !

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

ResultsSample sizeFuture work

Adaptive sampling

With our test subject (tweets @AmericanIdol), average networksize was no greater than a few units (≈ 10000 tweets in the scope)

With Star Wars (≈ 300000 tweets in the scope, with much lesstweets per people), we couldn’t get to the end of every network !

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

ResultsSample sizeFuture work

Subsection 3

Future work

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

ResultsSample sizeFuture work

Future work

Control sample size

Estimates and calibration for other statistics (centrality,clustering coefficients, path length, etc.)

Take advantage of graph description using models

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

ResultsSample sizeFuture work

Auxiliary information for Barabasi-Albert model :

Degree Centrality Local clustering Mean path Max pathDegree ++ - - - -Centrality - - - -Local clustering + +Mean path ++Max path

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

ResultsSample sizeFuture work

Conclusion

Thank you !

http://nc233.com/isnps2016

@nc233

Antoine Rebecq Sampling the Twitter graph

Statistics and networks: motivations and methodsSurvey sampling

Extending the sampling designResults and future work

ResultsSample sizeFuture work

Paul Burkhardt and Chris Waring.An nsa big graph experiment.In presentation at the Carnegie Mellon University SDI/ISTCSeminar, Pittsburgh, Pa, 2013.

Jean-Claude Deville and Carl-Erik Sarndal.Calibration estimators in survey sampling.Journal of the American statistical Association,87(418) :376–382, 1992.

Emilio Ferrara.”manipulation and abuse on social media” by emilio ferrarawith ching-man au yeung as coordinator.SIGWEB Newsl., (Spring) :4 :1–4 :9, April 2015.

Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer,and Alessandro Flammini.The rise of social bots.

Antoine Rebecq Sampling the Twitter graph

Statistics and networks: motivations and methodsSurvey sampling

Extending the sampling designResults and future work

ResultsSample sizeFuture work

arXiv preprint arXiv :1407.5225, 2014.

Eric D Kolaczyk.Statistical analysis of network data.Springer, 2009.

Seth A Myers, Aneesh Sharma, Pankaj Gupta, and Jimmy Lin.Information network or social network ? : the structure of thetwitter follow graph.In Proceedings of the companion publication of the 23rdinternational conference on World wide web companion, pages493–498. International World Wide Web Conferences SteeringCommittee, 2014.

Jerzy Neyman.On the two different aspects of the representative method :the method of stratified sampling and the method of purposiveselection.

Antoine Rebecq Sampling the Twitter graph

Statistics and networks : motivations and methodsSurvey sampling

Extending the sampling designResults and future work

ResultsSample sizeFuture work

Journal of the Royal Statistical Society, pages 558–625, 1934.

Art B. Owen.Empirical likelihood.CRC press, 2010.

Tiago P. Peixoto.The graph-tool python library.figshare, 2014.

Steven K Thompson.Adaptive cluster sampling.Journal of the American Statistical Association,85(412) :1050–1059, 1990.

Steven K Thompson.Stratified adaptive cluster sampling.Biometrika, pages 389–397, 1991.

Antoine Rebecq Sampling the Twitter graph