how many people do you know?: e–ciently estimating personal network...

$: How many people do you know?: E–ciently estimating personal network sizetzheng/files/NameAssort.pdf · I Responses from 32 questions of the form \how many X’s do you know". I$
Efficient degree estimation

How many people do you know?: Efficientlyestimating personal network size

Tian ZhengDepartment of Statistics

Columbia University

April 22nd, 2009

1 / 34


Acknowledgements

I CollaboratorsI Tyler McCormick (Statistics, Columbia University)I Matt Salganik (Sociology, Princeton University)

I NSF for research support.

2 / 34


Introduction

Social network analysis

Social networks

I Social Network Analysis is the study of the structure ofrelationships between individuals.

I Studied widely in the social sciences with gain in popularity inthe physical sciences and in government.

I We concentrate on one aspect of social network analysis—estimating personal network sizes.

3 / 34


Introduction

Social network analysis

Why estimate personal network size?

I Personal network size, or degree, is the number of peopleknown by a particular individual.

I Degree is a topic of interest in its own right to social scientistsand it can also be useful in helping to explain other socialphenomenon. For example:

I Conley (2004) wanted to know whether siblings who knewmore people tended to be more successful.

I Study the dynamics of social processes such as spread ofdiseases and the evolution of group behavior.

I Few cases where it is practical to survey all actors in a network

4 / 34


Introduction

Killworth, McCarty et al. ’s “how many X’s do you know” surveys

Killworth, McCarty et al. ’s “how many X’s do you know”surveys

I McCarty et al (2001) Comparing two methods for estimatingnetwork size. Human Organization 60, 28-39.

I Telephone surveys of 1370 Americans.

I Responses from 32 questions of the form “how many X’s doyou know”.

I Define “knowing someone”: you know them and they knowyou by sight or by name, that you could contact them,that they live within the United States, and that there hasbeen some contact (either in person, by telephone or mail) inthe past two years. (Symmetric “knowing”.)

I Example: how many people do you know who are currentlyhomeless?

5 / 34


Introduction

Killworth, McCarty et al. ’s “how many X’s do you know” surveys

Killworth, McCarty et al. ’s “how many X’s do you know”surveys

I (Female names) Stephanie, Jacqueline, Kimberly, Nicole,Christina, Jennifer

I (Male names) Christopher, David, Anthony, Robert, James,Michael

I (Occupations) Commercial pilot, gun dealer, postal worker

I (Ethnicity) American Indian

I (Experiences) Twin, woman adopted kid in past year, gavebirth in past year, widow(er) under 65, opened business inpast year, homicide victim, suicide in past year, died in autoaccident, diabetic, kidney dialysis, AIDS, HIV-positive, rapevictim, male in prison, homeless, member of Jaycees

6 / 34


Introduction

Previous attempts to estimate degree with “how many x’s” data.

Previous attempts to estimate degree with “how many x’s”data

The scale-up method (Killworth et al. 1998)

I Suppose you know 1 person named Nicole

I At the time of the survey, 358,000 out of 280 millionsAmericans are named Nicole

I Assume the 1 Nicole represents 0.13% =358, 000

280, 000, 000of your

acquaintances

I Estimate: you know 1/0.0013 = 770 people.

7 / 34


Introduction


Assumptions held by the scale-up methods

I Killworth’s method assumes that all people in the populationare equally likely to know members of a given subpopulation.(i.e., Random mixing in social space.)

I It also assumes that respondents will be able to accuratelyrecall their acquaintances during a survey.

8 / 34


Introduction


Previous attempts to estimate degree with “how many x’s”data

The Zheng et al (2006) Overdispersed model

I yik = number of persons in group k known by person i .

I yik ∼ Poisson(aibkgik)

I For k, we let gik follow a gamma distribution with mean 1 and1/(ωk − 1) as the shape parameter. Thus,yik ∼ Negative Binomial(mean = aibk , overdispersion = ωk).

I ai = eαi , “gregariousness” of person i—degree.

I bk = eβk , size of group k in the social networkI ωk is overdispersion of group k

I ωk = 1 is no overdispersion (Poisson model)I Higher values of ωk show overdispersion

I Overdispersion represents social structure9 / 34


Introduction


Zheng et al (2006) overdispersed model

I Model the overall extent of non-random mixing (socialstructure) but didn’t correct the degree estimation.

I Didn’t address the issue of imperfect recall.

10 / 34


Three sources of errors for degree estimation


I Barrier effect: Non-random mixing is common.I Transmission errors: the possibility that a respondent knows

someone in a category without realizing it.I Solution: we use first names as our X’s to eliminate

transmission errors.

I Imperfect recall.

11 / 34



Non-random mixing

Figure: An acquaintanceship network.

12 / 34



Biases in the degree estimates based on first names due tonon-Random mixing

Figure: Age distributions of male names.13 / 34



Effects of the under-recall of large subpopulations

Figure: The calibration curve.

14 / 34



Effects of the under-recall of large subpopulations

−7.0 −6.5 −6.0 −5.5 −5.0 −4.5 −4.0

−7.

0−

6.5

−6.

0−

5.5

−5.

0−

4.5

−4.

0Developing the Calibration Curve

Actual Beta, log scale

Est

imat

ed B

eta,

log

scal

e

y=xleast squares fitcalibration curve

Michael

Christina

Christopher

Jacqueline

James

Jennifer

AnthonyKimberly

Robert

Stephanie

David

Nicole

Figure: Twelve names without the calibration curve.

15 / 34



Calibration curve


I Barrier effect: Non-random mixing is common.I Transmission errors: the possibility that a respondent knows

someone in a category without realizing it.I Solution: we use first names as our X’s to eliminate

transmission errors.

I Imperfect recall.I Solution 1: Use calibration curve in the model to adjust for

under-recall of large groups.I Solution 2: Use first names that are about 0.1-0.2 percent of

the population, where the calibration curves is close to they = x line.

16 / 34



Calibration curve

Correcting for Under-Recall: calibration curve f (·)

−7.0 −6.5 −6.0 −5.5 −5.0 −4.5 −4.0

−7.

0−

6.5

−6.

0−

5.5

−5.

0−

4.5

−4.

0

Applying the Calibration Curve

Actual Beta, log scale

Est

imat

ed B

eta,

log

scal

e

y=xleast squares fitcalibration curveno recall function

Michael

Christina

Christopher

Jacqueline

James

Jennifer

Anthony

Kimberly

Robert

Stephanie

David

Nicole

Michael

Christina

Christopher

Jacqueline

James

Jennifer

Anthony

Kimberly

Robert

Stephanie

David

Nicole

Figure: The Calibration Curve.17 / 34


The Latent Non-Random Mixing Model

The latent non-random mixing model

Modeling goals and data

I account for non-random mixing due to gender/age

I We use data from McCarty et al. (2001). There are 1375respondents and twelve names.

18 / 34



Modeling ”How many X do you know” with non-randommixing and imperfect recall

yik ∼ Neg-Binom(µike , ωk),

where µike = di f(∑A

a=1 m(e, a) NakNa

).

I di is the degree of person iI m(e, a) is a matrix of mixing coefficients, the proportion of

respondent group e’s network made up of alter group aI We use 8 alter groups (4 age categories and gender) and 6 ego

groups (3 age categories and gender).

I NakNa

is the proportion of alter group a made up of people withname k and is assumed known (available from SSA).

I ωk is the overdispersion

I f () is the calibration curve.19 / 34



Priors and Hyperparameters

I Parameters are estimated using a multilevel model andBayesian inference.

I The network size and mixing coefficients are modeled withlog-normal distributions.

I This parameterizations is based on previous research aboutthe degree distribution of of the acquaintanceship network(see McCarty et al. 2001).

I Hyperparameters are assigned noninformative uniform priors.

20 / 34


Results

Results-Estimated Degree

Male

0 1000 2000 3000 4000 5000 6000

Female

Estimated Degree

0 1000 2000 3000 4000 5000 6000

21 / 34


Results

Results-Estimated Degree

0 1000 2000 3000 4000

Males562 472

0 1000 2000 3000 4000

Youth (18−24)

Females

0 1000 2000 3000 4000

482

0 1000 2000 3000 4000

Adult (25−64)

461

0 1000 2000 3000 4000

524

0 1000 2000 3000 4000

Senior(65+)

382

Figure: Histograms of the estimated degree distribution. Red lines showthe estimated median.

22 / 34


Results

Results-Mixing Matrix

Male AltersFemale AltersFraction of Network

0.4 0.2 0 0.2 0.4

61+

41−60

21−40

0−20

Alter Groups

Female Egos

Male AltersFemale AltersFraction of Network

0.4 0.2 0 0.2 0.4

Youth

Adult

Senior

Male Egos

Figure: The mixing matrix estimates +/- 2 standard errors.

23 / 34


Results


I Barrier effect: Non-random mixing is common.I Solution: the latent non-random mixing model.

I Transmission errors: the possibility that a respondent knowssomeone in a category without realizing it.

I Solution: we use first names as our X’s to eliminatetransmission errors.




24 / 34


Design for future surveys

Simple Estimates of Degree

I Our model addressesI non-random mixingI recall bias

I Our full model requires some expertise and time to fit.

I A simple design strategy would allow social scientists toobtain scale-up degree estimates that is equivalent to that ofour method.

25 / 34



Deriving the Simple EstimateRecall the expectation of our model:

µike = E (yike) = di

A∑

a=1

m(r , a)Nak

Na.

We the proceed by taking the sum over all K names.

µi ·e = E

(K∑

k=1

yike

)=

K∑

k=1

[di

A∑

a=1

m(r , a)Nak

Na

]

= di

A∑

a=1

m(r , a)

[K∑

k=1

Nak

Na

]

(if

K∑

k=1

Nak

Na= c , a constant

)= dic

A∑

a=1

m(r , a) = dic

26 / 34



The scale-down condition for unbiased scale-up estimates.

I The scale-down condition: the names are selected so that

K∑

k=1

Nak

Na= constant.

I That is the names are balanced and have the same distributionon the demographic variables that define the alter groups.

27 / 34



Selecting Names

Suggestions for Selecting Subpopulations (X’s):

1. Using first names eliminates barrier and transmission effects

2. Using rare names (suggest 0.1–0.2 percent of the population)minimizes recall bias

3. Select names with complimentary age profiles

28 / 34



Selecting Complimentary Name-Age Profiles

Figure: Selecting complementary name-age profiles.

29 / 34



Simulated Data Experiment

Figure: Comparison of full and simple model performance.30 / 34




I Barrier effect: Non-random mixing is common.I Solution 1: the latent non-random mixing model.I Solution 2: select names that satisfy the scale-down conditions.

I Transmission errors: the possibility that a respondent knowssomeone in a category without realizing it.

I Solution: we use first names as our X’s to eliminatetransmission errors.




31 / 34



Limitations and Future Directions

I We account for non-random mixing from age and gender, butother factors could still be present. The census bureau collectsother information on first names but does not release suchdata.

I Behavior of the calibration curve f () on larger groups is notwell established.

32 / 34



Conclusions

I We propose a model to estimate network size based onaggregated count data.

I This model accounts for non-random mixing based on genderand age, as well as corrects for under-recalling.

I We also show that the ‘scale-up’ approach is a reasonableestimate to our full model under given conditions.

33 / 34



THANK YOU!

34 / 34

how many people do you know?: e–ciently estimating personal network...

Documents