how many people do you know?: e–ciently estimating personal network...
TRANSCRIPT
Efficient degree estimation
How many people do you know?: Efficientlyestimating personal network size
Tian ZhengDepartment of Statistics
Columbia University
April 22nd, 2009
1 / 34
Efficient degree estimation
Acknowledgements
I CollaboratorsI Tyler McCormick (Statistics, Columbia University)I Matt Salganik (Sociology, Princeton University)
I NSF for research support.
2 / 34
Efficient degree estimation
Introduction
Social network analysis
Social networks
I Social Network Analysis is the study of the structure ofrelationships between individuals.
I Studied widely in the social sciences with gain in popularity inthe physical sciences and in government.
I We concentrate on one aspect of social network analysis—estimating personal network sizes.
3 / 34
Efficient degree estimation
Introduction
Social network analysis
Why estimate personal network size?
I Personal network size, or degree, is the number of peopleknown by a particular individual.
I Degree is a topic of interest in its own right to social scientistsand it can also be useful in helping to explain other socialphenomenon. For example:
I Conley (2004) wanted to know whether siblings who knewmore people tended to be more successful.
I Study the dynamics of social processes such as spread ofdiseases and the evolution of group behavior.
I Few cases where it is practical to survey all actors in a network
4 / 34
Efficient degree estimation
Introduction
Killworth, McCarty et al. ’s “how many X’s do you know” surveys
Killworth, McCarty et al. ’s “how many X’s do you know”surveys
I McCarty et al (2001) Comparing two methods for estimatingnetwork size. Human Organization 60, 28-39.
I Telephone surveys of 1370 Americans.
I Responses from 32 questions of the form “how many X’s doyou know”.
I Define “knowing someone”: you know them and they knowyou by sight or by name, that you could contact them,that they live within the United States, and that there hasbeen some contact (either in person, by telephone or mail) inthe past two years. (Symmetric “knowing”.)
I Example: how many people do you know who are currentlyhomeless?
5 / 34
Efficient degree estimation
Introduction
Killworth, McCarty et al. ’s “how many X’s do you know” surveys
Killworth, McCarty et al. ’s “how many X’s do you know”surveys
I (Female names) Stephanie, Jacqueline, Kimberly, Nicole,Christina, Jennifer
I (Male names) Christopher, David, Anthony, Robert, James,Michael
I (Occupations) Commercial pilot, gun dealer, postal worker
I (Ethnicity) American Indian
I (Experiences) Twin, woman adopted kid in past year, gavebirth in past year, widow(er) under 65, opened business inpast year, homicide victim, suicide in past year, died in autoaccident, diabetic, kidney dialysis, AIDS, HIV-positive, rapevictim, male in prison, homeless, member of Jaycees
6 / 34
Efficient degree estimation
Introduction
Previous attempts to estimate degree with “how many x’s” data.
Previous attempts to estimate degree with “how many x’s”data
The scale-up method (Killworth et al. 1998)
I Suppose you know 1 person named Nicole
I At the time of the survey, 358,000 out of 280 millionsAmericans are named Nicole
I Assume the 1 Nicole represents 0.13% =358, 000
280, 000, 000of your
acquaintances
I Estimate: you know 1/0.0013 = 770 people.
7 / 34
Efficient degree estimation
Introduction
Previous attempts to estimate degree with “how many x’s” data.
Assumptions held by the scale-up methods
I Killworth’s method assumes that all people in the populationare equally likely to know members of a given subpopulation.(i.e., Random mixing in social space.)
I It also assumes that respondents will be able to accuratelyrecall their acquaintances during a survey.
8 / 34
Efficient degree estimation
Introduction
Previous attempts to estimate degree with “how many x’s” data.
Previous attempts to estimate degree with “how many x’s”data
The Zheng et al (2006) Overdispersed model
I yik = number of persons in group k known by person i .
I yik ∼ Poisson(aibkgik)
I For k, we let gik follow a gamma distribution with mean 1 and1/(ωk − 1) as the shape parameter. Thus,yik ∼ Negative Binomial(mean = aibk , overdispersion = ωk).
I ai = eαi , “gregariousness” of person i—degree.
I bk = eβk , size of group k in the social networkI ωk is overdispersion of group k
I ωk = 1 is no overdispersion (Poisson model)I Higher values of ωk show overdispersion
I Overdispersion represents social structure9 / 34
Efficient degree estimation
Introduction
Previous attempts to estimate degree with “how many x’s” data.
Zheng et al (2006) overdispersed model
I Model the overall extent of non-random mixing (socialstructure) but didn’t correct the degree estimation.
I Didn’t address the issue of imperfect recall.
10 / 34
Efficient degree estimation
Three sources of errors for degree estimation
Three sources of errors for degree estimation
I Barrier effect: Non-random mixing is common.I Transmission errors: the possibility that a respondent knows
someone in a category without realizing it.I Solution: we use first names as our X’s to eliminate
transmission errors.
I Imperfect recall.
11 / 34
Efficient degree estimation
Three sources of errors for degree estimation
Non-random mixing
Figure: An acquaintanceship network.
12 / 34
Efficient degree estimation
Three sources of errors for degree estimation
Biases in the degree estimates based on first names due tonon-Random mixing
Figure: Age distributions of male names.13 / 34
Efficient degree estimation
Three sources of errors for degree estimation
Effects of the under-recall of large subpopulations
Figure: The calibration curve.
14 / 34
Efficient degree estimation
Three sources of errors for degree estimation
Effects of the under-recall of large subpopulations
−7.0 −6.5 −6.0 −5.5 −5.0 −4.5 −4.0
−7.
0−
6.5
−6.
0−
5.5
−5.
0−
4.5
−4.
0Developing the Calibration Curve
Actual Beta, log scale
Est
imat
ed B
eta,
log
scal
e
y=xleast squares fitcalibration curve
Michael
Christina
Christopher
Jacqueline
James
Jennifer
AnthonyKimberly
Robert
Stephanie
David
Nicole
Figure: Twelve names without the calibration curve.
15 / 34
Efficient degree estimation
Three sources of errors for degree estimation
Calibration curve
Three sources of errors for degree estimation
I Barrier effect: Non-random mixing is common.I Transmission errors: the possibility that a respondent knows
someone in a category without realizing it.I Solution: we use first names as our X’s to eliminate
transmission errors.
I Imperfect recall.I Solution 1: Use calibration curve in the model to adjust for
under-recall of large groups.I Solution 2: Use first names that are about 0.1-0.2 percent of
the population, where the calibration curves is close to they = x line.
16 / 34
Efficient degree estimation
Three sources of errors for degree estimation
Calibration curve
Correcting for Under-Recall: calibration curve f (·)
−7.0 −6.5 −6.0 −5.5 −5.0 −4.5 −4.0
−7.
0−
6.5
−6.
0−
5.5
−5.
0−
4.5
−4.
0
Applying the Calibration Curve
Actual Beta, log scale
Est
imat
ed B
eta,
log
scal
e
y=xleast squares fitcalibration curveno recall function
Michael
Christina
Christopher
Jacqueline
James
Jennifer
Anthony
Kimberly
Robert
Stephanie
David
Nicole
Michael
Christina
Christopher
Jacqueline
James
Jennifer
Anthony
Kimberly
Robert
Stephanie
David
Nicole
Figure: The Calibration Curve.17 / 34
Efficient degree estimation
The Latent Non-Random Mixing Model
The latent non-random mixing model
Modeling goals and data
I account for non-random mixing due to gender/age
I We use data from McCarty et al. (2001). There are 1375respondents and twelve names.
18 / 34
Efficient degree estimation
The Latent Non-Random Mixing Model
Modeling ”How many X do you know” with non-randommixing and imperfect recall
yik ∼ Neg-Binom(µike , ωk),
where µike = di f(∑A
a=1 m(e, a) NakNa
).
I di is the degree of person iI m(e, a) is a matrix of mixing coefficients, the proportion of
respondent group e’s network made up of alter group aI We use 8 alter groups (4 age categories and gender) and 6 ego
groups (3 age categories and gender).
I NakNa
is the proportion of alter group a made up of people withname k and is assumed known (available from SSA).
I ωk is the overdispersion
I f () is the calibration curve.19 / 34
Efficient degree estimation
The Latent Non-Random Mixing Model
Priors and Hyperparameters
I Parameters are estimated using a multilevel model andBayesian inference.
I The network size and mixing coefficients are modeled withlog-normal distributions.
I This parameterizations is based on previous research aboutthe degree distribution of of the acquaintanceship network(see McCarty et al. 2001).
I Hyperparameters are assigned noninformative uniform priors.
20 / 34
Efficient degree estimation
Results
Results-Estimated Degree
Male
0 1000 2000 3000 4000 5000 6000
Female
Estimated Degree
0 1000 2000 3000 4000 5000 6000
21 / 34
Efficient degree estimation
Results
Results-Estimated Degree
0 1000 2000 3000 4000
Males562 472
0 1000 2000 3000 4000
Youth (18−24)
Females
0 1000 2000 3000 4000
482
0 1000 2000 3000 4000
Adult (25−64)
461
0 1000 2000 3000 4000
524
0 1000 2000 3000 4000
Senior(65+)
382
Figure: Histograms of the estimated degree distribution. Red lines showthe estimated median.
22 / 34
Efficient degree estimation
Results
Results-Mixing Matrix
Male AltersFemale AltersFraction of Network
0.4 0.2 0 0.2 0.4
61+
41−60
21−40
0−20
Alter Groups
Female Egos
Male AltersFemale AltersFraction of Network
0.4 0.2 0 0.2 0.4
Youth
Adult
Senior
Male Egos
Figure: The mixing matrix estimates +/- 2 standard errors.
23 / 34
Efficient degree estimation
Results
Three sources of errors for degree estimation
I Barrier effect: Non-random mixing is common.I Solution: the latent non-random mixing model.
I Transmission errors: the possibility that a respondent knowssomeone in a category without realizing it.
I Solution: we use first names as our X’s to eliminatetransmission errors.
I Imperfect recall.I Solution 1: Use calibration curve in the model to adjust for
under-recall of large groups.I Solution 2: Use first names that are about 0.1-0.2 percent of
the population, where the calibration curves is close to they = x line.
24 / 34
Efficient degree estimation
Design for future surveys
Simple Estimates of Degree
I Our model addressesI non-random mixingI recall bias
I Our full model requires some expertise and time to fit.
I A simple design strategy would allow social scientists toobtain scale-up degree estimates that is equivalent to that ofour method.
25 / 34
Efficient degree estimation
Design for future surveys
Deriving the Simple EstimateRecall the expectation of our model:
µike = E (yike) = di
A∑
a=1
m(r , a)Nak
Na.
We the proceed by taking the sum over all K names.
µi ·e = E
(K∑
k=1
yike
)=
K∑
k=1
[di
A∑
a=1
m(r , a)Nak
Na
]
= di
A∑
a=1
m(r , a)
[K∑
k=1
Nak
Na
]
(if
K∑
k=1
Nak
Na= c , a constant
)= dic
A∑
a=1
m(r , a) = dic
26 / 34
Efficient degree estimation
Design for future surveys
The scale-down condition for unbiased scale-up estimates.
I The scale-down condition: the names are selected so that
K∑
k=1
Nak
Na= constant.
I That is the names are balanced and have the same distributionon the demographic variables that define the alter groups.
27 / 34
Efficient degree estimation
Design for future surveys
Selecting Names
Suggestions for Selecting Subpopulations (X’s):
1. Using first names eliminates barrier and transmission effects
2. Using rare names (suggest 0.1–0.2 percent of the population)minimizes recall bias
3. Select names with complimentary age profiles
28 / 34
Efficient degree estimation
Design for future surveys
Selecting Complimentary Name-Age Profiles
Figure: Selecting complementary name-age profiles.
29 / 34
Efficient degree estimation
Design for future surveys
Simulated Data Experiment
Figure: Comparison of full and simple model performance.30 / 34
Efficient degree estimation
Design for future surveys
Three sources of errors for degree estimation
I Barrier effect: Non-random mixing is common.I Solution 1: the latent non-random mixing model.I Solution 2: select names that satisfy the scale-down conditions.
I Transmission errors: the possibility that a respondent knowssomeone in a category without realizing it.
I Solution: we use first names as our X’s to eliminatetransmission errors.
I Imperfect recall.I Solution 1: Use calibration curve in the model to adjust for
under-recall of large groups.I Solution 2: Use first names that are about 0.1-0.2 percent of
the population, where the calibration curves is close to they = x line.
31 / 34
Efficient degree estimation
Design for future surveys
Limitations and Future Directions
I We account for non-random mixing from age and gender, butother factors could still be present. The census bureau collectsother information on first names but does not release suchdata.
I Behavior of the calibration curve f () on larger groups is notwell established.
32 / 34
Efficient degree estimation
Design for future surveys
Conclusions
I We propose a model to estimate network size based onaggregated count data.
I This model accounts for non-random mixing based on genderand age, as well as corrects for under-recalling.
I We also show that the ‘scale-up’ approach is a reasonableestimate to our full model under given conditions.
33 / 34
Efficient degree estimation
Design for future surveys
THANK YOU!
34 / 34