tutorial: analyzing real network data 1) creating data from survey

37
orial: Analyzing real network data reating data from survey You can download all of the needed files from here: http://www.soc.duke.edu/~jmoody77/rwj/wsfiles.htm -This is data (modified) from one of the Add Health schools. I’ve changed the data some for security reasons. We’ll walk through some of the data coding issues, creating measures & figures, and then running peer influence & structural models on the network. -Outline: -From survey to analysis files -Exploring the network: visualization -Network Behavior & Peer Influence Models -Network structure as indep variable -Peer influence models -Dyad similarity models -Network Structure analyses -Clustering for peer groups -Block models -Statistical Models for networks (STANET).

Upload: awena

Post on 25-Feb-2016

51 views

Category:

Documents


1 download

DESCRIPTION

Tutorial: Analyzing real network data 1) Creating data from survey. You can download all of the needed files from here: http://www.soc.duke.edu/~jmoody77/rwj/wsfiles.htm - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data1) Creating data from survey

You can download all of the needed files from here:http://www.soc.duke.edu/~jmoody77/rwj/wsfiles.htm

-This is data (modified) from one of the Add Health schools. I’ve changed the data some for security reasons. We’ll walk through some of the data coding issues, creating measures & figures, and then running peer influence & structural models on the network.-Outline:

-From survey to analysis files-Exploring the network: visualization-Network Behavior & Peer Influence Models

-Network structure as indep variable-Peer influence models-Dyad similarity models

-Network Structure analyses-Clustering for peer groups-Block models-Statistical Models for networks (STANET).

Page 2: Tutorial: Analyzing real network data 1) Creating data from survey

This is what students filled out in the Add Health, in school survey. One set for male friends, another for female friends.

This is the foundation of our data….

Tutorial: Analyzing real network data1) Creating data from survey

Page 3: Tutorial: Analyzing real network data 1) Creating data from survey

This is what students filled out in the Add Health, in school survey. One set for male friends, another for female friends.

This is the foundation of our data….

Resulting in a nomination data file that looks something like this (actual numbers changed).

We want to turn this file into something PAJEK, UCINET, etc. can read.

Open “netcreate.sas” & walk through logic of the file.

Tutorial: Analyzing real network data1) Creating data from survey

Page 4: Tutorial: Analyzing real network data 1) Creating data from survey

Netcreate.sas used files from SPAN to create PAJEK files. PAJEK files have a fixed structure that is easy to program for. See the PAJEK support files for details.

There are programs that convert excel or text to PAJEK format. And UCINET (and STATNET, sort of ) all read pajek .NET files.

Tutorial: Analyzing real network data1) Creating data from survey

Page 5: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data2) Exploring the network graphically

I think it’s extremely useful to simply “play” with the network in various ways and get a sense of the shape of the network. This is perhaps PAJEK’s most usefule effect.

-- Load a network and work through good/bad plots.

Page 6: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data2) Exploring the network graphically

Once you have a network, how do you create a print-ready image?

a) Screen shots (good for .ppt)b) Export to .ps or FLASH and edit in Illustrator

Page 7: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data3) Network Behavior & Peer Influence

We often want to know how some simple features of the network position affect students. These are “network behavior” models, where some indicator measure of network position is used to predict an outcome.

One should think carefully about a theoretical model here. Cause is often very difficult to disentangle. Here we’ll leave those questions asside and simply look for correlates of network position in behavior.

We’ll look at:a) network volume (degree)b) centrality (Closeness)c) local reciprocity (proportion of ties ego send that are received)

We can get most of these from either SAS or PAJEK, though I’m not sure PAJEK can give you node-level reciprocity rates…

Paj_nodestatread.sas is the SAS file…

Page 8: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data3) Network Behavior & Peer Influence

Paj_nodestatread.sas is the SAS file…After running some models we get:

Page 9: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data3) Network Behavior & Peer Influence

Open nodestats1.sas to see how to code these same stats, plus a few, in SAS…

Page 10: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data3) Network Behavior & Peer Influence

QAP is an alternative method that doesn’t make as many strong assumptions about the model.

To use QAP, we can run in SAS (but it’s slow and basic), or export to UCINET (which is fast, sophisticated and all that jazz).

The “qapstats.sas” file moves the data for us….

Page 11: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data3) Network Behavior & Peer Influence

We can also estimate the network autocorrelation model directly. We can get “QAD” estimates just by adding the W*Y term to the base model, which typically performs fairly well.

Open peerinfl1.sas to see this routine.

Alternatively, UCINET calculates a simple network correlation between any vector (Nx1) and any matrix (NxN) to estimate the bivariate peer effect, and Carter Butts’ LNAM routine in R (as part of SNA), let’s you run a full linear network autocorrelation model.

For stats details:Leenders, T.Th.A.J. (2002) ``Modeling Social Influence Through Network Autocorrelation: Constructing the Weight Matrix'' Social Networks, 24(1), 21-47.

Anselin, L. (1988) Spatial Econometrics: Methods and Models. Norwell, MA: Kluwer

Page 12: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data3) Network Behavior & Peer Influence

To run the R version, we need to export the data. We can get started using the send2r.mac routine and reshape some of the files.

The sas program “sas2r_peerinfl.sas” creates the needed external files

The r script “lname_example.r” is the needed r script.

Run the example models….

Call:lnam(y = fights, x = cv, W1 = w1, W2 = clbs)

Residuals: Min 1Q Median 3Q Max -1.3138 -0.7955 -0.3844 0.3147 3.6792

Coefficients: Estimate Std. Error Z value Pr(>|z|) FEMALE -0.292433 0.144148 -2.029 0.042489 * WHITE 0.160314 0.149228 1.074 0.282692 S3 0.061595 0.014843 4.150 3.33e-05 ***rho1.1 0.379421 0.103426 3.669 0.000244 ***rho2.1 0.001573 0.003954 0.398 0.690870 ---

Result of “fights” as Y, friendship as W1, club overlap as W2

Page 13: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data3) Network Behavior & Peer Influence

Getting measures from PAJEK.

PAJEK has no direct ID link to files. These are simply text files, so sort order matters.

The basic routine to get any measure in PAJEK is to create the measure using the dropdown menus, then save the files and read them into SAS, SPSS or whatever stats program you use.

Open the PAJEK files and create in-degree, out-degree, closeness centrality, & reciprocity.

Page 14: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data4) Network Structure: Clustering the network

As part of the description, we often want to identify significant clusters in the network. There are lots of ways to do this, we’ll sample a few.

a) Using UCINET’s routinesb) Clustering a distance matrix (SAS)c) The “Jiggle” routine (SAS, Moody)d) The “Crowds” algorithme) Using PAJEK’s blockmodel routine to fine-tune a peer group model.

Page 15: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data4) Network Structure: Clustering the network

Clustering in UCINET

-I find it simplest to read PAJEK files. Then the best “general” routine is FACTIONS, though it is slow for large (100s) nets. Very effective for small nets.

-In a pinch, CONCOR will often yield reasonable peer groups, and it’s faster in UCINET

Clustering in SAS- We can often get a quick starting point by simply using a hierarchical clustering on the distance matrix. This is a fair place to start for nets in the 100s of nodes size. - Two algorithms that work fairly well are “Jiggle” for large nets and “Crowds” for smaller nets. Both work by extending the RNM approach of Moody (2001), but jiggle is faster for large nets, Crowds includes more checks for particular structurs (like biconnected sets) and thus is slower.

Page 16: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data4) Network Structure: Clustering the network

Clustering in PAJEK

Pajek doesn’t have a dedicated clustering routine for finding peer groups in nets. But you can coerce the blockmodel routine to find block-diagonal structures (slow) or use some of it’s neighboring partitions.

Keep an eye on this, as I bet they implement Newman’s algorithm soon…

Let’s try running some of these….

Page 17: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data4) Network Structure: Clustering the network

Sample results

This is the resulting graph from a “Jiggle” run on the school net. Note this is a randomized algorithm, so you will get dif. Results from dif. runs

Page 18: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data4) Network Structure: Clustering the network

Sample results

This is the resulting graph from a “Crowds” run on the school net. We end up with smaller clusters, and a larger “background” set. By construction, the clusters must be bi-connected, so they are “rounder” than in the prior algorithm.

Page 19: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data4) Network Structure: Clustering the network

Sample results

This is the resulting graph from a “Crowds” run on the school net. We end up with smaller clusters, and a larger “background” set. By construction, the clusters must be bi-connected, so they are “rounder” than in the prior algorithm.

Page 20: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data4) Network Structure: Clustering the network

Sample results

This is the resulting graph from a “Crowds” run on the school net. We end up with smaller clusters, and a larger “background” set. By construction, the clusters must be bi-connected, so they are “rounder” than in the prior algorithm.

Page 21: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data4) Network Structure: Block modeling a network

Sample resultsThe most commonly used blockmodel routine is ConCorr, which is simple and fast. The result is a set of nested “splits” – to some pre-specified depth.

Here I apply that result to the school net, working to a depth of 3 splits.

Split 1

Page 22: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data4) Network Structure: Block modeling a network

Sample resultsThe most commonly used blockmodel routine is ConCorr, which is simple and fast. The result is a set of nested “splits” – to some pre-specified depth.

Here I apply that result to the school net, working to a depth of 3 splits.

Split 2

Note that the 2nd split in the bottom half captures a “periphery” position

Page 23: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data4) Network Structure: Block modeling a network

Sample resultsThe most commonly used blockmodel routine is ConCorr, which is simple and fast. The result is a set of nested “splits” – to some pre-specified depth.

Here I apply that result to the school net, working to a depth of 3 splits.

Split 3

Page 24: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data4) Network Structure: Block modeling a network

More in keeping w. the spirit of the original block modeling papers, “regular equivalence” models are less likely to generate block-diagonal models.

A simple positional model is the “core-periphery” model. This searches for a single “core” in the net. Since we know this net is split in two “wings”, we’ll just look within one of them.

Page 25: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data4) Network Structure: Block modeling a network

Another simple way to get at positions in a network is to compare nodes across a vector of triad-positions. In a directed network, the vector giving the count of which positions an actor is part of nicely summarizes the type of role the actor plays in the net.

003

012_S

012_E

012_I

102_D

102_I

021D_S

021D_E

021U_S

021U_E

021C_S

021C_B

021C_E

111D_S

111D_B

111D_E

111U_S

111U_B

111U_E

030T_S

030T_B

030T_E

030C

201_S

201_B

120D_S

120D_E

120U_S

120U_E

120C_S

120C_B

120C_E

210_S

210_B

210_B

300

Triadic Position Census: 36 Positions within 16 Directed TriadsIndicates the position.

Page 26: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data4) Network Structure: Block modeling a network

Another simple way to get at positions in a network is to compare nodes across a vector of triad-positions. In a directed network, the vector giving the count of which positions an actor is part of nicely summarizes the type of role the actor plays in the net.

Page 27: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network data4) Statistical Models for Networks

The exponential random graph (ERGM) class of models are designed to let you model an observed network as a function of local-network, node, and dyad-level features.

These models take the form:

)(

}{exp)( ,

ji

ijxxXp

Page 28: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network dataStatistical Models for Networks

http://csde.washington.edu/statnet/Sunbelt2006/ergmssunbeltxxviintroduction.ppt

Page 29: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network dataStatistical Models for Networks

http://csde.washington.edu/statnet/Sunbelt2006/ergmssunbeltxxviintroduction.ppt

Page 30: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network dataStatistical Models for Networks

From Handcock (2006):http://csde.washington.edu/statnet/Sunbelt2006/ergmssunbeltxxviergmclass.pdf

Page 31: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network dataStatistical Models for Networks

From Handcock (2006):http://csde.washington.edu/statnet/Sunbelt2006/ergmssunbeltxxviergmclass.pdf

Note this is a very simple “dyad independence” model.

Page 32: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network dataStatistical Models for Networks

From Handcock (2006):http://csde.washington.edu/statnet/Sunbelt2006/ergmssunbeltxxviergmclass.pdf

The dyad-independence model had been extended to other “node” features

Page 33: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network dataStatistical Models for Networks

From Handcock (2006):http://csde.washington.edu/statnet/Sunbelt2006/ergmssunbeltxxviergmclass.pdf

Lots of other structural features can be included, though not all imply reasonable models

Page 34: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network dataStatistical Models for Networks

From Handcock (2006):http://csde.washington.edu/statnet/Sunbelt2006/ergmssunbeltxxviergmclass.pdf

Page 35: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network dataStatistical Models for Networks

From Handcock (2006):http://csde.washington.edu/statnet/Sunbelt2006/ergmssunbeltxxviergmclass.pdf

The STATNET statistical package in R is the best way to estimate these models.

We will:• walk through exporting our school friendship data from SAS and

bringing it into R.• Specify some simple models• Demonstrate getting goodness of fit stats on these models• Demonstrate simulating from a model

The ultimate set of stats one can add to a model are growing quickly….

Open “statnet_datawrite.sas” to see how to create data for export.

Page 36: Tutorial: Analyzing real network data 1) Creating data from survey

Tutorial: Analyzing real network dataStatistical Models for Networks

From Handcock (2006):http://csde.washington.edu/statnet/Sunbelt2006/ergmssunbeltxxviergmclass.pdf

Results from a model (takes too long to run in real time!):

Summary of model fit==========================

Formula: s_friends ~ edges + mutual + ttriad + nodematch("S3") + nodematch("WHITE") + edgecov(s_clubs, "ovlpec")

Newton-Raphson iterations: 87 MCMC sample of size 10000

Monte Carlo MLE Results: estimate s.e. p-value MCMC s.e.edges -6.0927 0.1590376 < 1e-04 3.054007 mutual 1.7009 0.3217789 < 1e-04 0.716237 ttriad 0.4666 0.0003942 < 1e-04 0.006069 nodematch.S3 1.4469 0.1719817 < 1e-04 0.597009 nodematch.WHITE 0.9567 0.2931915 0.00110 2.890984 edgecov.s_clubs.ovlpec 0.2689 0.1585942 0.09001 0.555580

Null Deviance: 85606.4 on 61752 degrees of freedom Residual Deviance: 6867.4 on 61746 degrees of freedom Deviance: 78739.0 on 6 degrees of freedom AIC: 6879.4 BIC: 6933.6

Page 37: Tutorial: Analyzing real network data 1) Creating data from survey