statistical inference for networks · the american statistical association 81, 832-842. 4. k....

104
Statistical Inference for Networks Graduate Lectures Hilary Term 2009 Prof. Gesine Reinert 1

Upload: others

Post on 22-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Statistical Inference for NetworksGraduate Lectures

Hilary Term 2009

Prof. Gesine Reinert

1

Page 2: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Overview

1: Network summaries. What are networks? Some examples fromsocial science and from biology. The need to summarise networks.Clustering coefficient, degree distribution, shortest path length,motifs, between-ness, second-order summaries. Roles in networks,derived from these summary statistics, and modules in networks.Directed and weighted networks. The choice of summary shoulddepend on the research question.

2: Models of random networks. Models would provide furtherinsight into the network structure. Classical Erdos-Renyi(Bernoulli) random graphs and their random mixtures,Watts-Strogatz small worlds and the modification by Newman,Barabasi-Albert scale-free networks, exponential random graphmodels.

2

Page 3: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

3: Fitting a model: parametric methods. Deriving the distributionof summary statistics. Parametric tests based on the theoreticaldistribution of the summary statistics (only available for some ofthe models).

4: Statistical tests for model fit: nonparametric methods.Quantile-quantile plots and other visual methods. Monte-Carlotests based on shuffling edges with the number of edges fixed, orfixing the node degree distribution, or fixing some other summary.The particular issue of testing for power-law dependence.Subsampling issues. Tests carried out on the same network are notindependent.

3

Page 4: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Suggested reading

1. U. Alon: An Introduction to Systems Biology DesignPrinciples of Biological Circuits. Chapman and Hall 2007.

2. S.N. Dorogovtsev and J.F.F. Mendes: Evolution of Networks.Oxford University Press 2003.

3. R. Durrett: Random Graph Dynamics. Cambridge UniversityPress 2007.

4. W. de Nooy, A. Mrvar and V. Bagatelj. Exploratory SocialNetwork Analysis with Pajek. Cambridge University Press2005.

5. S. Wasserman and K. Faust: Social Network Analysis.Cambridge University Press 1994.

6. D. Watts: Small Worlds. Princeton University Press 1999.

4

Page 5: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Further reading:

1. L. da F. Costa, F.A. Rodrigues, P.R. Villas Boas, G. Travieso(2007). Characterization of complex networks: a survey ofmeasurements. Advances in Physics 56, Issue 1 January 2007,167 - 242.

2. J.-J. Daudin, F. Picard, S. Robin (2006). A mixture model forrandom graphs. Preprint.

3. O. Frank and D. Strauss (1986). Markov graphs. Journal ofthe American Statistical Association 81, 832-842.

4. K. Nowicky and T. Snijders (2001). Estimation and Predictionfor Stochastic Blockstructures. Journal of the AmericanStatistical Association 455, Vol. 96, pp. 1077-1087.

5. S. Milgram (1967). The small world problem. PsychologyToday 2, 60–67.

5

Page 6: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

6. A.D. Barbour and G. Reinert (2001). Small Worlds. RandomStructures and algorithms 19, 54 - 74.

7. A.D. Barbour and G. Reinert (2006). Discrete small worldnetworks. Electronic Journal of Probability 11, 12341283.

8. G. Barnard (1963). Contribution to the discussion of Bartlett’spaper. J. Roy. Statist. Soc. B, 294.

9. F. Chung and L. Lu (2001). The diameter of sparse randomgraphs. Advances in Applied Math. 26, 257–279.

10. M. Dwass (1957). Modified randomization tests fornonparametric hypotheses. Ann. Math. Stat. 28, 181–187. A.Fronczak, P. Fronczak and Janusz A. Holyst (2003). Mean-fieldtheory for clustering coefficients in Barabasi-Albert networksPhys. Rev. E 68, 046126.

11. A. Fronczak, P. Fronczak, Janusz A. Holyst (2004). Averagepath length in random networks Phys. Rev. E 70, 056110.

6

Page 7: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

12. P. Hall and D.M. Titterington (1989). The effect of simulationorder on level accuracy and power of Monte Carlo tests. J.Roy. Statist. Soc. B, 459.

13. K. Lin (2007). Motif counts, clustering coefficients, and vertexdegrees in models of random networks. D.Phil. dissertation,Oxford.

14. F. Marriott (1979). Barnard’s Monte Carlo tests: how manysimulations? Appl. Statist.28, 75–77.

15. M.E.J. Newman (2005). Power laws, Pareto distributions andZipf’s law. Contemporary Physics 46, 323351.

16. M. P. H. Stumpf, C.Wiuf and R. M. May (2005). Subnets ofscale-free networks are not scale-free: Sampling properties ofnetworks. Proc Natl Acad Sci U S A. 2005 March 22; 102(12):4221 - 4224.

7

Page 8: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

17. C. J. Anderson, S. Wasserman, B.Crouch (1999). A p* primer:logit models for social networks. Social Networks 21, 37–66.

18. P. Chen, C. Deane, G. Reinert (2007) A statistical approachusing network structure in the prediction of proteincharacteristics. Bioinformatics, Vol 23, 17, 2314-2321.

19. P. Chen, C. Deane, G. Reinert (2008) Predicting and validatingprotein interactions using network structure. Submitted.

20. J-D. J. Han, N. Bertin, T. Hao, D. S. Goldberg, G.F. Berriz, L.V. Zhang, D. Dupuy,A. J. M Walhout, M. E. Cusick, F. P.Roth, and M. Vidal (2004). Evidence for dynamicallyorganized modularity in the yeast protein-protein interactionnetwork. Nature, 430, 8893.

21. Handcock, M.S., Raftery, A.E., and Tantrum, J.M. 2007.Model-based clustering for social networks. Journal of theRoyal Statistical Society. 170:122.

8

Page 9: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

22. P. W. Holland, S. Leinhardt (1981). An exponential family ofprobability distributions for directed graphs (with discussion).Journal of the American Statistical Association, 76, 33-65.

23. F. Kepes ed. (2007). Biological Networks. Complex Systemsand Interdisciplinary Science Vol. 3. World Scientific,Singapore.

24. Newman, M.E.J. and Girvan, M. 2004. Finding and evaluatingcommunity structure in networks. Physical Review E. 69026113.

25. G. Palla, A.-L. Barabasi, T. Vicsek (2007). Quantifying socialgroup evolution. Nature 446, 664–667.

26. F. Picard, J.-J. Daudin, M. Koskas, S. Schbath, S. Robin(2008). Assessing the exceptionality of network motifs. Journalof Computational Biology 15, 1–20.

27. E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai, A.-L.

9

Page 10: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Barabsi (2002). Hierarchical Organization of Modularity inMetabolic Networks. Science 297, 1551.

28. G.L. Robins, T.A.B. Snijders, P. Wang, M. Handcock, P.Pattison (2007). Recent developments in exponential randomgraph (p*) models for social networks. Social Networks 29,192-215 (2007). http://www.stats.ox.ac.uk/ sni-jders/siena/RobinsSnijdersWangHandcockPattison2007.pdf

29. G.L. Robins, P. Pattison, Y. Kalisha, D. Lushera (2007). Anintroduction to exponential random graph (p*) models forsocial networks. Social Networks 29, 173-191.

10

Page 11: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

1 Network summaries

1.1 What are networks?

Networks are just graphs. Often one would think of a network as aconnected graph, but not always. In these lectures we shall usenetwork and graph interchangeably.

Here are some examples of networks (graphs).

11

Page 12: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

0

1 2

34

5

6

7

8

9

10

11

12

13

14

15

12

Page 13: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Marriage relations between Florentine families.

8: Medici14: Strozzi

13

Page 14: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Yeast: A plot of a connected subset of Yeast protein interactions.

0123456

7

8

9101112131415161718192021222324252627

28

2930313233343536373839404142434445464748

49

50515253545556575859606162636465

66

676869707172737475767778798081828384858687888990919293949596979899100101102

103

104

105

106107108109110111112113114115116117

118

119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238

239

240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324

325

326

327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403

404

405406407408

409

410411412413414415416417418419420421422423424425426427428429430431432

433

434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463

464465466467468469470471

472

473474475476477478479480

481

482483484485486487

488

489490491492493494495496497498499500501502503504505506507508509510511

512

513

514

515516517518519

520

521 522523

524

525

526

527

528529530531532

533

534

535

536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570

571

572

573574575576577578579580581582583584585586587588589590591592593594595596597598

599

600

601

602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634

635

636

637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691

692

693

694

695

696697698699 700

701

702703704705706707708709710711712713714715716717718

719

720721722723724725726727728729730731732733734735736737738739740741742743744745746747

748

749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839

840

841

842843844845846847848849850851852853854855856857858859860861862863864

865

866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913

914

915

916917918919920

921

922

923924925926927928929

930

931932933934935936937938939940

941

942943

944

945946947948949

950

951

952

953

954955

956

95795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013

1014

1015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048

1049

1050

1051105210531054105510561057105810591060

1061

1062

1063

10641065106610671068106910701071107210731074107510761077107810791080108110821083108410851086108710881089109010911092109310941095109610971098109911001101110211031104110511061107110811091110111111121113111411151116111711181119

1120

1121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160

1161

11621163

116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187118811891190119111921193

1194

1195

119611971198119912001201120212031204

1205

12061207120812091210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237

1238

1239124012411242124312441245124612471248124912501251125212531254125512561257125812591260126112621263126412651266126712681269127012711272127312741275127612771278127912801281128212831284128512861287128812891290129112921293129412951296

12971298

1299

1300

1301

1302130313041305130613071308130913101311131213131314131513161317131813191320

1321

132213231324132513261327132813291330133113321333

1334

133513361337133813391340134113421343134413451346

1347

13481349135013511352135313541355135613571358

1359

1360

13611362

1363

1364

1365136613671368136913701371137213731374137513761377137813791380138113821383138413851386138713881389139013911392139313941395

13961397

13981399

1400

140114021403

1404

140514061407

1408

1409

1410

1411141214131414141514161417141814191420142114221423

1424

1425

1426142714281429

1430

14311432143314341435

1436

1437

143814391440144114421443

1444

14451446

1447

144814491450

14511452145314541455145614571458

1459

1460

1461

1462

1463

1464

1465

1466

1467

1468

1469

1470

1471

1472

1473

1474147514761477147814791480148114821483148414851486148714881489149014911492149314941495149614971498149915001501

1502

1503

15041505150615071508150915101511151215131514151515161517151815191520152115221523152415251526152715281529153015311532153315341535153615371538153915401541154215431544154515461547154815491550155115521553155415551556155715581559156015611562156315641565156615671568156915701571157215731574157515761577157815791580158115821583158415851586158715881589159015911592159315941595159615971598159916001601160216031604160516061607160816091610161116121613161416151616161716181619162016211622162316241625162616271628162916301631

1632

1633

163416351636163716381639164016411642164316441645164616471648164916501651165216531654165516561657165816591660166116621663166416651666166716681669167016711672167316741675167616771678167916801681168216831684168516861687168816891690169116921693169416951696169716981699170017011702170317041705170617071708170917101711171217131714171517161717171817191720172117221723172417251726172717281729173017311732173317341735173617371738173917401741174217431744174517461747174817491750175117521753175417551756175717581759176017611762176317641765176617671768176917701771177217731774177517761777177817791780178117821783178417851786178717881789179017911792179317941795179617971798179918001801180218031804180518061807180818091810181118121813181418151816181718181819182018211822182318241825182618271828182918301831183218331834183518361837183818391840184118421843184418451846184718481849185018511852185318541855185618571858185918601861186218631864186518661867186818691870187118721873187418751876187718781879188018811882188318841885188618871888188918901891189218931894189518961897189818991900190119021903190419051906190719081909191019111912191319141915191619171918191919201921192219231924192519261927192819291930193119321933193419351936193719381939194019411942194319441945194619471948194919501951195219531954195519561957195819591960196119621963196419651966196719681969197019711972197319741975197619771978197919801981198219831984198519861987198819891990199119921993199419951996199719981999200020012002200320042005200620072008200920102011201220132014201520162017201820192020202120222023202420252026202720282029203020312032203320342035203620372038203920402041204220432044204520462047204820492050205120522053205420552056205720582059206020612062206320642065206620672068206920702071207220732074207520762077207820792080208120822083208420852086208720882089

2090

2091209220932094209520962097209820992100210121022103210421052106210721082109211021112112211321142115211621172118211921202121212221232124212521262127212821292130213121322133213421352136213721382139214021412142214321442145214621472148

21492150

215121522153

2154

21552156

2157

2158

2159

2160216121622163216421652166216721682169217021712172217321742175217621772178217921802181218221832184218521862187218821892190219121922193219421952196219721982199220022012202220322042205220622072208220922102211221222132214221522162217221822192220222122222223222422252226222722282229223022312232223322342235223622372238223922402241224222432244224522462247224822492250225122522253225422552256225722582259226022612262226322642265226622672268226922702271227222732274227522762277227822792280228122822283228422852286228722882289229022912292229322942295

2296

2297

229822992300230123022303230423052306230723082309231023112312231323142315231623172318231923202321232223232324232523262327232823292330233123322333233423352336

2337

2338

2339234023412342234323442345234623472348234923502351235223532354235523562357235823592360

14

Page 15: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Networks arise in a multitude of contexts, such as

- metabolic networks- protein-protein interaction networks- spread of epidemics- neural network of C. elegans- social networks- collaboration networks (Erdos numbers ... )- Membership of management boards- World Wide Web- power grid of the Western US

15

Page 16: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

The study of networks has a long tradition in social science, whereit is called Social Network Analysis; see also Krista Gile’s talks.The networks under consideration are typically fairly small. Incontrast, starting at around 1997, statistical physicists have turnedtheir attention to large-scale properties of networks. Our lectureswill try to get a glimpse on both approaches.

16

Page 17: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Research questions include

• How do these networks work? Where could we best manipulatea network in order to prevent, say, tumor growth?

• How didbiological networks evolve? Could mutation affectwhole parts of the network at once?

• How similar are networks? If we study some organisms verywell, how much does that tell us about other organisms?

• How are networks interlinked?

• What are the building principles of these networks? How isresilience achieved, and how is flexibility achieved? Could welearn from real-life networks to build man-made efficientnetworks?

17

Page 18: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

From a statistical viewpoint, questions include

• How to best describe networks?

• How to infer characteristics of nodes in the network?

• How to infer missing links, and how to check whether existinglinks are not false positives

• How to compare networks from related organisms?

• How to predict functions from networks?

• How to find relevant sub-structures of a network?

Statistical inference relies on the assumption that there is somerandomness in the data. Before we turn our attention to modellingsuch randomness, let’s look at how to describe networks, or graphs,in general.

18

Page 19: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

1.2 What are graphs?

A graph consists of nodes (sometimes also called vertices) and edges(sometimes also called links). We typically think of the nodes asactors, or proteins, or genes, or metabolites, and we think of anedge as an interaction between the two nodes at either end of theedge. Sometimes nodes may possess characteristics which are ofinterest (such as structure of a protein, or function of a protein).Edges may possess different weights, depending on the strength ofthe interaction. For now we just assume that all edges have thesame weight, which we set as 1.

19

Page 20: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Mathematically, we abbreviate a graph G as G = (V,E), where V

is the set of nodes and E is the set of edges. We use the notation|S| to denote the number of elements in the set S. Then |V | is thenumber of nodes, and |E| is the number of edges in the graph G. Ifu and v are two nodes and there is an edge from u to v, then wewrite that (u, v) ∈ E, and we say that v is a neighbour of u.

If both endpoints of an edge are the same, then the edge is a loop.For now we exclude self-loops, as well as multiple edges betweentwo nodes.

Edges may be directed or undirected. A directed graph, or digraph,is a graph where all edges are directed. The underlying graph of adigraph is the graph that results from turning all directed edgesinto undirected edges. Here we shall mainly deal with undirectedgraphs.

20

Page 21: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Two nodes are called adjacent if they are joined by an edge. Agraph can be described by its adjacency matrix A = (au,v). This isa square |V | × |V | matrix. Each entry is either 0 or 1;

au,v = 1 if and only if (u, v) ∈ E.

As we assume that there are no self-loops, all elements on thediagonal of the adjacency matrix are 0. If the edges of the graphare undirected, then the adjacency matrix will be symmetric.

21

Page 22: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

The adjacency matrix entries tell us for every node v which nodesare within distance 1 of v. If we take the matrix produceA2 = A×A, the entry for (u, v) with u 6= v would be

a(2)(u, v) =∑w∈V

au,waw,v.

If a(2)(u, v) 6= 0 then u can be reached from v within two steps; u iswithin distance 2 of v. Higher powers can be interpreted similarly.

A complete graph is a graph such that every pair of nodes is joinedby an edge. The adjacency matrix has entry 0 on the diagonal, and1 everywhere else.

22

Page 23: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

A bipartite graph is a graph where the node set V is decomposedinto two disjoint subsets, U and W , say, such that there are noedges between any two nodes in U , and also there are no edgesbetween any two nodes in W ; all edges have one endpoint in U andthe other endpoint in W . An example is a network of co-authorshipand articles; U could be the set of authors, W the set of articles,and an author is connected to an article by an edge if the author isa co-author of that article. The adjacency matrix A can then bearranged such that it is of the form 0 A1

A2 0

.

23

Page 24: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

1.3 Network summaries

The degree deg(v) of a node v is the number of edges which involvev as an endpoint. The degree is easily calculated from theadjacency matrix A;

deg(v) =∑

u

au,v.

The average degree of a graph is then the average of its nodedegrees.

(For directed graphs we would define the in-degree as the number ofedges directed at the node, and the out-degree as the number ofedges that go out from that node.)

24

Page 25: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

The clustering coefficient of a node v is, intuitively, the proportionof its ”friends” who are friends themselves. Mathematically, it isthe proportion of neighbours of v which are neighbours themselves.In adjacency matrix notation,

C(v) =

∑u,w∈V au,vaw,vau,w∑

u,w∈V au,vaw,v.

The (average) clustering coefficient is defined as

C =1|V |

∑v∈V

C(v).

25

Page 26: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Note that ∑u,w∈V

au,vaw,vau,w

is the number of triangles involving v in the graph. Similarly,∑u,w∈V

au,vaw,v

is the number of 2-stars centred around v in the graph. Theclustering coefficient is thus the ratio between the number oftriangles and the number of 2-stars. The clustering coefficientdescribes how ”locally dense” a graph is. Sometimes the clusteringcoefficient is also called the transitivity.

The clustering coefficient in the Florentine family example is0.1914894; the average clustering coefficient in the Yeast data is0.1023149.

26

Page 27: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

In a graph a path from node v0 to node vn is an alternatingsequence of nodes and edges, (v0, e1, v1, e2, . . . , vn−1, en, vn) suchthat the endpoints of ei are vi−1 and vi, for i = 1, . . . , n. A graph iscalled connected if there is a walk between any pair of nodes in thegraph, otherwise it is called disconnected. The distance `(u, v)between two nodes u and v is the length of the shortest pathjoining them. This path does not have to be unique.

We can calculate the distance `(u, v) from the adjacency matrix A

as the smallest power p of A such that the (u, v)-element of Ap isnot zero.

27

Page 28: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

In a connected graph, the average shortest path length is defined as

` =1

|V |(|V | − 1)

∑u 6=v∈V

`(u, v).

The average shortest path length describes how ”globallyconnected” a graph is.

Example: H. Pylori and Yeast protein interaction networkcomparison:

n ` C

H.Pylori 686 4.137637 0.016

Yeast 2361 4.376182 0.1023149

28

Page 29: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Node degree, clustering coefficient, and shortest path length are themost common summaries of networks. Other popular summaries,to name but a few, are: the between-ness of an edge counts theproportion of shortest paths between any two nodes which passthrough this edge. Similarly, the between-ness of a node is theproportion of shortest paths between any two nodes which passthrough this node. The connectivity of a connected graph is thesmallest number of edges whose removal results in a disconnectedgraph.

29

Page 30: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

In addition to considering these general summary statistics, it hasproven fruitful to describe networks in terms of motifs; these arebuilding- block patterns of networks such as a feed-forward loop,see the book by Alon. Here we think of a motif as a subgraph witha fixed number of nodes and with a given topology. In biologicalnetworks, it turns out that motifs seem to be conserved acrossspecies. They seem to reflect functional units which combine toregulate the cellular behaviour as a whole.

30

Page 31: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

The decomposition of communities in networks, small subgraphswhich are highly connected but not so highly connected to theremaining graph, can reveal some structure of the network.Identifying roles in networks singles out specific nodes with specialproperties, such as hub nodes, which are nodes with high degree.

31

Page 32: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Summaries based on spectral properties of the adjacency matrix.

If λi are the eigenvalues of the adjacency matrix A, then thespectral density of the graph is defined as

ρ(λ) =1n

∑i

δ(λ− λi),

where δ(x) is the delta function. For Bernoulli random graphs, if p

is constant as n →∞, then ρ(λ) converges to a semicircle.

32

Page 33: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

The eigenvalues can be used to compute the kth moments,

Mk =1n

∑i

(λi)k =1n

∑i1,i2,...,ik

ai1,i2ai2,i3 · · · aik−1,ik.

The quantity nMk is the number of paths returning to the samenode in the graph, passing through k edges, where these paths maycontain nodes that were already visited. Because in a tree-likegraph a return path is only possible going back through alreadyvisited nodes, the presence of odd moments is an indicator for thepresence of cycles in the graph.

33

Page 34: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

The subgraph centrality

Sci =∞∑

k=0

(Ak)i,i

k!

measures the ”centrality” of a node based on the number ofsubgraphs in which the node takes part. It can be computed as

Sci =n∑

j=1

vj(i)2eλi ,

where vj(i) is the ith element of the jth eigenvector.

34

Page 35: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Entropy-type summaries

The structure of a network is related to its reliability and speed ofinformation propagation. If a random walk starts on node i goingto node j, the probability that it goes through a given shortestpath π(i, j) between these vertices is

P(π(i, j)) =1

d(i)

∑b∈N (π(i,j))

1d(b)− 1

,

where d(i) is the degree of node i, and N (π(i, j)) is the set of nodesin the path π(i, j) excluding i and j. The search information is thetotal information needed to identify one of all the shortest pathsbetween i and j,

S(i, j) = − log2

∑π(i,j)

P(π(i, j)).

35

Page 36: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

The above network summaries provide an initial go at networks.Specific networks may require specific concepts. In proteininteraction networks, for example, there is a difference whether aprotein can interact with two other proteins simultaneously (partyhub) or sequentially (date hub). In addition, the research questionmay suggest other summaries. For example, in fungal networks,there are hardly any triangles, so the clustering coefficient does notmake much sense for these networks.

36

Page 37: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Excursion: Milgram and the small world effect.

In 1967 the American sociologist Milgram reported a series ofexperiments of the following type. A number of people from aremote US state (Nebraska, say) are asked to have a letter (orpackage) delivered to a certain person in Boston, Massachusetts(such as the wife of a divinity student). The catch is that the lettercan only be sent to someone whom the current holder knew on afirst-name basis. Milgram kept track of how many intermediarieswere required until the letters arrived; he reported a median of six;see for example http : //www.uaf.edu/northern/bigworld.html.This made him coin the notion of six degrees of separation, ofteninterpreted as everyone being six handshakes away from thePresident. While the experiments were somewhat flawed (in thefirst experiment only 3 letters arrived), the concept of six degrees ofseparation has stuck.

37

Page 38: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

2 Models of random networks

In order to judge whether a network summary is ”unusual” orwhether a motif is ”frequent”, there is an underlying assumption ofrandomness in the network.

38

Page 39: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Network data are subject to various errors, which can createrandomness, such as

• There may be missing edges in the network. Perhaps a nodewas absent (social network) or has not been studied yet(protein interaction network).

• Some edges may be reported to be present, but that recordingis a mistake. Depending on the method of determining proteininteractions, the number of such false positive interactions canbe substantial, of around 1/3 of all interactions.

• There may be transcription errors in the data.

• There may be bias in the data, some part of the network mayhave received higher attention than another part of thenetwork.

39

Page 40: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Often network data are snapshots in time, while the network mightundergo dynamical changes.

In order to understand mechanisms which could explain theformation of networks, mathematical models have been suggested.

40

Page 41: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

2.1 Bernoulli (Erdos-Renyi) random graphs

The most standard random graph model is that of Erdos and Renyi(1959). The (finite) node set V is given, say |V | = n, and an edgebetween two nodes is present with probability p, independently ofall other edges. As there are(

n

2

)=

n(n− 1)2

potential edges, the expected number of edges is then(n

2

)p.

Each node has n− 1 potential neighbours, and each of these n− 1edges is present with probability p, and so the expected degree of anode is (n− 1)p. As the expected degree of a node is the same forall nodes, the average degree is (n− 1)p.

41

Page 42: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Similarly, the average number of triangles in the graph is(n

3

)p3 =

n(n− 1)(n− 2)6

p3,

and the average number of 2-stars is(n

3

)p2.

Thus, with a bit of handwaving, we would expect an averageclustering coefficient of about(

n3

)p3(

n3

)p2

= p.

42

Page 43: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

In a Bernoulli random graphs, your friends are no more likely to befriends themselves than would be a two complete strangers. Thismodel is clearly not a good one for social networks. Below is anexample from scientific collaboration networks (N. Boccara,Modeling Complex Systems, Springer 2004, p.283). We canestimate p as the fraction of average node degree and n− 1; thisestimate would also be an estimate of the clustering coefficient in aBernoulli random graph.

43

Page 44: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Network n ave degree C CBernoulli

Los Alamos archive 52,909 9.7 0.43 0.00018

MEDLINE 1,520,251 18.1 0.066 0.000011

NCSTRL 11,994 3.59 0.496 0.0003

Also in real-world graphs often the shortest path length is muchshorter than expected from a Bernoulli random graph with thesame average node degree. The phenomenon of short paths, oftencoupled with high clustering coefficient, is called the small worldphenomenon. Remember the Milgram experiments!

44

Page 45: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

2.2 The Watts-Strogatz model

Watts and Strogatz (1998) published a ground-breaking paper witha new model for small worlds; the version currently most used is asfollows. Arrange the n nodes of V on a lattice. Then hard-wireeach node to its k nearest neighbours on each side on the lattice,where k is small. Thus there are nk edges in this hard-wiredlattice. Now introduce random shortcuts between nodes which arenot hard-wired; the shortcuts are chosen independently, all with thesame probability.

45

Page 46: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

46

Page 47: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

If there are no shortcuts, then the average distance between tworandomly chosen nodes is of the order n, the number of nodes. Butas soon as there are just a few shortcuts, then the average distancebetween two randomly chosen nodes has an expectation of orderlog n. Thinking of an epidemic on a graph - just a few shortcutsdramatically increase the speed at which the disease is spread.

It is possible to approximate the node degree distribution, theclustering coefficient, and the shortest path length reasonably wellmathematically; we may come back to these approximations later.

47

Page 48: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

While the Watts-Strogatz model is able to replicate a wide range ofclustering coefficient and shortest path length simultaneously, itfalls short of producing the observed types of node degreedistributions. It is often observed that nodes tend to attach to”popular” nodes; popularity is attractive.

48

Page 49: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

2.3 ”The” Barabasi-Albert model

In 1999, Barabasi and Albert noticed that the actor collaborationgraph adn the World Wide Web had degree distributions that wereof the type

Prob(degree = k) ∼ Ck−γ

for k →∞. Such behaviour is called power-law behaviour; theconstant γ is called the power-law exponent. Subsequently anumber of networks have been identified which show this type ofbehaviour. They are also called scale-free random graphs. Toexplain this behaviour, Barabasi and Albert introduced thepreferential attachment model for network growth.

49

Page 50: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Suppose that the process starts at time 1 with 2 nodes linked by m

(parallel) edges. At every time t ≥ 2 we add a new node with m

edges that link the new node to nodes already present in thenetwork. We assume that the probability πi that the new node willbe connected to a node i depends on the degree deg(i) of i so that

πi =deg(i)∑j deg(j)

.

To be precise, when we add a new node we will add edges one at atime, with the second and subsequent edges doing preferentialattachment using the updated degrees.

50

Page 51: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

This model has indeed the property that the degree distribution isapproximately power law with exponent γ = 3. Other exponentscan be achieved by varying the probability for choosing a givennode.

Unfortunately the above construction will not result in anytriangles at all. It is possible to modify the construction, addingmore than one edge at a time, so that any distribution of trianglescan be achieved.

51

Page 52: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

2.4 Erdos-Renyi Mixture Graphs

An intermediate model with not quite so many degrees of freedomis given by the Erdos-Renyi mixture model, also known as latentblock models in social science (Nowicky and Snijders (2001)). Herewe assume that nodes are of different types, say, there are L

different types. Then edges are constructed independently, suchthat the probability for an edge varies only depending on the typeof the nodes at the endpoints of the edge. Robin et al have shownthat this model is very flexible and is able to fit many real-worldnetworks reasonably well. It does not produce a power-law degreedistribution however.

52

Page 53: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

2.5 Exponential random graph (p∗) models

Networks have been analysed for ”ages” in the social scienceliterature, see for example the book by Wasserman and Faust. Hereusually digraphs are studied, and typical research questions are

• Is there a tendency in friendship towards transitivity; arefriends of friends my friends?

• What is the role of explanatory variables such as income on theposition in the network?

• What is the role of friendship in creating behaviour (such assmoking)?

• Is there a hierarchy in teh network?

• Is the network influenced by other networks for which themembership overlaps?

53

Page 54: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Exponential random graph (p∗) models model the whole adjacencymatrix of a graph simultaneously, making it easy to incorporatedependence. Suppose that X is our random adjacency matrix. Thegeneral form of the model is

Prob(X = x) =1κ

exp{∑B

λBzB(x)},

where the summation is over all subsets B of the set of potentialedges,

zB(x) =∏

(i,j)∈B

xi,j

is the network statistic corresponding to the subset B, κ is anormalising quantity so that the probabilities sum to 1, and theparameter λB = 0 for all x unless all the the variables in B aremutually dependent.

54

Page 55: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

The simplest such model is that the probability of any edge isconstant across all possible edges, i.e. the Bernoulli graph, fortwhich

Prob(X = x) =1κ

exp{λL(x)},

where L(x) is the number of edges in the network x and λ is aparameter.

55

Page 56: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

For social networks, Frank and Strauss (1986) introduced Markovdependence, whereby two possible edges are assumed to beconditionally dependent if they share a node. For non-directednetworks, the resulting model has parameters relating only to theconfigurations stars of various types, and triangles. If the numberL(x) of edges, the number S2(x) of two-stars, the number S3(x) ofthree-stars, and the number T (x) of triangles are included, thenthe model reads

Prob(X = x) =1κ

exp{λ1L(x) + λ2S2(x) + λ3S3(x) + λ4T (x)}.

By setting the parameters to particular values and then simulatingthe distribution, we can examine global properties of the network.

56

Page 57: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

2.6 Specific models for specific networks

Depending on the research question, it may make sense to build aspecific network model. For example, a gene duplication model hasbeen suggested which would result in a power-law like node degreedistribution. For metabolic pathways, a number of Markov modelshave been introduced. When thinking of flows through networks, itmay be a good idea to use weighted networks; the weights couldthemselves be random.

57

Page 58: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

3 Fitting a model: parametric methods

Bernoulli (Erdos-Renyi) random graphs

In the random graph model of Erdos and Renyi (1959), the (finite)node set V is given, say |V | = n. We denote the set of all potentialedges by E; thus |E| =

(n2

). An edge between two nodes is present

with probability p, independently of all other edges. Here p is anunknown parameter.

In classical (frequentist) statistics we often estimate unknownparameters via the method of maximum likelihood.

58

Page 59: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Example: Bernoulli random graphs.

Our data is the network we see. We describe the data using theadjacency matrix, denote it by x here because it is the realisationof a random adjacency matrix X. Recall that

xu,v = 1 if and only if there is an edge between u and v.

The likelihood of p being the true value of the edge probability ifwe see x is

L(p;x) = (1− p)|E|(

p

1− p

)∑(i,j)∈E xi,j

.

If t =∑

(i,j)∈E xi,j is the total number of edges in the randomgraph, then

p =t

|E|is our maximum-likelihood estimator.

59

Page 60: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Maximum-likelihood estimation also works well in Erdos-RenyiMixture graphs when the number of types is known, and it workswell in Watts-Strogatz small world networks when the number k ofnearest neighbours we connect to is known. When the number oftypes, or the number of nearest neighbours, is unknown, thenthings become messy.

In Barabasi-Albert models, the parameter would be the powerexponent for the node degree, as occurring in the probability for anincoming node to connect to some node i already in the network.

60

Page 61: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

In exponential random graphs, unless the network is very small,maximum-likelihood estimation quickly becomes numericallyunfeasible. Even in a simple model like

Prob(X = x) =1κ

exp{λ1L(x) + λ2S2(x) + λ3S3(x) + λ4T (x)}

the calculation of the normalising constant κ becomes numericallyimpossible very quickly.

61

Page 62: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

3.1 Markov Chain Monte Carlo estimation

A Markov chain is a stochastic process where the state at time n

only depends on the state at time n− 1, plus some independentrandomness. It is irreducible if any set of states can be reachedfrom any other state in a finite number of moves, and it isreversible if you cannot tell whether it is running forwards in timeor backwards in time. A distribution is stationary for the Markovchain if, when you start in the stationary distribution, one stepafter you cannot tell whether you made any step or not; thedistribution of the chain looks just the same.

If a Markov chain is irreducible and reversible, then it will have aunique stationary distribution, and no matter in which state youstart the chain, it will eventually converge to this stationarydistribution.

62

Page 63: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

We make use of this fact by looking at our target distribution, suchas the distribution for X in an exponential random graph model, asthe stationary distribution of a Markov chain.

This Markov chain lives on graphs, and moves are adding ordeleting edges, as well as adding types or reducing types. Findingsuitable Markov chains is an active area of research.

The ergm package has MCMC implemented for parameterestimation. We need to be aware that there is no guarantee thatthe Markov chain has reached its stationary distribution. Also, ifthe stationary distribution is not unique, then the results can bemisleading. Unfortunately in exponential random graph models itis known that in some small parameter regions the stationarydistribution is not unique.

63

Page 64: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

3.2 Assessing the model fit

Suppose that we have estimated our parameters in our model ofinterest. We can now use this model to see whether it does actuallyfit the data.

To that purpose we study the (asymptotic) distributions of oursummary statistics node degree, clustering coefficient, and shortestpath length. Then we see whether our observed values are plausibleunder the estimated model.

64

Page 65: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

3.3 The distribution of summary statistics in

Bernoulli random graphs

In a Bernoulli random graph on n nodes, with edge probability p,the network summaries are pretty well understood.

3.3.1 The degree of a random node

Pick a node v, and denote its degree by D(v), say. The degree iscalculated as the number of neighbours of this node. Each of theother (n− 1) nodes is connected to our node v with probability p,independently of all other nodes. Thus the distribution of D(v) isBinomial with parameters n and p, for each node v.

65

Page 66: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Typically we look at relatively sparse graphs, and so a Poissonapproximation applies. If X denotes the random adjacency matrix,then, in distribution,

D(v) =∑

u:u 6=v

Xu,v ≈ Poisson((n− 1)p).

Note that the node degrees in a graph are not independent. SoD(v) does not stand for the average node degree.

66

Page 67: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

How about the average degree of a node? Denote it by D. Notethat the average does not only take integer values, so wouldcertainly not be Poisson distributed. But

D =1n

n∑v=1

D(v) =2n

n∑v=1

∑u<v

Xu,v,

noting that each edge gets counted twice. As the Xu,v areindependent, we can use a Poisson approximation again, giving that

n∑v=1

∑u<v

Xu,v ≈ Poisson

(n(n− 1)

2p

)and so, in distribution,

D ≈ 2n

Z,

where Z ∼ Poisson(

n(n−1)2 p

).

67

Page 68: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

3.3.2 The clustering coefficient of a random node

Here it gets a little tricky already. Recall that the clusteringcoefficient of a node v is,

C(v) =

∑u,w∈V Xu,vXw,vXu,w∑

u,w∈V Xu,vXw,v.

The ratio of two random sums is not easy to evaluate. If we justlook at ∑

u,w∈V

Xu,vXw,vXu,w

then we see that we have a sum of dependent random variables.

68

Page 69: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Most 3-tuples (u, w, v) and (r, s, t), though, will not share an index,and hence Xu,vXw,vXu,w and Xr,sXs,tXr,t will be independent.The dependence among the random variables overall is hence weak,so that a Poisson approximation applies. As

E∑

u,w,v∈V

Xu,vXw,vXu,w =(

n

3

)p3,

we obtain that, in distribution,∑u,w.v∈V

Xu,vXw,vXu,w ≈ Poisson

((n

3

)p3

).

Similarly,

E∑

u,w∈V

Xu,vXw,v =(

n

2

)p2.

69

Page 70: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

K. Lin (2007) showed that, for the average clustering coefficient

C =1n

∑v

C(v)

it is also true that, in distribution,

C ≈ 1n(n2

)p2

Z,

where Z ∼ Poisson((

n3

)p3

).

70

Page 71: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Example. In the Florentine family data, we observe a total numberof 20 edges, an average node degree of 2.5, and an averageclustering coefficient of 0.1914894, with 16 nodes in total. Underthe null hypothesis that the data come from a Bernoulli randomgraph we estimate

p =20(162

) =20× 216× 15

=16,

and the average node degree would be

D ≈ 18Z,

where Z ∼ Poisson(20). The probability under the null hypothesisthat D ≥ 2.5 would then be

P (Z ≥ 2.5× 8) = P (Z ≥ 20) ≈ 0.55,

so no reason to reject the null hypothesis.

71

Page 72: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

3.3.3 Shortest paths: Connectivity in Bernoulli randomgraph

Erdos and Renyi (1960) showed the following ”phase transition” forthe connectedness of a Bernoulli random graph.

If p = p(n) = log nn + c

n + 0(

1n

)then the probability that a Bernoulli

graph, denoted by G(n, p) on n nodes with edge probability p isconnected converges to e−e−c

.

72

Page 73: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

The diameter of a graph is the maximum diameter of its connectedcomponents; the diameter of a connected component is the longestshortest path length in that component.

Chung and Lu (2001) showed that, if np ≥ 1 then, asympotically,the ratio between the diameter and log n

log(np) is at least 1, andremains bounded above as n →∞.

If np →∞ then the diameter of the graph is (1 + o(1)) log nlog(np) . If

nplog n →∞, then the diameter is concentrated on at most twovalues.

In the Physics literature, the value log nlog(np) is used for the average

shortest path length in a Bernoulli random graph. This has henceto be taken with a lot of grains of salt.

73

Page 74: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

While we have some idea about how the diameter (and, relatedly,the shortest path length) behaves, it is an inconvenient statistics forBernoulli random graphs, because the graph need not be connected.

74

Page 75: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

3.4 The distribution of summary statistics in

Watts-Strogatz small worlds

Recall that in this model we arrange the n nodes of V on a lattice.Then hard-wire each node to its k nearest neighbours on each sideon the lattice, where k is small. Thus there are nk edges in thishard-wired lattice. Now introduce random shortcuts between nodeswhich are not hard-wired; the shortcuts are chosen independently,all with the same probability φ.

Thus the shortcuts behave like a Bernoulli random graph, but thegraph will necessarily be connected. The degree D(v) of a node v

in the Watts-Strogatz small world is hence distributed as

D(v) = 2k + Binomial(n− 2k − 1, φ),

taking the fixed lattice into account. Again we can derive a Poissonapproximation when p is small; see K.Lin (2007) for the details.

75

Page 76: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

For the clustering coefficient there is a problem - triangles in thegraph may now appear in clusters. Each shortcut between nodes u

and v which are a distance of k + a ≤ 2k apart on the circle createsk − a− 1 triangles automatically.

Thus a Poisson approximation will not be suitable; instead we usea compound Poisson distribution. A compound Poisson distributionarises as the distribution of a Poisson number of clusters, where thecluster sizes are independent and have some distributionthemselves. In general there is no closed form for a compoundPoisson distribution.

76

Page 77: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

77

Page 78: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

The compound Poisson distribution also has to be used whenapproximating the number of 4-cycles in the graph, or the numberof other small subgraphs which have the clumping property.

It is also worth noting that when counting the joint distribution ofthe number of triangles and the number of 4-cycles, these countsare not independent, not even in the limit; a bivariate compoundPoisson approximation with dependent components is required. SeeLin (2007) for details.

78

Page 79: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

3.4.1 The shortest path length

Let D denote shortest distance between two randomly chosenpoints, and abbreviate ρ = 2kφ. Then (Barbour + Reinert) showthat uniformly in |x| ≤ 1

4 log(nρ),

P(D >

(12

log(nρ) + x

))=

∫ ∞

0

e−y

1 + e2xydy + O

((nρ)−

15 log2(nρ)

)if the probability of shortcuts is small. If the probability ofshortcuts is relatively large, then D will be concentrated on one ortwo points.

Note that D is the shortest distance between two randomly chosenpoints, not the average shortest path. Again the difference can beconsiderable (Computer exercise).

79

Page 80: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Dependent sampling

Our data are usually just one graph, and we calculate all shortestpaths. But there is much overlap between shortest paths possible,creating dependence

80

Page 81: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Simulation: n = 500, k = 1, φ = 0.01

81

Page 82: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Simulation: dependent vs independent

We simulate 100 replicas, and calculate the average shortest pathlength in each network. We compare this distribution to thetheoretical approximate distribution; we carry out 100 chi-squaretests:

n k φ E.no mean p-value max p-value

300 1 0.01 3 1.74 E-09 8.97 E-08

0.167 50 0.1978 0.8913

2 0.01 6 0 0

1000 1 0.003 3 1.65E-13 3.30 E-12

0.05 50 0.0101 0.1124

2 0.03 60 0.0146 0.2840

82

Page 83: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Thus the two statistics are close if the expected number E.no ofshortcuts is large (or very small); otherwise they are significantlydifferent.

83

Page 84: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

3.5 The distribution of summary statistics in

Barabasi-Albert models

The node degree distribution is given by the model directly, as thatis how it is designed.

The clustering coefficient depends highly on the chosen model. Inthe original Barabasi-Albert model, when only one new edge iscreated at any single time, there will be no triangles (beyond thosefrom the initial graph). The model can be extended to match anyclustering coefficient, but even if only two edges are attached at thesame time, the distribution of the number of the clusteringcoefficient is unknown to date.

84

Page 85: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

The expected value, however, can be approximated. Fronczak et al.(2003) studied the models where the network starts to grwo froman initial cluster of m fully connected nodes. Each new node that isadded to the network created m edges which connect it topreviously added nodes. The probability of a new edge to beconnected to a node v is proportional to the degree d(v) of thisnode. If both the number of nodes, n, and m are large, then theexpected average clustering coefficient is

EC =m− 1

8(log n)2

n.

85

Page 86: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

The average pathlengh ` increases approximately logarithmicallywith network size. If γ = 0.5772 denotes the Euler constant, thenFronczak et al. (2004) show for the mean average shortest pathlength that

E` ∼ log n− log(m/2)− 1− γ

log log n + log(m/2)+

32.

The asymptotic distribution is not understood.

86

Page 87: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

3.6 The distribution of summary statistics in

exponential random graph models

The distribution of the node degree, clustering coefficient, and theshortest path length is poorly studied in these models. One reasonis that these models are designed to predict missing edges, and toinfer characteristics of nodes, but their topology itself has not oftenbeen of interest.

The summary statistics appearing in the model try to push therandom networks towards certain behaviour with respect to thesestatistics, depending on the sign and the size of their factors θ.

87

Page 88: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

When only the average node degree and the clustering coefficientare included in the model, then a strange phenomenon happens.For many combinations of parameter values the model producesnetworks that are either full (every edge exists) or empty (no edgeexists) with probability close to 1. Even for parameters which donot produce this phenomenon, the distribution of networksproduced by the model is often bimodal: one mode is sparselyconnected and has a high number of triangles, while the othermode is densely connected but with a low number of triangles.Again: active research.

88

Page 89: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

4 Statistical tests for model fit:

nonparametric methods

What if we do not have a suitable test statistic for which we knowthe distribution? We need some handle on the distribution, so herewe assume that we can simulate random samples from our nulldistribution. There are a number of methods available. It is often agood idea to use plots to visually assess the fit, first via aquantile-quantile plot. Then what?

89

Page 90: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

4.1 Monte-Carlo tests

The Monte Carlo test, attributed to Dwass (1957) and Barnard(1963), is an exact procedure of virtually universal application andcorrespondingly widely used.

Suppose that we would like to base our test on the statistic T0. Weonly need to be able to simulate a random sample T01, T02, . . . fromthe distribution, call it F0, determined by the null hypothesis. Weassume that F0 is continuous, and, without loss of generality, thatwe reject the null hypothesis H0 for large values of T0. Then,provided that α = m

n+1 is rational, we can proceed as follows.

90

Page 91: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

1. Observe the actual value t∗ for T0, calculated from the data

2. Simulate a random sample of size n from F0

3. Order the set {t∗, t01, . . . , t0n}

4. Reject H0 if the rank of t∗ in this set (in decreasing order) is> m.

The basis of this test is that, under H0, the random variable T ∗ hasthe same distribution as the remainder of the set and so, bysymmetry,

P(t∗ is among the largest m values ) =m

n + 1.

91

Page 92: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

The procedure is exact however small n might be; increasing n

increases the power of the test. The question of how large n shouldbe is discussed by Marriott (1979), see also Hall and Titterington(1989).

An alternative view of the procedure is to count the number M ofsimulated values > t∗. Then P = M

n estimates the true significancelevel P achieved by the data, i.e.

P = P(T0 > t∗|H0).

In discrete data, we will typically observe ties. We can break tiesrandomly, then the above procedure will still be valid.

Unfortunately this test does not lead directly to confidenceintervals.

92

Page 93: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

For random graphs, Monte Carlo tests often use shuffling edgeswith the number of edges fixed, or fixing the node degreedistribution, or fixing some other summary.

Suppose we want to see whether our observed clustering coefficientis ”unusual” for the type of network we would like to consider.Then we may draw many networks uniformly at random from allnetworks having the same node degree sequence, say. We counthow often a clustering coefficient at least as extreme as ours occurs,and we use that to test the hypothesis.

In practice these types of test are the most used tests in networkanalysis. They are called conditional uniform graph tests.

93

Page 94: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Some caveats:

In Bernoulli random graphs, the number of edges asymptoticallydetermines the number of triangles when the number of edges ismoderately large. Thus conditioning on the number of edges (orthe node degrees, which determine the number of edges) givesdegenerate results. More generally, we have seen that node degreesand clustering coefficient (and other subgraph counts) are notindependent, nor are they independent of the shortest path length.By fixing one summary we may not know exactly what we aretesting against.

”Drawing uniformly at random” from complex networks is not aseasy as it sounds. Algorithms may not explore the whole data set.

94

Page 95: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

”Drawing uniformly at random”, conditional on some summariesbeing fixed, is related to sampling from exponential random graphs.We have seen already that in exponential random graphs there maybe more than one stationary distribution for the Markov chainMonte Carlo algorithm; this algorithm is similar to the one used fordrawing at random, and so we may have to expect similarphenomena.

95

Page 96: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

4.2 Scale-free networks

Barabasi and Albert introduced networks such that the distributionof node degrees the type

Prob(degree = k) ∼ Ck−γ

for k →∞. Such behaviour is called power-law behaviour; theconstant γ is called the power-law exponent. The networks are alsocalled scale-free:

If α > 0 is a constant, then

Prob(degree = αk) ∼ C(αk)−γ ∼ C ′k−γ ,

where C ′ is just a new constant. That is, scaling the argument inthe distribution changes the constant of proportionality as afunction of the scale change, but preserves the shape of thedistribution itself.

96

Page 97: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

If we take logarithms on both sides:

log Prob(degree = k) ∼ log C − γ log k

log Prob(degree = αk) ∼ log C − γ log α− γ log k;

scaling the argument results in a linear shift of the log probabilitiesonly. This equation also leads to the suggestion to plot thelog relfreq(degree = αk) of the empirical relative degreefrequencies against log k. Such a plot is called a log-log plot. If themodel is correct, then we should see a straight line; the slope wouldbe our estimate of γ.

97

Page 98: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

Example: Yeast data.

98

Page 99: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

0 1 2 3 4

-8-7

-6-5

-4-3

-2

log k

log

P(k

)

0 1 2 3 4

-8-6

-4-2

log k

log

P(g

reat

er th

an k

)

99

Page 100: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

These plots have a lot of noise in the tails. As an alternative,Newman (2005) suggests to plot the log of the empirical cumulativedistribution function instead, or, equivalently, our estimate for

log Prob(degree ≥ k).

If the model is correct, then one can calculate that

log Prob(degree ≥ k) ∼ C ′′ − (γ − 1) log k.

Thus a log-log plot should again give a straight line, but with ashallower slope. The tails are somewhat less noisy in this plot.

100

Page 101: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

In both cases, the slope is estimated by least-squares regression: forour observations, y(k) (which could be log probabilities or logcumulative probabilities, for example) we find the line a + bk suchthat ∑

(y(k)− a− bk)2

is as small as possible.

101

Page 102: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

As a measure of fit, the sample correlation R2 is computed. Forgeneral observations y(k) and x(k), for k = 0, 1, . . . , n, withaverages y and x, it is defined as

R =∑

k(x(k)− x)(y(k)− y)√(∑

k(x(k)− x)2)(∑

k(y(k)− y)2).

It measures the strength of the linear relationship.

In linear regression, R2 > 0.9 would be rather impressive. However,the rule of thumb for log-log plots is that

1. R2 > 0.992. The observed data (degrees) should cover at least 3 orders ofmagnitude.

Examples include the World Wide Web at some stage, when it hadaround 109 nodes. The criteria are not often matched.

102

Page 103: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

A final issue for scale-free networks: It has been shown (Stumpf etal. (2005) that when the underlying real network is scale-free, thena subsample on fewer nodes from the network will not be scale-free.Thus if our subsample looks scale-free, the underlying real networkwill not be scale-free.

In biological network analysis, is is debated how useful the conceptof ”scale-free” behaviour is, as many biological networks containrelatively few nodes.

103

Page 104: Statistical Inference for Networks · the American Statistical Association 81, 832-842. 4. K. Nowicky and T. Snijders (2001). Estimation and Prediction for Stochastic Blockstructures

For Bernoulli random graphs we have a reasonable grasp on thedistribution of our network summaries, we can usemaximum-likelihood estimation and we can use the distribution ofthe summaries for testing hypotheses. For Watts-Strogatz smallworlds some results are available. A main observation is that, incontrast to Bernoulli random graphs, even for counting triangles acompound Poisson approximation is needed, rather than a Poissonapproximation. The underlying issue is that triangles (and alsoother motifs) occur in clumps.

For random graphs, Monte Carlo tests often use shuffling edges withthe number of edges fixed, or fixing the node degree distribution, orfixing some other summary. Then we use a Monte Carlo test to seewhether our test statistic is unusual, compared to graphs drawn atrandom with the same number of edges, or the same node degreedistribution, say. The results have to be interpreted carefully.

104