large graph analysis - iscom · • large graphs: social graphs, web graphs … • extremal...

Large graph analysis

Paola Vocca - Università della Tuscia

Outline

• Large graphs: Social graphs, web graphs …

• Extremal measures: Diameter, centrality, eccentricity, average

distance, separation degree;

• Exact and approximate algorithms and data structure

o exact

o Sampling

o Data stream model: Probabilistic estimator

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 2

Graphs

• Graph allows to represent relations between «thinghs» or

entities

• Nodes or vertices represent the entities

• Edges represent the relation


Relation: Who is the master of whom?

Bow tie structure of the Web Graph


• An AltaVista crawl of 200

million pages and 1:5 billion

links.

• A giant strongly connected

component containing 28%

of the nodes.

A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins,

and J. Wiener. Graph structure in the Web: experiments and models. Computer

Networks, 33(1–6):309–320, 2000.

Graphs

Dolphin interactions


Graphs example


Graph Datasets

Hyperlinks (the Web)

Social graphs (Facebook, Twitter, LinkedIn,…)

Email logs, phone call logs , messages

Commerce transactions (Amazon purchases)

Road networks

Communication networks

Protein interactions

…


Properties

Directed/Undirected

Snapshot or with time dimension (dynamic)

One or more types of entities (people, pages, products)

Meta data associated with nodes or edges (labels)

Some graphs are really large: billions of edges for Facebook and Twitter graphs


Mining the graph

Connected/Strongly connected components

Eccentricity (the maximum distance d(v, u) for all u).

Radius r(G) is the minimum eccentricity of the nodes. A node is central if e(u) = r(G) and the center of G is the set of all central nodes. I

Diameter (longest shortest s-t path)

Effective diameter (90% percentile of pairwise distance)

Distance distribution (number of pairs within each distance)

Average distance

Degree distribution

Clustering coefficient: Ratio of the number of closed triangles to open triangles.

Centrality

…..


…Mining the link structure

• Centrality (who are the most important nodes?)

• Similarity of nodes (link prediction, targeted ads,

friend/product recommendations, Meta-Data completion)

• Communities: set of nodes that are more tightly related to

each other than to others

• “cover:” set of nodes with good coverage (facility location,

influence maximization)


Connected components

Number of connected

components 2


Eccentricity


Ecc= 2

Ecc= 2

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Diameter

Diameter is 3


Distance distribution

Distance 1: 27 (number of edges)


#nodes: 13

#edges: 27

#pairs: 78

1 2

3 4

5

6 7

8

9

10

11

12

13



#nodes: 13

#edges: 27

#pairs: 78


Distance 2: 33

1 2

3 4

5

6 7

8

9

10

11

12

13



#nodes: 13

#edges: 27

#pairs: 78


Distance 2: 33

Distance 3: 18

1 2

3 4

5

6 7

8

9

10

11

12

13

Average distance

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi

17

#nodes: 13

#edges: 27

#pairs: 78

𝑨𝒗𝒈𝑫𝒊𝒔𝒕 =𝟏

𝟕𝟖𝟐𝟕 ∙ 𝟏 + 𝟑𝟑 ∙ 𝟐 + 𝟏𝟖 ∙ 𝟑 ≅ 𝟏, 𝟖𝟖

1 2

3 4

5

6 7

8

9

10

11

12

13

Triangles


closed triangle 1 2

3 4

5

6 7

8

9

10

11

12

13

• Social graphs have many more closed triangle than random graphs

• “Communities” have more closed triangles

Communities


1 2

3 4

5

6 7

8

9

10

11

12

13 Star Wars

Ninjago

Communities in Les Miserables Network


The network of interactions between major characters in the novel Les Miserables by Victor Hugo

Centrality

• Which are the most important nodes ?

– Depends on the criteria and what we want to model


• Degree (in/out): largest number of followers, friends. Easy to

compute locally. Spammable.

• PageRank: Your importance/ reputation recursively depend

on that of your friends

• Betweenness: Your value as a “hub” -- being on a shortest

path between many pairs.

• Closeness: Centrally located, able to quickly reach/infect many nodes … the inverse of eccentricity

Centrality: example


1 2

3 4

5

6 7

8

9

10

11

12

13 • Central nodes respect to all criteria

Random graphs and real graphs

• Experiments show that statistical measures in real complex networks are significantly different with respect to random

generated graphs

• Biological networks, social networks, Internet, Web have similar

measures.


High aggregation degree Low separation degree

Clustering coefficient Average distance

Degree distribution

• In a random graph (a link is created according to a uniform probability distribution) all nodes have the same importance: The probability that 𝒗 is connecte to 𝒗’ is the same as it conencted to 𝒗’’ ,

• the probability that a node has exactly degree k is

𝑷 𝒌 =nodes with degree 𝒌Nodes of the graph


Random graphs Real graphs

• Nodes with low or a high degree are rare • Most of the nodes have a degree in the

average

Degree distribution

• Hubs are shortcuts on the paths so influencing the

separtion degree


small-world effect: very few nodes with many connections, which ensure the

speed of transmission of information (or

gossip ...) in the network

• Social networks: a few individuals with many friends (celebrities)

• Web: few websites with lots of links

• metabolic networks: a few metabolites participating in many metabolic processes

• Internet: works well 'cause it's true that any two computers are connected by not more than 10 or 20 "hops" (physical links)

Degree distribution

• Hubs are shortcuts on the paths so influencing the

separtion degree


small-world effect: very few nodes with many connections, which ensure the

speed of transmission of information (or

gossip ...) in the network

small-world effect, crucial • in the study of the epidemic spreading

• In the in the service-optimization problem in telecommunications networks,

• in the study of neuronal interconnections in the brain,

• in ecological networks, etc

Small-world effect

• a network evolves over time, and

each new node joining the network

does not have the same chance to

connect to a node rather than to

another, but

• it is more likely that the new node

connects to an already connected

node, rather than to a node

isolated: preferential attachment


Why?

hubs are

• Reason for Network Resiliency: if it falls a node case is rare

that it is a hub, then the connectivity remains "guaranteed"

• but they are also the target of targeted attacks!

Small world effect


Kevin

Bacon Number # of People

0 1

1 3303

2 381495

3 1383150

4 356429

5 30815

6 3640

7 584

8 116

9 26

10 1 Total number of linkable actors: 2159560

Weighted total of linkable actors: 6522634

Average Kevin Bacon number: 3.020

The average Bacon number is 3.020

How good a center isKevin Bac

Facebook

• Boldi, Rosa, and Vigna. Hyperanf: approximating the neighbourhood function of very large graphs on a budget. In

WWW 2011.

• The average distance of Facebook (721:1M nodes and 68:7G

edges) is 4.7 and the diameter is 41.


Neighbourhood function

• The neighbourhood 𝑵 𝒕 of a graph returns, for each 𝑡 ∈ ℕ the number of pairs of nodes 𝒙, 𝒚 such that 𝒚 is reachable from

𝒙 in less that 𝑡 steps.

• All the previous measures on graphs can be derived from the

computation of 𝑵 𝒕 .


Real graphs are huge

The Internet 2003

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi

31

Asia Pacific – Red

Europe/Middle East/Central

Asia/Africa – Green

North America – Blue

Latin American and Caribbean –

Yellow

RFC1918 IP Addresses – Cyan

Unknown – White

Graph Colors:

Figure by the Opte Project (www.opte.org).

The vertices are “class C subnets”: groups of computers with similar Internet addresses, usually managed by a single organization the connections represent the routes taken by data packets as they hop between subnets. The geometric positions of the vertices are chosen simply to give a pleasing layout and are not related to geographic position.

Internet 2010


Internet 2015


North America (ARIN)

Europe (RIPE)

Latin America (LACNIC)

Asia Pacific (APNIC)

Africa (AFRINIC)

“Backbone” (highly

connected networks)

Graph Colors:

Exact computations

How can one compute the distance distribution?

• Weighted graphs:

o Dijkstra (single-source: 𝑶(𝒏𝟐)),

o Floyd-Warshall (all-pairs: 𝑶 𝒏𝟑

• unweighted graphs:

o a single BFS solves the single-source version of the problem: 𝑶 𝒎

• if we repeat it from every source: 𝑶 𝒎𝒏

• Matrix multiplication Still too expensive.

o 𝑶(𝒏𝟑+𝝎

𝟐 𝐥𝐨𝐠 𝒏 ) where 𝜔 is the exponent of the matrix

multiplication.

o U. Zwick. All pairs shortest paths using bridging sets and rectangular matrix

multiplication. J. ACM, 49(3):289–317, 2002.


Computing on Very Large Graphs

Sampling

Approximation

Probabilistic counter

General algorithm design principles :

keep total computation/ communication/ storage “linear” in the size of the data

Parallelize (minimize chains of dependencies)

Localize dependencies


Sampling

• Strategy o Sample at random a source x

o Compute a full BFS from x

• It is an unbiased estimator only for undirected and connected graphs o Uses anyway BFS...

o ...not cache friendly

o ...not compression friendly

The average distance can obtained by sampling for each node only 𝑶𝒍𝒐𝒈 𝒏

𝜺𝟐

random nodes and not all n nodes with an error of 𝜀, reducing to

𝑶𝒍𝒐𝒈 𝒏

𝜺𝟐(𝒏 log 𝒏 +𝒎 the time complexity.

David Eppstein and Joseph Wang. 2001. Fast approximation of centrality. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms (SODA '01). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 228-229.


Sampling

Sampling algorithms • Sampling by random node selection: Authors in showed that RN does

not retain power-law o Vertex choice can be made

• proportional to its PageRank

• Random Degree Node (RDN) sampling has even more bias towards high degree nodes.

• Sampling by random edge selection: sampled graphs will be very sparsely connected and will thus have large diameter and will not respect community structure.

• Sampling by exploration o Random Node Neighbor (RNN)

o Random Walk (RW)

o Random Jump (RJ)

Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '06). ACM, New York, NY, USA, 631-636


Diffusion

Basic idea • ANF: • Christopher R. Palmer, Phillip B. Gibbons, and Christos Faloutsos. 2002. ANF: a fast

and scalable tool for data mining in massive graphs. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '02). ACM, New York, NY, USA, 81-90.

• HyperANF: Paolo Boldi, Marco Rosa, and Sebastiano Vigna. 2011. HyperANF: approximating the neighbourhood function of very large graphs on a budget. In Proceedings of the 20th international conference on World wide web (WWW '11). ACM, New York, NY, USA, 625-634.

• Let 𝑩𝒕 𝒙 be the ball of radius 𝒕 about x (the set of nodes at distance = 𝒕 from 𝒙)

• Clearly 𝑩𝒐(𝒙) = {𝒙} • Moreover 𝑩𝒕+𝟏(𝒙) = 𝑩𝒕(𝒚) ∪𝒙⟶𝒚 {𝒙}

• So computing 𝑩𝒕+𝟏 starting from 𝑩𝒕 just need a single (sequential) scan of the graph


Example

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 39 3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 39

𝒙

𝒙𝟏 𝒙𝟐

𝒙𝟑

𝑩𝟏(𝒙𝟑)

𝑩𝟏(𝒙𝟐)

𝑩𝟏(𝒙𝟏)

𝑩𝟐(𝒙)

Easy but expensive

• Every set requires 𝑶(𝒏) bits, hence 𝑶(𝒏𝟐) bits overall

• Too many!

• What about using approximated sets?

• We need probabilistic counters, with just two primitives: add

and size?


ANF e HyperANF

• ANF: use the probabilistic counter of Flajolet&Martin (1985)

implemented in the framework SNAP

• HyperANF : used HyperLogLog counters [Flajolet et al., 2007]

and implemented in the framework WebGraph to study the

web graph.

o With 40 bits you can count up to 4 billion with a

o standard deviation of 6%


https://snap.stanford.edu/snap/doc/snapuser-ref/index.html

http://webgraph.di.unimi.it/docs/overview-summary.html

http://webgraph.di.unimi.it/docs/overview-summary.html

Probabilistic counter: Streaming model

Sequence of elements from some domain

<x1, x2, x3, x4, ..... >

Bounded storage:

working memory << stream size

usually 𝑶(𝒍𝒐𝒈𝒌𝒏) or 𝑶(𝒏𝜶) for 𝜶 < 𝟏

Fast processing time per stream element


Counting Distinct Elements

Keys occur multiple times, we want to count the number of

distinct keys in the stream

In this example:

Number of distinct key is 𝒏 = 𝟔

Number of stream elements is 11


32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,

Distinct Elements: Approximate Counting

Exact counting of 𝑛 distinct element requires a structure of size

Ω 𝑛

We are often happy with an approximate count obtained using a

small-size working memory.

We want to be able to compute and maintain a small sketch 𝒔(𝑵) of the set 𝑁 of distinct items seen so far 𝑵 = {𝟑𝟐, 𝟏𝟐, 𝟏𝟒, 𝟕, 𝟔, 𝟒}


32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,

Wanted: Distinct Elements Sketch Small size s(𝑁) ≪ 𝑁 = 𝑛

Can query 𝐬(𝐍) to get a good estimate 𝒏 (𝒔) of 𝑛 (small relative

error)

Streaming: For a new element 𝑥, easy to compute s(𝑁 ∪ 𝑥) from

s 𝑁 and 𝑥

Mergeability: If 𝑁1 and 𝑁2 are (possibly overlapping) sets then we

can compute the union sketch from their sketches: 𝑠(𝑁1 ∪ 𝑁2) from

𝑠(𝑁1) and s 𝑁2


MinHash Sketch: [Flajolet & Martin 85, …]

32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,

ℎ 𝑥 ∼ 𝑈[0,1] ℎ is a random hash function from keys to uniform random numbers in [0,1]

Maintain the Min-Hash value 𝑦:

Initialize 𝑦 ← 1

Processing an element with key 𝑥:

𝑦 ← min {𝑦, ℎ 𝑥 }



32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4, 𝑥

ℎ(𝑥)

𝑦

𝑛 0

1

1 2 3 3 4 4 4 4 5 5 6

0.45 0.21

0.35 0.92

0.14

0.45 0.45 0.45 0.74

0.35 0.35

0.35 0.35

0.35

0.21 0.21

0.21 0.21 0.21 0.14 0.14

0.14

The minimum hash value 𝑦 = min ℎ x is: Non-increasing and unaffected by repeated elements.

Precise relation: E 𝑦 =1

𝑛+1



How does the minimum hash 𝑦 give information on the number of distinct elements 𝑛 ?

0 1

The expectation of the minimum is 𝐄 𝐦𝐢𝐧 𝒉 𝒙 =𝟏

𝒏+𝟏

minimum

A single value gives only limited information. To boost information, we maintain 𝒌 ≥ 𝟏 values


Advantages

• Scalability: a minimum of 20 bytes per node

• On a 2TiB machine, 100 billion nodes

• The algorithms can be implemented on scalable architecture

for big data (Hadoop, Sparc)


Camparing approaches


THANK YOU!!


large graph analysis - iscom · • large graphs: social graphs, web graphs … • extremal...

Documents