large graph analysis - iscom · • large graphs: social graphs, web graphs … • extremal...
TRANSCRIPT
Large graph analysis
Paola Vocca - Università della Tuscia
Outline
• Large graphs: Social graphs, web graphs …
• Extremal measures: Diameter, centrality, eccentricity, average
distance, separation degree;
• Exact and approximate algorithms and data structure
o exact
o Sampling
o Data stream model: Probabilistic estimator
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 2
Graphs
• Graph allows to represent relations between «thinghs» or
entities
• Nodes or vertices represent the entities
• Edges represent the relation
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 3
Relation: Who is the master of whom?
Bow tie structure of the Web Graph
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 4
• An AltaVista crawl of 200
million pages and 1:5 billion
links.
• A giant strongly connected
component containing 28%
of the nodes.
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins,
and J. Wiener. Graph structure in the Web: experiments and models. Computer
Networks, 33(1–6):309–320, 2000.
Graphs
Dolphin interactions
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 5
Graphs example
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 6
Graph Datasets
Hyperlinks (the Web)
Social graphs (Facebook, Twitter, LinkedIn,…)
Email logs, phone call logs , messages
Commerce transactions (Amazon purchases)
Road networks
Communication networks
Protein interactions
…
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 7
Properties
Directed/Undirected
Snapshot or with time dimension (dynamic)
One or more types of entities (people, pages, products)
Meta data associated with nodes or edges (labels)
Some graphs are really large: billions of edges for Facebook and Twitter graphs
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 8
Mining the graph
Connected/Strongly connected components
Eccentricity (the maximum distance d(v, u) for all u).
Radius r(G) is the minimum eccentricity of the nodes. A node is central if e(u) = r(G) and the center of G is the set of all central nodes. I
Diameter (longest shortest s-t path)
Effective diameter (90% percentile of pairwise distance)
Distance distribution (number of pairs within each distance)
Average distance
Degree distribution
Clustering coefficient: Ratio of the number of closed triangles to open triangles.
Centrality
…..
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 9
…Mining the link structure
• Centrality (who are the most important nodes?)
• Similarity of nodes (link prediction, targeted ads,
friend/product recommendations, Meta-Data completion)
• Communities: set of nodes that are more tightly related to
each other than to others
• “cover:” set of nodes with good coverage (facility location,
influence maximization)
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 10
Connected components
Number of connected
components 2
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 11
Eccentricity
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 12
Ecc= 2
Ecc= 2
Ecc= 3
Ecc= 3
Ecc= 3
Ecc= 3
Ecc= 3
Ecc= 3
Ecc= 3
Ecc= 3
Ecc= 3
Ecc= 3
Diameter
Diameter is 3
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 13
Distance distribution
Distance 1: 27 (number of edges)
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 14
#nodes: 13
#edges: 27
#pairs: 78
1 2
3 4
5
6 7
8
9
10
11
12
13
Distance distribution
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 15
#nodes: 13
#edges: 27
#pairs: 78
Distance 1: 27 (number of edges)
Distance 2: 33
1 2
3 4
5
6 7
8
9
10
11
12
13
Distance distribution
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 16
#nodes: 13
#edges: 27
#pairs: 78
Distance 1: 27 (number of edges)
Distance 2: 33
Distance 3: 18
1 2
3 4
5
6 7
8
9
10
11
12
13
Average distance
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi
17
#nodes: 13
#edges: 27
#pairs: 78
𝑨𝒗𝒈𝑫𝒊𝒔𝒕 =𝟏
𝟕𝟖𝟐𝟕 ∙ 𝟏 + 𝟑𝟑 ∙ 𝟐 + 𝟏𝟖 ∙ 𝟑 ≅ 𝟏, 𝟖𝟖
1 2
3 4
5
6 7
8
9
10
11
12
13
Triangles
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 18
closed triangle 1 2
3 4
5
6 7
8
9
10
11
12
13
• Social graphs have many more closed triangle than random graphs
• “Communities” have more closed triangles
Communities
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 19
1 2
3 4
5
6 7
8
9
10
11
12
13 Star Wars
Ninjago
Communities in Les Miserables Network
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 20
The network of interactions between major characters in the novel Les Miserables by Victor Hugo
Centrality
• Which are the most important nodes ?
– Depends on the criteria and what we want to model
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 21
• Degree (in/out): largest number of followers, friends. Easy to
compute locally. Spammable.
• PageRank: Your importance/ reputation recursively depend
on that of your friends
• Betweenness: Your value as a “hub” -- being on a shortest
path between many pairs.
• Closeness: Centrally located, able to quickly reach/infect many nodes … the inverse of eccentricity
Centrality: example
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 22
1 2
3 4
5
6 7
8
9
10
11
12
13 • Central nodes respect to all criteria
Random graphs and real graphs
• Experiments show that statistical measures in real complex networks are significantly different with respect to random
generated graphs
• Biological networks, social networks, Internet, Web have similar
measures.
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 23
High aggregation degree Low separation degree
Clustering coefficient Average distance
Degree distribution
• In a random graph (a link is created according to a uniform probability distribution) all nodes have the same importance: The probability that 𝒗 is connecte to 𝒗’ is the same as it conencted to 𝒗’’ ,
• the probability that a node has exactly degree k is
𝑷 𝒌 =nodes with degree 𝒌Nodes of the graph
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 24
Random graphs Real graphs
• Nodes with low or a high degree are rare • Most of the nodes have a degree in the
average
Degree distribution
• Hubs are shortcuts on the paths so influencing the
separtion degree
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 25
small-world effect: very few nodes with many connections, which ensure the
speed of transmission of information (or
gossip ...) in the network
• Social networks: a few individuals with many friends (celebrities)
• Web: few websites with lots of links
• metabolic networks: a few metabolites participating in many metabolic processes
• Internet: works well 'cause it's true that any two computers are connected by not more than 10 or 20 "hops" (physical links)
Degree distribution
• Hubs are shortcuts on the paths so influencing the
separtion degree
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 26
small-world effect: very few nodes with many connections, which ensure the
speed of transmission of information (or
gossip ...) in the network
small-world effect, crucial • in the study of the epidemic spreading
• In the in the service-optimization problem in telecommunications networks,
• in the study of neuronal interconnections in the brain,
• in ecological networks, etc
Small-world effect
• a network evolves over time, and
each new node joining the network
does not have the same chance to
connect to a node rather than to
another, but
• it is more likely that the new node
connects to an already connected
node, rather than to a node
isolated: preferential attachment
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 27
Why?
hubs are
• Reason for Network Resiliency: if it falls a node case is rare
that it is a hub, then the connectivity remains "guaranteed"
• but they are also the target of targeted attacks!
Small world effect
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 28
Kevin
Bacon Number # of People
0 1
1 3303
2 381495
3 1383150
4 356429
5 30815
6 3640
7 584
8 116
9 26
10 1 Total number of linkable actors: 2159560
Weighted total of linkable actors: 6522634
Average Kevin Bacon number: 3.020
The average Bacon number is 3.020
How good a center isKevin Bac
• Boldi, Rosa, and Vigna. Hyperanf: approximating the neighbourhood function of very large graphs on a budget. In
WWW 2011.
• The average distance of Facebook (721:1M nodes and 68:7G
edges) is 4.7 and the diameter is 41.
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 29
Neighbourhood function
• The neighbourhood 𝑵 𝒕 of a graph returns, for each 𝑡 ∈ ℕ the number of pairs of nodes 𝒙, 𝒚 such that 𝒚 is reachable from
𝒙 in less that 𝑡 steps.
• All the previous measures on graphs can be derived from the
computation of 𝑵 𝒕 .
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 30
Real graphs are huge
The Internet 2003
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi
31
Asia Pacific – Red
Europe/Middle East/Central
Asia/Africa – Green
North America – Blue
Latin American and Caribbean –
Yellow
RFC1918 IP Addresses – Cyan
Unknown – White
Graph Colors:
Figure by the Opte Project (www.opte.org).
The vertices are “class C subnets”: groups of computers with similar Internet addresses, usually managed by a single organization the connections represent the routes taken by data packets as they hop between subnets. The geometric positions of the vertices are chosen simply to give a pleasing layout and are not related to geographic position.
Internet 2010
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 32
Internet 2015
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 33
North America (ARIN)
Europe (RIPE)
Latin America (LACNIC)
Asia Pacific (APNIC)
Africa (AFRINIC)
“Backbone” (highly
connected networks)
Graph Colors:
Exact computations
How can one compute the distance distribution?
• Weighted graphs:
o Dijkstra (single-source: 𝑶(𝒏𝟐)),
o Floyd-Warshall (all-pairs: 𝑶 𝒏𝟑
• unweighted graphs:
o a single BFS solves the single-source version of the problem: 𝑶 𝒎
• if we repeat it from every source: 𝑶 𝒎𝒏
• Matrix multiplication Still too expensive.
o 𝑶(𝒏𝟑+𝝎
𝟐 𝐥𝐨𝐠 𝒏 ) where 𝜔 is the exponent of the matrix
multiplication.
o U. Zwick. All pairs shortest paths using bridging sets and rectangular matrix
multiplication. J. ACM, 49(3):289–317, 2002.
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 34
Computing on Very Large Graphs
Sampling
Approximation
Probabilistic counter
General algorithm design principles :
keep total computation/ communication/ storage “linear” in the size of the data
Parallelize (minimize chains of dependencies)
Localize dependencies
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 35
Sampling
• Strategy o Sample at random a source x
o Compute a full BFS from x
• It is an unbiased estimator only for undirected and connected graphs o Uses anyway BFS...
o ...not cache friendly
o ...not compression friendly
The average distance can obtained by sampling for each node only 𝑶𝒍𝒐𝒈 𝒏
𝜺𝟐
random nodes and not all n nodes with an error of 𝜀, reducing to
𝑶𝒍𝒐𝒈 𝒏
𝜺𝟐(𝒏 log 𝒏 +𝒎 the time complexity.
David Eppstein and Joseph Wang. 2001. Fast approximation of centrality. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms (SODA '01). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 228-229.
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 36
Sampling
Sampling algorithms • Sampling by random node selection: Authors in showed that RN does
not retain power-law o Vertex choice can be made
• proportional to its PageRank
• Random Degree Node (RDN) sampling has even more bias towards high degree nodes.
• Sampling by random edge selection: sampled graphs will be very sparsely connected and will thus have large diameter and will not respect community structure.
• Sampling by exploration o Random Node Neighbor (RNN)
o Random Walk (RW)
o Random Jump (RJ)
Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '06). ACM, New York, NY, USA, 631-636
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 37
Diffusion
Basic idea • ANF: • Christopher R. Palmer, Phillip B. Gibbons, and Christos Faloutsos. 2002. ANF: a fast
and scalable tool for data mining in massive graphs. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '02). ACM, New York, NY, USA, 81-90.
• HyperANF: Paolo Boldi, Marco Rosa, and Sebastiano Vigna. 2011. HyperANF: approximating the neighbourhood function of very large graphs on a budget. In Proceedings of the 20th international conference on World wide web (WWW '11). ACM, New York, NY, USA, 625-634.
• Let 𝑩𝒕 𝒙 be the ball of radius 𝒕 about x (the set of nodes at distance = 𝒕 from 𝒙)
• Clearly 𝑩𝒐(𝒙) = {𝒙} • Moreover 𝑩𝒕+𝟏(𝒙) = 𝑩𝒕(𝒚) ∪𝒙⟶𝒚 {𝒙}
• So computing 𝑩𝒕+𝟏 starting from 𝑩𝒕 just need a single (sequential) scan of the graph
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 38
Example
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 39 3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 39
𝒙
𝒙𝟏 𝒙𝟐
𝒙𝟑
𝑩𝟏(𝒙𝟑)
𝑩𝟏(𝒙𝟐)
𝑩𝟏(𝒙𝟏)
𝑩𝟐(𝒙)
Easy but expensive
• Every set requires 𝑶(𝒏) bits, hence 𝑶(𝒏𝟐) bits overall
• Too many!
• What about using approximated sets?
• We need probabilistic counters, with just two primitives: add
and size?
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 40
ANF e HyperANF
• ANF: use the probabilistic counter of Flajolet&Martin (1985)
implemented in the framework SNAP
• HyperANF : used HyperLogLog counters [Flajolet et al., 2007]
and implemented in the framework WebGraph to study the
web graph.
o With 40 bits you can count up to 4 billion with a
o standard deviation of 6%
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 41
Probabilistic counter: Streaming model
Sequence of elements from some domain
<x1, x2, x3, x4, ..... >
Bounded storage:
working memory << stream size
usually 𝑶(𝒍𝒐𝒈𝒌𝒏) or 𝑶(𝒏𝜶) for 𝜶 < 𝟏
Fast processing time per stream element
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 42
Counting Distinct Elements
Keys occur multiple times, we want to count the number of
distinct keys in the stream
In this example:
Number of distinct key is 𝒏 = 𝟔
Number of stream elements is 11
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 43
32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,
Distinct Elements: Approximate Counting
Exact counting of 𝑛 distinct element requires a structure of size
Ω 𝑛
We are often happy with an approximate count obtained using a
small-size working memory.
We want to be able to compute and maintain a small sketch 𝒔(𝑵) of the set 𝑁 of distinct items seen so far 𝑵 = {𝟑𝟐, 𝟏𝟐, 𝟏𝟒, 𝟕, 𝟔, 𝟒}
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 44
32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,
Wanted: Distinct Elements Sketch Small size s(𝑁) ≪ 𝑁 = 𝑛
Can query 𝐬(𝐍) to get a good estimate 𝒏 (𝒔) of 𝑛 (small relative
error)
Streaming: For a new element 𝑥, easy to compute s(𝑁 ∪ 𝑥) from
s 𝑁 and 𝑥
Mergeability: If 𝑁1 and 𝑁2 are (possibly overlapping) sets then we
can compute the union sketch from their sketches: 𝑠(𝑁1 ∪ 𝑁2) from
𝑠(𝑁1) and s 𝑁2
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 45
MinHash Sketch: [Flajolet & Martin 85, …]
32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,
ℎ 𝑥 ∼ 𝑈[0,1] ℎ is a random hash function from keys to uniform random numbers in [0,1]
Maintain the Min-Hash value 𝑦:
Initialize 𝑦 ← 1
Processing an element with key 𝑥:
𝑦 ← min {𝑦, ℎ 𝑥 }
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 46
Distinct Elements: Approximate Counting
32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4, 𝑥
ℎ(𝑥)
𝑦
𝑛 0
1
1 2 3 3 4 4 4 4 5 5 6
0.45 0.21
0.35 0.92
0.14
0.45 0.45 0.45 0.74
0.35 0.35
0.35 0.35
0.35
0.21 0.21
0.21 0.21 0.21 0.14 0.14
0.14
The minimum hash value 𝑦 = min ℎ x is: Non-increasing and unaffected by repeated elements.
Precise relation: E 𝑦 =1
𝑛+1
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 47
Distinct Elements: Approximate Counting
How does the minimum hash 𝑦 give information on the number of distinct elements 𝑛 ?
0 1
The expectation of the minimum is 𝐄 𝐦𝐢𝐧 𝒉 𝒙 =𝟏
𝒏+𝟏
minimum
A single value gives only limited information. To boost information, we maintain 𝒌 ≥ 𝟏 values
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 48
Advantages
• Scalability: a minimum of 20 bytes per node
• On a 2TiB machine, 100 billion nodes
• The algorithms can be implemented on scalable architecture
for big data (Hadoop, Sparc)
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 49
Camparing approaches
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 50
THANK YOU!!
3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 51