15-388/688 -practical data science: graph and network processing · 2019. 12. 6. · networks are...
TRANSCRIPT
![Page 1: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/1.jpg)
15-388/688 - Practical Data Science:Graph and network processing
J. Zico KolterCarnegie Mellon University
Fall 2019
1
![Page 2: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/2.jpg)
OutlineNetworks and graph
Representing graphs
Graph algorithms
Graph libraries
2
![Page 3: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/3.jpg)
OutlineNetworks and graph
Representing graphs
Graph algorithms
Graph libraries
3
![Page 4: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/4.jpg)
Networks vs. graphs?Our terminology (fairly standard, though some use them differently):
Networks are the systems of interrelated objects (in the real world)Graphs are the mathematical model for representing networks
This lecture is largely about representations and algorithms for graphs
But of course, in data science we use these algorithms to answer questions about networks
4
![Page 5: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/5.jpg)
Graphs modelsA graph is a collection of vertices (nodes) and edges 𝐺 = (𝑉 ,𝐸)
𝑉 = 𝐴,𝐵,𝐶,𝐷,𝐸,𝐹
𝐸 = 𝐴,𝐵 , 𝐴,𝐶 , 𝐵,𝐶 , 𝐶,𝐷 , 𝐷,𝐸 , 𝐷,𝐹 , 𝐸,𝐹
5
A C
B
D F
E
![Page 6: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/6.jpg)
Directed vs. undirected graphs
6
UndirectedE.g. paper co-authorship
DirectedE.g. web links
A C
B D
A C
B D
![Page 7: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/7.jpg)
Weighted vs. unweighted graphs
7
UnweightedE.g. friends on social network
WeightedE.g. travel distance
between cities
A C
B D
A C
B D
1
4 1 3
![Page 8: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/8.jpg)
Some example graphs
8
PA road network:1M nodes, 3M edges
Patent citations:3.7M nodes, 16.5M edges
Internet topology (in 2005)1.6M nodes, 11M edges
LiveJournal social network4.8M nodes, 69M edges
Graphs from http://snap.stanford.edu, visualizations from http://www.cise.ufl.edu/research/sparse/matrices/SNAP/
![Page 9: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/9.jpg)
OutlineNetworks and graph
Representing graphs
Graph algorithms
Graph libraries
9
![Page 10: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/10.jpg)
Representations of graphsThere are a few different ways that graphs can be represented in a program, which one you choose depends on your use case
E.g., are you going to be modifying the graph dynamically (adding/removing nodes/edges), just analyzing a static graph, etc?
Three main types we will consider:1. Adjacency list2. Adjacency dictionary3. Adjacency matrix
10
![Page 11: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/11.jpg)
Adjacency listFor each node, store an array of the nodes that it connects to
Pros: easy to get all outgoing links from a given node, fast to add new edges (without checking for duplicates)
Cons: deleting edges or checking existence of an edge requires scan through given node’s full adjacency array
11
Node EdgesA [B]B [C]C [A,D]D []
A C
B D
![Page 12: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/12.jpg)
Adjacency dictionaryFor each node, store a dictionary of the nodes that it connects to
Pros: easy to add/remove/query edges (requires two dictionary lookups, so a 𝑂(1) operation)
Cons: overhead of using a dictionary over array
12
Node (key) EdgesA {B:1.0}B {C:1.0}C {A:1.0,D:1.0}D {}
A C
B D
![Page 13: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/13.jpg)
Poll: complexity of graph operationsSuppose we have a directed graph with 𝑛 nodes where each node has fewer than some constant 𝑘 ≪ 𝑛 ingoing or outgoing edges. In the adjacency dictionary representation, which of the following operations are constant time (𝑂 1 ≡
𝑂(𝑘))?1. Checking if there is a link between two nodes 𝐴→ 𝐵
2. Finding all the outgoing edges of a node 𝐴3. Finding all the incoming edges of a node 𝐴4. Deleting all outgoing and incoming edges of a node 𝐴5. Deleting the link between two nodes 𝐴→ 𝐵
6. Adding a new node 𝑍 to the graph and adding links 𝐴→ 𝑍, 𝑍 → 𝐵
13
![Page 14: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/14.jpg)
Adjacency matrixStore the connectivity of the graph as a matrix
In virtually all cases, you will want to store this as a sparse matrix
Pros/cons depend on which sparse matrix format you use, but most operations on a static graph will but much faster using the right format
Connection between adjacency list and sparse CSC format14
A C
B D
𝐴 =
0 0
1 0
0 1
0 0
1 0
0 0
0 0
1 0
(From)A B C D
ABCD
(To)
![Page 15: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/15.jpg)
OutlineNetworks and graph
Representing graphs
Graph algorithms
Graph libraries
15
![Page 16: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/16.jpg)
Graph algorithmsAlgorithms for graphs could be (in fact, is) an entire course on its own
We’re going to briefly highlight just three algorithms that address different problem classes in graphs
1. Finding shortest paths in a graph – Dijkstra’s algorithm2. Finding important nodes in a graph – PageRank3. Finding communities in a graph – Girvan-Newman
16
![Page 17: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/17.jpg)
Shortest path problemClassical graph problem: find the shortest path between two nodes
Some important distinctions or modificationsWeighted vs. unweighted, directed vs. undirected, negative weightsSingle-source shortest path (we’ll do this one)All-pairs shortest path
17
![Page 18: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/18.jpg)
Dijkstra’s algorithmAlgorithm for single-source shortest path
Basic idea: dynamic programming algorithm, at each node maintain an upper bound on distance to source, iteratively expand node with smallest upper bound (updating bounds of its neighbors)
18
Given: Graph 𝐺 = (𝑉 ,𝐸), Source 𝑠Initialize:
𝐷 𝑠 ← 0, 𝐷 𝑖 ≠ 𝑠 ←∞
𝑄← 𝑉
Repeat until 𝑄 empty:𝑖 ← Remove element from 𝑄 with smallest 𝐷For all 𝑗 such that 𝑖, 𝑗 ∈ 𝐸:
𝐷 𝑗 = min 𝐷 𝑗 ,𝐷 𝑖 + 1
![Page 19: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/19.jpg)
Dijkstra’s algorithm exampleInitialization: source 𝐴
𝐷 = 0,∞,∞,∞
𝑄 = 𝐴,𝐵,𝐶,𝐷
Step 1: Pop node A𝑄 = 𝐵,𝐶,𝐷
𝐷 = 0,1,1,∞
Step 2: Pop node 𝐵𝑄 = 𝐶,𝐷
𝐷 = 0,1,1,∞
Step 3: Pop node 𝐶𝑄 = 𝐷
𝐷 = 0,1,1,2
Step 4: Pop node 𝐷𝑄 =
𝐷 = [0,1,1,2] 19
A C
B D
![Page 20: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/20.jpg)
“Important” nodesWhat are the important nodes in the following network?
Unlike shortest path, there is not correct answer here, depends on how you define importance
20
![Page 21: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/21.jpg)
PageRank algorithmThe algorithm that started Google
Perspective on importance: consider a random walk on the graphWe start at a random nodeWe repeatedly jump to a random neighboring nodeIf the node has no outgoing edges (in directed graph), jump to a random node(Optionally) also jump to a random node with probability 𝑑
Node importance is the probability that we will be at a given node when following the above procedure
21
![Page 22: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/22.jpg)
PageRank algorithm
For those who have heard these terms, this algorithm is creating a Markov chain over the graph, and finding the stationary distribution (largest eigenvector) of this Markov chain
22
Given: Graph 𝐺 = 𝑉 ,𝐸 , restart probability 𝑑, iteration count 𝑇Initialize:
𝐴← Adjacency_Matrix 𝐺
𝑃 ← replace zero columns of 𝐴 with 1, and normalize columns𝑃 ← 1 − 𝑑 𝑃 +
V
W11
X
𝑥 ←1
|W |1
Repeat 𝑇 times:𝑥 ← 𝑃𝑥
![Page 23: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/23.jpg)
PageRank example
23
A C
B D
𝐴 =
0 0
1 0
0 1
0 0
1 0
0 0
0 0
1 0
𝑃 =
0 0
1 0
0 1
0 0
0.5 0.25
0 0.25
0 0.25
0.5 0.25
𝑃 =
0.025 0.025
0.925 0.025
0.025 0.925
0.025 0.025
0.475 0.25
0.025 0.25
0.025 0.25
0.475 0.25
𝑑 = 0.1
𝑥 →
0.21
0.26
0.31
0.21
![Page 24: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/24.jpg)
Poll: PageRankWhat would happen if we did not replace the all-zero columns in 𝐴 with all-ones?
1. The algorithm would take longer to run before it converged
2. Pages with no outgoing edges would increase in importance
3. Pages with no outgoing edges would decrease in importance
4. The sum of all probabilities (𝑥) would no longer sum to one
24
![Page 25: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/25.jpg)
Community detectionCommunity: subgraphs where nodes are densely connected to each other, but sparsely connected to other nodes
A “soft” version of a clique (a fully connected subgraph)
A fundamental concept in e.g. social networks25
![Page 26: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/26.jpg)
Girvan-Newman AlgorithmPublished in 2002 (Girvan and Newman, 2002), one of the first methods of “modern” community detection
Basic idea: Recursively partition the network by removing edges, groups that are last to be partitioned are “communities”
1. Compute “betweenness” of edges in the network = number of shortest paths that pass through each edge
2. Remove edge(s) with highest betweenness, if this breaks the graph into subgraphs, recursively partition each one
3. Result is a hierarchical partitioning of the graph
Challenge is efficiently computing betweenness as we partition graph (we will not cover this) 26
![Page 27: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/27.jpg)
1
23
46
5
7
9
8
10
11
1213
14
Algorithmic illustration
27
49
33
121
![Page 28: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/28.jpg)
1
23
46
5
7
9
8
10
11
1213
14
Algorithmic illustration
28
![Page 29: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/29.jpg)
Algorithmic illustration
29
1
23
46
5
7
9
8
10
11
1213
14
![Page 30: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/30.jpg)
Resulting hierarchy (dendrogram)
30
Communities can be extracted by looking at the grouping at different levels of the tree
May want to threshold on things like community size, etc
1 2 3 4 5 6 7 8 9 10 11 12 13 14
![Page 31: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/31.jpg)
OutlineNetworks and graph
Representing graphs
Graph algorithms
Graph libraries
31
![Page 32: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/32.jpg)
NetworkXNetworkX: Python library for dealing with (medium-sized) graphs
https://networkx.github.io/
Simple Python interface for constructing graph, querying information about the graph, and running a large suite of algorithms
Not suitable for very large graphs (all native Python, using adjacency dictionary representation)
32
![Page 33: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/33.jpg)
Creating graphsCreate an undirected or directed graph
Add and remove nodes/edges
33
import networkx as nx
G = nx.Graph() # undirected graphG = nx.DiGraph() # directed graph
# add and remove edgesG.add_edges_from([("A","B"), ("B","C"), ("C","A"), ("C","D")])G.remove_edge("A","B") G.add_edge("A","B")G.remove_edges_from([("A","B"), ("B","C")])G.add_edges_from([("A","B"), ("B","C")])# also add_node(), remove_node(), add_nodes_form(), remove_nodes_from()
![Page 34: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/34.jpg)
Nodes/edges and propertiesNetworkX uses adjacency dictionary format internally
Iterate over nodes and edges
Get and set node/edge properties
34
for i in G.nodes(): # loop over nodesprint i
for i,j in G.edges(): # loop over edgesprint i,j
G.node["A"]["node_property"] = "node_value"G.edge["A"]["B"]["edge_property"] = "edge_value"G.nodes(data=True) # iterator over nodes returning propertiesG.edges(data=True) # iterator over edges returning properties
print G["C"]# {'A': {}, 'D': {}}
![Page 35: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/35.jpg)
Drawing and node propertiesDraw a graph using matplotlib (not the best visualization)
35
import matplotlib.pyplot as plt%matplotlib inlinenx.draw(G,with_labels=True)plt.savefig("mpl_graph.pdf")
![Page 36: 15-388/688 -Practical Data Science: Graph and network processing · 2019. 12. 6. · Networks are the systems of interrelated objects (in the real world) Graphs are the mathematical](https://reader035.vdocuments.us/reader035/viewer/2022071008/5fc5970d1fde8307790e0d40/html5/thumbnails/36.jpg)
AlgorithmsAlmost all the (medium scale) algorithms you could want
36
nx.shortest_path_length(G,source="A") # iterater over path lengths
nx.pagerank(G,alpha=0.9) # dictionary of node ranks
# NOTE: this requires Networkx 2.0nx.girvan_newman(G) # iterator over partitions at
different hierarchy levels