graph representation learning: where probability theory ...bruno ribeiro assistant professor...
TRANSCRIPT
-
Bruno RibeiroAssistant Professor
Department of Computer Science
Purdue University
Graph Representation Learning:
Where Probability Theory, Data Mining, and
Neural Networks Meet
Joint work with R. Murphy*, B. Srinivasan*, V. Rao
GrAPL Workshop @ IPDPS
May 20th, 2019
Sponsors:Army Research Lab
Network Science
CTA
-
Bruno Ribeiro
What is the most powerful+ graph model / representation?
How can we make model learning$ tractable*?
โฆ How can we make model learning$ scalable?
+ powerful โ expressive* tractable โ works on small graphs$ learning and inference
-
Bruno Ribeiro3
-
Bruno Ribeiro 4
Social GraphsBiological Graphs
MoleculesThe Web
Ecological Graphs
๐บ = (๐, ๐ธ)
-
Bruno Ribeiro
Adjacency Matrix
S.V.N. Vishwanathan: Graph Kernels, Page 3
1
23
4
5
6
7 8
vertices/nodes edges
Undirected Graph G(V, E)
A =
2
6666666664
0 1 0 0 0 0 1 01 0 1 0 0 0 0 00 1 0 1 0 0 0 10 0 1 0 1 1 0 10 0 0 1 0 1 0 00 0 0 1 1 0 1 01 0 0 0 0 1 0 10 0 1 1 0 0 1 0
3
7777777775
sub-matrix of A = a subgraph of G
5
๐(๐จ) probability of sampling A (this graph)
Arbitrary node labels
๐จ1โ
๐จโ 1๐จโ 2
๐จ2โ
-
Bruno Ribeiro 6
-
Bruno Ribeiro
Consider a sequence of n random variables:
with joint probability distribution
Sequence example: โThe quick brown fox jumped over the lazy dogโ
The joint probability is just a function
โฆ P takes an ordered sequence and outputs a value between zero and one (w/ normalization)
with๐1, โฆ , ๐๐
(w/ normalization)
7
๐(๐1 = the, ๐2 = quick, โฆ , ๐9 = dog)
countable
๐:ฮฉ๐ โ [0,1]
-
Bruno Ribeiro
Consider a set of n random variables (representing a multiset):
how should we define their joint probability distribution?
Recall: Probability function ๐:ฮฉ๐ โ 0,1 is order-dependent
Definition: For multisets the probability function P is such that
is true for any permutation ๐ of (1,โฆ,n)
with
Useful references:
(Diaconis, Synthese 1977). Finite forms of de Finettiโs theorem on exchangeability
(Murphy et al., ICLR 2019) Janossy Pooling: Learning Deep Permutation-Invariant Functions
for Variable-Size Inputs8
-
Bruno Ribeiro
Point clouds Bag of words Our friends
Neighbors of a node
Lidar maps
A
extension: set-of-sets (Meng et al., KDD 2019)
-
Bruno Ribeiro
Consider an array of ๐2 random variables:
and ๐:ฮฉ๐ร๐ โ [0,1] such that
for any permutation ๐ of (1,โฆ,n)
Then, P is a model of a graph with n vertices, where ๐๐๐ โ ฮฉ are edge attributes (e.g., weights)
โฆ For each graph, P assigns a probability
โฆ Trivial to add node attributes to the definition
If ฮฉ = {0,1} then P is a probability distribution over adjacency matricesโฆ Most statistical graph models can be represented this way
๐ ๐11, ๐12, ๐21, โฆ , ๐๐๐ = ๐ ๐๐(1)๐(1), ๐๐(1)๐(2), ๐๐(2)๐(1), โฆ , ๐๐ ๐ ๐(๐)
๐๐๐ โ ฮฉโฆ
10
๐๐๐
-
Bruno Ribeiro
Adjacency Matrix
S.V.N. Vishwanathan: Graph Kernels, Page 3
1
23
4
5
6
7 8
vertices/nodes edges
Undirected Graph G(V, E)
A =
2
6666666664
0 1 0 0 0 0 1 01 0 1 0 0 0 0 00 1 0 1 0 0 0 10 0 1 0 1 1 0 10 0 0 1 0 1 0 00 0 0 1 1 0 1 01 0 0 0 0 1 0 10 0 1 1 0 0 1 0
3
7777777775
sub-matrix of A = a subgraph of G
11
๐จ1โ
๐จโ 1๐จโ 2
๐จ2โ
๐ = (2,1,3,4,5,6,7,8)๐ ๐ด = ๐ ๐ด๐๐
Graph model is invariant to
permutations
Arbitrary node labels
๐
-
Bruno Ribeiro
-
Bruno Ribeiro
Invariances have deep implications in nature
โฆ Noetherโs (first) theorem (1918):invariances โ laws of conservation
e.g.:
time and space translation invariance โ energy conservation
The study of probabilistic invariances (symmetries) has a long history
โฆ Laplaceโs โrule of successionโ dates to 1774 (Kallenberg, 2005)
โฆ Maxwellโs work in statistical mechanics (1875) (Kallenberg, 2005)
โฆ Permutation invariance for infinite sets:
de Finettiโs theorem (de Finetti, 1930)
Special case of the ergodic decomposition theorem, related to integral decompositions (see Orbanz and Roy (2015) for a good overview)
โฆ Kallenberg (2005) & (2007): de-facto references on probabilistic invariances
-
Bruno Ribeiro
Aldous, D. J. Representations for partially exchangeable arrays of random variables. J.
Multivar. Anal., 1981.
14
-
Bruno Ribeiro
Consider an infinite set of random variables:
such that
is true for any permutation ๐ of the positive integers
Then,
๐ ๐11, ๐12, โฆ โ [๐1โ[0,1[๐โโ[0,1โฏ
ฯ๐๐ ๐(๐๐๐ |๐๐ , ๐๐)
is a mixture model of uniform distributions over ๐๐ , ๐๐ , โฆ โผ Uniform(0,1)
(Aldous-Hoover representation is sufficient only for infinite graphs)15
๐๐๐ โ ฮฉโฆ
-
Bruno Ribeiro
-
Bruno Ribeiro
Relationship between deterministic functions and probability distributions
Noise outsourcing:
โฆ Tool from measure theory
โฆ Any conditional probability ๐(๐|๐) can be represent as ๐ = ๐ ๐, ๐ , ๐ โผ Uniform(0,1)
where ๐ is a deterministic function
โฆ The randomness is entirely outsourced to ๐
Representation ๐ (๐):
โฆ ๐ (๐): deterministic function, makes ๐ independent of ๐ given ๐ (๐)
โฆ Then, โ๐โฒ such that๐, ๐ = (๐โฒ ๐ ๐ , ๐ , ๐), ๐ โผ Uniform(0,1)
* = is a.s.
We call ๐ (๐) a representation of ๐
Representations are generalizations of โembeddingsโ
-
Bruno Ribeiro18
-
Bruno Ribeiro
Gaussian Linear Model:
Node i vector Ui. ~ ๐๐๐๐๐๐(๐, ๐๐2 ๐ฐ)
Adjacency matrix: Aij ~ ๐๐๐๐๐๐(๐ผ๐โ ๐๐ผ๐โ , ๐
2)
๐ผโ = argmin๐ผ
๐จ โ ๐ผ๐ผ๐2
2+
๐2
๐๐2 ๐ผ 2
2
Equivalent optimization: Minimizing Negative Log-Likelihood:
19
Q: For a given ๐จ, what is the most likely ๐ผ?
Answer: ๐ผโ = argmax๐ผ ๐(๐จ|๐ผ) , a.k.a. maximum likelihood
(each node i represented by a random vector)
-
Bruno Ribeiro
That will turn out to be the same
-
Bruno Ribeiro
Embedding of adjacency matrix ๐จ
A= +U.1
U.1
U.2
U.2
+
โฆ
๐จ โ ๐ผ๐ผ๐
U.i = i-th column vector of ๐ผ
-
Bruno Ribeiro
Matrix factorization can be used to compute a
low-rank representation of A
A reconstruction problem:
Find
by optimizing
where ๐ has k columns*
๐จ โ ๐ผ๐ผT
min๐ผ
๐จ โ ๐ผ๐ผT2
2+ ๐โ๐ผโ2
2
Sum squared error
22*sometimes we will force orthogonal columns in U
L2 regularization
Regularization strength
-
Bruno Ribeiro23
-
Bruno Ribeiro
-
Bruno Ribeiro25
-
Bruno Ribeiro26
-
Bruno Ribeiro
Initialize:
๐๐ฃ is the attribute vector of vertex ๐ฃ โ ๐บ(if no attribute, assign 1)
๐ = 0
function WL-fingerprints(๐บ):while vertex attributes change do:
๐ โ ๐ + 1for all vertices ๐ฃ โ ๐บ do
๐๐,๐ฃ โ hash ๐๐โ1,๐ฃ , ๐๐โ1,๐ข: โ๐ข โ Neighbors ๐ฃ
Return {๐๐,๐ฃ: โ๐ฃ โ ๐บ}
Recursive algorithm to determine if two
graphs are isomorphic
โฆ Valid isomorphism test for most graphs (Babai and Kucera, 1979)
โฆ Cai et al., 1992 shows examples that cannot be
distinguished by it
โฆ Belongs to class of color refinement algorithms that iteratively update
vertex โcolorsโ (hash values) until
it has converged to unique
assignments of hashes to vertices
โฆ Final hash values encode the structural roles of vertices inside a
graph
โฆ Often fails for graphs with a high degree of symmetry, e.g. chains,
complete graphs, tori and stars
27
Shervashidze et al. 2011
neighbors of node v
-
Bruno Ribeiro
The hardest task for graph representation is:
โฆ Give different tags to different graphs
Isomorphic graphs should have the same tag
โฆ Task: Given adjacency matrix ๐จ , predict tag
Goal: Find a representation ๐ (๐จ) such thatP tag ๐ = g(๐ ๐จ , ๐)
โฆ Then, ๐ (๐จ) must give:
same representation to isomorphic graphs
different representations to non-isomorphic graphs
28
-
Bruno Ribeiro
-
Bruno Ribeiro
Main idea:Graph Neural Networks: Use the WL algorithm to compute representations that are related to a task
Initialize โ0,๐ฃ = node ๐ฃ attribute
function ิฆ๐(๐จ,๐พ1, โฆ ,๐พ๐พ , ๐1, โฆ , ๐๐พ):while ๐ < ๐พ do: # K layers๐ โ ๐ + 1for all vertices ๐ฃ โ ๐ do
๐๐,๐ฃ = ๐ ๐พ๐ ๐๐โ1,๐ฃ , ๐จ๐โ ๐ + ๐๐return {๐๐พ,๐ฃ: โ๐ฃ โ ๐}
Example supervised task: predict label ๐ฆ๐ of graph ๐บ๐ represented by ๐จ๐
Optimization for loss ๐ฟ: Let ๐ฝ = (๐พ1, โฆ ,๐พ๐พ , ๐1, โฆ , ๐๐พ ,๐พagg, ๐agg)
๐ฝโ = argmax๐ฝ
๐โData
๐ฟ(๐ฆ๐ ,๐พagg Pooling( ิฆ๐(๐จ๐ ,๐พ1, โฆ ,๐พ๐พ , ๐1, โฆ , ๐๐พ)) + ๐agg)
30
could be another permutation-invariant function
(see Murphy et al. ICLR 2019)
permutation-invariant function
(see Murphy et al. ICLR 2019)
-
Bruno Ribeiro
GNN representations can be as expressive as the
Weisfeller-Lehman (WL) isomorphism test
(Xu et al., ICLR 2019)
But WL test can sometimes fail โฆ.
E.g. in a family of
circulant graphs:
ิฆ๐ ๐จ = ิฆ๐(๐จ๐๐)
By construction, GNN representation ิฆ๐ is guaranteed permutation invariance (equivariance)
31
-
Bruno Ribeiro
-
Bruno Ribeiro
Multilayer perceptron (MLP) is universal function approximator
(Hornik et al. 1989)
โฆ What about using ิฆ๐MLP vec(๐จ )?
No! Permutation-sensitive*
ิฆ๐MLP vec(๐จ ) โ ิฆ๐MLP vec(๐จ๐๐ )
for some permutation ๐
* unless neuron weights nearly all the same (Maron et al., 2018)
-
Bruno Ribeiro
Extension: A graph model P is an array of ๐2 random variables, n>1,
and ๐:ฮฉโชร โ [0,1], where ฮฉโชร โกโช๐=2โ ฮฉ๐ร๐, such that
for any value of n and any permutation ๐ of (1,โฆ,n)
๐ ๐11, ๐12, ๐21, โฆ , ๐๐๐ = ๐ ๐๐(1)๐(1), ๐๐(1)๐(2), ๐๐(2)๐(1), โฆ , ๐๐ ๐ ๐(๐)
๐๐๐ โ ฮฉโฆ
34
๐๐๐
(Murphy et al., ICML 2019) insight:P is average of an unconstrained probability function ๐ applied over the Abelian group defined by the permutation operator ๐โฆ Average is invariant to the group action of a permutation ๐
(see Bloem-Reddy & Teh, 2019)โฆ Works for variable size graphs
-
Bruno Ribeiro
A is a tensor encoding: adjacency matrix & edge attributes
X(v) encodes node attributes
ฮ is the set of all permutation of (1,โฆ,|V|) , where |V| is number of vertices
๐ is any permutation-sensitive function
Theorem 2.1: Necessary and sufficient representation of finite graphs
โฆ (details) ๐ is an universal approximator (MLP, RNNs), then เดง๐ ๐จ is the most expressive representation of A
35
างาง๐ ๐จ = ๐ธ๐ ิฆ๐ ๐จ๐๐, ๐ฟ๐๐ฃ
=1
๐ !
๐โฮ
ิฆ๐(๐จ๐๐, ๐ฟ๐๐ฃ)
average over ๐ โผ Uniform(ฮ )
-
Bruno Ribeiro
1. Canonical orientation (some order of the vertices), so that
canonical(A) = canonical(Aฯ ฯ)
2. k-ary dependencies:
โฆ Nodes k-by-k independent in ๐
โฆ ๐ considers only the first ๐ nodes of any permutation ๐
3. Stochastic optimization (proposes ๐-SGD)
36
างาง๐ ๐จ โ
๐โฮ
ิฆ๐(๐จ๐๐, ๐ฟ๐๐ฃ)
-
Bruno Ribeiro
Order nodes with a sort function
โฆ E.g.: order nodes by PageRank
Arrange ๐จ with sort(๐จ) (assuming no ties)
โฆ Note that sort ๐จ = sort(๐จ๐ ๐ ) for any permutation ๐
างาง๐ ๐จ โ
๐โฮ
ิฆ๐(๐จ๐๐, ๐ฟ๐๐ฃ)
-
Bruno Ribeiro
Adjacency Matrix
S.V.N. Vishwanathan: Graph Kernels, Page 3
1
23
4
5
6
7 8
vertices/nodes edges
Undirected Graph G(V, E)
A =
2
6666666664
0 1 0 0 0 0 1 01 0 1 0 0 0 0 00 1 0 1 0 0 0 10 0 1 0 1 1 0 10 0 0 1 0 1 0 00 0 0 1 1 0 1 01 0 0 0 0 1 0 10 0 1 1 0 0 1 0
3
7777777775
sub-matrix of A = a subgraph of G
k-ary dependencies:
โฆ Nodes k-by-k independent in ๐
โฆ ๐ considers only the first ๐ nodes of any permutation ๐: ๐๐
permutations
38
๐ = (1,2,3,4,5,6,7,8)
๐( )
๐ = (2,1,3,4,5,6,7,8)
๐โฮ
๐ ๐จ๐๐ = ๐( )+ +โฏ+
๐ = (2,1,3,4,5,6,8,7)
๐( ) +โฏ
างาง๐ ๐จ โ
๐โฮ
ิฆ๐(๐จ๐๐ , ๐ฟ๐๐ฃ)
-
Bruno Ribeiro
SGD: standard Stochastic Gradient Descent
1. SGD will sample a batch of n training examples
2. Compute gradients (backpropagation using chain rule)
3. Update model following negative gradient (one gradient descent step)
4. GOTO 1:
๐-SGD (as fast as SGD per gradient step)
1. Sample a batch of training examples
2. For each example ๐ฑ(๐) in the batch
Sample one permutation ๐(๐)
3. Perform a forward pass over the examples with the single sampled permutation
4. Compute gradients (backpropagation using chain rule)
5. Update model following negative gradient (one gradient descent step)
6. GOTO 1:
-
Bruno Ribeiro
Proposition 2.1 (Murphy et al., ICML 2019)
โฆ ฯ-SGD behaves just like SGD
โฆ If loss is MSE, cross-entropy, negative log-likelihood, then ฯ-SGD is minimizing an upper bound of the loss
โฆ However, the solution ฯ-SGD converges to is not the solution of SGD.
But still a valid graph representation
-
Bruno Ribeiro
Consider a GNN ๐ , e.g., GIN of (Xu et al., ICLR 2019)
โฆ By definition ๐ is insensitive to permutations
Letโs make ๐ sensitive to permutations by adding node id (label) as unique node feature
โฆ And use RP to make the entire representation insensitive to permutations (learnt approximately via ๐-SGD)
Task: Classify circulant graphs
41
Task: Molecular classification
-
Bruno Ribeiro
๐ can be a logistic model (logistic regression)
๐ can be a Recurrent Neural Network (RNN)
๐ can be a Convolutional Neural Network (CNN)โฆ Treat A as image
These are all valid graph representations in Relational Pooling (RP)
๐ด =
42
-
Bruno Ribeiro
Permutation-invariant
ิฆ๐ without permutation averaging
โAnyโ ิฆ๐ made permutation-invariant by RP
างาง๐(๐จ) inference with Monte Carlo = approximately permutation-invariant
learns exact ิฆ๐
ิฆ๐ is always perm-invariant
learns างาง๐(๐จ) approximately via ๐-SGD
43
Relational Pooling (RP) framework gives a new class of graph
representations and models
โข Until now ิฆ๐ has been hand-designing to be permutation-invariant
โข (Murphy et al ICML 2019) RP ิฆ๐ can be permutation-sensitive, allows more expressive models
โข Trade-off: can only be learnt approximately
Thank You!
@brunofmr
RP: างาง๐ ๐จ โ
๐โฮ
ิฆ๐(๐จ๐๐, ๐ฟ๐๐ฃ)
-
Bruno Ribeiro
1. Murphy, R. L., Srinivasan, B., Rao, V., and Ribeiro, B., Janossy pooling: Learning deep permutationinvariant functions
for variable-size inputs. ICLR 2019
2. Murphy, R.L., Srinivasan, B., Rao, V., Ribeiro, B., Relational Pooling for Graph Representations, ICML 2019
3. Meng, C., Yang, J., Ribeiro, B., Neville, J., HATS: A Hierarchical Sequence-Attention Framework for Inductive Set-of-
Sets Embeddings. KDD 2019
4. de Finetti, B.. Fuzione caratteristica di un fenomeno aleatorio. Mem. R. Acc. Lincei, 1930
5. Aldous, D. J. Representations for partially exchangeable arrays of random variables. J. Multivar. Anal., 1981
6. Diaconis, P. and Janson, S. Graph limits and exchangeable random graphs. Rend. di Mat. e delle sue Appl. Ser. VII,
28:33โ61, 2008
7. Kallenberg, O. (2005). Probabilistic Symmetries and Invariance Principles. Springer.
8. Kallenberg, O. (2017). Random Measures, Theory and Applications. Springer International Publishing.
9. Diaconis P. Finite forms of de Finetti's theorem on exchangeability. Synthese. 1977
10. Orbanz, P. and Roy, D. M. Bayesian models of graphs, arrays and other exchangeable random structures. IEEE
transactions on pattern analysis and machine intelligence, 37(2):437โ461, 2015.
11. Bloem-Reddy B, Teh YW. Probabilistic symmetry and invariant neural networks, arXiv:1901.06082. 2019.
12. Robbins, H. and Monro, S. A stochastic approximation method. The annals ofmathematical statistics, pp. 400โ407,
1951.
13. Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural
networks, 2(5):359โ366, 1989.
14. Cai, J.-Y., Furer, M., and Immerman, N. An optimal lower bound on the number of variables for graph identification.
Combinatorica, 12(4):389โ410, 1992
15. Shervashidze, N., Schweitzer, P., Leeuwen, E. J. V., Mehlhorn, K., & Borgwardt, K. M., Weisfeiler-lehman graph
kernels. Journal of Machine Learning Research, 2011
16. Maron, H., Ben-Hamu, H., Shamir, N., and Lipman, Y. Invariant and equivariant graph networks. arXiv preprint
arXiv:1812.09902, 201
Maron, H., Fetaya, E., Segol, N., and Lipman, Y. On the universality of invariant networks. arXiv preprint
arXiv:1901.09342, 2019.