graph attention networks - university of cambridgepv273/posters/gat_iclr18_poster.pdf · dence on...
TRANSCRIPT
Graph Attention NetworksPetar Velickovic1,2 Guillem Cucurull2,3 Arantxa Casanova2,3 Adriana Romero2,4 Pietro Lio1 Yoshua Bengio2
1 Department of Computer Science and Technology, University of Cambridge 2 Montreal Institute for Learning Algorithms3 Centre de Visio per Computador, UAB 4 Facebook AI Research
github.com/PetarV-/GAT
Abstract
We present graph attention networks (GATs), novel neural network ar-chitectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based ongraph convolutions or their approximations.
By stacking layers in which nodes are able to attend over theirneighbourhoods’ features, we enable (implicitly) specifying differentweights to different nodes in a neighbourhood, and remove depen-dence on the global graph structure.
This addresses several key challenges of spectral-based graph neural net-works simultaneously, and is readily applicable to inductive as well astransductive problems (setting or matching state-of-the-art).
GAT Architecture
The input to our layer is a set of node features, h =
{~h1,~h2, . . . ,~hN},~hi ∈ RF , where N is the number of nodes, and F
is the number of features in each node. The layer produces a new set ofnode features, h′ = {~h′1,~h′2, . . . ,~h′N},~h′i ∈ RF ′, as its output.
We perform masked self-attention on the nodes—a shared attentionalmechanism a : RF × RF → R computes attention coefficients αij overneighbourhoods of nodes, Ni:
eij = a(~hi,~hj) αij = softmaxj(eij) =exp(eij)∑k∈Ni
exp(eik)
Once obtained, the attention coefficients are used to compute a linearcombination of the features corresponding to them, to serve as the finaloutput features for every node (potentially applying a nonlinearity, σ):
~h′i = σ
∑j∈Ni
αijW~hj
.
For stability, we employ multi-head attention—parallelising this processacross K independent attention heads:
~h′i =
K
‖k=1
σ
∑j∈Ni
αkijWk~hj
Additional optimisations, such as dropout on the αij values, further sta-bilise and regularise the model.
GAT Layer
~h1
~h2
~h3
~h4
~h5
~h6
~α16
~α11
~α12
~α13
~α 14
~α15
~h′1concat/avg
An illustration of multi-head attention (with K = 3 heads) by node 1 onits neighbourhood. Different arrow styles and colours denote indepen-dent attention computations. The aggregated features from each headare concatenated or averaged to obtain ~h′1.
Properties
Computationally efficient: attention computation can be parallelisedacross all edges of the graph, and aggregation across all nodes!
Storage efficient—a sparse version does not require storing more thanO(V + E) entries anywhere;
Fixed number of parameters (dependent only on the desirable featurecount, not on the node count);
Trivially localised (as we aggregate only over neighbourhoods);
Allows for (implicitly) specifying different importances to differentneighbours, through its attentional mechanism;
Readily applicable to inductive problems (as it is a shared edge-wisemechanism that does not depend on the global graph structure)!
Satisfies all of the major requirements for a graph convolutionallayer simultaneously.
Results
Transductive
Method Cora Citeseer PubmedMLP 55.1% 46.5% 71.4%ManiReg 59.5% 60.1% 70.7%SemiEmb 59.0% 59.6% 71.7%LP 68.0% 45.3% 63.0%DeepWalk 67.2% 43.2% 65.3%ICA 75.1% 69.1% 73.9%Planetoid 75.7% 64.7% 77.2%Chebyshev 81.2% 69.8% 74.4%GCN 81.5% 70.3% 79.0%MoNet 81.7 ± 0.5% — 78.8 ± 0.3%
GCN-64∗ 81.4 ± 0.5% 70.9 ± 0.5% 79.0 ± 0.3%GAT (ours) 83.0 ± 0.7% 72.5 ± 0.7% 79.0 ± 0.3%
Inductive
Method PPI
Random 0.396MLP 0.422GraphSAGE-LSTM 0.612
GraphSAGE∗ 0.768Const-GAT (ours) 0.934 ± 0.006GAT (ours) 0.973 ± 0.002
t-SNE + Attention coefficients on Cora