graph attention networks - university of cambridgepv273/posters/gat_iclr18_poster.pdf · dence on...

1
Graph Attention Networks Petar Veliˇ ckovi´ c 1,2 Guillem Cucurull 2,3 Arantxa Casanova 2,3 Adriana Romero 2,4 Pietro Li ` o 1 Yoshua Bengio 2 1 Department of Computer Science and Technology, University of Cambridge 2 Montr´ eal Institute for Learning Algorithms 3 Centre de Visi´ o per Computador, UAB 4 Facebook AI Research github.com/PetarV-/GAT Abstract We present graph attention networks (GATs), novel neural network ar- chitectures that operate on graph-structured data, leveraging masked self- attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighbourhoods’ features, we enable (implicitly) specifying different weights to different nodes in a neighbourhood, and remove depen- dence on the global graph structure. This addresses several key challenges of spectral-based graph neural net- works simultaneously, and is readily applicable to inductive as well as transductive problems (setting or matching state-of-the-art). GAT Architecture The input to our layer is a set of node features, h = { ~ h 1 , ~ h 2 ,..., ~ h N }, ~ h i R F , where N is the number of nodes, and F is the number of features in each node. The layer produces a new set of node features, h 0 = { ~ h 0 1 , ~ h 0 2 ,..., ~ h 0 N }, ~ h 0 i R F 0 , as its output. We perform masked self-attention on the nodes—a shared attentional mechanism a : R F × R F R computes attention coefficients α ij over neighbourhoods of nodes, N i : e ij = a( ~ h i , ~ h j ) α ij = softmax j (e ij )= exp(e ij ) k ∈N i exp(e ik ) Once obtained, the attention coefficients are used to compute a linear combination of the features corresponding to them, to serve as the final output features for every node (potentially applying a nonlinearity, σ ): ~ h 0 i = σ X j ∈N i α ij W ~ h j . For stability, we employ multi-head attention—parallelising this process across K independent attention heads: ~ h 0 i = K k k =1 σ X j ∈N i α k ij W k ~ h j Additional optimisations, such as dropout on the α ij values, further sta- bilise and regularise the model. GAT Layer h 1 h 2 h 3 h 4 h 5 h 6 α 16 α 11 α 12 α 13 α 14 α 15 h 1 concat/avg An illustration of multi-head attention (with K =3 heads) by node 1 on its neighbourhood. Different arrow styles and colours denote indepen- dent attention computations. The aggregated features from each head are concatenated or averaged to obtain ~ h 0 1 . Properties Computationally efficient: attention computation can be parallelised across all edges of the graph, and aggregation across all nodes! Storage efficient—a sparse version does not require storing more than O (V + E ) entries anywhere; Fixed number of parameters (dependent only on the desirable feature count, not on the node count); Trivially localised (as we aggregate only over neighbourhoods); Allows for (implicitly) specifying different importances to different neighbours, through its attentional mechanism; Readily applicable to inductive problems (as it is a shared edge-wise mechanism that does not depend on the global graph structure)! Satisfies all of the major requirements for a graph convolutional layer simultaneously. Results Transductive Method Cora Citeseer Pubmed MLP 55.1% 46.5% 71.4% ManiReg 59.5% 60.1% 70.7% SemiEmb 59.0% 59.6% 71.7% LP 68.0% 45.3% 63.0% DeepWalk 67.2% 43.2% 65.3% ICA 75.1% 69.1% 73.9% Planetoid 75.7% 64.7% 77.2% Chebyshev 81.2% 69.8% 74.4% GCN 81.5% 70.3% 79.0% MoNet 81.7 ± 0.5% — 78.8 ± 0.3% GCN-64 * 81.4 ± 0.5% 70.9 ± 0.5% 79.0 ± 0.3% GAT (ours) 83.0 ± 0.7% 72.5 ± 0.7% 79.0 ± 0.3% Inductive Method PPI Random 0.396 MLP 0.422 GraphSAGE-LSTM 0.612 GraphSAGE * 0.768 Const-GAT (ours) 0.934 ± 0.006 GAT (ours) 0.973 ± 0.002 t-SNE + Attention coefficients on Cora

Upload: others

Post on 24-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Graph Attention Networks - University of Cambridgepv273/posters/gat_iclr18_poster.pdf · dence on the global graph structure. This addresses several key challenges of spectral-based

Graph Attention NetworksPetar Velickovic1,2 Guillem Cucurull2,3 Arantxa Casanova2,3 Adriana Romero2,4 Pietro Lio1 Yoshua Bengio2

1 Department of Computer Science and Technology, University of Cambridge 2 Montreal Institute for Learning Algorithms3 Centre de Visio per Computador, UAB 4 Facebook AI Research

github.com/PetarV-/GAT

Abstract

We present graph attention networks (GATs), novel neural network ar-chitectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based ongraph convolutions or their approximations.

By stacking layers in which nodes are able to attend over theirneighbourhoods’ features, we enable (implicitly) specifying differentweights to different nodes in a neighbourhood, and remove depen-dence on the global graph structure.

This addresses several key challenges of spectral-based graph neural net-works simultaneously, and is readily applicable to inductive as well astransductive problems (setting or matching state-of-the-art).

GAT Architecture

The input to our layer is a set of node features, h =

{~h1,~h2, . . . ,~hN},~hi ∈ RF , where N is the number of nodes, and F

is the number of features in each node. The layer produces a new set ofnode features, h′ = {~h′1,~h′2, . . . ,~h′N},~h′i ∈ RF ′, as its output.

We perform masked self-attention on the nodes—a shared attentionalmechanism a : RF × RF → R computes attention coefficients αij overneighbourhoods of nodes, Ni:

eij = a(~hi,~hj) αij = softmaxj(eij) =exp(eij)∑k∈Ni

exp(eik)

Once obtained, the attention coefficients are used to compute a linearcombination of the features corresponding to them, to serve as the finaloutput features for every node (potentially applying a nonlinearity, σ):

~h′i = σ

∑j∈Ni

αijW~hj

.

For stability, we employ multi-head attention—parallelising this processacross K independent attention heads:

~h′i =

K

‖k=1

σ

∑j∈Ni

αkijWk~hj

Additional optimisations, such as dropout on the αij values, further sta-bilise and regularise the model.

GAT Layer

~h1

~h2

~h3

~h4

~h5

~h6

~α16

~α11

~α12

~α13

~α 14

~α15

~h′1concat/avg

An illustration of multi-head attention (with K = 3 heads) by node 1 onits neighbourhood. Different arrow styles and colours denote indepen-dent attention computations. The aggregated features from each headare concatenated or averaged to obtain ~h′1.

Properties

Computationally efficient: attention computation can be parallelisedacross all edges of the graph, and aggregation across all nodes!

Storage efficient—a sparse version does not require storing more thanO(V + E) entries anywhere;

Fixed number of parameters (dependent only on the desirable featurecount, not on the node count);

Trivially localised (as we aggregate only over neighbourhoods);

Allows for (implicitly) specifying different importances to differentneighbours, through its attentional mechanism;

Readily applicable to inductive problems (as it is a shared edge-wisemechanism that does not depend on the global graph structure)!

Satisfies all of the major requirements for a graph convolutionallayer simultaneously.

Results

Transductive

Method Cora Citeseer PubmedMLP 55.1% 46.5% 71.4%ManiReg 59.5% 60.1% 70.7%SemiEmb 59.0% 59.6% 71.7%LP 68.0% 45.3% 63.0%DeepWalk 67.2% 43.2% 65.3%ICA 75.1% 69.1% 73.9%Planetoid 75.7% 64.7% 77.2%Chebyshev 81.2% 69.8% 74.4%GCN 81.5% 70.3% 79.0%MoNet 81.7 ± 0.5% — 78.8 ± 0.3%

GCN-64∗ 81.4 ± 0.5% 70.9 ± 0.5% 79.0 ± 0.3%GAT (ours) 83.0 ± 0.7% 72.5 ± 0.7% 79.0 ± 0.3%

Inductive

Method PPI

Random 0.396MLP 0.422GraphSAGE-LSTM 0.612

GraphSAGE∗ 0.768Const-GAT (ours) 0.934 ± 0.006GAT (ours) 0.973 ± 0.002

t-SNE + Attention coefficients on Cora