graph embeddingsembeddings.pdf · 2019. 10. 16. · 1. a graph embedding is a fixed length vector...
TRANSCRIPT
Graph EmbeddingsAlicia Frame, PhDOctober 10, 2019
What’s an embedding?How do these work?- Motivating Example - Word2Vec- Motivating Example - DeepWalkGraph embeddings overviewGraph embedding techniquesGraph embeddings with Neo4j
2
Overview
What does the internet say?- Google: “An embedding is a relatively
low-dimensional space into which you can translate high-dimensional vectors”
- Wikipedia: “In mathematics, an embedding is one instance of some mathematical structure contained within another instance, such as a group that is a subgroup.”
3
TL;DR - what’s an embedding?
A way of mapping something (a document, an image, a graph) into a fixed length vector (or matrix) that captures key features while reducing the dimensionality
Graph embeddings are a specific type of embedding that translate graphs, or parts of graphs, to fixed length vectors (or tensors)
4
So what’s a graph embedding?
An embedding translates something complex into something a machine can work with- Represents the important features of the input object in a
compact, low dimensional format- Embedded representation can be used as a feature for ML, for
direct comparisons, or as an input representation for a DL model
Embeddings - typically - learn what’s important in an unsupervised, generalizable way.
5
But why bother?
6
Motivating Examples
How can I represent words in a way that I can use them mathematically?- How similar are two words?- Can I use the representation of a word in a model?
Naive approach - how similar are the strings?- Hand engineered rules?- How many of each letter?
CAT = [10100000000000000001000000]
7
Motivating example: Word Embeddings
Frequency matrix:
8
Motivating example: Word Embeddings
Weighted term frequency (TF-IDF)Can we use documents to encode words?
Word order probably matters too: Words that occur together have similar contexts.
9
Motivating example: Word Embeddings
- “Tylenol is a pain reliever,” “Paracetamol is a pain reliever” same context
- Co-occurence: how often do two words appear in the same context window?
- Context window: specific number and direction
He is not lazyHe is intelligent
He is smart
He is not lazyHe is intelligent
He is smart
Word order probably matters too: Words that occur together have similar contexts.
10
Motivating example: Word Embeddings
- “Tylenol is a pain reliever,” “Paracetamol is a pain reliever” same context
- Co-occurence: how often do two words appear in the same context window?
- Context window: specific number and direction
3 3
Why not stop here?- You need more documents to really understand context … but
the more documents you have the bigger your matrix is- Giant sparse matrices or vectors are cumbersome and
uninformative
We need to reduce the dimensionality of our matrix
11
Motivating example: Word Embeddings
Count Based Methods: Linear algebra to the rescue?
Pros: Preserves semantic relationships, accurate, known methodsCons: Huge memory requirements, not trained for a specific task
12
Motivating Example: Word Embeddings
13
Motivating Example: Word Embeddings
Predictive Methods: learn an embedding for a specific task
14
Motivating Example: Word Embeddings
Predictive Methods: learn an embedding for a specific task
The SkipGram model learns a vector representation for each word that maximizes the probability of that word given the previous words
15
Motivating Example: Word Embeddings
input word - one hot encoded vector
output prediction - probability, for each word in the corpus, that it’s the next word
The SkipGram model learns a vector representation for each word that maximizes the probability of that word given the previous words
16
Motivating Example: Word Embeddings
input word - one hot encoded vector
output prediction - probability, for each word in the corpus, that it’s the next word
The SkipGram model learns a vector representation for each word that maximizes the probability of that word given the previous words
17
Motivating Example: Word Embeddings
The hidden layer is a weight matrix with one row per word, and one column per neuron -- this is the embedding!
Maximize the probability that the next word is w_t given h:
Train model by maximizing the log-likelihood over the training set:
Skipgram model calculates:
18
(if we really want to get into the math)
19
Motivating Example: Word Embeddings
Word embeddings condense representations of the words while preserving context:
20
Cool, but what’s this got to do with graphs?
Motivating example: DeepWalk
21
How do we represent a node in a graph mathematically? Can we adapt word2vec?- Each node is like a word- Neighborhood around the node is the context window
Extract the context for each node by sampling random walks from the graph:
For every node in the graph, take n fixed length random walks (equivalent to sentences)
22
Motivating example: DeepWalk
Once we have our sentences, we can extract the context windows and learn weights using the same skip-gram model
(Objective is to predict neighboring nodes given the target node)23
Motivating example: DeepWalk
Embeddings are the hidden layer weights from the skipgram model
Note: there are also equivalent methodologies to the matrix factorization approaches or hand engineered approaches we talked about for words as well
24
Motivating example: DeepWalk
25
Graph Embeddings Overview
There are lots of graph embeddings...
26
What type of graph are you trying to create an embedding for?- Monopartite graphs (DeepWalk is designed for these)- Multipartite graphs (eg. Knowledge Graphs)
What aspect of the graph are you trying to represent?- Vertex embeddings: describe connectivity of each node- Path embeddings: traversals across the graph- Graph embeddings: encode an entire graph into a single vector
What tp
Most techniques consist of:
- A similarity function that measures the similarity between nodes- An encoder function: generates the node embedding- A decoder function to reconstruct pairwise similarity- A loss function that measures how good your reconstruction is
27
Node embedding overview
Shallow - Encoder function is an embedding lookup
Matrix Factorization:- These techniques all rely on an adjacency matrix input- Matrix factorization is applied either directly to the input or
some transformation of the input
Random Walk:- Obtain node co-occurrence via random walks- Learn weight to optimize similarity measure
28
Shallow Graph Embedding Techniques
Shallow - Encoder function is an embedding lookup
Matrix Factorization:- These techniques all rely on an adjacency matrix input- Matrix factorization is applied either directly to the input or
some transformation of the input
Random Walk:- Obtain node co-occurrence via random walks- Learn weight to optimize similarity measure
29
Shallow Graph Embedding Techniques
Massive memory footprint Computationally intense
Shallow - Encoder function is an embedding lookup
Matrix Factorization:- These techniques all rely on an adjacency matrix input- Matrix factorization is applied either directly to the input or
some transformation of the input
Random Walk:- Obtain node co-occurrence via random walks- Learn weight to optimize similarity measure
30
Shallow Graph Embedding Techniques
Massive memory footprint Computationally intense
Local-only perspectiveAssumes similar nodes are close together
Matrix Factorization:- These techniques all rely on an adjacency matrix input- Matrix factorization is applied either directly to the input or
some transformation of the input
Random Walk:- Obtain node co-occurrence via random walks- Learn weight to optimize similarity measure
31
Shallow Graph Embedding Techniques
Massive memory footprint Computationally intense
Local-only perspectiveAssumes similar nodes are close together
Why not stick with these?- Shallow embeddings are inefficient - no parameters shared
between nodes- Can’t leverage node attributes- Only generate embeddings for nodes present when the
embedding was trained - problematic for large, evolving graphsNewer methodologies - compress information- Neighborhood autoencoder methods- Neighborhood aggregation- Convolutional autoencoders
32
Shallow Embeddings
33
Autoencoder methods
Using Graph Embeddings
34
Why are we going to all this trouble?
35
Visualization & pattern discovery:- Leveraging lots of existing - t-SNE plots- PCA
Clustering and community detection:- Apply generic tabular data approaches (eg. k-means) but allows
capture of both functional and structural roles- KNN graphs based on embedding similarity
Node classification/semi-supervised learningPredict missing node attributesLink prediction- predict edges not present in the graph- Either using similarity measures/heuristics or ML pipelines
36
Why are we going to all this trouble?
Embeddings can make the graph algorithm library even more powerful!
Graph Embeddings in Neo4j
37
Two prototype implementations from Labs: DeepWalk & DeepGL- DeepGL is more similar to a “hand crafted” embedding- Uses graph algorithms to generate features- Diffusion of values across edges, dimensionality reduction
Neither is ready for production use - but lessons learned!- Lots of demand- Memory intensive and not turned for performance- Deep Learning is not easy in Java
Python is easy to get started with for experimentation, but doesn’t perform at scale38
Neo4j Labs Implementations
We’re actively exploring the best ways to implement graph embeddings at scale so please stay tuned
39
...So what’s next?
1. A graph embedding is a fixed length vector ofa. Numbersb. Lettersc. Nodes
2. An embedding is a ______________ representation of your dataa. Human readableb. Lower dimensionalc. Binary
3. What’s the name of the graph embedding we walked through in this presentation?40
Hunger Games!