scalable methods for graph-based unsupervised and semi-supervised learning frank lin language...
Post on 19-Dec-2015
217 views
TRANSCRIPT
![Page 1: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/1.jpg)
Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning
Frank LinLanguage Technologies Institute
School of Computer ScienceCarnegie Mellon University
PhD Thesis ProposalOctober 8, 2010, Pittsburgh, PA, USA
Thesis CommitteeWilliam W. Cohen (Chair), Christos Faloutsos, Tom Mitchell, Xiaojin Zhu
![Page 2: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/2.jpg)
2
Motivation
• Graph data is everywhere• We want to do machine learning on them• For non-graph data, often it makes sense to
represent them as graphs for unsupervised and semi-supervised learning
And some of them very big
Often require a similarity measure between data
points – naturally represented as edges
between nodes
![Page 3: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/3.jpg)
3
Motivation
• We want to do clustering and semi-supervised learning on large graphs
• We need scalable methods that provide quality results
![Page 4: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/4.jpg)
4
Thesis Goal
• Make contributions toward fast, space-efficient, effective, and simple graph-based learning methods that scale up to large datasets.
![Page 5: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/5.jpg)
5
Road Map
Basic Learning Methods Applications to a Web-Scale Knowledge Base
Graph Induced Graph Applications Support
Unsupervised / Clustering
Power Iteration
ClusteringPIC with Path
Folding
Learning Synonym NP’sDistributed Implementation
Avoiding Collision
Hierarchical
Learning Polyseme NP’s
Finding New Types and Relations
Semi-
Supervise
d
Multi-RankWalk
MRW with Path Folding
Noun Phrase Categorization
Distributed Implementation
![Page 6: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/6.jpg)
6
Road Map
Basic Learning Methods Applications to a Web-Scale Knowledge Base
Graph Induced Graph Applications Support
Unsupervised / Clustering
Power Iteration
ClusteringPIC with Path
Folding
Learning Synonym NP’sDistributed Implementation
Avoiding Collision
Hierarchical
Learning Polyseme NP’s
Finding New Types and Relations
Semi-
Supervise
d
Multi-RankWalk
MRW with Path Folding
Noun Phrase Categorization
Distributed Implementation
Prior Work
Proposed Work
![Page 7: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/7.jpg)
7
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
![Page 8: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/8.jpg)
8
Power Iteration Clustering
• Spectral clustering methods are nice, and natural choice for graph data
• But they are rather expensive (slow)
Power iteration clustering (PIC) can provide a similar solution at a very
low cost (fast)!
![Page 9: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/9.jpg)
9
Background: Spectral Clustering
• Idea: instead of clustering data points in their original (Euclidean) space, cluster them in the space spanned by the “significant” eigenvectors of a (Laplacian) similarity matrix
A popular spectral clustering method:
normalized cuts (NCut)
![Page 10: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/10.jpg)
10
Background: Spectral Clusteringdataset and normalized cut results
2nd smallest eigenvector
3rd smallest eigenvector
valu
e
index
1 2 3cluster
clus
terin
g s
pace
![Page 11: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/11.jpg)
11
Background: Spectral Clustering
• Normalized Cut algorithm (Shi & Malik 2000):1. Choose k and similarity function s2. Derive A from s, let W=I-D-1A, where I is the identity
matrix and D is a diagonal square matrix Dii=Σj Aij
3. Find eigenvectors and corresponding eigenvalues of W4. Pick the k eigenvectors of W with the 2nd to kth smallest
corresponding eigenvalues as “significant” eigenvectors5. Project the data points onto the space spanned by these
vectors6. Run k-means on the projected data points
D
Finding eigenvectors and eigenvalues of a matrix is very slow
in generalCan we find a similar low-dimensional embedding for
clustering without eigenvectors?
![Page 12: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/12.jpg)
12
The Power Iteration
• Or the power method, is a simple iterative method for finding the dominant eigenvector of a matrix:
tt cWvv 1
W : a square matrix
vt : the vector at
iteration t;
v0 typically a random vector
c : a normalizing constant to keep vt
from getting too large or too small
Typically converges quickly; fairly efficient if W is a sparse matrix
![Page 13: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/13.jpg)
13
The Power Iteration
• Or the power method, is a simple iterative method for finding the dominant eigenvector of a matrix:
tt cWvv 1
What if we let W=D-1A(like Normalized Cut)?
Row-normalized similarity
matrix
![Page 14: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/14.jpg)
14
The Power Iteration
![Page 15: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/15.jpg)
15
Power Iteration Clustering
• The 2nd to kth eigenvectors of W=D-1A are roughly piece-wise constant with respect to the underlying clusters, each separating a cluster from the rest of the data (Meila & Shi 2001)
• The linear combination of piece-wise constant vectors is also piece-wise constant!
![Page 16: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/16.jpg)
16
Spectral Clusteringdataset and normalized cut results
2nd smallest eigenvector
3rd smallest eigenvector
valu
e
index
1 2 3cluster
clus
terin
g s
pace
![Page 17: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/17.jpg)
17
a·
b·
+
=
![Page 18: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/18.jpg)
18
Power Iteration Clustering
dataset and PIC results
vt
we just need the clusters to be separated in some space.
Key idea: to do clustering, we may not need all the information in a full spectral
embedding (e.g., distance between clusters in a k-dimension eigenspace)
![Page 19: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/19.jpg)
When to Stop
ntnnk
tkkk
tkk
tt cccc eeeev ...... 111111
Recall:
n
t
nnk
t
kkk
t
kkt
t
c
c
c
c
c
c
ceeee
v
111
1
1
1
1
111
11
......
Then:
Because they are raised to the power t, the eigenvalue ratios
determines how fast v converges to e1
At the beginning, v changes fast (“accelerating”) to converge locally due to “noise terms”
(k+1…n) with small λ
When “noise terms” have gone to zero, v changes slowly (“constant speed”) because only larger λ terms (2…k) are left, where the
eigenvalue ratios are close to 1
![Page 20: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/20.jpg)
20
Power Iteration Clustering
• A basic power iteration clustering (PIC) algorithm:
Input: A row-normalized affinity matrix W and the number of clusters kOutput: Clusters C1, C2, …, Ck
1. Pick an initial vector v0
2. Repeat• Set vt+1 ← Wvt
• Set δt+1 ← |vt+1 – vt|• Increment t• Stop when |δt – δt-1| ≈ 0
3. Use k-means to cluster points on vt and return clusters C1, C2, …, Ck
![Page 21: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/21.jpg)
21
PIC RuntimeNormalized Cut
Normalized Cut, faster implementation
Ran out of memory (24GB)
![Page 22: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/22.jpg)
PIC Accuracy on Network Datasets
Upper triangle: PIC does
better
Lower triangle: NCut or
NJW does better
![Page 23: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/23.jpg)
23
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
![Page 24: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/24.jpg)
24
Clustering Text Data
• Spectral clustering methods are nice• We want to use them for clustering text data
(A lot of)
![Page 25: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/25.jpg)
25
The Problem with Text Data
• Documents are often represented as feature vectors of words:
The importance of a Web page is an inherently subjective matter, which depends on the readers…
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use…
You're not cool just because you have a lot of followers on twitter, get over yourself…
cool web search make over you0 4 8 2 5 30 8 7 4 3 21 0 0 0 1 2
![Page 26: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/26.jpg)
26
The Problem with Text Data
• Feature vectors are often sparse• But similarity matrix is not!
cool web search make over you0 4 8 2 5 30 8 7 4 3 21 0 0 0 1 2
Mostly zeros - any document contains only a small fraction
of the vocabulary
27 125 -
23 - 125
- 23 27
Mostly non-zero - any two
documents are likely to have a
word in common
![Page 27: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/27.jpg)
27
The Problem with Text Data
• A similarity matrix is the input to many clustering methods, including spectral clustering
• Spectral clustering requires the computation of the eigenvectors of a similarity matrix of the data
27 125 -
23 - 125
- 23 27
In general O(n3); approximation
methods still not very fast
O(n2) time to construct
O(n2) space to store
> O(n2) time to operate on
Too expensive! Does not
scale up to big
datasets!
![Page 28: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/28.jpg)
28
The Problem with Text Data
• We want to use the similarity matrix for clustering (like spectral clustering), but:– Without calculating eigenvectors– Without constructing or storing the similarity
matrix
Power Iteration Clustering + Path Folding
![Page 29: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/29.jpg)
29
Path Folding
• A basic power iteration clustering (PIC) algorithm:
Input: A row-normalized affinity matrix W and the number of clusters kOutput: Clusters C1, C2, …, Ck
1. Pick an initial vector v0
2. Repeat• Set vt+1 ← Wvt
• Set δt+1 ← |vt+1 – vt|• Increment t• Stop when |δt – δt-1| ≈ 0
3. Use k-means to cluster points on vt and return clusters C1, C2, …, Ck
Okay, we have a fast clustering method – but there’s the W that requires O(n2) storage space and construction and operation time!
Key operation in PIC
Note: matrix-vector multiplication!
![Page 30: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/30.jpg)
30
Path Folding
• What’s so good about matrix-vector multiplication?
• If we can decompose the matrix…
• Then we arrive at the same solution doing a series of matrix-vector multiplications!
ttt ABCW vvv )(1
)))(((1 tt CBA vv
Isn’t this more expensive?
Well, not if this matrix is dense…
And these are sparse!
![Page 31: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/31.jpg)
31
Path Folding
• As long as we can decompose the matrix into a series of sparse matrices, we can turn a dense matrix-vector multiplication into a series of sparse matrix-vector multiplications.
This means that we can turn an operation that requires O(n2)
storage and runtime into one that requires O(n) storage and runtime!
This is exactly the case for
text data
And many other kinds of data as well!
![Page 32: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/32.jpg)
32
Path Folding
• Example – inner product similarity:
TFFDW 1
The original feature matrix
The feature matrix
transposed
Diagonal matrix that normalizes W so rows sum to 1
Construction: givenStorage: O(n)
Construction: givenStorage: just use FStorage: n
![Page 33: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/33.jpg)
33
Path Folding
• Example – inner product similarity:
TFFDW 1
• Iteration update:
))((11 tTt FFD vv
Construction: O(n)
Storage:O(n)
Operation: O(n)
Okay…how about a similarity function we actually use for
text data?
![Page 34: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/34.jpg)
34
Path Folding
• Example – cosine similarity:
NNFFDW T1
• Iteration update:
))))((((11 tTt NFFND vv
Diagonal cosine
normalizing matrix
Construction: O(n)
Storage:O(n)
Operation: O(n)
Compact storage: we don’t need a cosine-normalized version of the feature vectors
![Page 35: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/35.jpg)
35
Path Folding
• We refer to this technique as path folding due to its connections to “folding” a bipartite graph into a unipartite graph.
![Page 36: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/36.jpg)
36
Results
• An accuracy result:Upper triangle:
we win
Lower triangle: spectral
clustering wins
Each point is accuracy for a 2-cluster
text dataset
Diagonal: tied (most datasets)
No statistically significant
difference – same accuracy
![Page 37: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/37.jpg)
37
Results
• A scalability result:
y: algorithm runtime
(log scale)
x: data size (log scale)
Linear curve
Quadratic curve
Spectral clustering (red & blue)
Our method (green)
![Page 38: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/38.jpg)
38
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
![Page 39: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/39.jpg)
MultiRankWalk
• Classification labels are expensive to obtain• Semi-supervised learning (SSL) learns from
labeled and unlabeled data for classification• When it comes to network data, what is a
general, simple, and effective method that requires very few labels?
Our Answer:MultiRankWalk (MRW)
![Page 40: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/40.jpg)
Random Walk with Restart
• Imagine a network, and starting at a specific node, you follow the edges randomly.
• But with some probability, you “jump” back to the starting node (restart!).
If you recorded the number of times you land on each node,
what would that distribution look like?
![Page 41: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/41.jpg)
Random Walk with Restart
What if we start at a
different node?Start node
![Page 42: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/42.jpg)
Random Walk with Restart
• The walk distribution r satisfies a simple equation:
rur dWd )1(
Start node(s)
Transition matrix of the
network
Restart probability
“Keep-going” probability (damping factor)
![Page 43: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/43.jpg)
Random Walk with Restart
• Random walk with restart (RWR) can be solved simply and efficiently with an iterative procedure:
1)1( tt dWd rur
![Page 44: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/44.jpg)
RWR for Classification
RWR with start nodes being
labeled points in class A
RWR with start nodes being
labeled points in class B
Nodes frequented more by RWR(A) belongs to class A, otherwise they
belong to B
• Simple idea: use RWR for classification
![Page 45: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/45.jpg)
45
ResultsAccuracy: MRW vs. harmonic functions method (HF)
Upper triangle:
MRW better
Lower triangle:
MRW better
Point color & style roughly indicates # of labeled examples
With larger # of labels, both
methods do well
MRW does much better with only a few labels!
![Page 46: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/46.jpg)
Results• How much better is MRW using authoritative seed preference?
y-axis:MRW F1
score minus wvRN F1
x-axis: number of seed
labels per classThe gap between
MRW and wvRN narrows with authoritative
seeds, but they are still
prominent on some datasets
with small number of seed
labels
![Page 47: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/47.jpg)
47
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
![Page 48: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/48.jpg)
48
MRW with Path Folding
• MRW is based on random walk with restart (RWR):
1)1( tt dWd rur
Core operation: matrix-vector multiplication
“Path folding” can be applied to MRW as well to efficiently learn
from induced graph data!
Most other methods preprocess the induced graph
data to create a sparse similarity
matrix
Can also be applied to iterative implementations of the harmonic
functions method
![Page 49: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/49.jpg)
49
On to real, large, difficult applications…
• We now have a basic set of tools to do efficient unsupervised and semi-supervised learning on graph data
• We want to apply to cool, challenging problems…
![Page 50: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/50.jpg)
50
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
![Page 51: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/51.jpg)
51
A Web-Scale Knowledge Base
• Read the Web (RtW) project:
Build a never-ending system that learns to extract information
from unstructured web pages, resulting in a knowledge base of
structured information.
![Page 52: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/52.jpg)
52
Noun Phrase and Context Data
• As a part of RtW project, two kinds of noun phrase (NP) and context co-occurrence data was produced:– NP-context co-occurrence data– NP-NP co-occurrence data
• These datasets can be treated as graphs
![Page 53: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/53.jpg)
53
Noun Phrase and Context Data
• NP-context data:
… know that drinking pomegranate juice may not be a bad …
NPbefore context after context
pomegranate juice know that drinking _
_ may not be a bad
3
2
_ is made from
_ promotes responsible
JELL-O
Jagermeister
5821
NP-contextgraph
![Page 54: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/54.jpg)
54
Noun Phrase and Context Data
• NP-NP data:
… French Constitution Council validates burqa ban …
NP context
French Constitution Council
JELL-OJagermeisterNP-NPgraph
NP
burqa ban
French Court hot pants
veil
Context can be used for weighting edges or making
a more complex graph
![Page 55: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/55.jpg)
55
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
![Page 56: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/56.jpg)
56
Noun Phrase Categorization
• We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs.
• Challenges:– Large, noisy dataset (10m NPs, 8.6m contexts from 500m
web pages).– What’s the right function for NP-NP categorical similarity?– Which learned category assignment should we “promote”
to the knowledge base?– How to evaluate it?
![Page 57: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/57.jpg)
57
Noun Phrase Categorization
• We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs.
• Challenges:– Large, noisy dataset (10m NPs, 8.6m contexts from 500m
web pages).– What’s the right function for NP-NP categorical similarity?– Which learned category assignment should we “promote”
to the knowledge base?– How to evaluate it?
![Page 58: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/58.jpg)
58
Noun Phrase Categorization
• Preliminary experiment:– Small subset of the NP-context data• 88k NPs• 99k contexts
– Find category “city”• Start with a handful of seeds
– Ground truth set of 2,404 city NPs created using
![Page 59: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/59.jpg)
59
Noun Phrase Categorization
• Preliminary result:
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
HF
coEM
MRW
MRWb
Recall
Precision
![Page 60: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/60.jpg)
60
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
![Page 61: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/61.jpg)
61
Learning Synonym NP’s
• Currently the knowledge base see NP’s with different strings as different entities
• It would be useful to know when two different surface forms refer to the same semantic entity
Natural thing to try:use PIC to cluster NPs
But do we get synonyms?
![Page 62: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/62.jpg)
62
Learning Synonym NP’sNP-context graph yields
categorical clusters…
President Obama
Senator Obama
Barack Obama
Senator McCain
George W. Bush
Nicolas Sarkozy
Tony Blair
NP-NP graph yields semantically related clusters…
President Obama
Senator Obama
Barack Obama
Senator McCain
African American
Democrat
Bailout
But together they may help us to identify synonyms!
Other methods may help to further refine results:
President Obama
Senator Obama
Barack Obama
BarackObama.com
Obamamania
Anti-Obama
Michelle Obama
e.g., string similarity
![Page 63: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/63.jpg)
63
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
![Page 64: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/64.jpg)
64
Learning Polyseme NP’s
• Currently a noun phrase found on a web page can only be mapped to a single entity in the knowledge base
• It would be useful to know when two different entities share the same surface form.
• Example:
Jordan
?
![Page 65: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/65.jpg)
65
Learning Polyseme NP’s• Proposal: Cluster the NP’s neighbors:
– Cluster contexts in the NP-context graph and NPs in the NP-NP graph– Identify salient clusters as reference to different entities
• Results from clustering neighbors of “Jordan” in the NP-NP graph:
Tru, morgue, Harrison, Jackie, Rick, Davis, Jack, Chicago, episode, sights, six NBA championships, glass, United Center, brother, restaurant, mother, show, Garret, guest star, dead body, woman, Khouri, Loyola, autopsy, friend, season, release, corpse, Lantern, NBA championships, Maternal grandparents, Sox, Starr, medical examiner, Ivers, Hotch, maiden name, NBA titles, cousin John, Scottie Pippen, guest appearance, Peyton, Noor, night, name, Dr . Cox, third pick, Phoebe, side, third season, EastEnders, Tei, Chicago White Sox, trumpeter, day, Chicago Bulls, products, couple, Pippen, Illinois, All-NBA First Team, Dennis Rodman, first retirement
1948-1967, 500,000 Iraqis, first Arab country, Palestinian Arab state, Palestinian President, Arab-states, al-Qaeda, Arab World, countries, ethics, Faynan, guidelines, Malaysia, methodology, microfinance, militias, Olmert, peace plan, Strategic Studies, million Iraqi refugees, two Arab countries, two Arab states, Palestinian autonomy, Muslim holy sites, Jordan algebras, Jordanian citizenship, Arab Federation, Hashemite dynasty, Jordanian citizens, Gulf Cooperation, Dahabi, Judeh, Israeli-occupied West Bank, Mashal, 700,000 refugees, Yarmuk River, Palestine Liberation, Sunni countries, 2 million Iraqi, 2 million Iraqi refugees, Israeli borders, moderate Arab states
![Page 66: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/66.jpg)
66
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
![Page 67: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/67.jpg)
67
Finding New Types and Relations
• Proposed methods for finding synonyms and polysemes identify particular nodes and relationship between nodes based on the characteristics of the graph structure, using PIC
Can we generalize these methods to find any unary or binary relationships between
nodes in the graph?
Different relationships may require
different kinds of graphs and
similarity functions
Can we learn effective similarity
functions for clustering
efficiently?
Admittedly the most
open-ended part of this proposal …
Tom’s suggestion: Given a set of NPs,
can you find k categories and l
relationships that cover the most NPs?
![Page 68: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/68.jpg)
68
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
![Page 69: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/69.jpg)
69
PIC Extension: Avoiding Collisions
• One robustness question for vanilla PIC as data size and complexity grows:
• How many (noisy) clusters can you fit in one dimension without them “colliding”?
Cluster signals cleanly separated
A little too close for comfort?
![Page 70: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/70.jpg)
70
PIC Extension: Avoiding Collisions
• We propose two general solutions:1. Precondition the starting vector; instead of
random:• Degree• Skewed (1, 2, 4, 8, 16, etc.)• May depend on particular properties of the data
2. Run PIC d times with different random starts and construct a d-dimension embedding• Unlikely two clusters collide on all d dimensions• We can afford it because PIC is fast and space-efficient!
![Page 71: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/71.jpg)
71
PIC Extension: Avoiding Collisions
• Preliminary results on network classification datasets:
RED: PIC embedding
with a random start vector
GREEN: PIC using a degree
start vector
BLUE: PIC using 4
random start vectors
Dataset (k)
1-dimension PIC embeddings lose on accuracy at higher k’s compared to NCut and NJW
# of clusters
But using a 4 random vectors instead helps!
Note # of vectors << k
![Page 72: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/72.jpg)
72
PIC Extension: Avoiding Collisions
• Preliminary results on name disambiguation datasets:
Again using a 4 random vectors seems to work!
Again note # of vectors << k
![Page 73: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/73.jpg)
73
PIC Extension: Hierarchical Clustering
• Real, large-scale data may not have a “flat” clustering structure
• A hierarchical view may be more useful
Good News:The dynamics of a PIC embedding display a hierarchically convergent
behavior!
![Page 74: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/74.jpg)
74
PIC Extension: Hierarchical Clustering
• Why?• Recall PIC embedding at time t:
n
t
nn
tt
t
t
c
c
c
c
c
c
ceeee
v
113
1
3
1
32
1
2
1
21
11
...
Less significant eigenvectors / structures go away first, one by one
More salient structure stick
around
e’s – eigenvectors (structure) SmallBig
There may not be a clear
eigengap - a gradient of
cluster saliency
![Page 75: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/75.jpg)
75
PIC Extension: Hierarchical Clustering
PIC already converged to 8 clusters…
But let’s keep on iterating…
“N” still a part of the “2009”
cluster…
Similar behavior also noted in matrix-matrix power
methods (diffusion maps, mean-shift, multi-resolution
spectral clustering)
Same dataset you’ve seen
Yes(it might take a while)
![Page 76: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/76.jpg)
76
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
![Page 77: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/77.jpg)
77
Distributed / Parallel Implementations
• Distributed / parallel implementations of learning methods are necessary to support large-scale data given the direction of hardware development
• PIC, MRW, and their path folding variants have at their core sparse matrix-vector multiplications
• Sparse matrix-vector multiplication lends itself well to a distributed / parallel computing framework
• We propose to use• Alternatives:
Existing graph analysis tool:
![Page 78: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/78.jpg)
78
Talk Outline
• Prior Work• Power Iteration Clustering (PIC)• PIC with Path Folding• MultiRankWalk (MRW)
• Proposed Work• MRW with Path Folding• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base• Learning Synonym NP’s• Learning Polyseme NP’s• Finding New Ontology Types and Relations
• PIC Extensions• Distributed Implementations
![Page 79: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/79.jpg)
Questions?
![Page 80: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/80.jpg)
80
Additional Information
![Page 81: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/81.jpg)
PIC: Some “Powering” Methods at a Glance
Method W Iterate Stopping Final
Tishby & Slonim 2000 W=D-1A Wt+1=Wt
rate of information
loss
information bottleneck
method
Zhou & Woodruff
2004W=A Wt+1=Wt a small t a threshold ε
Carreira-Perpinan
2006W=D-1A Xt+1=WX entropy a threshold ε
PIC W=D-1A vt+1=Wvt acceleration k-means
How far can we go with a one- or low-dimensional
embedding?
![Page 82: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/82.jpg)
82
PIC: Versus Popular Fast Sparse Eigencomputation Methods
For Symmetric Matrices
For General Matrices Improvement
Successive Power Method
Basic; numerically unstable, can be
slow
Lanczos Method Arnoldi MethodMore stable, but
require lots of memory
Implicitly Restarted Lanczos Method
(IRLM)
Implicitly Restarted Arnoldi Method
(IRAM)More memory-
efficient
Method Time Space
IRAM (O(m3)+(O(nm)+O(e))×O(m-k))×(# restart) O(e)+O(nm)
PIC O(e)x(# iterations) O(e)
Randomized sampling
methods are also popular
![Page 83: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/83.jpg)
83
PIC: Related Clustering Work• Spectral Clustering
– (Roxborough & Sen 1997, Shi & Malik 2000, Meila & Shi 2001, Ng et al. 2002)• Kernel k-Means (Dhillon et al. 2007)• Modularity Clustering (Newman 2006)• Matrix Powering
– Markovian relaxation & the information bottleneck method (Tishby & Slonim 2000)
– matrix powering (Zhou & Woodruff 2004)– diffusion maps (Lafon & Lee 2006)– Gaussian blurring mean-shift (Carreira-Perpinan 2006)
• Mean-Shift Clustering– mean-shift (Fukunaga & Hostetler 1975, Cheng 1995, Comaniciu & Meer 2002)– Gaussian blurring mean-shift (Carreira-Perpinan 2006)
![Page 84: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/84.jpg)
84
PICwPF: ResultsEach point represents
the accuracy result from a
dataset
Lower triangle: k-means wins
Upper triangle: PIC wins
![Page 85: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/85.jpg)
85
PICwPF: Results Two methods have almost the same behavior
Overall, one method not statistically significantly better than the other
![Page 86: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/86.jpg)
86
PICwPF: Results
Not sure why NCUTiram did
not work as well as NCUTevd
Lesson: Approximate
eigen- computation methods may
require expertise to work well
![Page 87: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/87.jpg)
87
PICwPF: Results
• PIC is O(n) per iteration and the runtime curve looks linear…
• But I don’t like eyeballing curves, and perhaps the number of iteration increases with size or difficulty of the dataset?
Correlation plotCorrelation statistic
(0=none, 1=correlated)
![Page 88: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/88.jpg)
88
PICwPF: Results
• Linear run-time implies constant number of iterations.
• Number of iterations to “acceleration-convergence” is hard to analyze:– Faster than a single complete run of power
iteration to convergence.– On our datasets• 10-20 iterations is typical• 30-35 is exceptional
![Page 89: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/89.jpg)
89
PICwPF: Related Work
• Faster spectral clustering– Approximate eigendecomposition (Lanczos, IRAM)– Sampled eigendecomposition (Nyström)
• Sparser matrix– Sparse construction• k-nearest-neighbor graph• k-matching
– graph sampling / reduction
Not O(n) time methods
Still require O(n2) construction in
general
Not O(n) space methods
![Page 90: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/90.jpg)
90
PICwPF: Results
![Page 91: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/91.jpg)
MRW: RWR for Classification
We refer to this method as MultiRankWalk: it classifies data with multiple rankings using random walks
![Page 92: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/92.jpg)
MRW: Seed Preference
• Obtaining labels for data points is expensive• We want to minimize cost for obtaining labels• Observations:– Some labels inherently more useful than others– Some labels easier to obtain than others
Question: “Authoritative” or “popular” nodes in a network are typically easier to obtain labels for. But are these labels also more
useful than others?
![Page 93: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/93.jpg)
Seed Preference
• Consider the task of giving a human expert (or posting jobs on Amazon Mechanical Turk) a list of data points to label
• The list (seeds) can be generated uniformly at random, or we can have a seed preference, according to simple properties of the unlabeled data
• We consider 3 preferences:– Random– Link Count– PageRank
Nodes with highest counts make the list
Nodes with highest scores make the list
![Page 94: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/94.jpg)
MRW: The Question• What really makes MRW and wvRN different?• Network-based SSL often boil down to label propagation. • MRW and wvRN represent two general propagation methods –
note that they are call by many names:MRW wvRN
Random walk with restart Reverse random walk
Regularized random walk Random walk with sink nodes
Personalized PageRank Hitting time
Local & global consistency Harmonic functions on graphs
Iterative averaging of neighbors
Great…but we still don’t know why the differences in
their behavior on these network datasets!
![Page 95: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/95.jpg)
MRW: The Question
• It’s difficult to answer exactly why MRW does better with a smaller number of seeds.
• But we can gather probable factors from their propagation models:
MRW wvRN
1 Centrality-sensitive Centrality-insensitive
2 Exponential drop-off / damping factor No drop-off / damping
3Propagation of different classes done independently
Propagation of different classes interact
![Page 96: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/96.jpg)
MRW: The Question
• An example from a political blog dataset – MRW vs. wvRN scores for how much a blog is politically conservative:
1.000 neoconservatives.blogspot.com1.000 strangedoctrines.typepad.com1.000 jmbzine.com0.593 presidentboxer.blogspot.com0.585 rooksrant.com0.568 purplestates.blogspot.com0.553 ikilledcheguevara.blogspot.com0.540 restoreamerica.blogspot.com0.539 billrice.org0.529 kalblog.com0.517 right-thinking.com0.517 tom-hanna.org0.514 crankylittleblog.blogspot.com0.510 hasidicgentile.org0.509 stealthebandwagon.blogspot.com0.509 carpetblogger.com0.497 politicalvicesquad.blogspot.com0.496 nerepublican.blogspot.com0.494 centinel.blogspot.com0.494 scrawlville.com0.493 allspinzone.blogspot.com0.492 littlegreenfootballs.com0.492 wehavesomeplanes.blogspot.com0.491 rittenhouse.blogspot.com0.490 secureliberty.org0.488 decision08.blogspot.com0.488 larsonreport.com
0.020 firstdownpolitics.com0.019 neoconservatives.blogspot.com0.017 jmbzine.com0.017 strangedoctrines.typepad.com0.013 millers_time.typepad.com0.011 decision08.blogspot.com0.010 gopandcollege.blogspot.com0.010 charlineandjamie.com0.008 marksteyn.com0.007 blackmanforbush.blogspot.com0.007 reggiescorner.blogspot.com0.007 fearfulsymmetry.blogspot.com0.006 quibbles-n-bits.com0.006 undercaffeinated.com0.005 samizdata.net0.005 pennywit.com0.005 pajamahadin.com0.005 mixtersmix.blogspot.com0.005 stillfighting.blogspot.com0.005 shakespearessister.blogspot.com0.005 jadbury.com0.005 thefulcrum.blogspot.com0.005 watchandwait.blogspot.com0.005 gindy.blogspot.com0.005 cecile.squarespace.com0.005 usliberals.about.com0.005 twentyfirstcenturyrepublican.blogspot.com
Seed labels underlined
1. Centrality-sensitive: seeds have different
scores and not necessarily the highest
2. Exponential drop-off: much less sure
about nodes further away from seeds
3. Classes propagate independently:
charlineandjamie.com is both very likely a conservative and a
liberal blog (good or bad?)
We still don’t really understand it yet.
![Page 97: Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d2c5503460f94a02700/html5/thumbnails/97.jpg)
MRW: Related Work
• MRW is very much related to– “Local and global consistency” (Zhou et al. 2004)– “Web content categorization using link information”
(Gyongyi et al. 2006)– “Graph-based semi-supervised learning as a generative
model” (He et al. 2007)• Seed preference is related to the field of active
learning– Active learning chooses which data point to label next
based on previous labels; the labeling is interactive– Seed preference is a batch labeling method
Similar formulation,
different view
RWR ranking as features to SVM
Random walk
without restart,
heuristic stopping
Authoritative seed preference a good base line for active
learning on network data!