leonid e. zhukov - leonid zhukov · evaluation metrics precision and recall, f-measure precision =...
Post on 26-May-2020
3 Views
Preview:
TRANSCRIPT
Link prediction
Leonid E. Zhukov
School of Data Analysis and Artificial IntelligenceDepartment of Computer Science
National Research University Higher School of Economics
Structural Analysis and Visualization of Networks
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 1 / 23
Lecture outline
1 Link prediction problem
2 Proximity measures
3 Prediction by supervised learning
4 Other methods
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 2 / 23
Link prediction
Link prediction. A network is changing over time. Given a snapshotof a network at time t, predict edges added in the interval (t, t ′)
Link completion (missing links identification). Given a network, inferlinks that are consistent with the structure, but missing (findunobserved edges)
Link reliability. Estimate the reliability of given links in the graph.
Predictions: link existence, link weight, link type
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 3 / 23
Link prediction
Graph G(V,E)
Number of ”missing edges”: |V |(|V | − 1)/2− |E |In sparse graphs |E | � |V |2, Prob. of correct random guess O( 1
|V |2 )
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 4 / 23
Scoring algorithm
Link prediction by proximity scoring
1 For each pair of nodes compute proximity (similarity) score c(v1, v2)
2 Sort all pairs by the decreasing score
3 Select top n pairs (or above some threshold) as new links
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 5 / 23
Scoring functions
Local neighborhood of vi and vj
Number of common neighbors:
|N (vi ) ∩N (vj)|
Jaccard’s coefficient:|N (vi ) ∩N (vj)||N (vi ) ∪N (vj)|
Adamic/Adar: ∑v∈N (vi )∩N (vj )
1
log |N (v)|
Liben-Nowell and Kleinberg, 2003
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 6 / 23
Scoring functions
Paths and ensembles of paths between vi and vj
Shortest path:−min
s{pathsij > 0}
Katz score:
∞∑l=1
βl |paths(l)(vi , vj)| =∞∑l=1
(βA)lij = (I − βA)−1 − I
Personalized (rooted) PageRank:
PR = α(D−1A)TPR + (1− α)|
Liben-Nowell and Kleinberg, 2003
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 7 / 23
Scoring functions
Expected number of random walk steps:– hitting time: −Hij
– commute time −(Hij + Hji )– normalized hitting/commute time (Hijπj + Hjiπi )
SimRank:
SimRank(vi , vj) =C
|N (vi )| · |N (vj)|∑
m∈N (vi )
∑n∈N (vj )
SimRank(m, n)
Liben-Nowell and Kleinberg, 2003
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 8 / 23
Vertex feature aggregations
Preferential attachment:
ki · kj = |N (vi )| · |N (vj)|
orki + kj = |N (vi )|+ |N (vj)|
Clustering coefficient:CC (vi ) · CC (vj)
orCC (vi ) + CC (vj)
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 9 / 23
Low-rank approximations
Low-rank approximation (truncated SVD)
A ≈∑k
UkSkVTk
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 10 / 23
Link prediction
Liben-Nowell and Kleinberg, 2007
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 11 / 23
Link prediction
Liben-Nowell and Kleinberg, 2007
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 12 / 23
Binary classification
Challenging classification problem:
Computational cost of evaluating of very large number of possibleedges (quadratic in number of nodes)
Highly imbalanced class distribution: number of positive examples(existing edges) grows linearly and negative quadratically withnumber on nodes
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 13 / 23
Prediction difficulty
Actual and possible collaborations between DBLP authors
Extreme class imbalance
from Rattigan and Jensen, 2005
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 14 / 23
Link prediction with supervised learning
Supervised learning:1 Features generation2 Model training3 Testing (model application)
Features:
Topological proximity features
Aggregated features
Content based node proximity features
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 15 / 23
Features
Discriminative abilities of features
from Rattigan and Jensen, 2005
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 16 / 23
Simple evaluation
Simple ”hold out set” evaluation
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 17 / 23
Training and testing
Evaluation for evolving networks
image from Y. Yang et.al, 2014
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 18 / 23
Evaluation metrics
Precision and Recall, F-measure
Precision =TP
TP + FP, Recall =
TP
TP + FN
F =2 · Precision · RecallPrecision + Recall
True positive rate (TPR), False positive rate (FPR), ROC curve, AUC
TPR =TP
TP + FN, FPR =
FP
FP + TN
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 19 / 23
Performance of classification algorithms
BIOBASE database (research publications)
DBLP dataset (research publications)
from M. Al Hasan, 2010
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 20 / 23
ROC curves
from Lichtenwalter, 2010
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 21 / 23
Probabilistic models
Local model, Markov random fields [Wang, 2007]
Hierarchical probabilistic model [Clauset, 2008]
Probabilistic relations models:– Bayesian networks [Getoor, 2002]– relational Markov networks [Tasker, 2003, 2007]
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 22 / 23
References
D. Liben-Nowell and J. Kleinberg. The link prediction problem forsocial networks. Journal of the American Society for InformationScience and Technology, 58(7):1019?1031, 2007
R. Lichtenwalter, J.Lussier, and N. Chawla. New perspectives andmethods in link prediction. KDD 10: Proceedings of the 16th ACMSIGKDD, 2010
M. Al Hasan, V. Chaoji, S. Salem, M. Zaki, Link prediction usingsupervised learning. Proceedings of SDM workshop on link analysis,2006
M. Rattigan, D. Jensen. The case for anomalous link discovery. ACMSIGKDD Explorations Newsletter. v 7, n 2, pp 41-47, 2005
M. Al. Hasan, M. Zaki. A survey of link prediction in social networks.In Social Networks Data Analytics, Eds C. Aggarwal, 2011.
Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 23 / 23
top related