leonid e. zhukov - leonid zhukov · evaluation metrics precision and recall, f-measure precision =...

23
Link prediction Leonid E. Zhukov School of Data Analysis and Artificial Intelligence Department of Computer Science National Research University Higher School of Economics Structural Analysis and Visualization of Networks Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 1 / 23

Upload: others

Post on 26-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Link prediction

Leonid E. Zhukov

School of Data Analysis and Artificial IntelligenceDepartment of Computer Science

National Research University Higher School of Economics

Structural Analysis and Visualization of Networks

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 1 / 23

Page 2: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Lecture outline

1 Link prediction problem

2 Proximity measures

3 Prediction by supervised learning

4 Other methods

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 2 / 23

Page 3: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Link prediction

Link prediction. A network is changing over time. Given a snapshotof a network at time t, predict edges added in the interval (t, t ′)

Link completion (missing links identification). Given a network, inferlinks that are consistent with the structure, but missing (findunobserved edges)

Link reliability. Estimate the reliability of given links in the graph.

Predictions: link existence, link weight, link type

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 3 / 23

Page 4: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Link prediction

Graph G(V,E)

Number of ”missing edges”: |V |(|V | − 1)/2− |E |In sparse graphs |E | � |V |2, Prob. of correct random guess O( 1

|V |2 )

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 4 / 23

Page 5: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Scoring algorithm

Link prediction by proximity scoring

1 For each pair of nodes compute proximity (similarity) score c(v1, v2)

2 Sort all pairs by the decreasing score

3 Select top n pairs (or above some threshold) as new links

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 5 / 23

Page 6: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Scoring functions

Local neighborhood of vi and vj

Number of common neighbors:

|N (vi ) ∩N (vj)|

Jaccard’s coefficient:|N (vi ) ∩N (vj)||N (vi ) ∪N (vj)|

Adamic/Adar: ∑v∈N (vi )∩N (vj )

1

log |N (v)|

Liben-Nowell and Kleinberg, 2003

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 6 / 23

Page 7: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Scoring functions

Paths and ensembles of paths between vi and vj

Shortest path:−min

s{pathsij > 0}

Katz score:

∞∑l=1

βl |paths(l)(vi , vj)| =∞∑l=1

(βA)lij = (I − βA)−1 − I

Personalized (rooted) PageRank:

PR = α(D−1A)TPR + (1− α)|

Liben-Nowell and Kleinberg, 2003

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 7 / 23

Page 8: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Scoring functions

Expected number of random walk steps:– hitting time: −Hij

– commute time −(Hij + Hji )– normalized hitting/commute time (Hijπj + Hjiπi )

SimRank:

SimRank(vi , vj) =C

|N (vi )| · |N (vj)|∑

m∈N (vi )

∑n∈N (vj )

SimRank(m, n)

Liben-Nowell and Kleinberg, 2003

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 8 / 23

Page 9: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Vertex feature aggregations

Preferential attachment:

ki · kj = |N (vi )| · |N (vj)|

orki + kj = |N (vi )|+ |N (vj)|

Clustering coefficient:CC (vi ) · CC (vj)

orCC (vi ) + CC (vj)

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 9 / 23

Page 10: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Low-rank approximations

Low-rank approximation (truncated SVD)

A ≈∑k

UkSkVTk

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 10 / 23

Page 11: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Link prediction

Liben-Nowell and Kleinberg, 2007

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 11 / 23

Page 12: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Link prediction

Liben-Nowell and Kleinberg, 2007

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 12 / 23

Page 13: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Binary classification

Challenging classification problem:

Computational cost of evaluating of very large number of possibleedges (quadratic in number of nodes)

Highly imbalanced class distribution: number of positive examples(existing edges) grows linearly and negative quadratically withnumber on nodes

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 13 / 23

Page 14: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Prediction difficulty

Actual and possible collaborations between DBLP authors

Extreme class imbalance

from Rattigan and Jensen, 2005

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 14 / 23

Page 15: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Link prediction with supervised learning

Supervised learning:1 Features generation2 Model training3 Testing (model application)

Features:

Topological proximity features

Aggregated features

Content based node proximity features

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 15 / 23

Page 16: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Features

Discriminative abilities of features

from Rattigan and Jensen, 2005

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 16 / 23

Page 17: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Simple evaluation

Simple ”hold out set” evaluation

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 17 / 23

Page 18: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Training and testing

Evaluation for evolving networks

image from Y. Yang et.al, 2014

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 18 / 23

Page 19: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Evaluation metrics

Precision and Recall, F-measure

Precision =TP

TP + FP, Recall =

TP

TP + FN

F =2 · Precision · RecallPrecision + Recall

True positive rate (TPR), False positive rate (FPR), ROC curve, AUC

TPR =TP

TP + FN, FPR =

FP

FP + TN

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 19 / 23

Page 20: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Performance of classification algorithms

BIOBASE database (research publications)

DBLP dataset (research publications)

from M. Al Hasan, 2010

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 20 / 23

Page 21: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

ROC curves

from Lichtenwalter, 2010

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 21 / 23

Page 22: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

Probabilistic models

Local model, Markov random fields [Wang, 2007]

Hierarchical probabilistic model [Clauset, 2008]

Probabilistic relations models:– Bayesian networks [Getoor, 2002]– relational Markov networks [Tasker, 2003, 2007]

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 22 / 23

Page 23: Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision = TP TP + FP; Recall = TP TP + FN F = 2 Precision Recall Precision + Recall True positive

References

D. Liben-Nowell and J. Kleinberg. The link prediction problem forsocial networks. Journal of the American Society for InformationScience and Technology, 58(7):1019?1031, 2007

R. Lichtenwalter, J.Lussier, and N. Chawla. New perspectives andmethods in link prediction. KDD 10: Proceedings of the 16th ACMSIGKDD, 2010

M. Al Hasan, V. Chaoji, S. Salem, M. Zaki, Link prediction usingsupervised learning. Proceedings of SDM workshop on link analysis,2006

M. Rattigan, D. Jensen. The case for anomalous link discovery. ACMSIGKDD Explorations Newsletter. v 7, n 2, pp 41-47, 2005

M. Al. Hasan, M. Zaki. A survey of link prediction in social networks.In Social Networks Data Analytics, Eds C. Aggarwal, 2011.

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 23 / 23