web structure mining - irityoann.pitarch/docs/m2stats/... · evaluation metrics 3 - prediction by...
TRANSCRIPT
![Page 1: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),](https://reader031.vdocuments.us/reader031/viewer/2022041021/5ed0762ccb98f31e1f33413b/html5/thumbnails/1.jpg)
Web Structure MiningLink Prediction
![Page 2: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),](https://reader031.vdocuments.us/reader031/viewer/2022041021/5ed0762ccb98f31e1f33413b/html5/thumbnails/2.jpg)
Outline1. Link prediction problem
2. Proximity measures
3. Prediction by supervised learning
2
![Page 3: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),](https://reader031.vdocuments.us/reader031/viewer/2022041021/5ed0762ccb98f31e1f33413b/html5/thumbnails/3.jpg)
Link prediction‣ Link prediction. Given a snapshot of a dynamic network at
time t, predict edges added in the interval (t,t’)
‣ Link completion. Given a network, infer links that are consistent with the structure, but missing
‣ Link reliability. Estimate the reliability of given links in the network
‣ What to predict?‣ Link existence
‣ Link weight
‣ Link type
3
1 - Link prediction
![Page 4: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),](https://reader031.vdocuments.us/reader031/viewer/2022041021/5ed0762ccb98f31e1f33413b/html5/thumbnails/4.jpg)
Link prediction
‣ Number of missing edges = |V| (|V| - 1)/2 - |E|
‣ In sparse graphs, |E| << |V|2
‣ Probability of correct random guess O(1/|V|2)
4
1 - Link prediction
8
1
46
3
7 5
2
![Page 5: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),](https://reader031.vdocuments.us/reader031/viewer/2022041021/5ed0762ccb98f31e1f33413b/html5/thumbnails/5.jpg)
Scoring algorithm
‣ Link prediction by proximity scoring 1. For each pair of nodes compute proximity score c(v,v’)
2. Sort all pairs by the decreasing score
3. Select top n pairs (or above some threshold) as new links
‣ Many metrics have been summarised in : David Liben-Nowell and Jon Kleinberg. 2007. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58, 7 (May 2007), 1019-1031.
5
1 - Link prediction
![Page 6: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),](https://reader031.vdocuments.us/reader031/viewer/2022041021/5ed0762ccb98f31e1f33413b/html5/thumbnails/6.jpg)
Scoring functions‣ Based on the local neighbourhood of vi and vj
‣ Number of common neighbours
‣ Jaccard’s coefficient
‣ Adamic / Adar
6
2 - Proximity measures
|N (vi) \N (vj)|
|N (vi) \N (vj)||N (vi) [N (vj)|
X
v 2N (vi)\N (vi)
1
log|N (v)|
![Page 7: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),](https://reader031.vdocuments.us/reader031/viewer/2022041021/5ed0762ccb98f31e1f33413b/html5/thumbnails/7.jpg)
Scoring functions‣ Based on paths and ensemble of paths between vi and vj
‣ Shortest path
‣ Katz score
‣ Personalized (rooted) PageRank
7
2 - Proximity measures
�min{pathij > 0}
1X
l=1
�(l)|paths(l)ij |
PR = ↵(D�1A)TPR+ (1� ↵)
![Page 8: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),](https://reader031.vdocuments.us/reader031/viewer/2022041021/5ed0762ccb98f31e1f33413b/html5/thumbnails/8.jpg)
Scoring functions‣ Expected number of random walk steps:
‣ Hitting time:
‣ Commute time:
‣ Normalized hitting / commute time:
‣ SimRank:
8
2 - Proximity measures
�Hij
�(Hij +Hji)
�(Hij⇡j +Hji⇡i)with ⇡i (resp. ⇡j) be the stationary probability of vi (resp. vj)
SimRank(vi, vj) = �.
Pa2Ni
Pb2Nj
SimRank(a, b)
|Ni|.|Nj |
![Page 9: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),](https://reader031.vdocuments.us/reader031/viewer/2022041021/5ed0762ccb98f31e1f33413b/html5/thumbnails/9.jpg)
Scoring functions‣ Preferential attachment (2 alternative versions)
‣
‣
‣ Clustering coefficient ‣
‣
9
2 - Proximity measures
ki . kj = |Ni| . |Nj |
ki + kj = |Ni| + |Nj |
CC(vi) . CC(vj)
CC(vi) + CC(vj)
![Page 10: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),](https://reader031.vdocuments.us/reader031/viewer/2022041021/5ed0762ccb98f31e1f33413b/html5/thumbnails/10.jpg)
Some results
10
2 - Proximity measures
Source: The link-prediction problem for social networks
![Page 11: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),](https://reader031.vdocuments.us/reader031/viewer/2022041021/5ed0762ccb98f31e1f33413b/html5/thumbnails/11.jpg)
Some results
11
2 - Proximity measures
Source: The link-prediction problem for social networks
![Page 12: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),](https://reader031.vdocuments.us/reader031/viewer/2022041021/5ed0762ccb98f31e1f33413b/html5/thumbnails/12.jpg)
Some results
2 - Proximity measures
12 Source: The link-prediction problem for social networks
![Page 13: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),](https://reader031.vdocuments.us/reader031/viewer/2022041021/5ed0762ccb98f31e1f33413b/html5/thumbnails/13.jpg)
Take away message
‣ Node-based topological similarity measures (common neighbours, Jaccard, Adamic/Adar, preferential attachment) perform the best but does not scale well
‣ Path-based topological similarity measure (Katz, Hitting time, rooted PageRank) have to be preferred when dealing with relatively big networks (>10K vertices)
2 - Proximity measures
13
![Page 14: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),](https://reader031.vdocuments.us/reader031/viewer/2022041021/5ed0762ccb98f31e1f33413b/html5/thumbnails/14.jpg)
Binary classification‣ A challenging classification problem:
‣ A very large number of possible edges (quadratic in number of nodes)
‣ Highly unbalanced class distribution
‣ Positive examples : linear growth with number of nodes
‣ Negative example : quadratic growth with number of nodes
3 - Prediction by supervised learning
14
![Page 15: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),](https://reader031.vdocuments.us/reader031/viewer/2022041021/5ed0762ccb98f31e1f33413b/html5/thumbnails/15.jpg)
A very challenging problem
3 - Prediction by supervised learning
15
Source: M. Rattigan, D. Jensen. The case for anomalous link discovery. ACM SIGKDD Explorations Newsletter. v 7, n 2, pp 41-47, 2005
![Page 16: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),](https://reader031.vdocuments.us/reader031/viewer/2022041021/5ed0762ccb98f31e1f33413b/html5/thumbnails/16.jpg)
Link prediction by supervised learning
‣ Supervised learning process 1. Feature generation
2. Model training
3. Testing
‣ Features ‣ Topological proximity features
‣ Aggregated features
‣ Content based node proximity features
3 - Prediction by supervised learning
16
![Page 17: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),](https://reader031.vdocuments.us/reader031/viewer/2022041021/5ed0762ccb98f31e1f33413b/html5/thumbnails/17.jpg)
Evaluation
3 - Prediction by supervised learning
17
‣ Simple « hold out set » evaluation
‣ More sophisticated evaluation method is preferable (cross-validation)
8
1
46
3
7 5
2
8
1
46
3
7 5
2
Whole graph Training graph
![Page 18: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),](https://reader031.vdocuments.us/reader031/viewer/2022041021/5ed0762ccb98f31e1f33413b/html5/thumbnails/18.jpg)
Evaluation metrics
3 - Prediction by supervised learning
18
‣ Precision, recall, F-measure
‣ True rate positive (TPR), False positive rate (FPR), ROC curve, AUC
Precision =TP
TP + FP
, Recall =TP
TP + FN
F =2 . P recision .Recall
Precision+Recall
TPR =TP
TP + FN, FPR =
FP
FP + TN