link prediction: a perspective

LINK PREDICTION: A PERSPECTIVE

Nitesh ChawlaStudents: Ryan Lichtenwalter, Jake LussierDepartment of computer Science and Engineering,iCeNSA, University of Notre Dame

“Ideas that connect”

Emergence of network structure• Certain nodes are more likely to develop a connection

between them• Certain connections more or less likely to dissolve over

time• What are the local and global phenomena that might

impact this process? What is the predictability in this process?

At the level of a whole network, link formation or decay may indicate changes in community structure

From an ego-centric viewpoint, it may indicate volatility, changes in peer groups

The link prediction problem• Prediction

• Given a graph G = (V,E) at time t, predict the set of edges EL’ ∉ Ethat appear in t’ and EL’c∉ E that do not appear in t’ based on either topological relationships between pairs Vi and Vj, attributes of the vertices Vi and Vj, or both of these.

Number of applications in social networks

• Security: Probable relationships in malicious networks

• Social: efficiently and effectively find companions, colleagues

• Anomalous link discovery: discover surprising links, given sparse networks

A perspective• Network types

• Different size, attributes

• Predictors• Unsupervised and supervised

• Network and predictors characteristics• Highly imbalanced• Saturation effects• Geodesic distance and temporality

• Evaluation• Skewed representations

Data sets

Common Unsupervised Predictors• Degree

• in• out

• Neighbors• Jaccard’s Coefficient• Adamic/Adar

• Path• PropFlow• Rooted PageRank• Katz• Hitting Time• Commute Time

s s t t

s

t1

t2

s

t1

t2

PropFlow• Represents the probability of

information transfer from i to jbased on random transmission along all paths using edge weights.

Supervised Learning

F1 > t1

F2 < t2 F3 > t3

Link Forms Link Doesn’t Form

Typical Prediction Model

Actual Negative

Actual Positive

Predict Negative

TN FN

Predict Positive

FP TP

Positives

Challenge: High Imbalance

The class imbalance ratio for link prediction in a sparse network islower-bounded by the number of vertices in the network.

Challenge: Network Saturation

Calls SMS

DBLP-Cite DBLP-Collab

Challenge: Evaluation

• Extreme sparsity• If I predict no new links, I will be very accurate

• Use ROC curves• Threshold agnostic

Challenge: Local vs Global Phenomena

Supervised learning: High Performance Link Predictor: HPLP

Feature vector (ego/node level, dyad level, and triad level)

• Sampling• Bagged Hellinger

Distance Decision Trees

Classifier

Available as part of comprehensive software suite:

LpMade

DBLP-Collab

DBLP-Cite

Patents-Cite

With increasing geodesic distance

Our method, HPLP (column G), outperforms other state-of-the-art methods.

Summary• Links form according to a competition between local and

global influences, and the dominant factor depends upon the network

• Our proposed supervised learning method, HPLP, outperforms unsupervised methods by as much as 30% AUROC

• Resulting data sets present a whole new world of imbalanced data challenges

• Combine the node and edge attributes with topology

Acknowledgements• Research sponsored by:

• Army Research Laboratory (ARL) Cooperative Agreement Number W911NF-09-2-0053

• National Science Foundation (NSF) Grant BCS-0826958

For papers and software, please visit

http://www.nd.edu/~dialhttp://icensa.nd.edu

Email: [email protected]

link prediction: a perspective

Documents