lise getoor, "

Link Mining

Li G tLise GetoorUniversity of Maryland, College Park

August 22, 2012

Alternate Title…..

What Machine Learning/Statistics/Data MiningMachine Learning/Statistics/Data Mining

can do for YOU!

1.Predict future values

2.Fill-in missing values

3 Identify anomalies

Supervised Learning

What are some common 3.Identify anomalies

4.Find patterns

What are some common machine learning algorithms?

5.Identify ClustersUnsupervised Learning

So, what’s Link Mining??? Machine learning when you have graphs (or networks)

Nodes are entities• People• Places

Organizations• Organizations• Text

Links are relationshipsp• Friends• MemberOf• LivesIn• LivesIn• Tweeted• Posted

e.g., heterogeneous multi-relational data, multimodal data …..

Ex: Social Media RelationshipsFriendsCollaborators

User-User

UbUa

FamilyFan/FollowerRepliesCo-EditsCo-Mentions, etc.

User-Doc

CommentsEdits, etc.

User Doc

Doc1U

User-Query-ClickQU URL

User-Tag-DocTag DocU

Link Mining Tasks Node Labeling Link Prediction Link Prediction Entity Resolution Group DetectionG oup etect o

Node Labeling

h What is Harry’spolitical persuasion?

Harry

Natasha

Link Prediction

Friends?

Entity Resolution Aka: deduplication, co-reference resolution, record

linkage, reference consolidation, etc.g

Abstract Problem StatementReal

WorldDigital World

Records / Mentions

Deduplication Problem Statement Cluster the records/mentions that correspond to

same entity y

Deduplication Problem Statement Cluster the records/mentions that correspond to

same entity y Intensional Variant: Compute cluster representative

Record Linkage Problem Statement Link records that match across databases

AB

Reference Matching Problem Match noisy records to clean records in a reference

table

Reference T blTable

InfoVis Co-Author Network Fragment

before after

Group Detection

Link Mining Algorithms Node Labeling Link Prediction Link Prediction Entity Resolution Group DetectionG oup etect o

Link Mining Algorithms Node Labeling Link Prediction

1. Relational Classifiers2. Collective Classifiers

Link Prediction Entity Resolution Group DetectionG oup etect o

Relational ClassifiersGiven:

125

ba

c

w

x

34e

d

z

y

Task: Predict attributeof some of the entities

1 ?

Alternate task: Predict existenceof relationship between entities

1 2?

?1

2

...

?

?

relational features

1 2

1 3?

...?

?

5

.

?

local featuresl f hb

4 5?

?

same-attribute-value

number of neighbors

avg value of neighborsnumber of shared neighbors

participate in relation

Relational Classifiers Values are represented as a fixed-length feature

vector

Instances are treated independently of each other

Relational features are computed by aggregating over related entities

Any classification or regression model can be used for learning and prediction

Application Case Studies Two example applications that use relational

classifiers Focus is on types of relational features used

Case Study 1: Predicting click-through rate of search result adsC St d 2 P di ti f i d hi i i l Case Study 2: Predicting friendships in a social network

Case Study 1: Predicting Ad Click-Through RatePredicting Ad Click Through Rate

Task: Predict the click-through rate (CTR) of an Task: Predict the click through rate (CTR) of an online ad, given that it is seen by the user, where the ad is described by: URL to which user is sent when clicking on ad Bid terms used to determine when to display ad

Titl d t t f d Title and text of ad

Our description is based on approach by Our description is based on approach by [Richardson et al., WWW07]

Relational Features Used

Ad Ad Ad Ad Ad Ad Ad

Average CTR Average CTRCTR?

Ad

contains-bid-term

Ad1 Ad2 Ad3 Ad4 Ad5 Ad6

BT1 BT3BT2BT4 BT5 BT6

t i bid trelated-bid-term(containing subsets or supersets of the term)

contains-bid-term(according to search engine)

… … ……

queried-bid-term

Count Count

Case Study 2: Predicting FriendshipsPredicting Friendships

Task: Predict new friendships among users based Task: Predict new friendships among users, based on their descriptive attributes, their existing friendships, and their family ties.p , y

Our description is based on approach byp pp y [Zheleva et al., SNAKDD08]

Relational Features Used “Petworks” - social networks of pets

P3P8

count, density

P4

P6

P9countt

count, proportion

P1 P2

4

P5P10P7

countcount

1

Friends?

same breed

P11

F2F1

in-familyJaccard coeff

same-breed

Key Idea: Feature Construction Feature informativeness is key to the success of a

relational classifier

Features can be: Attributes of entity/entities Match predicate on attributes of entities Attributes of related entities Encode structural features

Based on o erlap in sets Based on overlap in sets

Link Mining Algorithms Node Labeling Link Prediction

1. Relational Classifiers2. Collective Classifiers

Link Prediction Entity Resolution Group DetectionG oup etect o

Collective Classification

Extends relational classifiers by allowing relational

[Neville & Jensen, SRL00; Lu & Getoor, ICML03, Sen et al. AI Mag08]

Extends relational classifiers by allowing relational features to be functions of predicted attributes/relations of neighbors

At training time, these features are computed based on observed values in the training setAt i f ti th l ith it t ti At inference time, the algorithm iterates, computing relational features based on the current prediction for any unobserved attributesany unobserved attributes In the first, bootstrap, iteration, only local features are

used

CC: Learning label set:

P2 P4

P

P5P8

P3P1

P10

PP9

P6

P7

L d l (l l d l ti l) f Learn models (local and relational) from fully labeled training set

CC: Inference (1)

P

P1

P

P1

P5

P2P5

P2

P4P3 P4P3

St 1 B t t i tit tt ib t lStep 1: Bootstrap using entity attributes only

CC: Inference (2)

P

P1

P

P1

P5

P2P5

P2

P3 P4P3 P4P4

St 2 It ti l d t th t f h titStep 2: Iteratively update the category of each entity, based on related entities’ categories

CC Key Idea Rather than make predictions independently, begin

with relational classifier, and then ‘propagate’ p p gclassification

Variations: Propagate probabilities, rather than mode (related to

Gibbs Sampling)Gibbs Sampling) Batch vs. Incremental updates Ordering strategies Ordering strategies

Active area of research: active learning, semi- Active area of research: active learning, semisupervised learning, more principled joint probabilistic models, etc.

Link Mining Algorithms Node Labeling Link Prediction Link Prediction Entity Resolution Group DetectionG oup etect o

The Entity Resolution Problem

John Smith

James SmithSmith

“John Smith”

“Jim Smith”

“James Smith”

“J Smith”

Jonathan Smith “Jon Smith”

James Smith

“J Smith”

“Jonthan Smith”

Issues:1. Identification2. Disambiguation

Relational Identification

Very similar names.Added evidence from shared co-authors

Relational Disambiguation

Very similar names but no shared collaboratorscollaborators

Collective Entity Resolution

One resolution provides evidence for another => joint jresolution

P1: “JOSTLE: Partitioning of Unstructured Meshes for Massively Parallel Machines”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson J

P2: “Partitioning Mapping of Unstructured Meshes toP2: Partitioning Mapping of Unstructured Meshes to Parallel Machine Topologies”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus J

P3: “Dynamic Mesh Partitioning: A Unied Optimisation andP3: Dynamic Mesh Partitioning: A Unied Optimisation and Load-Balancing Algorithm”, C. Walshaw, M. Cross, M. G. Everett

P4: “Code Generation for Machines with Multiregister Operations”, Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman J

P5: “Deterministic Parsing of Ambiguous Grammars”, A. g gAho, S. Johnson, J. Ullman J

P6: “Compilers: Principles, Techniques, and Tools”, A. Aho, R. Sethi, J. Ullman

P1: “JOSTLE: Partitioning of Unstructured Meshes for Massively Parallel Machines”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson

P2: “Partitioning Mapping of Unstructured Meshes toP2: Partitioning Mapping of Unstructured Meshes to Parallel Machine Topologies”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus

P3: “Dynamic Mesh Partitioning: A Unied Optimisation andP3: Dynamic Mesh Partitioning: A Unied Optimisation and Load-Balancing Algorithm”, C. Walshaw, M. Cross, M. G. Everett

P4: “Code Generation for Machines with MultiregisterOperations”, Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman

P5: “Deterministic Parsing of Ambiguous Grammars”, A. g gAho, S. Johnson, J. Ullman

P6: “Compilers: Principles, Techniques, and Tools”, A. Aho, R. Sethi, J. Ullman

Relational Clustering (RC-ER)

C. Walshaw M. G. Everett S. JohnsonM. CrossP1

K McManusC Walshaw M. Everett S. JohnsonM CrossP2 K. McManusC. Walshaw M. Everett S. JohnsonM. CrossP2

Alfred V. Aho Stephen C. JohnsonJefferey D. UllmanP4

P5 A. Aho S. JohnsonJ. Ullman

Cut-based Formulation of RC-ER

S. JohnsonM. G. Everett S. JohnsonM. G. Everett

S. Johnson

S. Johnson

M. Everett S. Johnson

S. Johnson

M. Everett

Stephen C. A. Aho

Stephen C. A. Aho

Stephen C. JohnsonAlfred V. Aho

Stephen C. JohnsonAlfred V. Aho

Good separation of attributesMany cluster-cluster relationships Aho-Johnson1 Aho-Johnson2

Worse in terms of attributesFewer cluster-cluster relationships Aho-Johnson1 Everett-Johnson2 Aho Johnson1, Aho Johnson2,

Everett-Johnson1 Aho Johnson1, Everett Johnson2

Objective Function Minimize:

)()( ii ),(),( jiRRji j

iAA ccsimwccsimw

weight for attributes

weight for relations

similarity ofattributes

Similarity based on relational edges between ci and cj

Greedy clustering algorithm: merge cluster pair with max reduction in objective function

( , ) ( , ) (| ( )| | ( )|)c c w sim c c w N c N ci j A A i j R i j

Common cluster neighborhood Similarity of attributes

Relational Clustering Algorithm1. Find similar references using ‘blocking’2. Bootstrap clusters using attributes and relations3. Compute similarities for cluster pairs and insert into

priority queue

4. Repeat until priority queue is empty5. Find ‘closest’ cluster pair6. Stop if similarity below threshold7. Merge to create new cluster8 Update similarity for ‘related’ clusters8. Update similarity for related clusters

O( k l ) l ith / ffi i t i l t ti O(n k log n) algorithm w/ efficient implementation

Evaluation Datasets CiteSeer

1,504 citations to machine learning papers (Lawrence et al.) 2,892 references to 1,165 author entities

arXiv arXiv 29,555 publications from High Energy Physics (KDD Cup’03) 58,515 refs to 9,200 authors

Elsevier BioBase 156,156 Biology papers (IBM KDD Challenge ’05) 831,991 author refs Keywords, topic classifications, language, country and affiliation

of corresponding author, etcp g ,

Baselines A: Pair-wise duplicate decisions w/ attributes only

Names: Soft-TFIDF with Levenstein, Jaro, Jaro-Winkler Other textual attributes: TF-IDF

A*: Transitive closure over A

A+N: Add attribute similarity of co-occurring refs A+N*: Transitive closure over A+N

Evaluate pair-wise decisions over references F1 measure (harmonic mean of precision and recall) F1-measure (harmonic mean of precision and recall)

ER over Entire DatasetAlgorithm CiteSeer arXiv BioBase

A 0.980 0.976 0.568A* 0.990 0.971 0.559

A+N 0.973 0.938 0.710A+N* 0 984 0 934 0 753A+N 0.984 0.934 0.753

RC-ER 0.995 0.985 0.818

RC-ER outperforms baselines in all datasets Collective resolution better than naïve relational resolution

ER over Entire DatasetAlgorithm CiteSeer arXiv BioBase

A 0.980 0.976 0.568A* 0.990 0.971 0.559

A+N 0.973 0.938 0.710A+N* 0 984 0 934 0 753A+N 0.984 0.934 0.753

RC-ER 0.995 0.985 0.818

CiteSeer: Near perfect resolution; 22% error reduction arXiv: 6 500 additional correct resolutions; 20% error reduction arXiv: 6,500 additional correct resolutions; 20% error reduction BioBase: Biggest improvement over baselines

Flipside….

Privacy breaches in OSNs Identity disclosure A mapping from a record Who is ?

to a specific individual

Attribute disclosure

?

Find attribute value that the user intended to stay private

Is liberal?

Social link disclosure Participation in a sensitive

relationship or communication

Friends?p

Affiliation link disclosure Participation in a group revealing

Support gay Participation in a group revealing

a sensitive attribute value marriage

Other Linqs Projects Key Opinion Leader Identification Active Surveying in Social Networks Ontology Alignment and Folksonomy construction Ontology Alignment and Folksonomy construction Label Acquisition & Active Learning in Network Data Inference & Search in Camera Networks

Id tif i R l i S i l N t k Identifying Roles in Social Networks Group Recommendation in Social Networks Social Search Analysis of Dynamic Networks: loyalty, stability, diversity Ranking and Retrieval in Biological Networks Discourse-level sentiment analysis Discourse level sentiment analysis Bilingual Word Sense Disambiguation Visual Analytics:

D D C G G P D-Dupe, C-Group, G-Pare Others …

http://www.cs.umd.edu/linqs

Conclusion Link mining algorithms can be useful tools for social

media Need algorithms that can handle the multi-modal,

multi-relational, temporal nature of social media Collective algorithms make use of

Structure to define features and propagate i f ti ll t i th llinformation, allows us to improve the overall accuracy

While there are important pitfalls to take into account (confidence and privacy) there areaccount (confidence and privacy), there are many potential benefits and payoffs (improved personalization and context-aware predictions!)personalization and context aware predictions!)

http://www.cs.umd.edu/linqs

Work sponsored by the National Science Foundation, Maryland Industrial Partners (MIPS) National Geospatial AgencyMaryland Industrial Partners (MIPS), National Geospatial Agency,

Airforce Research Laboratory, DARPA, Google, Microsoft, and Yahoo!

lise getoor, "

Technology