lise getoor, "
DESCRIPTION
TRANSCRIPT
Link Mining
Li G tLise GetoorUniversity of Maryland, College Park
August 22, 2012
Alternate Title…..
What Machine Learning/Statistics/Data MiningMachine Learning/Statistics/Data Mining
can do for YOU!
1.Predict future values
2.Fill-in missing values
3 Identify anomalies
Supervised Learning
What are some common 3.Identify anomalies
4.Find patterns
What are some common machine learning algorithms?
5.Identify ClustersUnsupervised Learning
So, what’s Link Mining??? Machine learning when you have graphs (or networks)
Nodes are entities• People• Places
Organizations• Organizations• Text
Links are relationshipsp• Friends• MemberOf• LivesIn• LivesIn• Tweeted• Posted
e.g., heterogeneous multi-relational data, multimodal data …..
Ex: Social Media RelationshipsFriendsCollaborators
User-User
UbUa
FamilyFan/FollowerRepliesCo-EditsCo-Mentions, etc.
User-Doc
CommentsEdits, etc.
User Doc
Doc1U
User-Query-ClickQU URL
User-Tag-DocTag DocU
Link Mining Tasks Node Labeling Link Prediction Link Prediction Entity Resolution Group DetectionG oup etect o
Node Labeling
h What is Harry’spolitical persuasion?
Harry
Natasha
Link Prediction
Friends?
Entity Resolution Aka: deduplication, co-reference resolution, record
linkage, reference consolidation, etc.g
Abstract Problem StatementReal
WorldDigital World
Records / Mentions
Deduplication Problem Statement Cluster the records/mentions that correspond to
same entity y
Deduplication Problem Statement Cluster the records/mentions that correspond to
same entity y Intensional Variant: Compute cluster representative
Record Linkage Problem Statement Link records that match across databases
AB
Reference Matching Problem Match noisy records to clean records in a reference
table
Reference T blTable
InfoVis Co-Author Network Fragment
before after
Group Detection
Link Mining Algorithms Node Labeling Link Prediction Link Prediction Entity Resolution Group DetectionG oup etect o
Link Mining Algorithms Node Labeling Link Prediction
1. Relational Classifiers2. Collective Classifiers
Link Prediction Entity Resolution Group DetectionG oup etect o
Relational ClassifiersGiven:
125
ba
c
w
x
34e
d
z
y
Task: Predict attributeof some of the entities
1 ?
Alternate task: Predict existenceof relationship between entities
1 2?
?1
2
...
?
?
relational features
1 2
1 3?
...?
?
5
.
?
local featuresl f hb
4 5?
?
same-attribute-value
number of neighbors
avg value of neighborsnumber of shared neighbors
participate in relation
Relational Classifiers Values are represented as a fixed-length feature
vector
Instances are treated independently of each other
Relational features are computed by aggregating over related entities
Any classification or regression model can be used for learning and prediction
Application Case Studies Two example applications that use relational
classifiers Focus is on types of relational features used
Case Study 1: Predicting click-through rate of search result adsC St d 2 P di ti f i d hi i i l Case Study 2: Predicting friendships in a social network
Case Study 1: Predicting Ad Click-Through RatePredicting Ad Click Through Rate
Task: Predict the click-through rate (CTR) of an Task: Predict the click through rate (CTR) of an online ad, given that it is seen by the user, where the ad is described by: URL to which user is sent when clicking on ad Bid terms used to determine when to display ad
Titl d t t f d Title and text of ad
Our description is based on approach by Our description is based on approach by [Richardson et al., WWW07]
Relational Features Used
Ad Ad Ad Ad Ad Ad Ad
Average CTR Average CTRCTR?
Ad
contains-bid-term
Ad1 Ad2 Ad3 Ad4 Ad5 Ad6
BT1 BT3BT2BT4 BT5 BT6
t i bid trelated-bid-term(containing subsets or supersets of the term)
contains-bid-term(according to search engine)
… … ……
queried-bid-term
Count Count
Case Study 2: Predicting FriendshipsPredicting Friendships
Task: Predict new friendships among users based Task: Predict new friendships among users, based on their descriptive attributes, their existing friendships, and their family ties.p , y
Our description is based on approach byp pp y [Zheleva et al., SNAKDD08]
Relational Features Used “Petworks” - social networks of pets
P3P8
count, density
P4
P6
P9countt
count, proportion
P1 P2
4
P5P10P7
countcount
1
Friends?
same breed
P11
F2F1
in-familyJaccard coeff
same-breed
Key Idea: Feature Construction Feature informativeness is key to the success of a
relational classifier
Features can be: Attributes of entity/entities Match predicate on attributes of entities Attributes of related entities Encode structural features
Based on o erlap in sets Based on overlap in sets
Link Mining Algorithms Node Labeling Link Prediction
1. Relational Classifiers2. Collective Classifiers
Link Prediction Entity Resolution Group DetectionG oup etect o
Collective Classification
Extends relational classifiers by allowing relational
[Neville & Jensen, SRL00; Lu & Getoor, ICML03, Sen et al. AI Mag08]
Extends relational classifiers by allowing relational features to be functions of predicted attributes/relations of neighbors
At training time, these features are computed based on observed values in the training setAt i f ti th l ith it t ti At inference time, the algorithm iterates, computing relational features based on the current prediction for any unobserved attributesany unobserved attributes In the first, bootstrap, iteration, only local features are
used
CC: Learning label set:
P2 P4
P
P5P8
P3P1
P10
PP9
P6
P7
L d l (l l d l ti l) f Learn models (local and relational) from fully labeled training set
CC: Inference (1)
P
P1
P
P1
P5
P2P5
P2
P4P3 P4P3
St 1 B t t i tit tt ib t lStep 1: Bootstrap using entity attributes only
CC: Inference (2)
P
P1
P
P1
P5
P2P5
P2
P3 P4P3 P4P4
St 2 It ti l d t th t f h titStep 2: Iteratively update the category of each entity, based on related entities’ categories
CC Key Idea Rather than make predictions independently, begin
with relational classifier, and then ‘propagate’ p p gclassification
Variations: Propagate probabilities, rather than mode (related to
Gibbs Sampling)Gibbs Sampling) Batch vs. Incremental updates Ordering strategies Ordering strategies
Active area of research: active learning, semi- Active area of research: active learning, semisupervised learning, more principled joint probabilistic models, etc.
Link Mining Algorithms Node Labeling Link Prediction Link Prediction Entity Resolution Group DetectionG oup etect o
The Entity Resolution Problem
John Smith
James SmithSmith
“John Smith”
“Jim Smith”
“James Smith”
“J Smith”
Jonathan Smith “Jon Smith”
James Smith
“J Smith”
“Jonthan Smith”
Issues:1. Identification2. Disambiguation
Relational Identification
Very similar names.Added evidence from shared co-authors
Relational Disambiguation
Very similar names but no shared collaboratorscollaborators
Collective Entity Resolution
One resolution provides evidence for another => joint jresolution
P1: “JOSTLE: Partitioning of Unstructured Meshes for Massively Parallel Machines”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson J
P2: “Partitioning Mapping of Unstructured Meshes toP2: Partitioning Mapping of Unstructured Meshes to Parallel Machine Topologies”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus J
P3: “Dynamic Mesh Partitioning: A Unied Optimisation andP3: Dynamic Mesh Partitioning: A Unied Optimisation and Load-Balancing Algorithm”, C. Walshaw, M. Cross, M. G. Everett
P4: “Code Generation for Machines with Multiregister Operations”, Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman J
P5: “Deterministic Parsing of Ambiguous Grammars”, A. g gAho, S. Johnson, J. Ullman J
P6: “Compilers: Principles, Techniques, and Tools”, A. Aho, R. Sethi, J. Ullman
P1: “JOSTLE: Partitioning of Unstructured Meshes for Massively Parallel Machines”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson
P2: “Partitioning Mapping of Unstructured Meshes toP2: Partitioning Mapping of Unstructured Meshes to Parallel Machine Topologies”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus
P3: “Dynamic Mesh Partitioning: A Unied Optimisation andP3: Dynamic Mesh Partitioning: A Unied Optimisation and Load-Balancing Algorithm”, C. Walshaw, M. Cross, M. G. Everett
P4: “Code Generation for Machines with MultiregisterOperations”, Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman
P5: “Deterministic Parsing of Ambiguous Grammars”, A. g gAho, S. Johnson, J. Ullman
P6: “Compilers: Principles, Techniques, and Tools”, A. Aho, R. Sethi, J. Ullman
Relational Clustering (RC-ER)
C. Walshaw M. G. Everett S. JohnsonM. CrossP1
K McManusC Walshaw M. Everett S. JohnsonM CrossP2 K. McManusC. Walshaw M. Everett S. JohnsonM. CrossP2
Alfred V. Aho Stephen C. JohnsonJefferey D. UllmanP4
P5 A. Aho S. JohnsonJ. Ullman
Relational Clustering (RC-ER)
C. Walshaw M. G. Everett S. JohnsonM. CrossP1
K McManusC Walshaw M. Everett S. JohnsonM CrossP2 K. McManusC. Walshaw M. Everett S. JohnsonM. CrossP2
Alfred V. Aho Stephen C. JohnsonJefferey D. UllmanP4
P5 A. Aho S. JohnsonJ. Ullman
Relational Clustering (RC-ER)
C. Walshaw M. G. Everett S. JohnsonM. CrossP1
K McManusC Walshaw M. Everett S. JohnsonM CrossP2 K. McManusC. Walshaw M. Everett S. JohnsonM. CrossP2
Alfred V. Aho Stephen C. JohnsonJefferey D. UllmanP4
P5 A. Aho S. JohnsonJ. Ullman
Relational Clustering (RC-ER)
C. Walshaw M. G. Everett S. JohnsonM. CrossP1
K McManusC Walshaw M. Everett S. JohnsonM CrossP2 K. McManusC. Walshaw M. Everett S. JohnsonM. CrossP2
Alfred V. Aho Stephen C. JohnsonJefferey D. UllmanP4
P5 A. Aho S. JohnsonJ. Ullman
Cut-based Formulation of RC-ER
S. JohnsonM. G. Everett S. JohnsonM. G. Everett
S. Johnson
S. Johnson
M. Everett S. Johnson
S. Johnson
M. Everett
Stephen C. A. Aho
Stephen C. A. Aho
Stephen C. JohnsonAlfred V. Aho
Stephen C. JohnsonAlfred V. Aho
Good separation of attributesMany cluster-cluster relationships Aho-Johnson1 Aho-Johnson2
Worse in terms of attributesFewer cluster-cluster relationships Aho-Johnson1 Everett-Johnson2 Aho Johnson1, Aho Johnson2,
Everett-Johnson1 Aho Johnson1, Everett Johnson2
Objective Function Minimize:
)()( ii ),(),( jiRRji j
iAA ccsimwccsimw
weight for attributes
weight for relations
similarity ofattributes
Similarity based on relational edges between ci and cj
Greedy clustering algorithm: merge cluster pair with max reduction in objective function
( , ) ( , ) (| ( )| | ( )|)c c w sim c c w N c N ci j A A i j R i j
Common cluster neighborhood Similarity of attributes
Relational Clustering Algorithm1. Find similar references using ‘blocking’2. Bootstrap clusters using attributes and relations3. Compute similarities for cluster pairs and insert into
priority queue
4. Repeat until priority queue is empty5. Find ‘closest’ cluster pair6. Stop if similarity below threshold7. Merge to create new cluster8 Update similarity for ‘related’ clusters8. Update similarity for related clusters
O( k l ) l ith / ffi i t i l t ti O(n k log n) algorithm w/ efficient implementation
Evaluation Datasets CiteSeer
1,504 citations to machine learning papers (Lawrence et al.) 2,892 references to 1,165 author entities
arXiv arXiv 29,555 publications from High Energy Physics (KDD Cup’03) 58,515 refs to 9,200 authors
Elsevier BioBase 156,156 Biology papers (IBM KDD Challenge ’05) 831,991 author refs Keywords, topic classifications, language, country and affiliation
of corresponding author, etcp g ,
Baselines A: Pair-wise duplicate decisions w/ attributes only
Names: Soft-TFIDF with Levenstein, Jaro, Jaro-Winkler Other textual attributes: TF-IDF
A*: Transitive closure over A
A+N: Add attribute similarity of co-occurring refs A+N*: Transitive closure over A+N
Evaluate pair-wise decisions over references F1 measure (harmonic mean of precision and recall) F1-measure (harmonic mean of precision and recall)
ER over Entire DatasetAlgorithm CiteSeer arXiv BioBase
A 0.980 0.976 0.568A* 0.990 0.971 0.559
A+N 0.973 0.938 0.710A+N* 0 984 0 934 0 753A+N 0.984 0.934 0.753
RC-ER 0.995 0.985 0.818
RC-ER outperforms baselines in all datasets Collective resolution better than naïve relational resolution
ER over Entire DatasetAlgorithm CiteSeer arXiv BioBase
A 0.980 0.976 0.568A* 0.990 0.971 0.559
A+N 0.973 0.938 0.710A+N* 0 984 0 934 0 753A+N 0.984 0.934 0.753
RC-ER 0.995 0.985 0.818
CiteSeer: Near perfect resolution; 22% error reduction arXiv: 6 500 additional correct resolutions; 20% error reduction arXiv: 6,500 additional correct resolutions; 20% error reduction BioBase: Biggest improvement over baselines
Flipside….
Privacy breaches in OSNs Identity disclosure A mapping from a record Who is ?
to a specific individual
Attribute disclosure
?
Find attribute value that the user intended to stay private
Is liberal?
Social link disclosure Participation in a sensitive
relationship or communication
Friends?p
Affiliation link disclosure Participation in a group revealing
Support gay Participation in a group revealing
a sensitive attribute value marriage
Other Linqs Projects Key Opinion Leader Identification Active Surveying in Social Networks Ontology Alignment and Folksonomy construction Ontology Alignment and Folksonomy construction Label Acquisition & Active Learning in Network Data Inference & Search in Camera Networks
Id tif i R l i S i l N t k Identifying Roles in Social Networks Group Recommendation in Social Networks Social Search Analysis of Dynamic Networks: loyalty, stability, diversity Ranking and Retrieval in Biological Networks Discourse-level sentiment analysis Discourse level sentiment analysis Bilingual Word Sense Disambiguation Visual Analytics:
D D C G G P D-Dupe, C-Group, G-Pare Others …
http://www.cs.umd.edu/linqs
Other Linqs Projects Key Opinion Leader Identification Active Surveying in Social Networks Ontology Alignment and Folksonomy construction Ontology Alignment and Folksonomy construction Label Acquisition & Active Learning in Network Data Inference & Search in Camera Networks
Id tif i R l i S i l N t k Identifying Roles in Social Networks Group Recommendation in Social Networks Social Search Analysis of Dynamic Networks: loyalty, stability, diversity Ranking and Retrieval in Biological Networks Discourse-level sentiment analysis Discourse level sentiment analysis Bilingual Word Sense Disambiguation Visual Analytics:
D D C G G P D-Dupe, C-Group, G-Pare Others …
http://www.cs.umd.edu/linqs
Conclusion Link mining algorithms can be useful tools for social
media Need algorithms that can handle the multi-modal,
multi-relational, temporal nature of social media Collective algorithms make use of
Structure to define features and propagate i f ti ll t i th llinformation, allows us to improve the overall accuracy
While there are important pitfalls to take into account (confidence and privacy) there areaccount (confidence and privacy), there are many potential benefits and payoffs (improved personalization and context-aware predictions!)personalization and context aware predictions!)
http://www.cs.umd.edu/linqs
Work sponsored by the National Science Foundation, Maryland Industrial Partners (MIPS) National Geospatial AgencyMaryland Industrial Partners (MIPS), National Geospatial Agency,
Airforce Research Laboratory, DARPA, Google, Microsoft, and Yahoo!