Extracting Keyphrases to Represent Relations in Social
Networks from Web
Junichiro Mori and Mitsuru IshizukaUniversiry of Tokyo
Yutaka MatsuoNational Institute of Advanced
Industrial Science and Technology
IJCAI-07
Abstract
• The goal is extracting the underlying relations between entities that are embedded in social networks.
• The algorithm automatically extracts labels that describe relations among entities.
• The algorithm– clusters similar entity pairs– underlying relations between entities are obtained
from results of clustering.
Introduction
• Social networks for AI and the Semantic Web – trust estimation– ontology construction– end-user ontology
• Building social networks– extraction of social networks automatically from vari
ous sources of information.• Flink : Web pages, e-mail messages, and publications
• Polyphonet [www06]
Introduction
• Explore underlying relations• Most automatic extraction methods are superficial approac
h
• Co-occurrence analysis
• Non-profound assessment
– Flink : provide a clue to the strength of relations– Polyphonet : defines four kinds of relations
• C5
• Co-Author, Co-Lab, Co-Proj, Co-Conf
Related Work
• A supervised method– Need large annotated corpora– to gather the domain specific knowledge– a priori to define extracted relations
• Ontology population (Semantic annotation)– Pattern-based approaches– context-based approaches
• Web is highly heterogeneous and unstructured– In this paper
• context-based• a bag-of-words of context [Turney, 2005]
Method - Concept (1/4)
• The social network was extracted according to co-occurrence of entities on the Web.
Method - Concept (2/4)
• Given entity pairs in the social network– discover relevant keyphrases
• to analyze the surrounding local context (Co-occur on the Web )
• keyword extraction
Method - Concept (3/4)
• The keywords are ordered according to TF-IDF-based scoring
Method - Concept (4/4)
• Hypothesize:– the local contexts of entity pairs in the Web are similar,
the entity pairs share a similar relation.– [Harris, 1968; Schutze, 1998]: words are similar to the
extent that their contextual representations are similar.
• According to that hypothesis– the method clusters entity pairs according to the simila
rity of their collective contexts.– each cluster represents a different relation and each en
tity pair in a cluster is an instance of similar relation.
Method - Procedure
Method - Context Model and Similarity Calculation
• Ci,j (n,m) = t1, ..., tN
– A context model Ci,j of an entity pair (ei, ej)
– N terms t1, ..., tN that are extracted from the context of an entity pair
– m is the number of intervening terms between ei and ej
– n is the number of words to the left and right of either entity.
– a feature weight of ti : TF-IDF
• TF : term frequency of term ti in the contexts
• IDF : log(|C|/df(ti))+1
Method - Clustering and Label Selection
• TFIDF-based cosine similarity • Hierarchical agglomerative clustering
– complete linkage– The similarity between the clusters CL1 , CL2 is evalu
ated by considering the two most dissimilar elements
• With a cluster CL’s labels l1, ..., ln scored according to the term relevancy, an entity pair, ei and ej , that belongs to the CL can be regarded as holding the relations described by l1, ..., ln.
Experiment – 1/3
• Test Data– 143 distinct entity pairs from a political social netwo
rk• pair of a politician and a geo-political entity
– 421 entity pairs from a researcher network• pair of Japanese AI researchers
• Context model of each entity pair– 100 Web pages– NP and Noun by part-of-speeches (POS) – exclude stop words
Experiment – 2/3
• Clustering– complete-linkage agglomerative
• five distinct clusters for the political social network
• twelve distinct clusters for the researcher network
• two human subjects– three or fewer possible labels for each pairs– a cluster label
• the most frequent term among the manually assigned relation labels of entity pairs in the cluster.
Experiment – 3/3
Evaluation
• For each cluster cl– EPcl,correct : manually assigned relation labels include t
he label of cluster cl– EPcl,total : the number of entity pairs in the cluster cl
• For each relation l– EPl,correct : the relation label l whose cluster label is l– EPl,total : the number of entity pairs have the relation l
abel l
Evaluation
Conclusions
• Automatically extracting labels– relations between entities in social networks– Unsupervised and domain independent
• Utilizing the Web to obtain the collective contexts– Semantic Web– Web mining
• Future– other types of social networks– enriching social networks