seungwon hwang: entity graph mining and matching
Post on 22-Oct-2014
1.178 views
DESCRIPTION
This talk introduces the problem of matching web-scale entity graphs, such as multilingual name graphs and social network graphs, to solve difficult problems such as name translation or social id finding. While existing approaches focus on using textual (or phonetic) similarity or Web co-occurrences, this approach combines the strength of the two and significantly outperforms the state-of-the-arts. We present our evaluation results using real-life entity graphs.TRANSCRIPT
Info
rmati
on &
Data
base
Syst
em
s La
b
Entity Graph Mining and MatchingSeung-won Hwang
Associate ProfessorDepartment of Computer Science and Engineering
POSTECH, Korea
Info
rmati
on &
Data
base
Syst
em
s La
b
Mining Human Intelligence from the Web: Click Graph
Language-agnostic/data-intensive: e.g., arabic Corpus?
Are q1 and q2 similar?
Are u3 and u4 similar?
Info
rmati
on &
Data
base
Syst
em
s La
b
Mining at Finer Granularity: Named Entity (NE) Graph
Person name, Place name, Organization name, Product name Newspapers, Web sites, TV programs, …
MS
jobsgate
s
Apple
Mac
complicated
Co-founder
tenure
Info
rmati
on &
Data
base
Syst
em
s La
b
Case I: Matching names with twitter accounts [EDBT11]
Info
rmati
on &
Data
base
Syst
em
s La
b
Case II: Entity Translation [EMNLP10,CIKM11] What are the features? How are the features combined?(using translation as an application scenario)
English Corpus
Chi-nese
Corpus
NE
NE
NE
NE
NE
NE
NE
NE
Ge=(Ve, Ee) Gc=(Vc, Ec)
NE
NE
NE
NE
NE
NE
NE
NE
NE
NE
NE NE
NE
NE
NENE
NE
NE
NE
Info
rmati
on &
Data
base
Syst
em
s La
b
NE Translation Goal
Finding a NE in source language into its NE in target language Ex) “Obama” (English) “ 奥巴马” (Chinese)
Resources: comparable corpora
Features
NEE
Features
NEE
Features
NEE
Features
NEE
Features
NEC
Features
NEC
Features
NEC
Features
NEC
NEE NEC
Find!!
NEE NEC
NEE NEC
NEE NEC
Xinhua News Agency (English)
Xinhua News Agency (Chinese)
Info
rmati
on &
Data
base
Syst
em
s La
b
NE Translation Similarity Features Entity Name Similarity (E): S.Wan [1], L. Haizhou [2], K. Knight [3]
Pronunciation similarity between named entities Ex) “Obama” and “ 奥巴马” (pronounced Aobama)
Entity Context Similarity (EC): M. Diab [4], H. Ji [5], K. Yu [6]
Contextual word similarity between named entities Ex) The president (总统 ) Obama ( 奥巴马 )
Relationship Similarity (R): G.-w.You [7]
Co-occurrence similarity between pairs of named entities Ex) (“Jackie Chan”, “Bill Gates” ) vs. (“ 成龙” , “ 比尔 · 盖茨 ” )
“As president, Obama signed economic stimulus legislation …”
Info
rmati
on &
Data
base
Syst
em
s La
b Entity Relationship
Using Entity Names E [1,2,3] R
Using Textual Context EC [4,5,6] ?
Motivation Taxonomy Table
Research questions: Why RC is not used? Can all four categories combined?
Shao [8]
You [7]
Info
rmati
on &
Data
base
Syst
em
s La
b
In this paper… We propose a new NE translation similarity feature
Relationship Context similarity (RC) Contextual word similarity between named entities Ex) pair (“Barack”, “Michelle”) Spouse
We propose new holistic approaches Combining all E, EC, R, and RC
We validate our proposed approach using extensive experi-ments
Info
rmati
on &
Data
base
Syst
em
s La
b
Our Framework We abstract this problem as… Graph Matching of two NE relationship graphs extracted from
comparable corpora
English Corpus
Chi-nese
Corpus
NE
NE
NE
NE
NE
NE
NE
NE
Ge=(Ve, Ee) Gc=(Vc, Ec)
NE
NE
NE
NE
NE
NE
NE
NE
NE
NE
NE NE
NE
NE
NENE
NE
NE
NE
Populate a decision matrix R, |Ve|-by-|Vc| matrix
Info
rmati
on &
Data
base
Syst
em
s La
b
Our Framework Overview – 3 Steps
Initialization Construct NE relationship graphs Build an initial pairwise similarity matrix R0
Use Entity (E) and Entity Context (EC) similarities
Iterative reinforcement Build a final pairwise similarity matrix R∞
Use Relationship (R) and Relationship Context (RC) similarities
Matching Find 1:1 matching from R∞
Build a binary hard decision matrix R*
奥巴马 成龙
Obama .99 .1 .2
Jackie chan
.1
奥巴马 成龙
Obama .99 .1 .2
Jackie chan
.99
Info
rmati
on &
Data
base
Syst
em
s La
b
Initialization Constructing NE relationship graphs G = (N, E)
Extract NEs using entity tagger for each document in each corpus Regard NEs that appears more than δ times as Nodes Connect two Nodes when they co-occur more than δ times
Initializing R0
Computing entity similarity matrix SE
Use Edit-Distance (ED) between ‘ei’ and Pinyin representation of ‘cj’ Ex) ED(“Obama”, “ 奥巴马” ) = ED(“Obama”, “Aobama”)
)()(
),(1
j
j
Ci
CiEij PYLeneLen
PYeEDS
Info
rmati
on &
Data
base
Syst
em
s La
b
Initialization Initializing R0
Computing entity context similarity matrix SEC
Context word
ex) “As president, Obama signed economic stimulus legislation …”
Context window
Correlation between a NE and a context word : Log-odd ratios
},),...,(,...,,{),( 2/12/12/2/ liliilili wwNEwwwdNECW
Info
rmati
on &
Data
base
Syst
em
s La
b
Initializing R0
Computing entity context similarity matrix SEC
Projected Context Association Vector
Initialization
Obama Score… …
President 0.9… …… …
奥巴马 Score
… …… …
总统 0.85
… …
Dictionary…
(President, 总统 )
……
presi-dent
USA
总统
美國
Info
rmati
on &
Data
base
Syst
em
s La
b
Initialization Initializing R0
Computing entity context similarity matrix SEC
Context Similarity between ‘ei’ and ‘cj’
Compute cosine similarity between two vectors
Merging SE and SEC
Min-Max normalization in range [0:1] Merge
ji
ji
ce
ceECij
CACA
CACAS
ECij
Eijij SSR
Info
rmati
on &
Data
base
Syst
em
s La
b
Reinforcement Intuition
Two NEs with a strong relationship Co-occur frequently have edge Share similar context have similar relationship context
1. Align neighbors using relationship (R) and relationship context (RC) similarity
2. Update the similarity score
X
NE
NE
English NE Graph
Y
NE
NE
Chinese NE Graph
con-text
con-text
con-text
con-text
Info
rmati
on &
Data
base
Syst
em
s La
b
Reinforcement Iterative Approach
0
),,(),(
,1 )1(2
)(ij
jiBkvuk
RCjviu
tuvt
ij RSR
Rt
Entity-based Similarity (E & EC)Relationship-based Similarity (R & RC)
Ordered set of aligned neighbor pairs of (i, j) at iteration t
Relationship Context (RC) Similarity between relation pair (i, u) and (j, v)
Relationship (R) Similarity ofi’s neighbor u and j’s neighbor v
Info
rmati
on &
Data
base
Syst
em
s La
b
Matching Finding 1:1 matching using greedy algorithm
Steps1. Find a translation pair with the highest final similarity score 2. Select the pair and remove the corresponding row and column from R∞
3. Repeat 1. and 2. until the similarity score < threshold
R∞
Info
rmati
on &
Data
base
Syst
em
s La
b
Experiments Dataset
English Gigaword Corpus Xinhua News Agency 2008.01~2008.12 100,746 news documents
Chinese Gigaword Corpus Xinhua News Agency 2008.01~2008.12 88,029 news documents
Approaches EC : consider Entity context similarity feature only E : consider Entity name similarity feature only Shao (E+EC) : combine Entity name & Entity Context similari-
ties You (E+R) : combine Entity name & Relationship similarities Ours
E+EC+R (when ϒ = 0) E+EC+R+RC
Measure Precision, Recall, and F1-score
Info
rmati
on &
Data
base
Syst
em
s La
b
Experiments Effectiveness of overall framework
500 person named entities Set λ = 0.15 5-fold cross-validation for threshold parameter learning
Other type of NE (100 Location named entities)
Info
rmati
on &
Data
base
Syst
em
s La
b
Directions Graph matching Graph cleansing [VLDB11] Scalable entity search
US Presidents
Bill Clinton
William J Clinton
George W. Bush
George H.W. Bush
Dubya
Info
rmati
on &
Data
base
Syst
em
s La
b
Thanks Question?
Visit: www.postech.ac.kr/~swhwang for these papers