public profile
Post on 14-Jul-2015
139 Views
Preview:
TRANSCRIPT
Public profile matching
Urzhumtcev Oleg, SkolTech, ITMO1
Instructor: Raymond Chi-Wing Wong, HKUST
27 November 2012, Hong Kong
1http://en.qdinvest.ru
ProblemThere are many objects in the world
Some of them are named entities, among them — people
They may have different representations (profiles)
3
GeneralizationPart of summarization problem
Precisely — first step: accurate data collection
Caused by homonimy
6
Approaches• User Identification Across Multiple Social Networks –
Jan Vosecky, Dan Hong, Vincent Y. Shen
• Features:
• Direct matching (nearest neighbor)
• Vector-Based Comparison Algorithm
• Fuzzy string field matching
• Weighted parameters
8
ApproachesVector-based comparison:
9
profile { :id = “50b2c847e3b24cf21400000”:username = “darikcr”:type “twitter”:source “http://twitter.com/darikcr”:name “NetBUG”:lang “ru-RU”:birthday nil:email nil:about “Linguist, programmer, also have some XPrience in making startups. Groaning for active shiny people to do business together”:status “Две новые станции метро в Петербурге - "Бухарестскую" и "Международную" - откроют 27 декабря. Б... http://vk.cc/15wd4M ”:tags nil}
profile { :id = “50b2c843e3b24cf214000005”:username = “oleg.urzhumtsev”:type “facebook”:source “http://facebook.com/oleg.urzhumtsev”:name “Oleg Urzhumtcev”:alias “NetBUG”:lang “ru-RU”:birthday 1989/10/19:email “darikcr@gmail.cm”:about “”:status Checked in at HKUST Bus Station”:tags nil:university [“HKUST” “SkolTech” “ITMO” “SPbSU”]:job [“ProMT JSC” ”Israeli Embassy”]:interests [“Linguistics” “motoschool” “programming” “startups”]}
ApproachesFuzzy matching (VMN algorithm*):
• Partial matching
• Word swapping tolerance
• *Vosecky, Hong, Shen 2009 10
String Pair VMN SDS SD1 “Jan Vosecky”,“J Vosecky” 0.66 0.82 2.02 “Jan Vosecky”,“Vosecky Jan” 1.0 0.55 5.03 “Jan Vosecky”,“Honza vosecky” 0.5 0.36 7.04 “Jan Vosecky”,“Robert Vosecky” 0.5 0.55 5.05 “Jan Vosecky”,“Jan Smith” 0.5 0.45 6.06 “Jan Vosecky”,“Jack Vondracek” 0.0 0.27 8.0
Table 1. String Match Functions Comparison
ApproachesDrawbacks:
1.Suitable for well-intersected profiles
2.Bad for discovery
3.No cross-parameter search
11
Approaches• Identifying Users Across Social Tagging Systems by
Tereza Iofciu, Peter Fankhauser, Fabian Abel, Kerstin Bischof
• Tagged entities
• ‘Bag-of-words’ document model
• Only basic matching
13
Proposition1. Profile is a non-uniform document with
different features of different types
2. Parameters split into ‘unique’ and ‘frequent’
‘username’ is unique
‘surname’ is unique although homonymy may occur
‘interests’ is frequent (shared by many people)
15
Proposition3. Use combined model:
1. Initial matching as in [1] (vector-based)
2. If fails, continue to weight-based unique attribute matching
3. If fails, continue to clustering and all attribute nearest-neighbor prediction
16
PropositionWeight-based unique attribute matching
Similarity = (this.unique_attrs.each{|id,attr| weight_unique[id]*other.unique_attrs.each == attr}.sum + this.freqent_attrs.each{|attr| other.freqent_attrs.each == attr}.sum) / this.freqent_attrs.each{|attr| other.freqent_attrs.each != attr}.sum
17
PropositionClustering
Hierarchical: the distribution seems to be even
• Distance: non-numeric parameter conversion
• Merging:
• show up features shared by 30% of members or more for vector-like attributes
• Slow
• Reliable
• Probabilistic for singular features
Curse of dimensionality18
Technical work1. Data fetching:
1. About.me
2. Facebook
3. Twitter
2. Tools:
1. Ruby
2. Document-oriented noSQL database: mongoDB
3. Implementation of vector-based weighted comparison
4. Implementation of VMN algorithm
19
Testing
21
Data Direct nearest neighbor matching
Unique parameter matching
Combined (direct + unique + clustering)
LDA Document-based model (experimental)
Completeness (%)
51% 56% 74% 46%
Basic set 53 58 78 95
Accuracy (%) 100% 98% 95% 51%
,of them false positive
0 1 3 42
False negative 51 46 29 9
Extended set 56 62 127 N/T
Accuracy(%) 98% 54% 70% N/T
,of them false positive
2 5 N/T
False negative 51 (basic set) 37 (basic set) 28 (basic set) N/T
Future work1. Attempt to convert all parameters to numeric
format and apply SVM for clustering
2. Add semantic word similarity via WordNet distance
3. Named Entity Recognition in text fields
4. Envelope the algorithms developed into a single sleek Rails web application and public testing
22
Conclusion1. All approaches studied had strong
mathematical background but were badly adapted for real applications
2. Intuitive fusion of approaches suitable for different situations may improve results
3. Further work is necessary to develop the best approach
23
top related