public profile

Public profile matching

Urzhumtcev Oleg, SkolTech, ITMO1

Instructor: Raymond Chi-Wing Wong, HKUST

27 November 2012, Hong Kong

1http://en.qdinvest.ru

Problem

Generalization

Approach studied

Proposition

Testing

Conclusion

2

ProblemThere are many objects in the world

Some of them are named entities, among them — people

They may have different representations (profiles)

3

Problem

4

Problem

5

GeneralizationPart of summarization problem

Precisely — first step: accurate data collection

Caused by homonimy

6

Problem

Generalization

Approach studied

Proposition

Testing

Conclusion

7

Approaches• User Identification Across Multiple Social Networks –

Jan Vosecky, Dan Hong, Vincent Y. Shen

• Features:

• Direct matching (nearest neighbor)

• Vector-Based Comparison Algorithm

• Fuzzy string field matching

• Weighted parameters

8

ApproachesVector-based comparison:

9

profile { :id = “50b2c847e3b24cf21400000”:username = “darikcr”:type “twitter”:source “http://twitter.com/darikcr”:name “NetBUG”:lang “ru-RU”:birthday nil:email nil:about “Linguist, programmer, also have some XPrience in making startups. Groaning for active shiny people to do business together”:status “Две новые станции метро в Петербурге - "Бухарестскую" и "Международную" - откроют 27 декабря. Б... http://vk.cc/15wd4M ”:tags nil}

profile { :id = “50b2c843e3b24cf214000005”:username = “oleg.urzhumtsev”:type “facebook”:source “http://facebook.com/oleg.urzhumtsev”:name “Oleg Urzhumtcev”:alias “NetBUG”:lang “ru-RU”:birthday 1989/10/19:email “[email protected]”:about “”:status Checked in at HKUST Bus Station”:tags nil:university [“HKUST” “SkolTech” “ITMO” “SPbSU”]:job [“ProMT JSC” ”Israeli Embassy”]:interests [“Linguistics” “motoschool” “programming” “startups”]}

ApproachesFuzzy matching (VMN algorithm*):

• Partial matching

• Word swapping tolerance

• *Vosecky, Hong, Shen 2009 10

String Pair VMN SDS SD1 “Jan Vosecky”,“J Vosecky” 0.66 0.82 2.02 “Jan Vosecky”,“Vosecky Jan” 1.0 0.55 5.03 “Jan Vosecky”,“Honza vosecky” 0.5 0.36 7.04 “Jan Vosecky”,“Robert Vosecky” 0.5 0.55 5.05 “Jan Vosecky”,“Jan Smith” 0.5 0.45 6.06 “Jan Vosecky”,“Jack Vondracek” 0.0 0.27 8.0

Table 1. String Match Functions Comparison

ApproachesDrawbacks:

1.Suitable for well-intersected profiles

2.Bad for discovery

3.No cross-parameter search

11

ApproachesAwareness of missing data:

12

Approaches• Identifying Users Across Social Tagging Systems by

Tereza Iofciu, Peter Fankhauser, Fabian Abel, Kerstin Bischof

• Tagged entities

• ‘Bag-of-words’ document model

• Only basic matching

13

Problem

Generalization

Approach studied

Proposition

Testing

Conclusion

14

Proposition1. Profile is a non-uniform document with

different features of different types

2. Parameters split into ‘unique’ and ‘frequent’

‘username’ is unique

‘surname’ is unique although homonymy may occur

‘interests’ is frequent (shared by many people)

15

Proposition3. Use combined model:

1. Initial matching as in [1] (vector-based)

2. If fails, continue to weight-based unique attribute matching

3. If fails, continue to clustering and all attribute nearest-neighbor prediction

16

PropositionWeight-based unique attribute matching

Similarity = (this.unique_attrs.each{|id,attr| weight_unique[id]*other.unique_attrs.each == attr}.sum + this.freqent_attrs.each{|attr| other.freqent_attrs.each == attr}.sum) / this.freqent_attrs.each{|attr| other.freqent_attrs.each != attr}.sum

17

PropositionClustering

Hierarchical: the distribution seems to be even

• Distance: non-numeric parameter conversion

• Merging:

• show up features shared by 30% of members or more for vector-like attributes

• Slow

• Reliable

• Probabilistic for singular features

Curse of dimensionality18

Technical work1. Data fetching:

1. About.me

2. Facebook

3. Twitter

2. Tools:

1. Ruby

2. Document-oriented noSQL database: mongoDB

3. Implementation of vector-based weighted comparison

4. Implementation of VMN algorithm

19

Problem

Generalization

Approach studied

Proposition

Testing

Conclusion

20

Testing

21

Data Direct nearest neighbor matching

Unique parameter matching

Combined (direct + unique + clustering)

LDA Document-based model (experimental)

Completeness (%)

51% 56% 74% 46%

Basic set 53 58 78 95

Accuracy (%) 100% 98% 95% 51%

,of them false positive

0 1 3 42

False negative 51 46 29 9

Extended set 56 62 127 N/T

Accuracy(%) 98% 54% 70% N/T

,of them false positive

2 5 N/T

False negative 51 (basic set) 37 (basic set) 28 (basic set) N/T

Future work1. Attempt to convert all parameters to numeric

format and apply SVM for clustering

2. Add semantic word similarity via WordNet distance

3. Named Entity Recognition in text fields

4. Envelope the algorithms developed into a single sleek Rails web application and public testing

22

Conclusion1. All approaches studied had strong

mathematical background but were badly adapted for real applications

2. Intuitive fusion of approaches suitable for different situations may improve results

3. Further work is necessary to develop the best approach

23

Thank you!

Questions?

Slides available at http://n3r.ru/c4

Demo&code available at http://n3r.ru/c5

Feel free to contact me: [email protected]://about.me/netbug

...and enlarge your soft skills!

24

public profile

Documents