public profile

Public profile matching

Urzhumtcev Oleg, SkolTech, ITMO1

Instructor: Raymond Chi-Wing Wong, HKUST

27 November 2012, Hong Kong

1http://en.qdinvest.ru

Problem

Generalization

Approach studied

Proposition

Testing

Conclusion

ProblemThere are many objects in the world

Some of them are named entities, among them — people

They may have different representations (profiles)

Problem

GeneralizationPart of summarization problem

Precisely — first step: accurate data collection

Caused by homonimy

Problem

Generalization

Approach studied

Proposition

Testing

Conclusion

Approaches• User Identification Across Multiple Social Networks –

Jan Vosecky, Dan Hong, Vincent Y. Shen

• Features:

• Direct matching (nearest neighbor)

• Vector-Based Comparison Algorithm

• Fuzzy string field matching

• Weighted parameters

ApproachesVector-based comparison:

profile { :id = “50b2c847e3b24cf21400000”:username = “darikcr”:type “twitter”:source “http://twitter.com/darikcr”:name “NetBUG”:lang “ru-RU”:birthday nil:email nil:about “Linguist, programmer, also have some XPrience in making startups. Groaning for active shiny people to do business together”:status “Две новые станции метро в Петербурге - "Бухарестскую" и "Международную" - откроют 27 декабря. Б... http://vk.cc/15wd4M ”:tags nil}

profile { :id = “50b2c843e3b24cf214000005”:username = “oleg.urzhumtsev”:type “facebook”:source “http://facebook.com/oleg.urzhumtsev”:name “Oleg Urzhumtcev”:alias “NetBUG”:lang “ru-RU”:birthday 1989/10/19:email “darikcr@gmail.cm”:about “”:status Checked in at HKUST Bus Station”:tags nil:university [“HKUST” “SkolTech” “ITMO” “SPbSU”]:job [“ProMT JSC” ”Israeli Embassy”]:interests [“Linguistics” “motoschool” “programming” “startups”]}

ApproachesFuzzy matching (VMN algorithm*):

• Partial matching

• Word swapping tolerance

• *Vosecky, Hong, Shen 2009 10

String Pair VMN SDS SD1 “Jan Vosecky”,“J Vosecky” 0.66 0.82 2.02 “Jan Vosecky”,“Vosecky Jan” 1.0 0.55 5.03 “Jan Vosecky”,“Honza vosecky” 0.5 0.36 7.04 “Jan Vosecky”,“Robert Vosecky” 0.5 0.55 5.05 “Jan Vosecky”,“Jan Smith” 0.5 0.45 6.06 “Jan Vosecky”,“Jack Vondracek” 0.0 0.27 8.0

Table 1. String Match Functions Comparison

ApproachesDrawbacks:

1.Suitable for well-intersected profiles

2.Bad for discovery

3.No cross-parameter search

ApproachesAwareness of missing data:

Approaches• Identifying Users Across Social Tagging Systems by

Tereza Iofciu, Peter Fankhauser, Fabian Abel, Kerstin Bischof

• Tagged entities

• ‘Bag-of-words’ document model

• Only basic matching

Problem

Generalization

Approach studied

Proposition

Testing

Conclusion

Proposition1. Profile is a non-uniform document with

different features of different types

2. Parameters split into ‘unique’ and ‘frequent’

‘username’ is unique

‘surname’ is unique although homonymy may occur

‘interests’ is frequent (shared by many people)

Proposition3. Use combined model:

1. Initial matching as in [1] (vector-based)

2. If fails, continue to weight-based unique attribute matching

3. If fails, continue to clustering and all attribute nearest-neighbor prediction

PropositionWeight-based unique attribute matching

Similarity = (this.unique_attrs.each{|id,attr| weight_unique[id]*other.unique_attrs.each == attr}.sum + this.freqent_attrs.each{|attr| other.freqent_attrs.each == attr}.sum) / this.freqent_attrs.each{|attr| other.freqent_attrs.each != attr}.sum

PropositionClustering

Hierarchical: the distribution seems to be even

• Distance: non-numeric parameter conversion

• Merging:

• show up features shared by 30% of members or more for vector-like attributes

• Slow

• Reliable

• Probabilistic for singular features

Curse of dimensionality18

Technical work1. Data fetching:

1. About.me

2. Facebook

3. Twitter

2. Tools:

1. Ruby

2. Document-oriented noSQL database: mongoDB

3. Implementation of vector-based weighted comparison

4. Implementation of VMN algorithm

Problem

Generalization

Approach studied

Proposition

Testing

Conclusion

Testing

Data Direct nearest neighbor matching

Unique parameter matching

Combined (direct + unique + clustering)

LDA Document-based model (experimental)

Completeness (%)

51% 56% 74% 46%

Basic set 53 58 78 95

Accuracy (%) 100% 98% 95% 51%

,of them false positive

0 1 3 42

False negative 51 46 29 9

Extended set 56 62 127 N/T

Accuracy(%) 98% 54% 70% N/T

,of them false positive

2 5 N/T

False negative 51 (basic set) 37 (basic set) 28 (basic set) N/T

Future work1. Attempt to convert all parameters to numeric

format and apply SVM for clustering

2. Add semantic word similarity via WordNet distance

3. Named Entity Recognition in text fields

4. Envelope the algorithms developed into a single sleek Rails web application and public testing

Conclusion1. All approaches studied had strong

mathematical background but were badly adapted for real applications

2. Intuitive fusion of approaches suitable for different situations may improve results

3. Further work is necessary to develop the best approach

Thank you!

Questions?

Slides available at http://n3r.ru/c4

Demo&code available at http://n3r.ru/c5

Feel free to contact me: darikcr@gmail.comhttp://about.me/netbug

...and enlarge your soft skills!

public profile

Documents

public health and preventive medicine profile · public...

edmonton public library profile

profile of state public enterprises

hughes public relations corporate profile

composite profile of pakistan public sector analysis

minneapolis public schools leadership profile...

hamdard public school hamdard public school a profile a...

ioe public sector profile: transportation & parking

phase 1- school profile from public data

regional profile 2016 - world bank · regional profile 2016...

sample public profile - alberta.casample carrier information...

uml profile for ejb public draft

a profile of public land amenities

bill legray's linkedin public profile

jacqueline foglia sandoval's public profile

ioe public sector profile: healthcare

lessons from redesigning the linkedin public profile

queensland public sector workforce profile

pakistan public administration profile

public health district needs profile - somerset … deane...