mds 2011 presentation: an unsupervised approach to discovering and disambiguating social media...

44
An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles Mining Data Semantics Workshop 2011 Carlton Northern Old Dominion University 8/25/2011 1

Upload: carlton-northern

Post on 18-Dec-2014

1.438 views

Category:

Technology


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

An Unsupervised Approach to Discovering and Disambiguating Social Media ProfilesMining Data Semantics Workshop 2011Carlton NorthernOld Dominion University8/25/2011

1

Page 2: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Background

• Digital Preservation– How are students using social media as a digital

preservation strategy?– Evaluating Personal Archiving Strategies for

Internet-based Information - Marshall, McCown, Nelson http://www.cs.odu.edu/~mln/pubs/archiving-2007/eval-personal-arch-strat-archiving07.pdf

2

Page 3: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Goal• Ascertain the set of social media profiles for

ODU CS students.

...{ }

3

Page 4: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

4

What's out there already?

Page 5: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

5

Intelius

Page 6: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Wink / my life

6

Page 7: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Google

7

Page 8: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Requirements and Assumptions• Approach must be automated - no human interaction except for search query

consisting of:• location• organization• profession/education domain.

• Achieve precision 0.7 or higher and f-measure 0.5 or higher comparable to a human level of the same activity

• Must find profiles not indexed by search engines• Can use any means available including using search engines, page scraping, web

service APIs, etc.• Only publicly declared identities; do not expose obfuscated identities

– e.g., “Bruce Wayne“ -> “Batman"

• Find profiles from 25 pre-defined sites (next slide)• Approach must be extensible,

– i.e. new social media sites can be added with minimal changes.

8

Page 9: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Social Media Sites

9

Page 10: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Approach

10

Page 11: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

11

Discovery Phase

Generate Usernames

Check Sites for Profiles

Check Social Graph

Check Google and

Yahoo

Check Sites For Profiles

Check Rapportive

Disambiguation Phase

Assign Points for Keywords, Email, Me and

Friend Links

Remove Duplicates

Algorithm

*Run multiple times

Page 12: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Discovery Phase

12

Page 13: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Starting Information

• Given:– Full name, i.e. Carlton Northern– CS username, i.e. cnorther– CS email, i.e. [email protected]– .forward files -> [email protected]– CS profile URI, i.e. http://www.cs.odu.edu/~cnorther

• Inferred:– School affiliation, i.e. Old Dominion– Approximate location, i.e. Norfolk, Hampton Roads– Computer Science affiliation, i.e. software engineer

13

Page 14: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Username Generation

• Generate usernames from full name derivatives, i.e. for “Carlton Northern” we have:

• cnorthern• northernc• carlton.northern• carlton_northern• carlton-norther

14

Page 15: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Poll Sites

• Issue HTTP GET to determine if a profile exists with a generated username– Create site templates for links:

• http://www.facebook.com/’username here’• http://www.stumbleupon.com/stumbler/’username here’• https://picasaweb.google.com/’username here’

– 2016 students, 6 usernames, 25 sites = 302k requests• GET http://www.facebook.com/carlton.northern HTTP/1.1

– If 200 accept response, profile exists, else it doesn’t.– Soft 404’s can be somewhat problematic but can be handled.– Some sites detect robots and will present a Captcha which is

also problematic.

15

Page 16: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

• Run existing profile URLs through Google Social Graph to find “me” links.

16

Google’s Social Graph API

Page 17: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

“Me” Links

• “me” links are links in Friend of a Friend (FOAF) and XHTML Friends Network (XFN) that specify the same identity

• For example, a me link from my CS profile page to twitter:

17

<html> <head>

<title>Carlton Northern's CS Home Page</title> </head>

<body> stuff here ... <a href=http://twitter.com/carltonnorthern rel=“me”>My Twitter</a> </body></html>

Page 18: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Rapportive• Rapportive is a contacts relationship

management (CRM) tool that sits on top of Gmail

• Uses AJAX and JSON to serve up content to their Gmail widget.

• Mined .forward files on the CS departmental server – Found only 24 email addresses out of 2016 students

• Run CS and non CS email addresses through Rapportive’s not-so-public API to access their results.– Produced 15.9% of our truth set profile results with

only 1.6% being unique to Rapportive

18

Page 19: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Google and Yahoo

• Query Google and Yahoo using their respective APIs.– “carlton northern" AND norfolk– “carlton northern" AND “computer science"– “carlton northern" AND “old dominion“– “carlton northern” site:http://www.facebook.com

• Geonames could be used to derive nearby cities to automatically form search queries

• The same could be done with WordNet to derive profession or education terms

19

Page 20: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Google and Yahoo

• Calls to Google and Yahoo need to be limited because of API restrictions.– Google restricts use to about 1,000 requests per hour

• Furthermore, best results are in the first 1 – 8 positions of the result set

20

Page 21: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Disambiguation Phase

21

Page 22: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

22

• From a public Facebook profile you can (sometimes) get a persons full name, city/area, friends and picture

Page 23: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

23

Personally Identifiable Information Poor Profile

Page 24: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Personally Identifiable Information Rich Profile

24

Page 25: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Point System

• Simple point system:– Keyword matching– Link community structure analysis– Extraction of semantic and feature data from

profiles• 11 points is considered a validated profile.• Points can range from a total negative score to

about 50.

25

Page 26: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Keyword Matching

• 1 point for weak indicators – 1 word terms like “programmer” or “student”

• 4 points for stronger indicators – 2 or more words terms like “computer science” or

“software engineer”• 7 points for very strong indicators

– locations i.e. “norfolk” or “portsmouth”– Localized advertisements can be problematic

• 2 points for first name or given name • 4 points for last name

26

Page 27: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Name Matching• Facebook, Linkedin, Google, and Twitter, use real names so:

– 2 points for a first name or diminutive/nickname– 5 points for a last name– Subtract 21 points if neither a nickname or diminutive and a last name are found

• Watch out for diminutive/nicknames!– http://code.google.com/p/nickname-and-diminutive-names-lookup/

• Linkedin in provides location– add or subtract 7 points

27

Page 28: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Link Community Structure Analysis

• Retrieve all links in a page and see if they point to other validated profiles in the data set, if so, assign 5 points

28

Validated ProfileNot-Validated Profile

Assign 5 points to Michael’s Twitter profile

Page 29: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Me Links and Email Matching

• 10 points if a profile is found from Rapportive • 10 points if a profile has a me link from an

already validated profile

29

Validated ProfileNot-Validated Profile

Assign 10 points to Carlton’s Twitter profile

Page 30: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Experiment

30

Page 31: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Dataset

• 2016 students from our departmental server– 142 graduate– 1874 undergraduate– Generated 9GB worth of data

• Truth set: 20 graduate students and 2 professors from our research group Web Science and Digital Libraries

• Use information retrieval metrics of precision, recall and f-measure to assess our truth set

31

Page 32: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Truth Set Results Summary

32

Page 33: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Social Media Web Site Results

33

Page 34: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

34

Whole Set Service Graph

Page 35: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

35

Page 36: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

36

Truth Set User Graph

Page 37: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

37

Whole Set User Graph

Page 38: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

38

Page 39: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

39

Whole Set User Graph Without Blogger Links

Page 40: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

40

Closeup

Page 41: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Future Work

• Facial recognition• Better link community structure analysis• Perform quantitative social media digital

preservation study• Remove social media sites that produced no

or little results (unpopular) and add new ones (foursquare.com)?

41

Page 42: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Potential Impacts/Uses

• Open source intelligence gathering– “Open source” as in publicly available information

• Social media research• Measure the social health of an organization

42

Page 43: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Conclusions

• Completely automated with the only human interaction being with the creation of the search query

• Precision 0.863, recall .526, f-measure 0.632• The approach uses non-traditional search

mechanisms to achieve it's goals• Only publicly available information was used

43

Page 44: MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

44

Carlton [email protected]

http://carlton-northern.com/