finding event-specific influencers in dynamic social networks

FINDING EVENT-SPECIFIC INFLUENCERS IN DYNAMIC SOCIAL NETWORKSMasters Thesis – Chris SchenkDecember 1st, 2010

OUTLINE Problem overview

Influencers, reputation, validation and security Summary of analysis methods Boulder fire data

Twitter Data API, formats, collection and data limitations Statistics

Finding event-specific influencers – Rankings Stats Hyperlink-Induced Topic Search (HITS) Context-specific in-degree (original work)

Conclusions and Future Work

PROBLEM OVERVIEW

INFLUENCERS Social dynamics vs online social dynamics

Social network features Search, friends, re-tweets

Influencers and sheep What is meant by influence?

Understanding the data Sampling and baseline statistics Similarity measures, clustering Semantics, intent (NLP)

Baseline activity

INFLUENCERS – NETWORK STRUCTURE Betweenness/Closeness centrality PageRank/TwitterRank/TunkRank Local/Global hierarchical clustering K-core decomposition K-clique percolation Nearest Neighbor Networks Assortative mixing

HITS Activity Network

TWITTER DATA STATS – BOULDER FIRE Tweets

First day – September 6th, 2010 10:00am to September 7th, 2010 10:00am, Mountain time

First week – September 6th, 2010 10:00am to September 13th, 2010 10:00am, Mountain time

Social graph Five one-day snapshots beginning September 7th, 2010

12:40pm, Mountain time Tweet example

RT @garytx: Article on Twitter's use during #eqnz, #boulderfire, and #sanbrunofire: http://bit.ly/cwI1fi

kate30_CU - 2010-09-13 15:29:24+00:00 Keywords: boulder, boulderfire, fourmilefire,

fourmilecanyon, 4milefire

QUALITATIVELY INFLUENTIAL USERS Sixteen users gathered by Jo White

Used as “ground truth” data for ranking comparison

epiccolorado laurasrecipes HumaneBoulder fishnettesuzanbond CampSteve ConnectColorad

metroseen palen sophiabliu MediamumTanukun eadvocate kate30_CU BoulderChannel

TWITTER API AND DATA COLLECTION Search+Track+REST

Unique users for a given event Profiles

Periodic collection Friends/Followers

Periodic collection Tweets

One-time collection Limitations

Rate limits, multi-threading Improper SQL query

TWEET STATSStat First Day First Week

# Tweets (total) 12,147 2,314,700# Users 398 13,955Avg. Tweets/user 30.5 165.9Med. Tweets/user 9.0 38.0# Hashtags (total) 7,422 756,785# Hashtags (unique) 895 66,765Avg. Hashtag occurrence 8.3 11.3Med. Hashtag occurrence 1.0 1.0# Mentions (total) 7,877 1,224,851Avg. Mentions/User 19.9 87.8Med. Mentions/User 1.0 1.0# Users mentioning others

308 (77.39%)

11,036 (79.08%)

TWEET STATS (CONT.)Stat First Day First Week

# Addressed Msgs. 2,291 (18.85%)

368,047 (15.90%)

# Users addressing msgs.

227 (57.04%) 8,404 (60.22%)

# Re-tweet Msgs. 3,994 (32.88%)

504,836 (21.81%)

# Users re-tweeted (global)

1,456 134,204

# Users re-tweeted (fire) 356 (24.45%) 2,085 (1.55%)# URLs (unique) 4,105 1,200,927# Source applications 85 1,026# Users giving location

30 (7.53%) 858 (6.14%)

# Tweets with location 172 (1.42%) 17,093 (0.77%)

GRAPH STATS Timezone: Mountain

2010-09-07

12:40:01

2010-09-08

12:40:01

2010-09-09

12:40:01

2010-09-10

12:40:01

2010-09-11

15:10:01Users (fire)

448 1,631 1,623 1,622 4,093

Users (all) 821,609 2,292,929 2,295,885 2,300,838 4,075,573Edges (fire)

3,142 25,193 25,484 25,664 87,539

Edges (all) 1,510,036 5,361,650 5,370,451 5,372,597 30,458,948

LOCATION DATA – U.S.

LOCATION DATA – DENVER METRO

LOCATION DATA – BOULDER, LONGMONT, BROOMFIELD

USER “FISHNETTE” DATA - AGGREGATE HOURLY TWEET COUNTS

USER “FISHNETTE” DATA – AGGREGATE MONTHLY TWEET COUNTS

HASHTAG COUNTS

ADDRESSED MESSAGES

RE-TWEETS

FINDING INFLUENCERS - RANKINGS Tweets

Number of tweets Username mentions Number of re-tweets

Graph In-degree HITS

all users (sorted by frequency) active users Mentions addressed messages (replies)

Context-specific in-degree Global followers count Active edges (pre-existing network) New Edges

RANKINGS - NUMBER OF TWEETS

RANKINGS – USERNAME MENTIONS

RANKINGS – RE-TWEETS

RANKINGS – IN-DEGREE (FOLLOWERS)

HYPERLINK-INDUCED TOPIC SEARCH (HITS) Hubs

Those that link to many authorities Authorities

Those that are linked to by many hubs Process

Calculate the principle eigenvector of two matrices Followers adjacency matrix (authorities) Friends adjacency matrix (hubs)

Iterative Rankings by highest value descending in

eigenvectors

RANKINGS – HITS – ALL USERS

RANKINGS – HITS – ACTIVE USERS

RANKINGS – HITS – MENTIONS

RANKINGS – HITS – ADDRESSED MSGS.

CONTEXT-SPECIFIC IN-DEGREE RANKING Global followers count

Periodically download user profiles Calculate change in followers count for each snapshot Rank based on overall change, descending

Active edges (includes pre-existing edges) Periodically download friend/follower lists Calculate change in followers count for each snapshot Rank based on overall change, descending

New Edges Periodically download friend/follower lists Calculate change in followers count for each snapshot

Do not count edges that existed prior to the start of the event

Rank based on overall change, descending

RANKINGS – GLOBAL FOLLOWERS COUNT

RANKINGS – ACTIVE EDGES

RANKINGS – NEW EDGES

LIMITATIONS AND MODIFICATIONS On-going influence

Can only measure when a user becomes influential Global popularity masking local influence

User “andrewhyde” News and bot activity

Extra data needed to ignore these users Large events

Data collection limitations How important is a de-follow?

Can identify individual user activity Identifying the sheep

Can equivalently count friends (out-links) created

CONCLUSIONS Notions of influence and interaction are

heavily dependent on social network features No agreement on definitions

Influence measured by features not 100% in use Or features not used in the same way by

everyone Composability problem

HITS ranking no better than global in-degree Context-specific in-degree ranking good!

Needs to be tested on multiple events of varying sizes

FUTURE WORK Understanding “baseline” behavior

For users active (using keywords) during an event

Calculate all given statistics for a user (Klout.com?) Lots of ways to cut the data

Composable factors/measures/attributes Explaining new links created

Models for searching, re-tweeting, hashtags, #ff, etc

Incorporating blogs, forums, news websites Real-time vs not

Informing algorithms with other techniques NLP and more automation Qualitative analysis (crowdsourcing?)

THANKS! QUESTIONS?

REPUTATION Definitions? Scores

Composability Explicit reputation

Ratings, votes Implicit reputation

Client Server

VALIDATION Ground truth

Authorities Armies of grad students Crowd-sourcing?

More data Cross-referencing News websites Blogs Public health and safety (or other)

SECURITY Malicious users

Inflation of reputation Sybil attacks

Reporting Audience? Anonymization

finding event-specific influencers in dynamic social networks

actual data

haiti dataset

ground truth data

mountain timefirst week

hitsactivity network

mention haiti event

hashtags total7

milefire chi dataset

Documents

how to find instagram...

influencers 101

pubcon austin: finding and reaching the influencers

finding bugs in web applications using dynamic test...

#newwaytoengage influencers

influencers board

influencing influencers

dynamic positioning – finding a new position in the market

finding health influencers to impact health behavior: the...

finding bugs in web applications using dynamic test...

finding cycles using rectangular matrix multiplication and...

texas influencers

influencers - czarnikow

media & influencers

the hidden truth inside your data: leveraging dynamic...

the impact of social media influencers on purchase ... ·...

finding opportunities for program refinement: an evaluation...

5 tools for finding influencers to follow on twitter and...

finding influencers in your crm & making advocates

finding key influencers and viral topics in twitter networks...