research on online digital cultures — community extraction from twitter networks by markov...

23
Our Data, Ourselves Research on Online Digital Cultures — Community Extraction from Twitter Networks by Markov Clustering Department of Digital Humanities Giles Greenway Tobias Blanke Jenifer Pybus Mark Cote

Upload: kingsbsd

Post on 10-Aug-2015

67 views

Category:

Social Media


0 download

TRANSCRIPT

Our Data, Ourselves

Research on Online Digital Cultures — Community Extraction from Twitter Networks by Markov Clustering

Department of Digital Humanities

Giles Greenway

Tobias Blanke

Jenifer Pybus

Mark Cote

A “mobile-data commons”?

• Can we write an app to capture the data-trails that smartphones transmit to third parties and make them available?

• NO.• This would require rooting the 'phones. An

Android phone is a Linux system, where the end user typically doesn't have admin rights.

• If the app reaches a mass audience, we cannot expect users to root their phones. Some rooting software contains malware, we cannot ensure that users root their devices safelyhttp://tinyurl.com/weidmandroid

A “mobile-data commons”?

20 young coders from Young RewiredState (YRS) were issued with Android smartphones.

'Phones were pre-loaded with our “MobileMiner” app, that logs app network traffic, GSM cells, app notifications and WIFI network connections.

The data is logged by a CKAN server, and also made available to users on their devices.

Twitter accounts were also scraped.

Is net activity a proxy for app usage?

Sometimes...

Is net activity a proxy for app usage?

Sometimes not...

Some apps use analytics / ad services continually.Provoked a workshop on app reversal and network traffic capture. http://kingsbsd.github.io/DroidDestructionKit

Notifications as a proxy for social network usage.

0 200 400 600 800 1000 12000

200

400

600

800

1000

1200Twitter Network Degree vs Notifications

Friends

Followers

Number of Notifications

frie

nd

s / f

ollo

we

rs c

ou

nt

Twitter sends notifications based on people you follow. The more notifications the more friends.

Questions to ask of Twitter

●How many different “tribes” does the average teenage hacker have?●What do they Tweet about?●Do they use it conversationally? What's the distribution of lengths of chains of tweets and replies?

Need a community-detection algorithm:●Easy to implement.●Can be explained to non-technical cultural-studies academics in three slides!●Returns realistic communities.

Markov Clustering -MCL

● There are clusters of Twitter users with densely connected networks of friend/follower relationships.

• If you take a random walk around the network, you are likely to stay within the cluster you started in.

http://www.micans.org/mcl/

MCL -A Trivial Example

1: Build an adjacency matrix for the graph.2: Normalize the columns to produce transition probabilities.

MCL -A Trivial Example

3: Square the matrix to get probabilities after two steps.

MCL -A Trivial Example

4: Element-wise square the matrix and re-normalize.5: Rinse and repeat until convergence.

The matrix entries will be 0 or 1. Interpret rows as: “If I'm in this row node, which column nodes are credible start-points?”

MCL -Does it work?

MCL was applied to two Twitter accounts of digital culture researchers with ~7000 once-removed friend-follower relationships.

Gephi's “OpenOrd” layout is meant to emphasise clusters. Are nodes in the same cluster close together?

Compare with Gephi's own “modularity algorithm”, the Louvain method.

MCL -Does it work?

MCL Louvain

MCL -Does it work?

Louvain: Twitter accounts in the same cluster are placed close together.

MCL: Accounts in the same cluster are scattered.

This suggests that Louvain performed better than MCL.

MCL -Does it work?

Louvain: Twitter accounts in the same cluster are placed close together.

MCL: Accounts in the same cluster are scattered.

This suggests that Louvain performed better than MCL.

WRONG!

MCL -Does it work?

Why did Gephi/Louvain put these two in the same modularity class / cluster?

MCL LouvainCluster is identifiable and relevant.

20% 0% !Cluster is not identifiable, but possibly relevant.

37%Cluster is neither identifiable or relevant.

43%

Researchers rated clusters for both methods.

MCL -Does it work?

Why did the Louvain method perform poorly?-The Louvain method works by combining smaller clusters to maximize modularity. Does the very high degree of Twitter networks harm its performance? One wrongly-placed Twitter account pulls in many others.

Why was the OpenOrd layout misleading?-Both OpenOrd and Louvain work by combining smaller clusters. Both are vulnerable to the same problems.

MCL -Does it work?

MCL can suggest plausible Twitter communities.Can it find pre-existing ones?Repeat for the YRS volunteers:

MCL Louvain

MCL -Does it work?

Do the Twitter accounts of 9 YRS volunteers end up in the same cluster?MCL: Mostly...

Cluster Size 20 26 6 6 5 45 5 319 6 5 14 14 5

YRS accounts 0 0 0 0 0 1 0 8 0 0 0 0 0

Louvain: Not so much...

Cluster Size 15 78 7 43 168 67 55 230 24

YRS accounts 0 1 0 0 0 3 2 3 0

[ ~4% probability of allocating 8 Twitter users to the largest MCL cluster by chance. ]

Is inferring from layouts always problematic?

-Of course not!

Th

Theban scribes with common contracting parties

Source: Silke Vanbeselaere http://tinyurl.com/thebanscribes

What do the clusters tweet about?

Top tags for the MCL clusters:Cluster Size 6 45 319

YRS accounts 0 1 8

Top tags dotnetnotts, 18TechNott, 10NottsTest, 8JavaScript, 2hack24, 2ukbestworkplace, 2

GE2015, 78Eurovision, 2015 58leadersdebate, 33bdw2015, 24BattleForNumber10, 21BBCQT, 18GBR, 15bbcqt, 14eurovision, 14NHTG15, 13FoC2015, 12YRSAmbassadors, 11depop, 11BBCFreeSpeech, 10VoteConverative, 9YRS2014, 9DimblebyLecture, 9endpointcon, 9

GE2015, 275tech, 214jobs, 207YRS2014 185Haunted, 183ghosts, 183YRSFoc, 181hackmcr, 167,yrs2014, 156Arduino, 149FoC2015, 141Norwich, 133gamedev, 132TG, 130BigData, 112linux, 111YRSHyperlocal, 105design, 99

Conclusions:

● Acquire Twitter data with Twython/Celery/Redis/RabbbitMQ.

● Store Twitter data with: Neo4J/Py2Neo.● Perform MCL with NumPy.● Export to Gephi with NetworkX.

● Gephi and the Louvain method are fine tools, use them carefully!

● MCL is very effective (if slow) at extracting Twitter communities.

● Numerical techniques should be easy to justify and validate.

● Visualizations are powerful, persuasive, and sometimes misleading! (“Beware of geeks bearing .gifs!”)

The tools:

Download our app: http://kingsbsd.github.io/MobileMiner

Follow us on Twitter: @KingsBSD

Read our blog:http://big-social-data.net/

Read about our data:http://tinyurl.com/miningmobileyouth

Slideshare:http://www.slideshare.net/kingsBSD/