social computing research with apache spark

Social Compu,ng Research with Apache Spark Hadoop and Big Data Meetup Manchester, July 2015 Dr. MaEhew Rowe Lecturer in Social Compu,ng | M.Sc. Data Science Director hEp://www.lancaster.ac.uk/staff/rowem/ | [email protected] | @mrowebot

Social Compu,ng Research

The inves,ga,on of how and why social behaviour occurs in computa,onal systems

https://prinsayn.files.wordpress.com/2013/01/tile.jpg

Social Compu,ng Team

Small team comprised of myself + 3 Ph.D. students Researching a range of social compu,ng topics: •  Churn predic,on •  User engagement in social systems •  Recommender systems •  Informa(on diffusion •  Digital Accountability

Common theme: inves(ga(ng and applying data mining techniques to large-‐scale data

'Culture' Parallel Processing Cluster

Contains 12 rack-‐mounted Dell servers •  1 x 4 core with 64Gb RAM •  1 x 6 core with 64Gb RAM •  10 x 2 core with 16Gb RAM Cloudera + Apache MESOS installed on the cluster to provide access to: •  Apache Hadoop Stack (HDFS, HBASE) •  Apache Spark •  RabbitMQ (for custom distributed processing apps)

•  E.g. Parallelised parameter tuning for Recommender Systems

Language Innova,on

Diffusion

on

Social Media

Diffusion of Language Innova,on

Language innova,on can take various forms: •  Neologisms (e.g. brah) •  Word blends (e.g. downvo+ng, cooldown) •  Shortening (e.g. ur) Studying the adop,on of such innova,ons is hard: •  Interviews with different communi,es •  Travelling between different loca,ons •  Relies on understanding the agent, the social structure + their

interplay

Social media allows language spread to be inves,gated at scale: •  To understand who influences whom •  To understand the language of brands' audiences

Compu,ng Language Innova,on Diffusion

1.  Varia,on in Term Frequency Probability of a term being used in a context: ,me (week) and community (e.g. subreddit)

2.  Varia,on in Term Form Probability of a term having a suffix or prefix added (as a word blend)

Goal: Look for significant increases & decreases in term frequency and form:

(i) globally in a system (ii) locally in communi,es

The Role of Apache Spark

Collect datasets from

Twitter and Reddit

Identify innovations

Compute frequency and

form values

Write significant

increases + decreases to

HDFS

Point to TSV file in HDFS

Map: return <<term,

context>, value> pairs as

RDD

ReduceByKey: merge <<term,

context>, value> pairs as

RDD

Identify significant contextual

increases + decreases

Note: the key here is a tuple

Increase in Frequency

Increase in Form

Inves,ga,ng UK Web Filters (A Data Science Approach)

UK Web Filtering: Default-‐on

Collateral Filtering

How accurate are the filters? What is being overblocked and underblocked?

Censorship Monitoring Project

Open Rights Group constructed a system of probes to check URLs across ISPs for blocking Lancaster’s goal: build a system to gauge filters’ accuracy and categories of blocks

Computing per-ISP accuracy

Broadcasting DMOZ Category RDD <URL, topics>

Pseudo-‐classifiers in Apache Spark

hEps://github.com/openrightsgroup/cmp-‐analysis

Point to DMOZ JSON file in

HDFS

Map: return <URL, topics> pairs as RDD

ReduceByKey: merge <URL,

topics> as RDD

Point to Adult DMOZ JSON file in HDFS

Map: return <URL, topics> pairs as RDD

ReduceByKey: merge <URL,

topics> as RDD

Take union of

RDD objects

Broadcast RDD to cluster

Point to probe request file in

HDFS

Retrieve DMOZ RDD

from broadcast

Map: return <ISP, Result>

RDD w/ pseudo-classifiers

ReduceByKey: merge <ISP, Result> as

RDD

Collect map and compute

per-ISP accuracy

What did we find?

For examples of overblocks & underblocks: hEps://github.com/openrightsgroup/cmp-‐analysis/tree/master/data/output

30% to 82% of sites are underblocked 2% to 6% of sites are overblocked

M.Sc. Data Science

Ques,ons?

Web: hEp://www.lancaster.ac.uk/staff/rowem/

(For publica,ons and current projects) Code: hEps://github.com/maEroweshow/ Email: [email protected] TwiEer: @mrowebot

social computing research with apache spark

Social Media

rdd point

rdd reducebykey

hdfs retrieve dmoz rdd

hdfs map

hdfs point

merge pairs

broadcast map

adult dmoz json file