social computing research with apache spark
TRANSCRIPT
Social Compu,ng Research with Apache Spark Hadoop and Big Data Meetup Manchester, July 2015 Dr. MaEhew Rowe Lecturer in Social Compu,ng | M.Sc. Data Science Director hEp://www.lancaster.ac.uk/staff/rowem/ | [email protected] | @mrowebot
Social Compu,ng Research
The inves,ga,on of how and why social behaviour occurs in computa,onal systems
https://prinsayn.files.wordpress.com/2013/01/tile.jpg
Social Compu,ng Team
Small team comprised of myself + 3 Ph.D. students Researching a range of social compu,ng topics: • Churn predic,on • User engagement in social systems • Recommender systems • Informa(on diffusion • Digital Accountability
Common theme: inves(ga(ng and applying data mining techniques to large-‐scale data
'Culture' Parallel Processing Cluster
Contains 12 rack-‐mounted Dell servers • 1 x 4 core with 64Gb RAM • 1 x 6 core with 64Gb RAM • 10 x 2 core with 16Gb RAM Cloudera + Apache MESOS installed on the cluster to provide access to: • Apache Hadoop Stack (HDFS, HBASE) • Apache Spark • RabbitMQ (for custom distributed processing apps)
• E.g. Parallelised parameter tuning for Recommender Systems
Diffusion of Language Innova,on
Language innova,on can take various forms: • Neologisms (e.g. brah) • Word blends (e.g. downvo+ng, cooldown) • Shortening (e.g. ur) Studying the adop,on of such innova,ons is hard: • Interviews with different communi,es • Travelling between different loca,ons • Relies on understanding the agent, the social structure + their
interplay
Social media allows language spread to be inves,gated at scale: • To understand who influences whom • To understand the language of brands' audiences
Compu,ng Language Innova,on Diffusion
1. Varia,on in Term Frequency Probability of a term being used in a context: ,me (week) and community (e.g. subreddit)
2. Varia,on in Term Form Probability of a term having a suffix or prefix added (as a word blend)
Goal: Look for significant increases & decreases in term frequency and form:
(i) globally in a system (ii) locally in communi,es
The Role of Apache Spark
Collect datasets from
Twitter and Reddit
Identify innovations
Compute frequency and
form values
Write significant
increases + decreases to
HDFS
Point to TSV file in HDFS
Map: return <<term,
context>, value> pairs as
RDD
ReduceByKey: merge <<term,
context>, value> pairs as
RDD
Identify significant contextual
increases + decreases
Note: the key here is a tuple
Censorship Monitoring Project
Open Rights Group constructed a system of probes to check URLs across ISPs for blocking Lancaster’s goal: build a system to gauge filters’ accuracy and categories of blocks
Computing per-ISP accuracy
Broadcasting DMOZ Category RDD <URL, topics>
Pseudo-‐classifiers in Apache Spark
hEps://github.com/openrightsgroup/cmp-‐analysis
Point to DMOZ JSON file in
HDFS
Map: return <URL, topics> pairs as RDD
ReduceByKey: merge <URL,
topics> as RDD
Point to Adult DMOZ JSON file in HDFS
Map: return <URL, topics> pairs as RDD
ReduceByKey: merge <URL,
topics> as RDD
Take union of
RDD objects
Broadcast RDD to cluster
Point to probe request file in
HDFS
Retrieve DMOZ RDD
from broadcast
Map: return <ISP, Result>
RDD w/ pseudo-classifiers
ReduceByKey: merge <ISP, Result> as
RDD
Collect map and compute
per-ISP accuracy
What did we find?
For examples of overblocks & underblocks: hEps://github.com/openrightsgroup/cmp-‐analysis/tree/master/data/output
30% to 82% of sites are underblocked 2% to 6% of sites are overblocked
Ques,ons?
Web: hEp://www.lancaster.ac.uk/staff/rowem/
(For publica,ons and current projects) Code: hEps://github.com/maEroweshow/ Email: [email protected] TwiEer: @mrowebot