social computing research with apache spark

20
Social Compu,ng Research with Apache Spark Hadoop and Big Data Meetup Manchester, July 2015 Dr. MaEhew Rowe Lecturer in Social Compu,ng | M.Sc. Data Science Director hEp://www.lancaster.ac.uk/staff/rowem/ | [email protected] | @mrowebot

Upload: matthew-rowe

Post on 17-Aug-2015

308 views

Category:

Social Media


0 download

TRANSCRIPT

Social  Compu,ng  Research  with  Apache  Spark  Hadoop  and  Big  Data  Meetup  Manchester,  July  2015  Dr.  MaEhew  Rowe    Lecturer  in  Social  Compu,ng  |  M.Sc.  Data  Science  Director  hEp://www.lancaster.ac.uk/staff/rowem/  |  [email protected]  |  @mrowebot  

Social  Compu,ng  Research

The  inves,ga,on  of  how  and  why  social  behaviour  occurs  in  computa,onal  systems

https://prinsayn.files.wordpress.com/2013/01/tile.jpg

Social  Compu,ng  Team

Small  team  comprised  of  myself  +  3  Ph.D.  students Researching  a  range  of  social  compu,ng  topics: •  Churn  predic,on •  User  engagement  in  social  systems •  Recommender  systems •  Informa(on  diffusion •  Digital  Accountability

Common  theme:  inves(ga(ng  and  applying  data  mining  techniques  to  large-­‐scale  data

'Culture'  Parallel  Processing  Cluster

Contains  12  rack-­‐mounted  Dell  servers •  1  x  4  core  with  64Gb  RAM •  1  x  6  core  with  64Gb  RAM •  10  x  2  core  with  16Gb  RAM Cloudera  +  Apache  MESOS  installed  on  the  cluster  to  provide  access  to: •  Apache  Hadoop  Stack  (HDFS,  HBASE) •  Apache  Spark •  RabbitMQ  (for  custom  distributed  processing  apps)  

•  E.g.  Parallelised  parameter  tuning  for  Recommender  Systems

Language  Innova,on

Diffusion

on

Social Media

Diffusion  of  Language  Innova,on

Language  innova,on  can  take  various  forms: •  Neologisms  (e.g.  brah) •  Word  blends  (e.g.  downvo+ng,  cooldown) •  Shortening  (e.g.  ur) Studying  the  adop,on  of  such  innova,ons  is  hard: •  Interviews  with  different  communi,es  •  Travelling  between  different  loca,ons  •  Relies  on  understanding  the  agent,  the  social  structure  +  their  

interplay  

Social  media  allows  language  spread  to  be  inves,gated  at  scale: •  To  understand  who  influences  whom •  To  understand  the  language  of  brands'  audiences

Compu,ng  Language  Innova,on  Diffusion

1.   Varia,on  in  Term  Frequency  Probability  of  a  term  being  used  in  a  context:  ,me  (week)  and  community  (e.g.  subreddit)  

2.   Varia,on  in  Term  Form  Probability  of  a  term  having  a  suffix  or  prefix  added  (as  a  word  blend)

Goal:  Look  for  significant  increases  &  decreases  in  term  frequency  and  form:    

(i)  globally  in  a  system    (ii)  locally  in  communi,es

The  Role  of  Apache  Spark

Collect datasets from

Twitter and Reddit

Identify innovations

Compute frequency and

form values

Write significant

increases + decreases to

HDFS

Point to TSV file in HDFS

Map: return <<term,

context>, value> pairs as

RDD

ReduceByKey: merge <<term,

context>, value> pairs as

RDD

Identify significant contextual

increases + decreases

Note: the key here is a tuple

Increase  in  Frequency

Increase  in  Form

Inves,ga,ng  UK  Web  Filters  (A  Data  Science  Approach)

UK  Web  Filtering:  Default-­‐on

Collateral  Filtering

How  accurate  are  the  filters?  What  is  being  overblocked  and  underblocked?  

Censorship  Monitoring    Project

Open  Rights  Group  constructed  a  system  of  probes  to  check  URLs  across  ISPs  for  blocking    Lancaster’s  goal:  build  a  system  to  gauge  filters’  accuracy  and  categories  of  blocks  

Computing per-ISP accuracy

Broadcasting DMOZ Category RDD <URL, topics>

Pseudo-­‐classifiers  in    Apache  Spark

hEps://github.com/openrightsgroup/cmp-­‐analysis  

Point to DMOZ JSON file in

HDFS

Map: return <URL, topics> pairs as RDD

ReduceByKey: merge <URL,

topics> as RDD

Point to Adult DMOZ JSON file in HDFS

Map: return <URL, topics> pairs as RDD

ReduceByKey: merge <URL,

topics> as RDD

Take union of

RDD objects

Broadcast RDD to cluster

Point to probe request file in

HDFS

Retrieve DMOZ RDD

from broadcast

Map: return <ISP, Result>

RDD w/ pseudo-classifiers

ReduceByKey: merge <ISP, Result> as

RDD

Collect map and compute

per-ISP accuracy

What  did  we  find?

For  examples  of  overblocks  &  underblocks:    hEps://github.com/openrightsgroup/cmp-­‐analysis/tree/master/data/output    

30%  to  82%  of  sites  are  underblocked  2%  to  6%  of  sites  are  overblocked  

M.Sc.  Data  Science

Ques,ons?

 Web:  hEp://www.lancaster.ac.uk/staff/rowem/    

 (For  publica,ons  and  current  projects)    Code:  hEps://github.com/maEroweshow/      Email:  [email protected]      TwiEer:  @mrowebot