large scale social analytics on wikipedia, delicious, and twitter (presented at ibm npuc 2010)

50
Image from: http://www.flickr.com/photos/ourcommon/480538715/ Ed H. Chi, Principal Scientist and Area Manager Peter Pirolli, Lichan Hong Bongwon Suh, Les Nelson Gregorio Convertino, Sharoda Paul Interns: Sanjay Kairam, Jilin Chen, Brent HectMichael Bernstein Alumni: Raluca Budiu, Bryan Pendleton, Niki Kittur, Todd Mytkowicz, Terrell Russell, Brynn Evans, Bryan Chan, KMRC students Augmented Social Cognition Area Palo Alto Research Center

Upload: ed-chi

Post on 14-Jun-2015

3.693 views

Category:

Business


2 download

DESCRIPTION

Ed H. Chi, Palo Alto Research Center Large-Scale Social Analytics in Wikipedia, Delicious, and TwitterAbstractWe will illustrate an analytical research approach in social computing. Our research in Augmented Social Cognition is aimed at enhancing the ability of a group of people to remember, think, and reason. The drive to build models and theories for social computing research should further our understanding of how network science, behavioral economics, and evolutionary theories could explain how social systems work. Here we will summarize the published research we conducted on large-scale social analytics in Wikipedia, Delicious, and Twitter, and point out how social analytics can help us understand the intricacies of large social systems.About the SpeakerEd H. Chi is area manager and principal scientist at Palo Alto Research Center's Augmented Social Cognition Group. He leads the group in understanding how Web2.0 and Social Computing systems help groups of people to remember, think and reason. Ed completed his three degrees (B.S., M.S., and Ph.D.) in 6.5 years from University of Minnesota, and has been doing research on user interface software systems since 1993. He has been featured and quoted in the press, such as the Economist, Time Magazine, LA Times, and the Associated Press. With 20 patents and over 70 research articles, he has won awards for both teaching and research. In his spare time, Ed is an avid Taekwondo martial artist, photographer, and snowboarder.

TRANSCRIPT

Page 1: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

Image from: http://www.flickr.com/photos/ourcommon/480538715/

Ed  H.  Chi,  Principal  Scientist  and  Area  Manager  

Peter  Pirolli,  Lichan  Hong  Bongwon  Suh,  Les  Nelson  Gregorio  Convertino,  Sharoda  Paul  

Interns:  Sanjay  Kairam,  Jilin  Chen,  Brent  HectMichael  Bernstein  Alumni:  Raluca  Budiu,  Bryan  Pendleton,  Niki  Kittur,  Todd  Mytkowicz,  Terrell  Russell,  Brynn  Evans,  Bryan  Chan,  KMRC  students  

Augmented  Social  Cognition  Area  Palo  Alto  Research  Center  

Page 2: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

2010-10-22 IBM NPUC 2010 2

To:  [email protected]  From:  Brad  Barrish  <brad@…removed.for.privacy….com>  Subject:  Pancreatic  cancer  Date:  Thu,  1  Feb  2007  21:37:55  PST  

Hey  Ed.  I'm  a  fellow  del.icio.us  user  and  noticed  you  bookmark  a  lot      of  pancreatic  cancer  stuff.  I'm  at  home  with  my  dad  who  was  diagnosed      a  little  over  a  year  ago  and  is  now  at  the  tale  end  of  things.  I've      learned  a  lot  through  his  treatments  and  about  what's  out  there.  I      dunno  if  it's  something  you  or  a  family  member  has,  but  just  wanted      to  drop  you  an  email.  Be  well.  

Brad  

Page 3: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

  Cognition:  the  ability  to  remember,  think,  and  reason;  the  faculty  of  knowing.  

  Social  Cognition:  the  ability  of  a  group  to  remember,  think,  and  reason;  the  construction  of  knowledge  structures  by  a  group.  –  (not  quite  the  same  as  in  the  branch  of  psychology  that  studies  the  

cognitive  processes  involved  in  social  interaction,  though  included)  

  Augmented  Social  Cognition:  Supported  by  systems,  the  enhancement    of  the  ability  of  a  group  to  remember,  think,  and  reason;  the  system-­‐supported  construction  of  knowledge  structures  by  a  group.    

Citation:  Chi,  IEEE  Computer,  Sept  2008  

3 2010-10-22 IBM NPUC 2010

Page 4: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

Kudos  to  Todd  Mytkowicz  and  Rowan  Nairn  

Page 5: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

Topics  Concepts  

Users   Documents  

Tags  

T1…Tn  Encoding  Decoding  

Noise  

2010-10-22 5 IBM NPUC 2010

Page 6: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

H(Tag)  shows  tag  saturation   H(Doc  |  Tag),  browsability  

2010-10-22 IBM NPUC 2010 6

Page 7: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

I(Doc;  Tag)    Mutual  Information   Raise  in  avg.  tag  /  bookmark  

2010-10-22 IBM NPUC 2010 7

Page 8: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

2010-10-22 8

Guide

Web

Howto

Tips Help

Tools

Tip

Tricks

Tutorial

Tutorials

Reference

Semantic Similarity Graph

IBM NPUC 2010

Page 9: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

  Spreading  Activation  in  a  bi-­‐graph    Computation  over  a  very  large  data  set  

–  150  Million+  bookmarks  

Tags URLs

P(URL|Tag)

P(Tag|URL)

2010-10-22 9 IBM NPUC 2010

Page 10: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

2010-10-22 10 IBM NPUC 2010

Page 11: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

Kudos  to  Bongwon  Suh,  Niki  Kittur  

Page 12: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

What  drives  contributions  to  Wikipedia?  

  Conflicts  drives  most  of  the  contributions  to  Wikipedia.  –  How  do  we  measure  conflicts?  

  Conflicts  cause  coordination  costs  to  go  up.  –  Measuring  coordination  costs  

2010-10-22 IBM NPUC 2010 12

Page 13: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

2010-10-22 13 IBM NPUC 2010

Page 14: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

Mediators

Sympathetic to parents

Sympathetic to husband

Anonymous (vandals/spammers)

2010-10-22 14 IBM NPUC 2010

Page 15: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

2010-10-22 IBM NPUC 2010 15

  Counting  ‘Controversial’  labels    5x  cross-­‐validation,  R2  =  0.897  

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Predicted controversial revisions

Actu

al c

ontr

over

sial r

evisi

ons

Page 16: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

Number of Articles (Log Scale)

http://en.wikipedia.org/wiki/Wikipedia:Modelling_Wikipedia’s_growth

2010-10-22 16 IBM NPUC 2010

Page 17: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

Monthly Edits

2010-10-22 17 IBM NPUC 2010

Page 18: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

Monthly Edits

2010-10-22 18 IBM NPUC 2010

Page 19: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

*In thousands Monthly Active Editors

2010-10-22 19 IBM NPUC 2010

Page 20: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

*In thousands Monthly Active Editors

2010-10-22 20 IBM NPUC 2010

Page 21: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

  Preferential  Attachment:  Edits  beget  edits  –  more  number  of  previous  edits,  more  number  of  new  edits  

Growth rate of population

Current population

Growth rate depends on: N = current population r = growth rate of the population

2010-10-22 21 IBM NPUC 2010

!

dNdt

= r " N

!

N(t) = N0 " ert

Page 22: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

  Ecological  population  growth  model  –  Also  depend  on  environmental  conditions  –  K,  carrying  capacity  (due  to  resource  limitation)  

dNdt

= rN(1− NK)

2010-10-22 22 IBM NPUC 2010

Page 23: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

  Follows  a  logistic  growth  curve  

New Article

2010-10-22 23 IBM NPUC 2010

Page 24: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

  Biological  system  –  Competition  increases  as  

population  hit  the  limits  of  the  ecology  

–  Advantage  go  to  members  of  the  population  that  have  competitive  dominance  over  others  

  Analogy  –  Limited  opportunities  to  make  

novel  contributions  –  Increased  patterns  of  conflict  and  

dominance    

2010-10-22 24 IBM NPUC 2010

Page 25: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

Monthly Ratio of Reverted Edits

2010-10-22 25 IBM NPUC 2010

Page 26: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

2010-10-22 26 IBM NPUC 2010

Page 27: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

Kudos  to  Brent  Hecht,  Jilin  Chen,    Bongwon  Suh,  Lichan  Hong  

Page 28: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)
Page 29: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)
Page 30: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

n = 10,000 users with 5 or more tweets

All Users Who Manually Specified Location

Page 31: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

n = 3,311 users with 5 or more tweets

Users w/ No Useful Location Information Manually Entered

Page 32: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

Schrute Farms User ID 39111154

User ID 75135928

NONE YA BISNESS!!

User ID 57987417

in jail...smh

not tellin you User ID 130681147

Page 33: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

wherever justin wants me to be

User ID 71097545

User ID 77503970

Justin Biebers heart!

User ID 134222427

Jonasbieberland3

Bieber Island User ID 91705969

Page 34: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

n = 10,000 users with 5 or more tweets

All Twitter Users

Page 35: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

n = 2,965 users with 5 or more tweets

Users w/ Informative Location in the United States

Page 36: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

California User ID 125271323

User ID 92455577

Skinny Jeans City, IL

User ID 92455577

Bieberville, California

East Jesus Nowhere, Indiana

User ID 26526957

Page 37: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

All 1,698 Fake Locations Yahoo! Geocoder

Justin Biebers heart!

Page 38: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

All 1,698 Fake Locations Yahoo! Geocoder

Justin Biebers heart!

Lat = 36.328785 Lon = -91.700189

Page 39: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

Location of Justin Bieber’s Heart (Don’t Tell Your Teenage Daughters)

Page 40: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)
Page 41: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

Country-scale

10-fold cross validation multinomial naive bayes classifier

2.4x better than random

Page 42: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

State-scale

20% test set multinomial naive bayes classifier

2.2x better than random

Page 43: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

  Which  tweet  features  are  associated  with  retweet?    Retweet  Model  

–  #  Retweet  ~  function(f1,  f2,  ….,  fn),  where  fi  are  simple  features  extracted  from  a  tweet  

  74M  tweets  from  Twitter  Stream  API  –  Characterization  –  2~3  %  sample  –  Hadoop  /  Hbase  /  MapReduce    

2010-10-22 43 IBM NPUC 2010

Page 44: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

#  Followees:  395  #  Followers:  1,400  #  Favorite:  1,657  #  Day:  (since  June  17,  2008)  #  Past  tweets:  21,000  

Contextual  Features  

URL   Hashtag  

Mention  

Content  Features  

2010-10-22 44 IBM NPUC 2010

Two  Types  of  Features  

Page 45: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

Con

tent

Fac

tor

Contextual Factor

2010-10-22 45 IBM NPUC 2010

Page 46: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

Information Streams =>Information Overload

ASC Social Recommender

Engine

2010-10-22 46 IBM NPUC 2010

Page 47: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

My Friends’ URLs

Popular URLs

Recommendation Algorithm: Combining Sources and

Models

Recommendations

My Friends’ Network and Tweeting Pattern

Social Ranking Model

My Tweets

My Friends’ Tweets

Topic Relevance Model

2010-10-22 47 IBM NPUC 2010

Page 48: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

  Hadoop  Compute  Cluster  –  50  nodes,  depending  on  project  requirement  –  ~40TB  storage  capacity  –  Experience  with  Hbase,  Pig,  Interaction  with  Lucene,  MySQL  

  Large-­‐scale  crawling  and  analytics  experience  with  –  Wikipedia    (all  edits  up  to  2009)  –  Delicious  data  set  (200M  bookmarks)  –  Twitter  (70M+  Tweets)  

  Experience  with  Large  Scale  Social  Analytics  –  Example  1:  Visual  analytics  in  Wikipedia  (wikidashboard.com)    –  Example  2:  Search  engines  for  social  bookmarks  (mrtaggy.com)  –  Example  3:  Recommenders  for  Twitter  news  (zerozero88.com)  

2010-10-22 IBM NPUC 2010 48

Page 49: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

2010-10-22 IBM NPUC 2010 49

Page 50: Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

Image from: http://www.flickr.com/photos/ourcommon/480538715/

  Research  Vision:  Understand  how  social  computing  systems  can  enhance  the  ability  of  a  group  of  people  to  remember,  think,  and  reason.  

  Understand and support Collective Intelligence by modeling social group behaviors and testing prototype tools in Living Labs

http://asc-­‐parc.blogspot.com  http://www.edchi.net  [email protected]