large scale social analytics on wikipedia, delicious, and twitter (presented at ibm npuc 2010)
DESCRIPTION
Ed H. Chi, Palo Alto Research Center Large-Scale Social Analytics in Wikipedia, Delicious, and TwitterAbstractWe will illustrate an analytical research approach in social computing. Our research in Augmented Social Cognition is aimed at enhancing the ability of a group of people to remember, think, and reason. The drive to build models and theories for social computing research should further our understanding of how network science, behavioral economics, and evolutionary theories could explain how social systems work. Here we will summarize the published research we conducted on large-scale social analytics in Wikipedia, Delicious, and Twitter, and point out how social analytics can help us understand the intricacies of large social systems.About the SpeakerEd H. Chi is area manager and principal scientist at Palo Alto Research Center's Augmented Social Cognition Group. He leads the group in understanding how Web2.0 and Social Computing systems help groups of people to remember, think and reason. Ed completed his three degrees (B.S., M.S., and Ph.D.) in 6.5 years from University of Minnesota, and has been doing research on user interface software systems since 1993. He has been featured and quoted in the press, such as the Economist, Time Magazine, LA Times, and the Associated Press. With 20 patents and over 70 research articles, he has won awards for both teaching and research. In his spare time, Ed is an avid Taekwondo martial artist, photographer, and snowboarder.TRANSCRIPT
Image from: http://www.flickr.com/photos/ourcommon/480538715/
Ed H. Chi, Principal Scientist and Area Manager
Peter Pirolli, Lichan Hong Bongwon Suh, Les Nelson Gregorio Convertino, Sharoda Paul
Interns: Sanjay Kairam, Jilin Chen, Brent HectMichael Bernstein Alumni: Raluca Budiu, Bryan Pendleton, Niki Kittur, Todd Mytkowicz, Terrell Russell, Brynn Evans, Bryan Chan, KMRC students
Augmented Social Cognition Area Palo Alto Research Center
2010-10-22 IBM NPUC 2010 2
To: [email protected] From: Brad Barrish <brad@…removed.for.privacy….com> Subject: Pancreatic cancer Date: Thu, 1 Feb 2007 21:37:55 PST
Hey Ed. I'm a fellow del.icio.us user and noticed you bookmark a lot of pancreatic cancer stuff. I'm at home with my dad who was diagnosed a little over a year ago and is now at the tale end of things. I've learned a lot through his treatments and about what's out there. I dunno if it's something you or a family member has, but just wanted to drop you an email. Be well.
Brad
Cognition: the ability to remember, think, and reason; the faculty of knowing.
Social Cognition: the ability of a group to remember, think, and reason; the construction of knowledge structures by a group. – (not quite the same as in the branch of psychology that studies the
cognitive processes involved in social interaction, though included)
Augmented Social Cognition: Supported by systems, the enhancement of the ability of a group to remember, think, and reason; the system-‐supported construction of knowledge structures by a group.
Citation: Chi, IEEE Computer, Sept 2008
3 2010-10-22 IBM NPUC 2010
Kudos to Todd Mytkowicz and Rowan Nairn
Topics Concepts
Users Documents
Tags
T1…Tn Encoding Decoding
Noise
2010-10-22 5 IBM NPUC 2010
H(Tag) shows tag saturation H(Doc | Tag), browsability
2010-10-22 IBM NPUC 2010 6
I(Doc; Tag) Mutual Information Raise in avg. tag / bookmark
2010-10-22 IBM NPUC 2010 7
2010-10-22 8
Guide
Web
Howto
Tips Help
Tools
Tip
Tricks
Tutorial
Tutorials
Reference
Semantic Similarity Graph
IBM NPUC 2010
Spreading Activation in a bi-‐graph Computation over a very large data set
– 150 Million+ bookmarks
Tags URLs
P(URL|Tag)
P(Tag|URL)
2010-10-22 9 IBM NPUC 2010
2010-10-22 10 IBM NPUC 2010
Kudos to Bongwon Suh, Niki Kittur
What drives contributions to Wikipedia?
Conflicts drives most of the contributions to Wikipedia. – How do we measure conflicts?
Conflicts cause coordination costs to go up. – Measuring coordination costs
2010-10-22 IBM NPUC 2010 12
2010-10-22 13 IBM NPUC 2010
Mediators
Sympathetic to parents
Sympathetic to husband
Anonymous (vandals/spammers)
2010-10-22 14 IBM NPUC 2010
2010-10-22 IBM NPUC 2010 15
Counting ‘Controversial’ labels 5x cross-‐validation, R2 = 0.897
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Predicted controversial revisions
Actu
al c
ontr
over
sial r
evisi
ons
Number of Articles (Log Scale)
http://en.wikipedia.org/wiki/Wikipedia:Modelling_Wikipedia’s_growth
2010-10-22 16 IBM NPUC 2010
Monthly Edits
2010-10-22 17 IBM NPUC 2010
Monthly Edits
2010-10-22 18 IBM NPUC 2010
*In thousands Monthly Active Editors
2010-10-22 19 IBM NPUC 2010
*In thousands Monthly Active Editors
2010-10-22 20 IBM NPUC 2010
Preferential Attachment: Edits beget edits – more number of previous edits, more number of new edits
Growth rate of population
Current population
Growth rate depends on: N = current population r = growth rate of the population
2010-10-22 21 IBM NPUC 2010
!
dNdt
= r " N
!
N(t) = N0 " ert
Ecological population growth model – Also depend on environmental conditions – K, carrying capacity (due to resource limitation)
€
dNdt
= rN(1− NK)
2010-10-22 22 IBM NPUC 2010
Follows a logistic growth curve
New Article
2010-10-22 23 IBM NPUC 2010
Biological system – Competition increases as
population hit the limits of the ecology
– Advantage go to members of the population that have competitive dominance over others
Analogy – Limited opportunities to make
novel contributions – Increased patterns of conflict and
dominance
2010-10-22 24 IBM NPUC 2010
Monthly Ratio of Reverted Edits
2010-10-22 25 IBM NPUC 2010
2010-10-22 26 IBM NPUC 2010
Kudos to Brent Hecht, Jilin Chen, Bongwon Suh, Lichan Hong
n = 10,000 users with 5 or more tweets
All Users Who Manually Specified Location
n = 3,311 users with 5 or more tweets
Users w/ No Useful Location Information Manually Entered
Schrute Farms User ID 39111154
User ID 75135928
NONE YA BISNESS!!
User ID 57987417
in jail...smh
not tellin you User ID 130681147
wherever justin wants me to be
User ID 71097545
User ID 77503970
Justin Biebers heart!
User ID 134222427
Jonasbieberland3
Bieber Island User ID 91705969
n = 10,000 users with 5 or more tweets
All Twitter Users
n = 2,965 users with 5 or more tweets
Users w/ Informative Location in the United States
California User ID 125271323
User ID 92455577
Skinny Jeans City, IL
User ID 92455577
Bieberville, California
East Jesus Nowhere, Indiana
User ID 26526957
All 1,698 Fake Locations Yahoo! Geocoder
Justin Biebers heart!
All 1,698 Fake Locations Yahoo! Geocoder
Justin Biebers heart!
Lat = 36.328785 Lon = -91.700189
Location of Justin Bieber’s Heart (Don’t Tell Your Teenage Daughters)
Country-scale
10-fold cross validation multinomial naive bayes classifier
2.4x better than random
State-scale
20% test set multinomial naive bayes classifier
2.2x better than random
Which tweet features are associated with retweet? Retweet Model
– # Retweet ~ function(f1, f2, …., fn), where fi are simple features extracted from a tweet
74M tweets from Twitter Stream API – Characterization – 2~3 % sample – Hadoop / Hbase / MapReduce
2010-10-22 43 IBM NPUC 2010
# Followees: 395 # Followers: 1,400 # Favorite: 1,657 # Day: (since June 17, 2008) # Past tweets: 21,000
Contextual Features
URL Hashtag
Mention
Content Features
2010-10-22 44 IBM NPUC 2010
Two Types of Features
Con
tent
Fac
tor
Contextual Factor
2010-10-22 45 IBM NPUC 2010
Information Streams =>Information Overload
ASC Social Recommender
Engine
2010-10-22 46 IBM NPUC 2010
My Friends’ URLs
Popular URLs
Recommendation Algorithm: Combining Sources and
Models
Recommendations
My Friends’ Network and Tweeting Pattern
Social Ranking Model
My Tweets
My Friends’ Tweets
Topic Relevance Model
2010-10-22 47 IBM NPUC 2010
Hadoop Compute Cluster – 50 nodes, depending on project requirement – ~40TB storage capacity – Experience with Hbase, Pig, Interaction with Lucene, MySQL
Large-‐scale crawling and analytics experience with – Wikipedia (all edits up to 2009) – Delicious data set (200M bookmarks) – Twitter (70M+ Tweets)
Experience with Large Scale Social Analytics – Example 1: Visual analytics in Wikipedia (wikidashboard.com) – Example 2: Search engines for social bookmarks (mrtaggy.com) – Example 3: Recommenders for Twitter news (zerozero88.com)
2010-10-22 IBM NPUC 2010 48
2010-10-22 IBM NPUC 2010 49
Image from: http://www.flickr.com/photos/ourcommon/480538715/
Research Vision: Understand how social computing systems can enhance the ability of a group of people to remember, think, and reason.
Understand and support Collective Intelligence by modeling social group behaviors and testing prototype tools in Living Labs
http://asc-‐parc.blogspot.com http://www.edchi.net [email protected]