harith alani's presentation at sssw 2011

Live Social Semantics

& online community monitoring

1

Harith Alani Knowledge Media institute, The Open University, UK

Semantic Web Summer School Cercedilla, Spain,, 2011

http://twitter.com/halani http://delicious.com/halani http://www.linkedin.com/pub/harith-alani/9/739/534

Market value of Web Analytics

2

Tag-Along Marketing The New York Times, November 6, 2010

“Everything is in place for location-based social networking to be the next big thing. Tech companies are building the platforms, venture capitalists are providing the cash and marketers are eager to develop advertising. “

Location, Sensors, & Social Networking

3

Location, Sensors, & Social Networking

4 4

The Canine Twitterer

“Having my daily workout. Already did 15 leg lifts!”

Monitoring online/offline social activity

5

Where is everybody?

Monitoring online/offline social activity

•  Generating opportunities for F2F networking

6

Tracking of F2F contact networks

7

TraceEncounters - 2004

Sociometer, MIT, 2002 -  F2F and productivity

-  F2F dynamics

-  Who are key players?

-  F2F and office distance

8

SocioPatterns platform

8 http://www.sociopatterns.org/!

9

Convergence with online social networks

9

•  Digital social networking increases physical social isolation

•  Causes – Genetic alterations – Weakened immune system –  Less resistant to cancer – Higher risk of heart disease – Higher blood pressure – Faster dementia – Narrower arteries

Aric Sigman, “Well Connected? The Biological Implications of 'Social Networking’”, Biologist, 56(1), 2009

10

Online vs. offline social networking

•  Digital networking increase social interaction – Create more opportunities to

network – Supports and increases F2F

contact! – Stronger offline social tiesà

more online communication – Stronger offline social ties à

more diverse online communications

– F2F is medium of choice in weaker social ties

Barry Wellman, The Glocal Village: Internet and Community, Idea’s - The Arts & Science Review, University of Toronto, 1(1),2004

Offline + online social networking

11 ESWC2010

Where should I go?

Where have I met this guy?

Anyone I know here?

Who should I talk to?

<?xml version="1.0"?>!<rdf:RDF! xmlns="http://tagora.ecs.soton.ac.uk/schemas/tagging#"! xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"! xmlns:xsd="http://www.w3.org/2001/XMLSchema#"! xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"! xmlns:owl="http://www.w3.org/2002/07/owl#"! xml:base="http://tagora.ecs.soton.ac.uk/schemas/tagging">! <owl:Ontology rdf:about=""/>! <owl:Class rdf:ID="Post"/>! <owl:Class rdf:ID="TagInfo"/>! <owl:Class rdf:ID="GlobalCooccurrenceInfo"/>! <owl:Class rdf:ID="DomainCooccurrenceInfo"/>! <owl:Class rdf:ID="UserTag"/>! <owl:Class rdf:ID="UserCooccurrenceInfo"/>! <owl:Class rdf:ID="Resource"/>! <owl:Class rdf:ID="GlobalTag"/>! <owl:Class rdf:ID="Tagger"/>! <owl:Class rdf:ID="DomainTag"/>! <owl:ObjectProperty rdf:ID="hasPostTag">! <rdfs:domain rdf:resource="#TagInfo"/>! </owl:ObjectProperty>! <owl:ObjectProperty rdf:ID="hasDomainTag">! <rdfs:domain rdf:resource="#UserTag"/>! </owl:ObjectProperty>! <owl:ObjectProperty rdf:ID="isFilteredTo">! <rdfs:range rdf:resource="#GlobalTag"/>! <rdfs:domain rdf:resource="#GlobalTag"/>! </owl:ObjectProperty>! <owl:ObjectProperty rdf:ID="hasResource">! <rdfs:domain rdf:resource="#Post"/>! <rdfs:range =…!

Live Social Semantics (LSS): RFIDs + Social Web + Semantic Web

•  Integration of physical presence and online information •  Semantic user profile generation •  Logging of face-to-face contact •  Social network browsing •  Analysis of online vs offline social networks

Live Social Semantics: architecture

13

triple store

Profile builder

Tag disambiguation service

Tag to URI service

ontology

net

wor

ks

interests Delicious

Flickr

LastFM

Facebook

semanticweb.org

rkbexplorer.com

dbpedia.org

dbtune.org

TAGora Sense Repository

JXT Triple Store

Extractor Daemon

Connect API

We

b-b

ase

d S

yste

ms

Re

al W

orld

Visualization Web Interface Linked Data

Local Server

RFID Readers

Real-WorldContact Data

SocialSemantics

Communities of Practice

Social TaggingSocial Networks

Contacts

mbid -> dbpedia uritag -> dbpedia uri

Profile BuilderPublications

Ag

gre

ga

tor

RD

F c

ach

e

RFID Badges

Delicious

Flickr

LastFM

Facebook

semanticweb.org

rkbexplorer.com

dbpedia.org

dbtune.org

TAGora Sense Repository

JXT Triple Store

Extractor Daemon

Connect API

Web-b

ased S

yste

ms

Real W

orld

Visualization Web Interface Linked Data

Local Server

RFID Readers

Real-WorldContact Data

SocialSemantics

Communities of Practice

Social TaggingSocial Networks

Contacts

mbid -> dbpedia uritag -> dbpedia uri

Profile BuilderPublications

Aggre

gato

r

RD

F c

ache

RFID Badges

Web interface Linked data Visualization

URIs

tags

social semantics

contacts data

data.semanticweb.org rkbexplorer.com

publications, co-authorship networks

SW resources

14

proceedings chair

chair author

CoP

conference

http://data.semanticweb.org/!

www.rkbexplorer.com/!

15

Social and information networks

15

16

Merging social networks

16 FOAF

Distinct, Separated Identity Management

http://tagora.ecs.soton.ac.uk/delicious/halani

http://tagora.ecs.soton.ac.uk/flickr/69749885@N00

http://tagora.ecs.soton.ac.uk/lastfm/halani

http://tagora.ecs.soton.ac.uk/facebook/568493878

Harith Alani

http://data.semanticweb.org/person/harith-alani/

http://southampton.rkbexplorer.com/id/person-05877

http://tagora.ecs.soton.ac.uk/LiveSocialSemantics/eswc2009/1139

http://tagora.ecs.soton.ac.uk/LiveSocialSemantics/eswc2009/foaf/2

Delicious Tagging and Network

Flickr Tagging and Contacts

Las:m favourite ar>sts and friends

Facebook contacts

RFID Contact Data

Conference Publica>on Data

Past Publica>ons, Projects, Communi>es of Prac>ce

18

Tag Filtering Service

Semantic modeling Semantic analysis Collective intelligence Statistical analysis Syntactical analysis

19

Tag Filtering Service

20

From Tags to Semantics

20

21

Tags to User Interests

21

22

From raw tags and social relations to Structured Data

User raw data

Structured data

Collective intelligence

ontologies

Semantic data

23

RFIDs for tracking social contact

23

People contact à RFID à RDF Triples

24

F2FContact

hasContact

contactWith

contactDate contactDura>on

XMLSchema#date XMLSchema#>me

contactPlace

Place

foaf#Person1

foaf#Person2

26

Real-time F2F networks with SNS links

http://www.vimeo.com/6590604

27

Deployed at:

Live Social Semantics

Data analysis •  Face-to-face interactions across scientific conferences

•  Networking behaviour of frequent users

•  Correlations between scientific seniority and social networking

•  Comparison of F2F contact network with Twitter and Facebook

•  Social networking with online and offline friends

Characteristics of F2F contact network

•  Degree is number of people with whom the person had at least one F2F contact

•  Strength is the time spent in a F2F contact •  Edge weight is total time spent by a pair of users in F2F contact

28

Network characteristics

ESWC 2009 HT 2009 ESWC 2010

Number of users 175 113 158

Average degree 54 39 55

Avg. strength (mn) 143 123 130

Avg. weight (mn)

2.65 3.15 2.35

Weights ≤ 1 mn

70% 67% 74%

Weights ≤ 5 mn

90% 89% 93%

Weights ≤ 10 mn 95% 94% 96%

Characteristics of F2F contact events Contact characteristics

ESWC 2009 HT 2009 ESWC 2010

Number of contact events

16258 9875 14671

Average contact length (s)

46 42 42

Contacts ≤ 1mn 87% 89% 88%

Contacts ≤ 2mn 94% 96% 95%

Contacts ≤ 5mn 99% 99% 99%

Contacts ≤ 10mn 99.8% 99.8% 99.8%

F2F contact pattern is very similar for all three conferences

F2F contacts of returning users

101 102101

102

103 104 105103

104

ESW

C20

10

101 102 103 104 105

ESWC2009101102103104

Degree

Total interaction time

Links’ weights

30

•  Degree: number of other participants with whom an attendee has interacted

•  Total time: total time spent in

interaction by an attendee

•  Link weight: total time spent in F2F interaction by a pair of returning attendees in 2010, versus the same quantity measured in 2009

Time spent on F2F networking by frequent users is stable, even when the list of people they networked with changed

ESWC 2009 & ESWC 2010

Pearson Correlation

Degree 0.37

Total F2F interaction time

0.76

Link weight 0.75

Average seniority of neighbours in F2F networks

0 5 10seniority (number of papers)

0

1

2

3

4

5

Ave

rage

seni

ority

of n

eigh

bors

sennsenn,wsenn,max

31

•  No clear pattern is observed if the unweighted average over all neighbours in the aggregated network is considered

•  A correlation is observed when each neighbour is weighted by the time spent with the main person

•  The correlation becomes much stronger when considering for each individual only the neighbour with whom the most time was spent

Avg seniority of the neighbours

with weighted averages

Seniority of user with strongest link

Conference attendees tend to networks with others of similar levels of scientific seniority

Presence of AJendees HT2009

Offline networking vs online networking

33

•  people who have a large number of friends on Twitter and/or Facebook don’t seem to be the most socially active in the offline world in comparison to other SNS users

Users with Facebook and Twitter accounts in ESWC 2010

Twitterers Spearman Correlation (ρ)

Tweets – F2F Degree - 0.15

Tweets – F2F Strength - 0.15

Twitter Following – F2F Degree

- 0.21

No strong correlation between amount of F2F contact activity and size of online social networks

users

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

(" &" ((" (&" $(" $&" )(" )&" %("

*+,-./"01221+./3"

45678.9"

*+..:3"

Scientific seniority vs Twitter followers

34

•  Comparison between people’s scientific seniority and the number of people following them on Twitter

People who have the highest number of Twitter followers are not necessarily the most scientifically senior, although they do have high visibility and experience

users

Twitter users Correlation H-index – Twitter Followers 0.32

H-index – Tweets - 0.13

Conference Chairs

all participants

2009

chairs 2009

all participants

2010

chairs 2010

average degree average strength

55 8590

77.7 19590

54 7807

77.6 22520

average weight average number of events per edge

159 3.44

500 8

141 3.37

674 12

•  Conf chairs interact with more distinct people (larger average degree)

•  Conf chairs spend more time in F2F interaction (almost three times as much as a random participant)

Networking with online and offline ‘friends’ Characteristics all users coauthors Facebook

friends Twitter

followers average contact duration (s)

42 75 63 72

average edge weight (s)

141 4470 830 1010

average number of events per edge

3.37 60 13 14

•  Individuals sharing an online or professional social link meet much more often than other individuals

•  Average number of encounters, and total time spent in interaction, is highest for co-authors

F2F contacts with Facebook & Twitter friends were respectively %50 and %71 longer, and %286 and %315 more frequent than with others They spent %79 more time in F2F contacts with their co-authors, and they met them %1680 more times than they met non co-authors

Twitterers vs Non-Twitterers

•  Time spent in conference rooms – Twitter users spent on average 11.4% more time in the

conf rooms than non-twitter users (mean is 26% higher)

•  Number of people met F2F during the conference – Twitter users met on average 9% more people F2F

(mean 8% higher)

•  Duration of F2F contacts – Twitter users spent on average 63% more time in F2F

contact than non twitter users (mean is 20% higher)

37

Behaviour of individuals – micro level analysis

38

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

(" )" *" (+" (," $(" $)" $*" ++" +," %(" %)"-./0123" 4$4"526722" 4$4"8972069:"

:2;<9:=">?@20AB?"C">D?@;<"E7DB<2>#"F72G"

?:;@7>HIJ>"

@0"K88"92;L"6DD1">?@20AB?M";01">D?@;<">@60;<>""

>:=">?@20A>9N"

DO9>@127M":@6:"E7DB<2"

89O1209>M"PQM"12R2<DE27>#"S:DT>"9:2"0239">9;7"72>2;7?:27N"

Behaviour analysis

Jeffrey Chan, Conor Hayes, and Elizabeth Daly. Decomposing discussion forums using common user roles. In Proc. Web Science Conf. (WebSci10), Raleigh, NC: US, 2010

Role Skeleton

Encoding Rules in Ontologies with SPIN

Approach for inferring User Roles

42

Structural, social network, reciprocity, persistence, participation

Feature levels change with the dynamics of the community

Associate Roles with a collection of feature-to-level Mappings e.g. in-degree -> high, out-degree -> high

Run our rules over each user’s features and derive the role composition

Data from Boards.ie •  Forum 246 (Commuting and Transport): Demonstrates a clear increase in

activity over time.

•  Forum 388 (Rugby): Exhibits periodic increase and decrease in activity and hence it provides good examples of healthy/unhealthy evolutions.

•  Forum 411 (Mobile Phones and PDAs): Increase in activity over time with some fluctuation - i.e. reduction and increase over various time windows.

•  For the time in 2004-01 to 2006-12

Results

•  Correlation of individual features in each of the three forums

Commuting and Transport Rugby Mobile Phones and PDAs

Results (a

) For

um 2

46: C

omm

utin

g an

d Tr

ansp

ort

(b) F

orum

388

: Rug

by

(c) F

orum

411

: Mob

ile P

hone

s an

d P

DA

s

•  Variation in behaviour composition & activity

•  Behaviour composition in/stability influences forum activity

Prediction analysis – preliminary results!

•  Predicting rise/fall in post submission numbers

•  Binary classification

•  Features : Community composition, roles and percentages of users associated with each

•  Cross-community predictions are less reliable than individual community analysis due to the idiosyncratic behaviour observed in each individual community

Forum P R F1 ROC

246 0.799 0.769 0.780 0.800

388 0.603 0.615 0.605 0.775

411 0.765 0.692 0.714 0.617

All 0.583 0.667 0.607 0.466

Rise and fall of social networks

47

Predicting engagement

•  Which posts will receive a reply? – What are the most influential features here?

•  How much discussion will it generate? – What are the key factors of lengthy discussions?

48

Common online community features

49

initial tweet that generates a reply. Features which describe seed posts can bedivided into two sets: user features - attributes that define the user making thepost; and, content features - attributes that are based solely on the post itself.We wish to explore the application of such features in identifying seed posts, todo this we train several machine learning classifiers and report on our findings.However we first describe the features used.

4.1 Feature Extraction

The likelihood of posts eliciting replies depends upon popularity, a highly subjec-tive term influenced by external factors. Properties influencing popularity includeuser attributes - describing the reputation of the user - and attributes of a post’scontent - generally referred to as content features. In Table 1 we define user andcontent features and study their influence on the discussion “continuation”.

Table 1. User and Content Features

User FeaturesIn Degree: Number of followers of U #

Out Degree: Number of users U follows #List Degree: Number of lists U appears on. Lists group users by topic #Post Count: Total number of posts the user has ever posted #

User Age: Number of minutes from user join date #Post Rate: Posting frequency of the user PostCount

UserAge

Content FeaturesPost length: Length of the post in characters #Complexity: Cumulative entropy of the unique words in post p !

of total word length n and pi the frequency of each word!

i![1,n] pi(log !"log pi)

!Uppercase count: Number of uppercase words #

Readability: Gunning fog index using average sentence length (ASL) [7]and the percentage of complex words (PCW). 0.4(ASL+ PCW )

Verb Count: Number of verbs #Noun Count: Number of nouns #

Adjective Count: Number of adjectives #Referral Count: Number of @user #

Time in the day: Normalised time in the day measured in minutes #Informativeness: Terminological novelty of the post wrt other posts

The cumulative tfIdf value of each term t in post p!

t!p tfidf(t, p)Polarity: Cumulation of polar term weights in p (using

Sentiwordnet3 lexicon) normalised by polar terms count Po+Ne|terms|

4.2 Experiments

Experiments are intended to test the performance of di!erent classification mod-els in identifying seed posts. Therefore we used four classifiers: discriminativeclassifiers Perceptron and SVM, the generative classifier Naive Bayes and thedecision-tree classifier J48. For each classifier we used three feature settings:user features, content features and user+content features.

Datasets For our experiments we used two datasets of tweets available on theWeb: Haiti earthquake tweets4 and the State of the Union Address tweets.5 The

4 http://infochimps.com/datasets/twitter-haiti-earthquake-data5 http://infochimps.com/datasets/tweets-during-state-of-the-union-address

•  How do all these features influence activity generation in an online community? –  Such knowledge leads to better use and management of the community

Experiment for identifying Twitter seed posts

•  Twitter data on the Haiti earthquake, and the Union Address

•  Evaluated a binary classification task –  Is this post a seed post or not?

50

Dataset Users Tweets Seeds Non-seeds Replies

Haiti 44,497 65,022 1,405 60,686 2,931

Union Address 66,300 80,272 7,228 55,169 17,875

Identifying seeds with different type of features

51

use f-measure, as defined in Equation 1 as the harmonic mean between precisionand recall, setting ! = 1 to weight precision and recall equally. We also plot theReceiver Operator Curve of our trained models to show graphical comparisonsof performance.

F! =(1 + !2) ! P ! R

!2 ! P + R(1)

For our experiments we divided each dataset up into 3 sets: a training set, avalidation set and a testing set using a 70/20/10 split. We trained our classifi-cation models using the training split and then applied them to the validationset, labelling the posts within this split. From these initial results we performedmodel selection by choosing the best performing model - based on maximisingthe F1 score - and used this model together with the best performing features,using a ranking heuristic, to classify posts contained within our test split. Wefirst report on the results obtained from our model selection phase, before movingonto our results from using the best model with the top-k features.

Table 3. Results from the classification of seed posts using varying feature sets andclassification models

(a) Haiti DatasetP R F1 ROC

User Perc 0.794 0.528 0.634 0.727SVM 0.843 0.159 0.267 0.566NB 0.948 0.269 0.420 0.785J48 0.906 0.679 0.776 0.822

Content Perc 0.875 0.077 0.142 0.606SVM 0.552 0.727 0.627 0.589NB 0.721 0.638 0.677 0.769J48 0.685 0.705 0.695 0.711

All Perc 0.794 0.528 0.634 0.726SVM 0.483 0.996 0.651 0.502NB 0.962 0.280 0.434 0.852J48 0.824 0.775 0.798 0.836

(b) Union Address DatasetP R F1 ROC

User Perc 0.658 0.697 0.677 0.673SVM 0.510 0.946 0.663 0.512NB 0.844 0.086 0.157 0.707J48 0.851 0.722 0.782 0.830

Content Perc 0.467 0.698 0.560 0.457SVM 0.650 0.589 0.618 0.638NB 0.762 0.212 0.332 0.649J48 0.740 0.533 0.619 0.736

All Perc 0.630 0.762 0.690 0.672SVM 0.499 0.990 0.664 0.506NB 0.874 0.212 0.341 0.737J48 0.890 0.810 0.848 0.877

4.3 Results

Our findings from Table 3 demonstrate the e!ectiveness of using solely userfeatures for identifying seed posts. In both the Haiti and Union Address datasetstraining a classification model using user features shows improved performanceover the same models trained using content features. In the case of the Uniondataset we are able to achieve an F1 score of 0.782, coupled with high precision,when using the J48 decision-tree classifier - where the latter figure (precision)indicates conservative estimates using only user features. We also achieve similarhigh-levels of precision when using the same classifier on the Haiti dataset. Theplots of the Receiver Operator Characteristic (ROC) curves in Figure 2 showsimilar levels of performance for each classifier over the two corpora.When usingsolely user features J48 is shown to dominate the ROC space, subsuming theplots from the other models. A similar behaviour is exhibited for the NaiveBayes classifier where SVM and Perceptron are each outperformed. The plotsalso demonstrate the poor recall levels when using only content features, whereeach model fails to yield the same performance as the use of only user features.

•  User features are most important in Twitter

•  But combining user & content features gives best results

Impact of different features in Twitter

•  What features have the highest impact on identification of seed posts?

•  Rank features by information gain ratio wrt seed post class label

52

which we found to be 0.674 indicating a good correlation between the two listsand their respective ranks.

Table 4. Features ranked by Information Gain Ratio wrt Seed Post class label. Thefeature name is paired within its IG in brackets.

Rank Haiti Union Address1 user-list-degree (0.275) user-list-degree (0.319)2 user-in-degree (0.221) content-time-in-day (0.152)3 content-informativeness (0.154) user-in-degree (0.133)4 user-num-posts (0.111) user-num-posts (0.104)5 content-time-in-day (0.089) user-post-rate (0.075)6 user-post-rate (0.075) user-out-degree (0.056)7 content-polarity (0.064) content-referral-count (0.030)8 user-out-degree (0.040) user-age (0.015)9 content-referral-count (0.038) content-polarity (0.015)10 content-length (0.020) content-length (0.010)11 content-readability (0.018) content-complexity (0.004)12 user-age (0.015) content-noun-count (0.002)13 content-uppercase-count (0.012) content-readability (0.001)14 content-noun-count (0.010) content-verb-count (0.001)15 content-adj-count (0.005) content-adj-count (0.0)16 content-complexity (0.0) content-informativeness (0.0)17 content-verb-count (0.0) content-uppercase-count (0.0)

Fig. 3. Contributions of top-5 features to identifying Non-seeds (N) and Seeds(S).Upper plots are for the Haiti dataset and the lower plots are for the Union Addressdataset.

The top-most ranks from each dataset are dominated by user features includ-ing the list-degree, in-degree, num-of-posts and post-rate. Such features describea user’s reputation, where higher values are associated with seed posts. Figure3 shows the contributions of each of the top-5 features to class decisions in thetraining set, where the list-degree and in-degree of the user are seen to correlateheavily with seed posts. Using these rankings our next experiment explored thee!ects of training a classification model using only the top-k features, observing

Positive/negative impact of features

•  What is the correlation between seed posts and features?

53

which we found to be 0.674 indicating a good correlation between the two listsand their respective ranks.

Table 4. Features ranked by Information Gain Ratio wrt Seed Post class label. Thefeature name is paired within its IG in brackets.

Rank Haiti Union Address1 user-list-degree (0.275) user-list-degree (0.319)2 user-in-degree (0.221) content-time-in-day (0.152)3 content-informativeness (0.154) user-in-degree (0.133)4 user-num-posts (0.111) user-num-posts (0.104)5 content-time-in-day (0.089) user-post-rate (0.075)6 user-post-rate (0.075) user-out-degree (0.056)7 content-polarity (0.064) content-referral-count (0.030)8 user-out-degree (0.040) user-age (0.015)9 content-referral-count (0.038) content-polarity (0.015)10 content-length (0.020) content-length (0.010)11 content-readability (0.018) content-complexity (0.004)12 user-age (0.015) content-noun-count (0.002)13 content-uppercase-count (0.012) content-readability (0.001)14 content-noun-count (0.010) content-verb-count (0.001)15 content-adj-count (0.005) content-adj-count (0.0)16 content-complexity (0.0) content-informativeness (0.0)17 content-verb-count (0.0) content-uppercase-count (0.0)

Fig. 3. Contributions of top-5 features to identifying Non-seeds (N) and Seeds(S).Upper plots are for the Haiti dataset and the lower plots are for the Union Addressdataset.

The top-most ranks from each dataset are dominated by user features includ-ing the list-degree, in-degree, num-of-posts and post-rate. Such features describea user’s reputation, where higher values are associated with seed posts. Figure3 shows the contributions of each of the top-5 features to class decisions in thetraining set, where the list-degree and in-degree of the user are seen to correlateheavily with seed posts. Using these rankings our next experiment explored thee!ects of training a classification model using only the top-k features, observing

H

aiti

Uni

on A

ddre

ss

Predicting discussion activity on Twitter

•  Reply rates: – Haiti 1-74 responses, Union Address 1-75 responses

•  Compare rankings – Ground truth vs predicted

•  Experiments – Using Haiti and Union Address datasets – Evaluate predicted rank k where k={1,5,10,20,50,100) – Support Vector Regression with user, content, user+content

features

54

Dataset Training size

Test size Test Vol Mean

Test Vol SD

Haiti 980 210 1.664 3.017

Union Address 5,067 1,161 1.761 2.342

Predicting discussion activity on Twitter

55

Haiti dataset Union Address dataset

•  Content features are key for top ranks

•  Use features more important for higher ranks

56

Identifying seed posts in Boards.ie

•  Used the same features as before – User features

•  In-degree, out-degree, post count, user age, post rate – Content features

•  Post Length, complexity, readability, referral count, time in day, informativeness, polarity

•  New features designed to capture user affinity – Forum Entropy

•  Concentration of forum activity •  Higher entropy = large forum spread

– Forum Likelihood •  Likelihood of forum post given user history •  Combines post history with incoming data

57

Experiment for identifying seed posts

•  Used all posts from Boards.ie in 2006

•  Built features using a 6-month window prior to seed post date

•  Evaluated a binary classification task –  Is this post a seed post or not? –  Precision, Recall, F1 and Accuracy –  Tested: user, content, focus features, and their combinations

Posts Seeds Non-Seeds Replies Users

1,942,030 90,765 21,800 1,829,465 29,908

58

Identifying seeds with different type of features

activity levels, and because it has already been used in otherinvestigations (e.g., [14]).

Boards.ie does not provide explicit social relations be-tween community members, unlike for example Facebook andTwitter. We followed the same strategy proposed in [3] forextracting social networks from Digg, and built the Boards.iesocial network for users, weighting edges cumulatively by thenumber of replies between any two users.

TABLE IDESCRIPTION OF THE BOARDS.IE DATASET

Posts Seeds Non-Seeds Replies Users1,942,030 90,765 21,800 1,829,465 29,908

In order to take derive our features we required a windowof n-days from which the social graph can be compiled andrelevant measurements taken. Based on previous work overthe same dataset in [14], we used a similar window of 188days (roughly 6-months) prior to the post date of a given seedor non-seed post. For instance, if a seed post p is made attime t, then our window from which the features (i.e., userand focus features) are derived is from t − 188 to t − 1. Inusing this heuristic we ensure that the features compiled foreach post are independent of future outcomes and will notbias our predictions - for example a user may increase theiractivity following the seed post which would not be a trueindicator of their behaviour at the time the post was made.Table I summarises the dataset and the number of posts (seeds,non-seeds and replies) and users contained within.

V. CLASSIFICATION: DETECTING SEED POSTS

Predicting discussion activity levels are often hindered byincluding posts that yield no replies. We alleviate this problemby differentiating between seed posts and non-seeds through abinary classification task. Once seed posts have been identifiedwe then attempt to predict the level of discussion that suchposts will generate. To this end, we look for the best classifierfor identifying seed and non-seed posts and then search for thefeatures that played key roles in distinguishing seed posts fromnon-seeds, thereby observing key features that are associatedwith discussions.

A. Experimental SetupFor our experiments we are using the previously described

dataset collected from Boards.ie containing both seeds andnon-seeds throughout 2006. For our collection of posts webuilt the content, user, and focus features listed in section IIIfrom the past 6 months of data leading up to the date on whichthe post was published - thereby ensuring no bias from futureevents in our dataset. We split the dataset into 3 sets using a70/20/10% random split, providing a training set, a validationset and a test set.

Our first task was to perform model selection by testing fourdifferent classifiers: SVM, Naive Bayes, Maximum Entropyand J48 decision tree, when trained on various individual fea-ture sets and their combinations: user features, content features

and focus features. This model selection phase was performedby training each classifier, together with the combination offeatures, using the 70% training split and labelling instancesin the held out 20% validation split.

Once we had identified the best performing model - i.e.,the classifier and combination of feature set that produces thehighest F1 value - our second task was to perform featureassessment, thereby identifying key features that contributesignificantly to seed post prediction accuracy. For this wetrained the best performing model from the model selectionphase over the training split and tested its classification accu-racy over the 10% test split, dropping individual features fromthe model and recording the reduction in accuracy followingthe omission of a given feature. Given that we are performinga binary classification task we use the standard performancemeasures for such a scenario: precision, recall and f-measure- setting β = 1 for an equal weighting of precision andrecall. We also measure the area under the Receiver OperatorCharacteristic curve to gauge the relationship between recalland fallout - i.e., false negative rate.

TABLE IIRESULTS FROM THE CLASSIFICATION OF SEED POSTS USING

VARYING FEATURE SETS AND CLASSIFICATION MODELS

P R F1 ROC

User SVM 0.775 0.810 0.774 0.581Naive Bayes 0.691 0.767 0.719 0.540Max Ent 0.776 0.806 0.722 0.556J48 0.778 0.809 0.734 0.582

Content SVM 0.739 0.804 0.729 0.511Naive Bayes 0.730 0.794 0.740 0.616Max Ent 0.758 0.806 0.730 0.678J48 0.795 0.822 0.783 0.617

Focus SVM 0.649 0.805 0.719 0.500Naive Bayes 0.710 0.737 0.722 0.588Max Ent 0.649 0.805 0.719 0.586J48 0.649 0.805 0.719 0.500

User + Content SVM 0.790 0.808 0.727 0.509Naive Bayes 0.712 0.772 0.732 0.593Max Ent 0.767 0.807 0.734 0.671J48 0.795 0.821 0.779 0.675

User + Focus SVM 0.776 0.810 0.776 0.583Naive Bayes 0.699 0.778 0.724 0.585Max Ent 0.771 0.806 0.722 0.607J48 0.777 0.810 0.742 0.617

Content + Focus SVM 0.750 0.805 0.729 0.511Naive Bayes 0.732 0.787 0.746 0.658Max Ent 0.762 0.807 0.731 0.692J48 0.798 0.823 0.787 0.662

All SVM 0.791 0.808 0.727 0.510Naive Bayes 0.724 0.780 0.740 0.637Max Ent 0.768 0.808 0.733 0.688J48 0.798 0.824 0.792 0.692

B. Results: Model Selection

1) Model Selection with Individual Features: The resultsfrom our first experiments are shown in Table II. Lookingfirst at individual feature sets - e.g., SVM together withuser features - we see that content features yield improvedpredictive performance over user and focus features. On dis-cussion forums content appears to play a more central role

Positive/negative impact of features on Boards.ie

•  What are the most important features for predicting seed posts?

•  Correlations: – Referral counts (non-seeds) – Forum likelihood (seeds) –  Informativeness (non-seeds) – Readability (seeds) – User age (non-seeds)

59

TABLE IIIREDUCTION IN F1 LEVELS AS INDIVIDUAL FEATURES ARE

DROPPED FROM THE J48 CLASSIFIER

Feature Dropped F1

- 0.815Post Count 0.815In-Degree 0.811*Out-Degree 0.811*User Age 0.807***Post Rate 0.815Forum Entropy 0.815Forum Likelihood 0.798***Post Length 0.810**Complexity 0.811**Readability 0.802***Referral Count 0.793***Time in Day 0.810**Informativeness 0.801***Polarity 0.808***Signif. codes: p-value < 0.001 *** 0.01 ** 0.05 * 0.1 .

hyperlinks (e.g., ads and spams). This contrasts with work inTwitter which found that tweets containing many links weremore likely to get ‘retweeted’ [11].

The boxplot for Forum Likelihood shows a correlation be-tween seed posts and higher values of the likelihood measure,suggesting that users who frequently post in the same forumsare more likely to start a discussion. Also, If a user often postsin discussion forums, while concentrating on only a few selectforums, then the likelihood that a new post is within one ofthose forums is high.

Fig. 3. Boxplots showing the correlation of feature values with seedand non-seed posts within the training split

VI. REGRESSION: PREDICTING DISCUSSION ACTIVITY

Early detection of lengthy discussions helps analysts andmanagers to focus attention to where activity and topicaldebates are about to occur. In this section we predict thelevel of discussion activity that seed posts will generate andwhat features are key indicators of lengthy discussions. Weuse regression models that induce a function describing therelationship between the level of discussion activity and ouruser, content and focus features. By learning such a functionwe can identify patterns in the data and correlations betweenour dependent variable and the range of predictor variablesthat we have.

Fig. 4. Discussion Activity Length Distribution

A. Experimental Setup

Forecasting the exact number of replies (discussion activity)is limited if the distribution of known reply lengths has alarge skew to either the minimum or maximum. For predictingpopular tweets, Hang et al [12] adopted a multiclass classifi-cation setting to deal with the large skew in the dataset bypredicting retweet count ranges. We have a similar scenarioin our Boards.ie dataset, where a large number of seed postsyield fewer than 20 replies (Figure VI). In such cases utilisingstandard regression error measures such as Relative AbsoluteError produces inaccurate assessments of the predictions dueto using a simple predictor based on the mean of the targetvariables.

In our experiments we instead use the Normalised Dis-counted Cumulative Gain (nDCG) at varying rank positions,looking at the performance of our predictions over the top-k documents where k = {1, 5, 10, 20, 50, 100}. NDCG isderived by dividing the Discounted Cumulative Gain (DCG)of the predicted ranking by the actual rank defined by (iDCG).DCG is well suited to our setting, given that we wish topredict the most popular posts and then expand that selectionto assess growing ranks, as the measure penalises elements inthe ranking that appear lower down when in fact they shouldbe higher up. We define DCG formally, based on the definitionfrom [9], as:

DCGk =k�

i=1

relilog2(1 + i)

(5)

For our experiments we first identify the best performingregression model before moving onto analysing the coeffi-cients of that model and the patterns in the data that leadto increased discussion activity. For our model selection phasewe test three regression models: Linear regression, Isotonic

60

•  What impact do features have on discussion length? – Assessed Linear Regression model with focus and content

features – Forum Likelihood (pos) – Content Length (+/neutral) – Complexity (pos) – Readability (+/neutral) – Referral Count (neg) – Time in Day (+/neutral) –  Informativeness (-/neutral) – Polarity (neg)

Predicting Discussion Activity in Boards.ie

Stay tuned •  More communities

– SAP, IBM, StackOverflow, Reddit – Compare impact of features on their dynamics

•  Better behaviour analysis –  Less features, more forums/communities, more graphs! – Healthy? posts, reciprocation, discussions, sentiment mixture

•  Churn analysis – Correlation of features/behaviour to ‘bounce rate’ (WebSci11 best paper)

•  Intervention! – Opportunities and mechanisms to influence behaviour

61

Upcoming events

62

Intelligent Web Services Meet Social Computing AAAI Spring Symposium 2012,

March 26-28, Stanford, California

http://vitvar.com/events/aaai-ss12 Deadline: Octover 7, 2011

Social Object Networks IEEE Social Computing, 2011 October 9-10, Boston, USA

http://ir.ii.uam.es/socialobjects2011/!Deadline: August 5, 2011

My social semantics team

Acknowledgement

63

Sofia Angeletou Research Associate

Matthew Rowe Research Associate

Alain Barrat CPT Marseille & ISI

Martin Szomszor CeRC, City University, UK

Wouter van Den Broeck ISI, Turin

Ciro Cattuto ISI, Turin

Live Social Semantics team

Gianluca Correndo, Uni Southampton Ivan Cantador, UAM, Madrid

STI International ESWC09/10 & HT09 chairs and organisers

All LSS participants

harith alani's presentation at sssw 2011

Technology

social web

social networkingtag

social contact23

social networkingwho

social networksfoaf16

facebook social networking

online social networks9

f2f contact strength