harith alani's presentation at sssw 2011
TRANSCRIPT
Live Social Semantics
& online community monitoring
1
Harith Alani Knowledge Media institute, The Open University, UK
Semantic Web Summer School Cercedilla, Spain,, 2011
http://twitter.com/halani http://delicious.com/halani http://www.linkedin.com/pub/harith-alani/9/739/534
Market value of Web Analytics
2
Tag-Along Marketing The New York Times, November 6, 2010
“Everything is in place for location-based social networking to be the next big thing. Tech companies are building the platforms, venture capitalists are providing the cash and marketers are eager to develop advertising. “
Location, Sensors, & Social Networking
3
Location, Sensors, & Social Networking
4 4
The Canine Twitterer
“Having my daily workout. Already did 15 leg lifts!”
Monitoring online/offline social activity
5
Where is everybody?
Monitoring online/offline social activity
• Generating opportunities for F2F networking
6
Tracking of F2F contact networks
7
TraceEncounters - 2004
Sociometer, MIT, 2002 - F2F and productivity
- F2F dynamics
- Who are key players?
- F2F and office distance
8
SocioPatterns platform
8 http://www.sociopatterns.org/!
9
Convergence with online social networks
9
• Digital social networking increases physical social isolation
• Causes – Genetic alterations – Weakened immune system – Less resistant to cancer – Higher risk of heart disease – Higher blood pressure – Faster dementia – Narrower arteries
Aric Sigman, “Well Connected? The Biological Implications of 'Social Networking’”, Biologist, 56(1), 2009
10
Online vs. offline social networking
• Digital networking increase social interaction – Create more opportunities to
network – Supports and increases F2F
contact! – Stronger offline social tiesà
more online communication – Stronger offline social ties à
more diverse online communications
– F2F is medium of choice in weaker social ties
Barry Wellman, The Glocal Village: Internet and Community, Idea’s - The Arts & Science Review, University of Toronto, 1(1),2004
Offline + online social networking
11 ESWC2010
Where should I go?
Where have I met this guy?
Anyone I know here?
Who should I talk to?
<?xml version="1.0"?>!<rdf:RDF! xmlns="http://tagora.ecs.soton.ac.uk/schemas/tagging#"! xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"! xmlns:xsd="http://www.w3.org/2001/XMLSchema#"! xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"! xmlns:owl="http://www.w3.org/2002/07/owl#"! xml:base="http://tagora.ecs.soton.ac.uk/schemas/tagging">! <owl:Ontology rdf:about=""/>! <owl:Class rdf:ID="Post"/>! <owl:Class rdf:ID="TagInfo"/>! <owl:Class rdf:ID="GlobalCooccurrenceInfo"/>! <owl:Class rdf:ID="DomainCooccurrenceInfo"/>! <owl:Class rdf:ID="UserTag"/>! <owl:Class rdf:ID="UserCooccurrenceInfo"/>! <owl:Class rdf:ID="Resource"/>! <owl:Class rdf:ID="GlobalTag"/>! <owl:Class rdf:ID="Tagger"/>! <owl:Class rdf:ID="DomainTag"/>! <owl:ObjectProperty rdf:ID="hasPostTag">! <rdfs:domain rdf:resource="#TagInfo"/>! </owl:ObjectProperty>! <owl:ObjectProperty rdf:ID="hasDomainTag">! <rdfs:domain rdf:resource="#UserTag"/>! </owl:ObjectProperty>! <owl:ObjectProperty rdf:ID="isFilteredTo">! <rdfs:range rdf:resource="#GlobalTag"/>! <rdfs:domain rdf:resource="#GlobalTag"/>! </owl:ObjectProperty>! <owl:ObjectProperty rdf:ID="hasResource">! <rdfs:domain rdf:resource="#Post"/>! <rdfs:range =…!
Live Social Semantics (LSS): RFIDs + Social Web + Semantic Web
• Integration of physical presence and online information • Semantic user profile generation • Logging of face-to-face contact • Social network browsing • Analysis of online vs offline social networks
Live Social Semantics: architecture
13
triple store
Profile builder
Tag disambiguation service
Tag to URI service
ontology
net
wor
ks
interests Delicious
Flickr
LastFM
semanticweb.org
rkbexplorer.com
dbpedia.org
dbtune.org
TAGora Sense Repository
JXT Triple Store
Extractor Daemon
Connect API
We
b-b
ase
d S
yste
ms
Re
al W
orld
Visualization Web Interface Linked Data
Local Server
RFID Readers
Real-WorldContact Data
SocialSemantics
Communities of Practice
Social TaggingSocial Networks
Contacts
mbid -> dbpedia uritag -> dbpedia uri
Profile BuilderPublications
Ag
gre
ga
tor
RD
F c
ach
e
RFID Badges
Delicious
Flickr
LastFM
semanticweb.org
rkbexplorer.com
dbpedia.org
dbtune.org
TAGora Sense Repository
JXT Triple Store
Extractor Daemon
Connect API
Web-b
ased S
yste
ms
Real W
orld
Visualization Web Interface Linked Data
Local Server
RFID Readers
Real-WorldContact Data
SocialSemantics
Communities of Practice
Social TaggingSocial Networks
Contacts
mbid -> dbpedia uritag -> dbpedia uri
Profile BuilderPublications
Aggre
gato
r
RD
F c
ache
RFID Badges
Web interface Linked data Visualization
URIs
tags
social semantics
contacts data
data.semanticweb.org rkbexplorer.com
publications, co-authorship networks
SW resources
14
proceedings chair
chair author
CoP
conference
http://data.semanticweb.org/!
www.rkbexplorer.com/!
15
Social and information networks
15
16
Merging social networks
16 FOAF
Distinct, Separated Identity Management
http://tagora.ecs.soton.ac.uk/delicious/halani
http://tagora.ecs.soton.ac.uk/flickr/69749885@N00
http://tagora.ecs.soton.ac.uk/lastfm/halani
http://tagora.ecs.soton.ac.uk/facebook/568493878
Harith Alani
http://data.semanticweb.org/person/harith-alani/
http://southampton.rkbexplorer.com/id/person-05877
http://tagora.ecs.soton.ac.uk/LiveSocialSemantics/eswc2009/1139
http://tagora.ecs.soton.ac.uk/LiveSocialSemantics/eswc2009/foaf/2
Delicious Tagging and Network
Flickr Tagging and Contacts
Las:m favourite ar>sts and friends
Facebook contacts
RFID Contact Data
Conference Publica>on Data
Past Publica>ons, Projects, Communi>es of Prac>ce
18
Tag Filtering Service
Semantic modeling Semantic analysis Collective intelligence Statistical analysis Syntactical analysis
19
Tag Filtering Service
20
From Tags to Semantics
20
21
Tags to User Interests
21
22
From raw tags and social relations to Structured Data
User raw data
Structured data
Collective intelligence
ontologies
Semantic data
23
RFIDs for tracking social contact
23
People contact à RFID à RDF Triples
24
F2FContact
hasContact
contactWith
contactDate contactDura>on
XMLSchema#date XMLSchema#>me
contactPlace
Place
foaf#Person1
foaf#Person2
25
26
Real-time F2F networks with SNS links
http://www.vimeo.com/6590604
27
Deployed at:
Live Social Semantics
Data analysis • Face-to-face interactions across scientific conferences
• Networking behaviour of frequent users
• Correlations between scientific seniority and social networking
• Comparison of F2F contact network with Twitter and Facebook
• Social networking with online and offline friends
Characteristics of F2F contact network
• Degree is number of people with whom the person had at least one F2F contact
• Strength is the time spent in a F2F contact • Edge weight is total time spent by a pair of users in F2F contact
28
Network characteristics
ESWC 2009 HT 2009 ESWC 2010
Number of users 175 113 158
Average degree 54 39 55
Avg. strength (mn) 143 123 130
Avg. weight (mn)
2.65 3.15 2.35
Weights ≤ 1 mn
70% 67% 74%
Weights ≤ 5 mn
90% 89% 93%
Weights ≤ 10 mn 95% 94% 96%
Characteristics of F2F contact events Contact characteristics
ESWC 2009 HT 2009 ESWC 2010
Number of contact events
16258 9875 14671
Average contact length (s)
46 42 42
Contacts ≤ 1mn 87% 89% 88%
Contacts ≤ 2mn 94% 96% 95%
Contacts ≤ 5mn 99% 99% 99%
Contacts ≤ 10mn 99.8% 99.8% 99.8%
F2F contact pattern is very similar for all three conferences
F2F contacts of returning users
101 102101
102
103 104 105103
104
ESW
C20
10
101 102 103 104 105
ESWC2009101102103104
Degree
Total interaction time
Links’ weights
30
• Degree: number of other participants with whom an attendee has interacted
• Total time: total time spent in
interaction by an attendee
• Link weight: total time spent in F2F interaction by a pair of returning attendees in 2010, versus the same quantity measured in 2009
Time spent on F2F networking by frequent users is stable, even when the list of people they networked with changed
ESWC 2009 & ESWC 2010
Pearson Correlation
Degree 0.37
Total F2F interaction time
0.76
Link weight 0.75
Average seniority of neighbours in F2F networks
0 5 10seniority (number of papers)
0
1
2
3
4
5
Ave
rage
seni
ority
of n
eigh
bors
sennsenn,wsenn,max
31
• No clear pattern is observed if the unweighted average over all neighbours in the aggregated network is considered
• A correlation is observed when each neighbour is weighted by the time spent with the main person
• The correlation becomes much stronger when considering for each individual only the neighbour with whom the most time was spent
Avg seniority of the neighbours
with weighted averages
Seniority of user with strongest link
Conference attendees tend to networks with others of similar levels of scientific seniority
Presence of AJendees HT2009
Offline networking vs online networking
33
• people who have a large number of friends on Twitter and/or Facebook don’t seem to be the most socially active in the offline world in comparison to other SNS users
Users with Facebook and Twitter accounts in ESWC 2010
Twitterers Spearman Correlation (ρ)
Tweets – F2F Degree - 0.15
Tweets – F2F Strength - 0.15
Twitter Following – F2F Degree
- 0.21
No strong correlation between amount of F2F contact activity and size of online social networks
users
!"
!#$"
!#%"
!#&"
!#'"
("
(#$"
(" &" ((" (&" $(" $&" )(" )&" %("
*+,-./"01221+./3"
45678.9"
*+..:3"
Scientific seniority vs Twitter followers
34
• Comparison between people’s scientific seniority and the number of people following them on Twitter
People who have the highest number of Twitter followers are not necessarily the most scientifically senior, although they do have high visibility and experience
users
Twitter users Correlation H-index – Twitter Followers 0.32
H-index – Tweets - 0.13
Conference Chairs
all participants
2009
chairs 2009
all participants
2010
chairs 2010
average degree average strength
55 8590
77.7 19590
54 7807
77.6 22520
average weight average number of events per edge
159 3.44
500 8
141 3.37
674 12
• Conf chairs interact with more distinct people (larger average degree)
• Conf chairs spend more time in F2F interaction (almost three times as much as a random participant)
Networking with online and offline ‘friends’ Characteristics all users coauthors Facebook
friends Twitter
followers average contact duration (s)
42 75 63 72
average edge weight (s)
141 4470 830 1010
average number of events per edge
3.37 60 13 14
• Individuals sharing an online or professional social link meet much more often than other individuals
• Average number of encounters, and total time spent in interaction, is highest for co-authors
F2F contacts with Facebook & Twitter friends were respectively %50 and %71 longer, and %286 and %315 more frequent than with others They spent %79 more time in F2F contacts with their co-authors, and they met them %1680 more times than they met non co-authors
Twitterers vs Non-Twitterers
• Time spent in conference rooms – Twitter users spent on average 11.4% more time in the
conf rooms than non-twitter users (mean is 26% higher)
• Number of people met F2F during the conference – Twitter users met on average 9% more people F2F
(mean 8% higher)
• Duration of F2F contacts – Twitter users spent on average 63% more time in F2F
contact than non twitter users (mean is 20% higher)
37
Behaviour of individuals – micro level analysis
38
!"
!#$"
!#%"
!#&"
!#'"
("
(#$"
(" )" *" (+" (," $(" $)" $*" ++" +," %(" %)"-./0123" 4$4"526722" 4$4"8972069:"
:2;<9:=">?@20AB?"C">D?@;<"E7DB<2>#"F72G"
?:;@7>HIJ>"
@0"K88"92;L"6DD1">?@20AB?M";01">D?@;<">@60;<>""
>:=">?@20A>9N"
DO9>@127M":@6:"E7DB<2"
89O1209>M"PQM"12R2<DE27>#"S:DT>"9:2"0239">9;7"72>2;7?:27N"
Behaviour analysis
Jeffrey Chan, Conor Hayes, and Elizabeth Daly. Decomposing discussion forums using common user roles. In Proc. Web Science Conf. (WebSci10), Raleigh, NC: US, 2010
Role Skeleton
Encoding Rules in Ontologies with SPIN
Approach for inferring User Roles
42
Structural, social network, reciprocity, persistence, participation
Feature levels change with the dynamics of the community
Associate Roles with a collection of feature-to-level Mappings e.g. in-degree -> high, out-degree -> high
Run our rules over each user’s features and derive the role composition
Data from Boards.ie • Forum 246 (Commuting and Transport): Demonstrates a clear increase in
activity over time.
• Forum 388 (Rugby): Exhibits periodic increase and decrease in activity and hence it provides good examples of healthy/unhealthy evolutions.
• Forum 411 (Mobile Phones and PDAs): Increase in activity over time with some fluctuation - i.e. reduction and increase over various time windows.
• For the time in 2004-01 to 2006-12
Results
• Correlation of individual features in each of the three forums
Commuting and Transport Rugby Mobile Phones and PDAs
Results (a
) For
um 2
46: C
omm
utin
g an
d Tr
ansp
ort
(b) F
orum
388
: Rug
by
(c) F
orum
411
: Mob
ile P
hone
s an
d P
DA
s
• Variation in behaviour composition & activity
• Behaviour composition in/stability influences forum activity
Prediction analysis – preliminary results!
• Predicting rise/fall in post submission numbers
• Binary classification
• Features : Community composition, roles and percentages of users associated with each
• Cross-community predictions are less reliable than individual community analysis due to the idiosyncratic behaviour observed in each individual community
Forum P R F1 ROC
246 0.799 0.769 0.780 0.800
388 0.603 0.615 0.605 0.775
411 0.765 0.692 0.714 0.617
All 0.583 0.667 0.607 0.466
Rise and fall of social networks
47
Predicting engagement
• Which posts will receive a reply? – What are the most influential features here?
• How much discussion will it generate? – What are the key factors of lengthy discussions?
48
Common online community features
49
initial tweet that generates a reply. Features which describe seed posts can bedivided into two sets: user features - attributes that define the user making thepost; and, content features - attributes that are based solely on the post itself.We wish to explore the application of such features in identifying seed posts, todo this we train several machine learning classifiers and report on our findings.However we first describe the features used.
4.1 Feature Extraction
The likelihood of posts eliciting replies depends upon popularity, a highly subjec-tive term influenced by external factors. Properties influencing popularity includeuser attributes - describing the reputation of the user - and attributes of a post’scontent - generally referred to as content features. In Table 1 we define user andcontent features and study their influence on the discussion “continuation”.
Table 1. User and Content Features
User FeaturesIn Degree: Number of followers of U #
Out Degree: Number of users U follows #List Degree: Number of lists U appears on. Lists group users by topic #Post Count: Total number of posts the user has ever posted #
User Age: Number of minutes from user join date #Post Rate: Posting frequency of the user PostCount
UserAge
Content FeaturesPost length: Length of the post in characters #Complexity: Cumulative entropy of the unique words in post p !
of total word length n and pi the frequency of each word!
i![1,n] pi(log !"log pi)
!Uppercase count: Number of uppercase words #
Readability: Gunning fog index using average sentence length (ASL) [7]and the percentage of complex words (PCW). 0.4(ASL+ PCW )
Verb Count: Number of verbs #Noun Count: Number of nouns #
Adjective Count: Number of adjectives #Referral Count: Number of @user #
Time in the day: Normalised time in the day measured in minutes #Informativeness: Terminological novelty of the post wrt other posts
The cumulative tfIdf value of each term t in post p!
t!p tfidf(t, p)Polarity: Cumulation of polar term weights in p (using
Sentiwordnet3 lexicon) normalised by polar terms count Po+Ne|terms|
4.2 Experiments
Experiments are intended to test the performance of di!erent classification mod-els in identifying seed posts. Therefore we used four classifiers: discriminativeclassifiers Perceptron and SVM, the generative classifier Naive Bayes and thedecision-tree classifier J48. For each classifier we used three feature settings:user features, content features and user+content features.
Datasets For our experiments we used two datasets of tweets available on theWeb: Haiti earthquake tweets4 and the State of the Union Address tweets.5 The
4 http://infochimps.com/datasets/twitter-haiti-earthquake-data5 http://infochimps.com/datasets/tweets-during-state-of-the-union-address
• How do all these features influence activity generation in an online community? – Such knowledge leads to better use and management of the community
Experiment for identifying Twitter seed posts
• Twitter data on the Haiti earthquake, and the Union Address
• Evaluated a binary classification task – Is this post a seed post or not?
50
Dataset Users Tweets Seeds Non-seeds Replies
Haiti 44,497 65,022 1,405 60,686 2,931
Union Address 66,300 80,272 7,228 55,169 17,875
Identifying seeds with different type of features
51
use f-measure, as defined in Equation 1 as the harmonic mean between precisionand recall, setting ! = 1 to weight precision and recall equally. We also plot theReceiver Operator Curve of our trained models to show graphical comparisonsof performance.
F! =(1 + !2) ! P ! R
!2 ! P + R(1)
For our experiments we divided each dataset up into 3 sets: a training set, avalidation set and a testing set using a 70/20/10 split. We trained our classifi-cation models using the training split and then applied them to the validationset, labelling the posts within this split. From these initial results we performedmodel selection by choosing the best performing model - based on maximisingthe F1 score - and used this model together with the best performing features,using a ranking heuristic, to classify posts contained within our test split. Wefirst report on the results obtained from our model selection phase, before movingonto our results from using the best model with the top-k features.
Table 3. Results from the classification of seed posts using varying feature sets andclassification models
(a) Haiti DatasetP R F1 ROC
User Perc 0.794 0.528 0.634 0.727SVM 0.843 0.159 0.267 0.566NB 0.948 0.269 0.420 0.785J48 0.906 0.679 0.776 0.822
Content Perc 0.875 0.077 0.142 0.606SVM 0.552 0.727 0.627 0.589NB 0.721 0.638 0.677 0.769J48 0.685 0.705 0.695 0.711
All Perc 0.794 0.528 0.634 0.726SVM 0.483 0.996 0.651 0.502NB 0.962 0.280 0.434 0.852J48 0.824 0.775 0.798 0.836
(b) Union Address DatasetP R F1 ROC
User Perc 0.658 0.697 0.677 0.673SVM 0.510 0.946 0.663 0.512NB 0.844 0.086 0.157 0.707J48 0.851 0.722 0.782 0.830
Content Perc 0.467 0.698 0.560 0.457SVM 0.650 0.589 0.618 0.638NB 0.762 0.212 0.332 0.649J48 0.740 0.533 0.619 0.736
All Perc 0.630 0.762 0.690 0.672SVM 0.499 0.990 0.664 0.506NB 0.874 0.212 0.341 0.737J48 0.890 0.810 0.848 0.877
4.3 Results
Our findings from Table 3 demonstrate the e!ectiveness of using solely userfeatures for identifying seed posts. In both the Haiti and Union Address datasetstraining a classification model using user features shows improved performanceover the same models trained using content features. In the case of the Uniondataset we are able to achieve an F1 score of 0.782, coupled with high precision,when using the J48 decision-tree classifier - where the latter figure (precision)indicates conservative estimates using only user features. We also achieve similarhigh-levels of precision when using the same classifier on the Haiti dataset. Theplots of the Receiver Operator Characteristic (ROC) curves in Figure 2 showsimilar levels of performance for each classifier over the two corpora.When usingsolely user features J48 is shown to dominate the ROC space, subsuming theplots from the other models. A similar behaviour is exhibited for the NaiveBayes classifier where SVM and Perceptron are each outperformed. The plotsalso demonstrate the poor recall levels when using only content features, whereeach model fails to yield the same performance as the use of only user features.
• User features are most important in Twitter
• But combining user & content features gives best results
Impact of different features in Twitter
• What features have the highest impact on identification of seed posts?
• Rank features by information gain ratio wrt seed post class label
52
which we found to be 0.674 indicating a good correlation between the two listsand their respective ranks.
Table 4. Features ranked by Information Gain Ratio wrt Seed Post class label. Thefeature name is paired within its IG in brackets.
Rank Haiti Union Address1 user-list-degree (0.275) user-list-degree (0.319)2 user-in-degree (0.221) content-time-in-day (0.152)3 content-informativeness (0.154) user-in-degree (0.133)4 user-num-posts (0.111) user-num-posts (0.104)5 content-time-in-day (0.089) user-post-rate (0.075)6 user-post-rate (0.075) user-out-degree (0.056)7 content-polarity (0.064) content-referral-count (0.030)8 user-out-degree (0.040) user-age (0.015)9 content-referral-count (0.038) content-polarity (0.015)10 content-length (0.020) content-length (0.010)11 content-readability (0.018) content-complexity (0.004)12 user-age (0.015) content-noun-count (0.002)13 content-uppercase-count (0.012) content-readability (0.001)14 content-noun-count (0.010) content-verb-count (0.001)15 content-adj-count (0.005) content-adj-count (0.0)16 content-complexity (0.0) content-informativeness (0.0)17 content-verb-count (0.0) content-uppercase-count (0.0)
Fig. 3. Contributions of top-5 features to identifying Non-seeds (N) and Seeds(S).Upper plots are for the Haiti dataset and the lower plots are for the Union Addressdataset.
The top-most ranks from each dataset are dominated by user features includ-ing the list-degree, in-degree, num-of-posts and post-rate. Such features describea user’s reputation, where higher values are associated with seed posts. Figure3 shows the contributions of each of the top-5 features to class decisions in thetraining set, where the list-degree and in-degree of the user are seen to correlateheavily with seed posts. Using these rankings our next experiment explored thee!ects of training a classification model using only the top-k features, observing
Positive/negative impact of features
• What is the correlation between seed posts and features?
53
which we found to be 0.674 indicating a good correlation between the two listsand their respective ranks.
Table 4. Features ranked by Information Gain Ratio wrt Seed Post class label. Thefeature name is paired within its IG in brackets.
Rank Haiti Union Address1 user-list-degree (0.275) user-list-degree (0.319)2 user-in-degree (0.221) content-time-in-day (0.152)3 content-informativeness (0.154) user-in-degree (0.133)4 user-num-posts (0.111) user-num-posts (0.104)5 content-time-in-day (0.089) user-post-rate (0.075)6 user-post-rate (0.075) user-out-degree (0.056)7 content-polarity (0.064) content-referral-count (0.030)8 user-out-degree (0.040) user-age (0.015)9 content-referral-count (0.038) content-polarity (0.015)10 content-length (0.020) content-length (0.010)11 content-readability (0.018) content-complexity (0.004)12 user-age (0.015) content-noun-count (0.002)13 content-uppercase-count (0.012) content-readability (0.001)14 content-noun-count (0.010) content-verb-count (0.001)15 content-adj-count (0.005) content-adj-count (0.0)16 content-complexity (0.0) content-informativeness (0.0)17 content-verb-count (0.0) content-uppercase-count (0.0)
Fig. 3. Contributions of top-5 features to identifying Non-seeds (N) and Seeds(S).Upper plots are for the Haiti dataset and the lower plots are for the Union Addressdataset.
The top-most ranks from each dataset are dominated by user features includ-ing the list-degree, in-degree, num-of-posts and post-rate. Such features describea user’s reputation, where higher values are associated with seed posts. Figure3 shows the contributions of each of the top-5 features to class decisions in thetraining set, where the list-degree and in-degree of the user are seen to correlateheavily with seed posts. Using these rankings our next experiment explored thee!ects of training a classification model using only the top-k features, observing
H
aiti
Uni
on A
ddre
ss
Predicting discussion activity on Twitter
• Reply rates: – Haiti 1-74 responses, Union Address 1-75 responses
• Compare rankings – Ground truth vs predicted
• Experiments – Using Haiti and Union Address datasets – Evaluate predicted rank k where k={1,5,10,20,50,100) – Support Vector Regression with user, content, user+content
features
54
Dataset Training size
Test size Test Vol Mean
Test Vol SD
Haiti 980 210 1.664 3.017
Union Address 5,067 1,161 1.761 2.342
Predicting discussion activity on Twitter
55
Haiti dataset Union Address dataset
• Content features are key for top ranks
• Use features more important for higher ranks
56
Identifying seed posts in Boards.ie
• Used the same features as before – User features
• In-degree, out-degree, post count, user age, post rate – Content features
• Post Length, complexity, readability, referral count, time in day, informativeness, polarity
• New features designed to capture user affinity – Forum Entropy
• Concentration of forum activity • Higher entropy = large forum spread
– Forum Likelihood • Likelihood of forum post given user history • Combines post history with incoming data
57
Experiment for identifying seed posts
• Used all posts from Boards.ie in 2006
• Built features using a 6-month window prior to seed post date
• Evaluated a binary classification task – Is this post a seed post or not? – Precision, Recall, F1 and Accuracy – Tested: user, content, focus features, and their combinations
Posts Seeds Non-Seeds Replies Users
1,942,030 90,765 21,800 1,829,465 29,908
58
Identifying seeds with different type of features
activity levels, and because it has already been used in otherinvestigations (e.g., [14]).
Boards.ie does not provide explicit social relations be-tween community members, unlike for example Facebook andTwitter. We followed the same strategy proposed in [3] forextracting social networks from Digg, and built the Boards.iesocial network for users, weighting edges cumulatively by thenumber of replies between any two users.
TABLE IDESCRIPTION OF THE BOARDS.IE DATASET
Posts Seeds Non-Seeds Replies Users1,942,030 90,765 21,800 1,829,465 29,908
In order to take derive our features we required a windowof n-days from which the social graph can be compiled andrelevant measurements taken. Based on previous work overthe same dataset in [14], we used a similar window of 188days (roughly 6-months) prior to the post date of a given seedor non-seed post. For instance, if a seed post p is made attime t, then our window from which the features (i.e., userand focus features) are derived is from t − 188 to t − 1. Inusing this heuristic we ensure that the features compiled foreach post are independent of future outcomes and will notbias our predictions - for example a user may increase theiractivity following the seed post which would not be a trueindicator of their behaviour at the time the post was made.Table I summarises the dataset and the number of posts (seeds,non-seeds and replies) and users contained within.
V. CLASSIFICATION: DETECTING SEED POSTS
Predicting discussion activity levels are often hindered byincluding posts that yield no replies. We alleviate this problemby differentiating between seed posts and non-seeds through abinary classification task. Once seed posts have been identifiedwe then attempt to predict the level of discussion that suchposts will generate. To this end, we look for the best classifierfor identifying seed and non-seed posts and then search for thefeatures that played key roles in distinguishing seed posts fromnon-seeds, thereby observing key features that are associatedwith discussions.
A. Experimental SetupFor our experiments we are using the previously described
dataset collected from Boards.ie containing both seeds andnon-seeds throughout 2006. For our collection of posts webuilt the content, user, and focus features listed in section IIIfrom the past 6 months of data leading up to the date on whichthe post was published - thereby ensuring no bias from futureevents in our dataset. We split the dataset into 3 sets using a70/20/10% random split, providing a training set, a validationset and a test set.
Our first task was to perform model selection by testing fourdifferent classifiers: SVM, Naive Bayes, Maximum Entropyand J48 decision tree, when trained on various individual fea-ture sets and their combinations: user features, content features
and focus features. This model selection phase was performedby training each classifier, together with the combination offeatures, using the 70% training split and labelling instancesin the held out 20% validation split.
Once we had identified the best performing model - i.e.,the classifier and combination of feature set that produces thehighest F1 value - our second task was to perform featureassessment, thereby identifying key features that contributesignificantly to seed post prediction accuracy. For this wetrained the best performing model from the model selectionphase over the training split and tested its classification accu-racy over the 10% test split, dropping individual features fromthe model and recording the reduction in accuracy followingthe omission of a given feature. Given that we are performinga binary classification task we use the standard performancemeasures for such a scenario: precision, recall and f-measure- setting β = 1 for an equal weighting of precision andrecall. We also measure the area under the Receiver OperatorCharacteristic curve to gauge the relationship between recalland fallout - i.e., false negative rate.
TABLE IIRESULTS FROM THE CLASSIFICATION OF SEED POSTS USING
VARYING FEATURE SETS AND CLASSIFICATION MODELS
P R F1 ROC
User SVM 0.775 0.810 0.774 0.581Naive Bayes 0.691 0.767 0.719 0.540Max Ent 0.776 0.806 0.722 0.556J48 0.778 0.809 0.734 0.582
Content SVM 0.739 0.804 0.729 0.511Naive Bayes 0.730 0.794 0.740 0.616Max Ent 0.758 0.806 0.730 0.678J48 0.795 0.822 0.783 0.617
Focus SVM 0.649 0.805 0.719 0.500Naive Bayes 0.710 0.737 0.722 0.588Max Ent 0.649 0.805 0.719 0.586J48 0.649 0.805 0.719 0.500
User + Content SVM 0.790 0.808 0.727 0.509Naive Bayes 0.712 0.772 0.732 0.593Max Ent 0.767 0.807 0.734 0.671J48 0.795 0.821 0.779 0.675
User + Focus SVM 0.776 0.810 0.776 0.583Naive Bayes 0.699 0.778 0.724 0.585Max Ent 0.771 0.806 0.722 0.607J48 0.777 0.810 0.742 0.617
Content + Focus SVM 0.750 0.805 0.729 0.511Naive Bayes 0.732 0.787 0.746 0.658Max Ent 0.762 0.807 0.731 0.692J48 0.798 0.823 0.787 0.662
All SVM 0.791 0.808 0.727 0.510Naive Bayes 0.724 0.780 0.740 0.637Max Ent 0.768 0.808 0.733 0.688J48 0.798 0.824 0.792 0.692
B. Results: Model Selection
1) Model Selection with Individual Features: The resultsfrom our first experiments are shown in Table II. Lookingfirst at individual feature sets - e.g., SVM together withuser features - we see that content features yield improvedpredictive performance over user and focus features. On dis-cussion forums content appears to play a more central role
Positive/negative impact of features on Boards.ie
• What are the most important features for predicting seed posts?
• Correlations: – Referral counts (non-seeds) – Forum likelihood (seeds) – Informativeness (non-seeds) – Readability (seeds) – User age (non-seeds)
59
TABLE IIIREDUCTION IN F1 LEVELS AS INDIVIDUAL FEATURES ARE
DROPPED FROM THE J48 CLASSIFIER
Feature Dropped F1
- 0.815Post Count 0.815In-Degree 0.811*Out-Degree 0.811*User Age 0.807***Post Rate 0.815Forum Entropy 0.815Forum Likelihood 0.798***Post Length 0.810**Complexity 0.811**Readability 0.802***Referral Count 0.793***Time in Day 0.810**Informativeness 0.801***Polarity 0.808***Signif. codes: p-value < 0.001 *** 0.01 ** 0.05 * 0.1 .
hyperlinks (e.g., ads and spams). This contrasts with work inTwitter which found that tweets containing many links weremore likely to get ‘retweeted’ [11].
The boxplot for Forum Likelihood shows a correlation be-tween seed posts and higher values of the likelihood measure,suggesting that users who frequently post in the same forumsare more likely to start a discussion. Also, If a user often postsin discussion forums, while concentrating on only a few selectforums, then the likelihood that a new post is within one ofthose forums is high.
Fig. 3. Boxplots showing the correlation of feature values with seedand non-seed posts within the training split
VI. REGRESSION: PREDICTING DISCUSSION ACTIVITY
Early detection of lengthy discussions helps analysts andmanagers to focus attention to where activity and topicaldebates are about to occur. In this section we predict thelevel of discussion activity that seed posts will generate andwhat features are key indicators of lengthy discussions. Weuse regression models that induce a function describing therelationship between the level of discussion activity and ouruser, content and focus features. By learning such a functionwe can identify patterns in the data and correlations betweenour dependent variable and the range of predictor variablesthat we have.
Fig. 4. Discussion Activity Length Distribution
A. Experimental Setup
Forecasting the exact number of replies (discussion activity)is limited if the distribution of known reply lengths has alarge skew to either the minimum or maximum. For predictingpopular tweets, Hang et al [12] adopted a multiclass classifi-cation setting to deal with the large skew in the dataset bypredicting retweet count ranges. We have a similar scenarioin our Boards.ie dataset, where a large number of seed postsyield fewer than 20 replies (Figure VI). In such cases utilisingstandard regression error measures such as Relative AbsoluteError produces inaccurate assessments of the predictions dueto using a simple predictor based on the mean of the targetvariables.
In our experiments we instead use the Normalised Dis-counted Cumulative Gain (nDCG) at varying rank positions,looking at the performance of our predictions over the top-k documents where k = {1, 5, 10, 20, 50, 100}. NDCG isderived by dividing the Discounted Cumulative Gain (DCG)of the predicted ranking by the actual rank defined by (iDCG).DCG is well suited to our setting, given that we wish topredict the most popular posts and then expand that selectionto assess growing ranks, as the measure penalises elements inthe ranking that appear lower down when in fact they shouldbe higher up. We define DCG formally, based on the definitionfrom [9], as:
DCGk =k�
i=1
relilog2(1 + i)
(5)
For our experiments we first identify the best performingregression model before moving onto analysing the coeffi-cients of that model and the patterns in the data that leadto increased discussion activity. For our model selection phasewe test three regression models: Linear regression, Isotonic
60
• What impact do features have on discussion length? – Assessed Linear Regression model with focus and content
features – Forum Likelihood (pos) – Content Length (+/neutral) – Complexity (pos) – Readability (+/neutral) – Referral Count (neg) – Time in Day (+/neutral) – Informativeness (-/neutral) – Polarity (neg)
Predicting Discussion Activity in Boards.ie
Stay tuned • More communities
– SAP, IBM, StackOverflow, Reddit – Compare impact of features on their dynamics
• Better behaviour analysis – Less features, more forums/communities, more graphs! – Healthy? posts, reciprocation, discussions, sentiment mixture
• Churn analysis – Correlation of features/behaviour to ‘bounce rate’ (WebSci11 best paper)
• Intervention! – Opportunities and mechanisms to influence behaviour
61
Upcoming events
62
Intelligent Web Services Meet Social Computing AAAI Spring Symposium 2012,
March 26-28, Stanford, California
http://vitvar.com/events/aaai-ss12 Deadline: Octover 7, 2011
Social Object Networks IEEE Social Computing, 2011 October 9-10, Boston, USA
http://ir.ii.uam.es/socialobjects2011/!Deadline: August 5, 2011
My social semantics team
Acknowledgement
63
Sofia Angeletou Research Associate
Matthew Rowe Research Associate
Alain Barrat CPT Marseille & ISI
Martin Szomszor CeRC, City University, UK
Wouter van Den Broeck ISI, Turin
Ciro Cattuto ISI, Turin
Live Social Semantics team
Gianluca Correndo, Uni Southampton Ivan Cantador, UAM, Madrid
STI International ESWC09/10 & HT09 chairs and organisers
All LSS participants