reconstruction and analysis of user context graphs in...

7
1 Introduction 1 Reconstruction and Analysis of User Context Graphs in Large Social Networks Svitlana Volkova Center for Language and Speech Processing Johns Hopkins University [email protected] November 30, 2012 Abstract The goal of the paper is to describe both challenges and solutions for constructing user context graphs in large social networks like Twitter. We first describe data collection process and present straightforward user sampling technique to reduce data size with- out any loss in performance. Next, we define user political orientation classification task and present 2 types of user context graphs such as: user-follower and user-friend graphs. We claim that in the case of no or limited amount of user data available, user’s context can be used to help predicting different user attributes such as user’s politi- cal orientation. We show that by using follower and friend graphs we can improve the performance by 4% using follower and by 5.5% using friend relationships between the users. Finally, we review other works that investigate conversation graphs in Twitter. 1 Introduction Communication over social networks has dramatically emerged over the last decade. There has been an extensive amount of work in social network analysis including information propagation [Cogan et al., 2012, Kwak et al., 2010, Cha et al., 2009], link prediction [Liben-Nowell & Kleinberg, 2003], trend detection [Vakali et al., 2012], and user attribute prediction [Rao et al., 2010, Burger et al., 2011, Rao et al., 2011]. At the same time, data collection from large social networks like Facebook, Twitter, or MyS- pace has become increasingly more difficult due to privacy issues and company strate- gies of making a revenue out of data rather than releasing it publicly for free (e.g., Twitter API has a limitation of accepting 150 requests per hour from one IP). Moreover, studying informal language in social media in general and in Twitter in particular, is very challenging compared to many other types of data because tweets are short, include abbreviations, misspellings, urls, informal words, and repeated punctuation. Despite all of the challenges described above, we found Twitter to be an attractive resource for studying both language (content) and graph structure (context) on a large scale due to its volume 1 , dynamic nature and its diverse population. We choose to investigate the contribution of both content and context to predict user political orien- tation (democrat vs. republican). We aim to test a hypothesis that in the case of no or limited amount of content from a user, user’s local context (followers or friends) can be used to help to predict different user attributes like using political orientation. 1 As of June 2012, Twitter has passed 500M users, 140M of them in US, 340M tweets daily.

Upload: dinhxuyen

Post on 25-Apr-2018

216 views

Category:

Documents


3 download

TRANSCRIPT

1 Introduction 1

Reconstruction and Analysis of User Context Graphsin Large Social Networks

Svitlana VolkovaCenter for Language and Speech Processing

Johns Hopkins [email protected] 30, 2012

Abstract

The goal of the paper is to describe both challenges and solutions for constructing usercontext graphs in large social networks like Twitter. We first describe data collectionprocess and present straightforward user sampling technique to reduce data size with-out any loss in performance. Next, we define user political orientation classificationtask and present 2 types of user context graphs such as: user-follower and user-friendgraphs. We claim that in the case of no or limited amount of user data available, user’scontext can be used to help predicting different user attributes such as user’s politi-cal orientation. We show that by using follower and friend graphs we can improve theperformance by 4% using follower and by 5.5% using friend relationships between theusers. Finally, we review other works that investigate conversation graphs in Twitter.

1 Introduction

Communication over social networks has dramatically emerged over the last decade.There has been an extensive amount of work in social network analysis includinginformation propagation [Cogan et al., 2012, Kwak et al., 2010, Cha et al., 2009], linkprediction [Liben-Nowell & Kleinberg, 2003], trend detection [Vakali et al., 2012], anduser attribute prediction [Rao et al., 2010, Burger et al., 2011, Rao et al., 2011]. At thesame time, data collection from large social networks like Facebook, Twitter, or MyS-pace has become increasingly more difficult due to privacy issues and company strate-gies of making a revenue out of data rather than releasing it publicly for free (e.g.,Twitter API has a limitation of accepting 150 requests per hour from one IP). Moreover,studying informal language in social media in general and in Twitter in particular, is verychallenging compared to many other types of data because tweets are short, includeabbreviations, misspellings, urls, informal words, and repeated punctuation.

Despite all of the challenges described above, we found Twitter to be an attractiveresource for studying both language (content) and graph structure (context) on a largescale due to its volume1, dynamic nature and its diverse population. We choose toinvestigate the contribution of both content and context to predict user political orien-tation (democrat vs. republican). We aim to test a hypothesis that in the case of no orlimited amount of content from a user, user’s local context (followers or friends) can beused to help to predict different user attributes like using political orientation.

1 As of June 2012, Twitter has passed 500M users, 140M of them in US, 340M tweets daily.

2 Data Collection and Sampling 2

Figure 1: Sampled users of inter-est from democrat and republicanhubs: - users of interests (blue -democrats, red - republicans), } -hubs, and #- other users in Twitteruser space.

2 Data Collection and Sampling

To get a representative sample of republican and democrat profiles we use Tweepy2,a library for accessing Twitter API, and randomly sample n = 516 democrat user pro-files (users of interest) with 200 tweets per user who follow @BarackObama - B and@JoeBiden - J user profiles such that ud ∈ (B ∩ J), ud /∈ (M ∪P ) and m = 515 repub-lican profiles who follow @MittRomney - M and @RepPaulRyan - P user profiles suchthat ur ∈ (M ∩ P ), ur /∈ (B ∪ J)3. We show democrat and republican user hubs withsampled users of interest in Figure 1. We label users of interest as democrats if theyexclusively follow democrat hubs, and as republicans if they exclusively follow repub-lican hubs. We decode json format that is used to store user information and tweetsin Twitter using Java libraries including json-simple4 and google-gson5. Finally, wesave user information and tweets to a scalable, high-performance, open source NoSQLdatabase MongoDB6.

To investigate different types of communication relationships between the users wecollect data to encode the relationships as graph structures. For instance, for user-follower and user-friend relationships we download user followers7 and friends8 for mdemocrat and n republican users of interest. The distribution of follower and friendcounts for the users of interest are given in Figures 2a and 2b, respectively.

(a) User followers (b) User friends

Fig. 2: True follower and friend distributions for the users of interest.

2 http://tweepy.github.com/3 As of Oct 12, 2012, the total number of followers for Obama, Biden, Romney and Ryan

profiles, from where we sample 1,301K users, are 2M, 168K, 1.3M and 267K, respectively.4 code.google.com/p/google-gson/5 code.google.com/p/json-simple/6 www.mongodb.org/7 Follower is someone who subscribes to or follows the tweets of another Twitter user.8 Friend is someone who follows another Twitter user who is following his follower user back.

2 Data Collection and Sampling 3

The distributions of follower and friend counts per user of interest are very similarfor democrat and republican users, and only 1-2% users have less than 10 followers orfriends. We could download tweets for all 1.37M followers and 0.83M friends as shownin Figure 2, but due to the limitations of the number of requests Twitter API can acceptdaily, we get a sample of user followers and friends. As a result, from downloadeduser follower and friend IDs we randomly sample k followers and k friends per userof interest (k = 10). We compare the original (true) and sampled follower and friendgraphs in Figures 3a and Figure 3b, respectively.

(a) User-follower graph (b) User-friend graph

Fig. 3: Sampled follower and friend graphs (overlapping vertices on the outer circle canbe further scaled to the balanced stars centered in user vertices).

To compare the original and sampled graphs, we report graph statistics for theoriginal and sampled graphs in Table 1. We consider graph statistics such as:

- graph size Gsize = |E| = |E(G)| - the number of edges in the graph;

- graph order Gorder = |U | = |U(G)| - the number of vertices in the graph;

- graph maximum degree Gdegree = |∆| = |∆G| - the number of edges incident tothe vertex, with loops counted twice.

Graph types Gorder Gsize Gdegree

Original follower graph 1,185,105 1,371,557 292,999Sampled follower graph 10,499 9,894 10Original friend graph 4,622,884 836,771 50,344Sampled friend graph 9,206 9,772 10

Tab. 1: User-follower and user-friend graph statistics.

3 User Context Graphs 4

Finally, we report additional graph statistics for user-follower Gf and user-friendGr sampled graphs only such as: diameter – Gf = 38, Gr = 22, average degree –Gf = 0.949, Gr = 2.123, average path length – Gf = 13, 584, Gr = 9, 117, weaklyconnected components – Gf = 653, Gr = 161.

3 User Context Graphs

In this Section we give a general definition of Twitter context graph and discuss thedifferences of two context graph types: user-follower and user-friend graphs.

Definition 1. Context graph is a directed graph G = (U,E), where two disjoint subsetsof vertices represent users of interest ui of two types {democrats, republicans}, andtheir local context uc such that ui, uci ∈ U, ui∩uci = ∅; a set of edges E ⊂ U (2) representcommunication relationships between the users of interest ui and other users from thespace of all Twitter users U \ {ui ∪ uci} such that erk , efk ∈ E correspond to k sampledfriends and followers per user of interest, respectively.

As shown in Figure 3a, our sampled user-follower graph Gf contains only a sub-set of edges of type ef ∈ E with corresponding pairs of vertices ui, u

fi ∈ U . It can

be visualized as balanced disjoint stars with k sampled followers per user of interestcentered in democrat and republican user vertices with incoming follower edges. Thereare some shared followers between the users of interest represented by a connectedcomponent concentrated at the center of the graph.

As shown in Figure 3b, our sampled user-friend graph Gr contains only a subsetof edges of type er ∈ E with corresponding pairs of vertices ui, u

ri ∈ U . It is similar

to user-follower graph except the edges are bidirectional meaning that if users followeach other, they are friends. There are some shared friends between the users, thatsimilarly to the user-follower graph concentrated in the center of the graph, but thedensity is much higher compared to the follower graph; it means that even randomlysampled friends, are shared more ofter between the users of interest compared to therandomly sampled followers.

In the next Section we describe how we apply different types of user context graphspredict political orientation for Twitter users of interest.

4 Experimental Setup and Results

To demonstrate the influence of user local context in user political orientation predictiontask we design several experiments:

(A) to show the influence of the local context exclusively, we assume that we do nothave either tweets ui(T ) or bibliographies (bios) ui(B) for the users of interest butwe have follower/friend tweets uci (T ), or their bios uci (B) or both uci (B) + uci (T );we represent users of interest using their local context by sampling m = n/10tweets from 10 followers/friends;

5 Related Work 5

(B) to show the influence of the local contact combined with a limited amount of userdata, we assume that we have only bibliographies ui(B) for the users of interestand combine them with the follower/friend tweets ui(B) + uci (T ), or their biosui(B) + uci (B) or both ui(B) + uci (B) + uci (T );

(C) to get an upper bound for accuracy numbers, we assume that we know the biosand tweets for the users of interest, and apply n tweets ui(T ), or bios ui(B), orboth ui(B + T ) from the users of interest themselves.

We use a LIBLINEAR implementation [Fan et al., 2008] integrated in Jerboa toolkit[Van Durme, 2012] to train9 a binary classifier to predict user political orientation usinglog-linear word unigram features [Smith, 2004]. We remove all retweets from the setof tweets for every user10 and end up having on average 150 tweets per user. Wereport the results for user political orientation classification in Table 2. In addition,we consider two simple baselines: (1) when we search user bios and check if userexplicitly mentions his political orientation, we get accuracy of 0.028; (2) when we takethe majority class and label all users with a majority label, the accuracy is 0.5.

Context Users of interest*, ui User-follower Gf User-friend Gr

uci(B) 0.508 0.667 0.545

uci(T ) 0.580 0.674 0.704

uci(B + T ) 0.675 0.704 0.717

ui(B) + uci(B) – 0.704 0.704

ui(B) + uci(T ) – 0.662 0.713

ui(B) + uci(B + T ) – 0.714 0.730

Tab. 2: User political orientation classification results using local context only: uci (B) -context bios, uci (T ) - context tweets (for users of interest* we consider user bios ui(B),or user tweet ui(T ), or both, but no context), and local context in combination with alimited amount of user data: ui(B) - user bios.

As can be seen from Table 2, models trained using user-follower and user-friendgraphs produce event better results compared to the models trained using tweets ofbibliographies for the user of interest. For example, the best model trained using fol-lower tweets and bios in combination with user of interest bios out performs the modeltrained using tweets and bios for the users of interest by 4%; the best model trainedusing friends tweets and bios give an improvement of 5.5%.

5 Related Work

Coppersmith and Priebe investigate both graph structure (context) and edge attributes(content) to explore vertex nomination (find interesting vertices) in Enron email cor-

9 We split the user data into 3 parts: 70% train, 10% development and 20% test.10 We remove retweets because we want to use the original tweets generated by a user rather

than retweeted information from a different user.

6 Conclusions 6

pus. Their results demonstrate that a joint model of context and content improves theperformance over either models alone [Coppersmith & Priebe, 2012].

Cogan et al. study the dynamics of how information propagates in large social net-works by examining the underlying conversation graph structure [Cogan et al., 2012].They investigate conversations formed by users replying to other user tweets, retweet-ing or mentioning other users. To construct conversation graphs they use (a) keywordfiltering approach and (b) following user reply trees to the root. Interestingly, in theirlarge corpus of reply tweets they found that 60% of reply trees are paths. Their resultdemonstrate that using keyword filtering approach Twitter conversation graphs for thereply tweets are represented using stars; and graphs constructured from user mentionshave more complicated structure, which is clearly not a tree.

Other interesting studies that investigate the underlying graph structure of largesocial networks like Twitter include [Kwak et al., 2010] and [Cha et al., 2009] works.The authors demonstrated how information flow depends on the structural propertiesof user interactions.

Furthermore, there are some useful tools for massive social network analysis, forexample Graph Characterization Toolkit (GraphCT) developed using massively multi-threaded Gray XMT architecture [Ediger et al., 2010].

6 Conclusions

We have demonstrated an effective approach to construct user conversation graphs fora random sample of users in a large social network. Moreover, we showed that theserandomly sampled graph structures can be effectively used to predict user politicalorientation with a limited amount of user information or without any user information.By randomly sampling user followers and friends we reduced the size of the data, savedour storage space and the number of calls to Twitter API to download user tweets, andfinally, decreased our model training time with any loss in classification performance.

References

[Burger et al., 2011] Burger, J. D., Henderson, J. C., Kim, G., & Zarrella, G. (2011).Discriminating gender on twitter. In Proceedings of EMNLP.

[Cha et al., 2009] Cha, M., Mislove, A., & Gummadi, K. P. (2009). A measurement-driven analysis of information propagation in the flickr social network. In Proceedingsof the 18th international conference on World wide web, WWW ’09 (pp. 721–730).New York, NY, USA: ACM.

[Cogan et al., 2012] Cogan, P., Andrews, M., Bradonjic, M., Kennedy, W. S., Sala, A.,& Tucci, G. (2012). Reconstruction and analysis of twitter conversation graphs. InProceedings of the First ACM International Workshop on Hot Topics on Interdisci-plinary Social Networks Research, HotSocial ’12 (pp. 25–31). New York, NY, USA.

[Coppersmith & Priebe, 2012] Coppersmith, G. A. & Priebe, C. E. (2012). Vertex nom-ination via content and context. CoRR, abs/1201.4118.

6 Conclusions 7

[Ediger et al., 2010] Ediger, D., Jiang, K., Riedy, J., Bader, D. A., & Corley, C. (2010).Massive social network analysis: Mining twitter for social good. In Proceedings of the2010 39th International Conference on Parallel Processing, ICPP ’10 (pp. 583–593).Washington, DC, USA: IEEE Computer Society.

[Fan et al., 2008] Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J.(2008). LIBLINEAR: A library for large linear classification. Journal of MachineLearning Research, 9, 1871–1874.

[Kwak et al., 2010] Kwak, H., Lee, C., Park, H., & Moon, S. (2010). What is twitter, asocial network or a news media? In Proceedings of the 19th international conferenceon World wide web, WWW ’10 (pp. 591–600). New York, NY, USA: ACM.

[Liben-Nowell & Kleinberg, 2003] Liben-Nowell, D. & Kleinberg, J. (2003). The linkprediction problem for social networks. In Proceedings of the twelfth internationalconference on Information and knowledge management, CIKM ’03 (pp. 556–559).New York, NY, USA: ACM.

[Rao et al., 2011] Rao, D., Paul, M., Fink, C., Yarowsky, D., Oates, T., & Coppersmith,G. (2011). Hierarchical bayesian models for latent attribute detection in social media.In Proceedings of ICWSM.

[Rao et al., 2010] Rao, D., Yarowsky, D., Shreevats, A., & Gupta, M. (2010). Classify-ing latent user attributes in Twitter. In Proceedings of the 2nd International Workshopon Search and Mining User-generated Contents.

[Smith, 2004] Smith, N. A. (2004). Log-linear models.

[Vakali et al., 2012] Vakali, A., Giatsoglou, M., & Antaris, S. (2012). Social networkingtrends and dynamics detection via a cloud-based framework design. In Proceed-ings of the 21st international conference companion on World Wide Web, WWW ’12Companion (pp. 1213–1220). New York, NY, USA: ACM.

[Van Durme, 2012] Van Durme, B. (2012). Jerboa: AToolkit for Randomized andStreaming Algorithms. Technical report, Human Language Technology Center ofExcellence.