twitter data clustering and visualization

Twitter Data Clustering and VisualizationAndrei Sechelea, Tien Do Huu, Evangelos Zimos, and Nikos Deligiannis

Department of Electronics and Informatics, Vrije Universiteit Brussel, Brussels, BelgiumiMinds, Gaston Crommenlaan 8(b102), 9050 Gent, Belgium

E-mail: {atsechel, ezimos, ndeligia}@etro.vub.ac.be, [email protected]

Abstract—We present a system for the acquisition, analy-sis and visualisation of Twitter data. Twitter messages areharvested and stored in a distributed cluster, and the datais processed using algorithms implemented in a MapReduceframework. We present a clustering algorithm capable ofidentifying the main topics of interest in a tweet data set. Also,we designed a visualization method which allows to follow theintensity of twitter activity at a given geographical location.

Index Terms—Twitter, social media, big data, clustering,visualization

I. INTRODUCTION

Social media platforms offer users the possibility to com-municate: personal facts and opinions, news and discussions,all of those carry information, and their summation repre-sents a massive large scale data-set. Fortunately, all majorsocial networks offer programming interfaces through whichdata can be harvested. In light of the ever increasing impor-tance and ubiquity of digital communications, computationalmethods for the acquisition, processing and visualisation ofbig data become of paramount importance for researchersand engineers alike.

Twitter is a micro-blogging platform which has overthe time become a major form of communication. Whilebeing a social network service, and providing individualsthe means of connecting with family and friends, Twitter hasevolved into a social phenomenon of global scale. It has beenused in various scenarios ranging from major sportive eventcoverage to the organisation of national-wide civil protestactions. Implicitly, tweets reflect those events as seen by theindividuals tweeting, and can be aggregated to form the basisof event exploration and visualization [1]–[4].

Twitter generated data is extremely heterogenous in termsof content, and very big in size - hundreds of thousands ofmessages per day. Since the late 2009, Twitter has allowedeach message to be geo-tagged - associated with the latitudeand longitude of a specific location1. Users can thereforebecome authentic life-sensors which capture informationfrom their regions and spread real-time geo-tagged data.According to recent surveys2, as much as 20 percent of theusers give their location to an accuracy of street level orbetter. For non geo-tagged messages, results in inferring thegeo-location from available meta-information are promising[5]–[7].

1https://blog.twitter.com/2009/location-location-location2http://geosocialfootprint.com

Fig. 1. Schematic illustration of the system.

In this paper we demonstrate Twitters’ capacity of iden-tifying opinions, trends and events in a specific geographiclocation. Users’ tweets are gathered, processed, analyzed andvisualized in a 3D geospatial representation. The ultimategoal of our system is to enable the interactive display in a 3Denvironment of the hottest existing topics, as seen by Twitterusers in a particular geographic location. Such a frameworkcan be used in practical applications targeting early detectionof unusual events or prediction of trends, e.g., social unrestanalysis [8], [9] or viral marketing [10], [11].

Our solution addresses two problems that come inherentlywith such systems. The first contribution consists in thehandling of the sheer volume of data potentially floodingthe system - acquisition, filtering and preprocessing - ina distributed fashion. The second contribution is the dataanalysis and visualisation: the tweets are clustered accordingto major themes, then a topic-tagged heat-map is overlayedon a real geographical map of the considered region.

The paper is structured as follows: section II describes thegeneric architecture of the system, while section III presentsthe practical implementation details. Section IV presents theexperimental results of the clustering and visualisation, andsection V draws the conclusions of our work.

II. SYSTEM ARCHITECTURE

The overall architecture of the system is depicted in Fig. 1.It consists in separated data acquisition, preprocessing, clus-tering and visualisation modules, which will subsequentlybe discussed.

A. Data Acquisition

In line with modern big data requirements, we havedesigned our own distributed Apache Hadoop cluster capableof running distributed MapReduce tasks. The data set iscomposed of tweets collected by a harvesting machine usingthe Twitter API. The harvester runs a python script thatqueries the Twitter server, and is able to parse the content ofeach response. From the raw stream, data is filtered out byimposing restrictions on the geo-location coordinates; morespecifically, for every region of interest we can define thecoordinates for a rectangular bounding box. The relevant textmessages are stored in the distributed file system.

B. Data Preprocessing

As is often the case, together with useful informationexpressed coherently, tweets contain considerable noise, andthe data needs to be cleaned. To start with, re-tweets areremoved, as they convey already existing information mul-tiple times. By applying classical tokenization, the resultingraw text is split into components separated by white-spaces;colloquial abbreviations and emoticons are interpreted andall non-alphanumeric characters, URLs and punctuationsigns are removed. Words are transformed to their respectivelowercase version. Then stop words are removed. Some ofthe more frequently used stop words for English include ’a’,’of’, ’the’, ’I’, ’it’, ’you’, and ’and’, and they are generallyregarded as functional words which do not carry meaning.When assessing the content of a message, the meaning canbe conveyed more clearly by ignoring the functional words.Hence it is practical to remove those words which appeartoo often, but offer no true information. Another practicalsimplification carried out on text data sets is stemming,i.e. the process for reducing derived words to their stem,or root form. Stemming algorithms look up the inflectedform of a word in a lookup table, this kind of approachbeing simple and fast. The disadvantage is that all inflectedforms must be explicitly listed in a pre-stored look-up table.For example, ’developed’, ’development’, ’developing’ arereduced to the stem ’develop’. All those steps lead to theforming of a dictionary, by taking into account tokens with acertain minimal reoccurrence count, and ignoring words thatare used rarely. Typically, tokens that appeared less than 20times were discarded. Messages are transformed such thatthey contain only elements from the dictionary. Hashtagsand coordinates are stored as well, in a structure that linksthem to the corresponding original message.

All data analysis algorithms are implemented in pythonusing the MapReduce paradigm, and they run on the Hadoopcluster.

C. Generating the Topics

The cluster analysis follows closely the method presentedin [12]. We implemented a clustering method for find-ing dominant themes, the main goal being to develop analgorithm that combines the k-means algorithm [13], theDensity-Based Spatial Clustering of Applications with Noise(DBSCAN) algorithm [14] and a consensus matrix.

k-means is a well known clustering algorithm, which aimsto partition a number of observations into k clusters in whicheach observation belongs to the cluster with the nearestmean. DBSCAN is another common clustering algorithm.It uses a similarity metric, usually in the form of a distance,to group data points together; it also marks points as noise,allowing for the removal of unnecessary entries in the dataset. By running k-means and DBSCAN for the same dataset with multiple parameters, the results of the clusteringprocess will be different. Nevertheless, they can be combinedusing consensus clustering. To this end, a consensus matrixis used. It is a square matrix having the dimension of the dataset. When different instances of the clustering algorithmsare performed, each time data points i and j are clusteredtogether, a 1 is added to the consensus matrix at positionsij and ji. In the end, the use of the consensus matrix istwofold: it can be used to remove noisy entries, which havea low correlation with other entries in the data set, and it canalso be used to establish the number of clusters outputtedby our algorithm.

D. Visualization

We have considered two distinct visualisation scenarios.The first one regards the structure of the harvested tweets,namely, the separation into clusters as obtained by theclustering algorithm in section II-C. The second one regardsthe spatial density distribution of the data set. More specif-ically, the goal is to visualize on a real world geographicalmap informations about the number of tweets originating inspecific micro-regions.

For the visualisation of the clusters we used the graphingsoftware Gephi3 to represent in a diagram how close topicsare to one another. If the tweets are clustered togetherfrequently, they are closer together in the graph and forma topic, represented by a color in the graph. On the otherhand, the unconnected nodes are tweets that are not clusteredwith any other tweet a sufficient number of times. Theunconnected nodes are ignored.

The 3D visualisation is created from the superpositionof two components: a high resolution texture of the region,and a so called height-map. The height-map is a grayscale2D image in which pixels correspond to a coordinate onthe texture. The intensity of this pixel corresponds to thenumber of tweets traceable to that specific coordinate, bymeans of Twitters’ geolocation information. The processincludes 4 steps: generation of the texture, the generation of

3http://gephi.org/

Fig. 2. Cluster architecture

the grayscale image, editing of the images and the joiningof both in the 3D environment. We wanted to use a texturewhich makes the city that the visualisation shows easilyrecognizable. As such, we made use of OpenStreetMaps4,a public interface which supplies free maps of the world invector graphic form. The maps can be rendered in arbitraryresolutions, and also converted to raster graphic format to beused for our application. The maps will therefore be formedby the union of a multitude of 256x256 pixel tiles whichare stitched together to a single texture by a simple pythonscript. Also, as a visualization alternative, the Google MapsAPI5 offers support for integrating heatmap layers for weband mobile applications. The grayscale image is a result ofa transposing the normed number of harvested tweets in aregion to a intensity value from 0 (non present) to 255 (themaximum observed number of tweets in a region). The twoobtained images are subsequently edited for better visualoutput and superposed.

III. SYSTEM IMPLEMENTATION

The architecture that is used to process the harvestedtweets, is a cluster of 4 nodes. The software (CDH5.5)that is run over this cluster is provided by Cloudera6, anIT company that provides software, support, services andtraining for Apache Hadoop, an implementation of Google’sMapReduce framework to process big data in a distributedfashion. We use the YARN service to do the processing.The specification of the infrastructure is as follows: thereare three physical machines with four virtual machinesrunning on it. All the machines have two processors. Twoof the three machines have Intel XEON X5650 @ 2.67 GHzprocessors, while the other one has two Intel XEON E5-2670@ 2.60 GHz processors. The 4 virtual machines (2vCPU)have each 8GB RAM, except for the master, which has16GB RAM. Each node in the cluster is running the Ubuntu14.04 operating system. One of the nodes acts as a masterand assigns tasks to the other nodes. The full architecturecan be seen in Fig. 2. To use this architecture, mrjob7 is

4https://www.openstreetmap.org/5https://developers.google.com/maps/6https://cloudera.com/7https://pythonhosted.org/mrjob/

Fig. 3. Key-words given by the clustering algorithm

installed to run the python code that processes the harvestedtweets. Mrjob is a python package that helps with writingand running MapReduce jobs in python on any Hadoopimplementation.

Data harvesting and preprocessing runs on the Hadoopsystem. All clustering and visualisation methods are imple-mented in Matlab and C++.

IV. DATA VISUALISATION

We have harvested twitter data for a period of one monthduring the winter holiday season, in two different locations:Brussels and London. Only geo-tagged tweets were consid-ered. This resulted in a data set comprising roughly 250.000messages for London and 15.000 messages for Brussels. Forillustration purposes, we have ran the clustering algorithmin II-C for the more extensive London data, while applyingthe 3D visualisation for the Brussels data.

The result of the clustering algorithm for a chosen numberof 5 clusters, ran on a sample of 1000 randomly selectedtweets, can be seen in Fig. 3. The hottest topics couldbe clustered under five key words: ’Job’, ’Time’, ’Happy’,’Christmas’ and ’Hire’. The relevance of the results of thealgorithm can be improved by a more thorough pruning ofall words that do not carry meaningful information in thecontext of a particular application.

In the 3D visualization case, a number of intermediarysteps had to be taken before reaching the desired output. Thecoordinates defining the box corresponding to the Brusselsregion are: Lower LAT = 50.771208, Lower LONG =4.211197, Upper LAT = 50.924679, Upper LONG =4.521561. It is interesting to mention the fact that one lati-tude degree corresponds to twice the distance of a longitudedegree.

To generate the grayscale image, we translate the MapRe-duce values to a two dimensional array, which is filledwith integer values that describe the frequency of tweetsat the given grid point. The size of the grid is given by the

(a) Original density map

(b) Re-scaled density map

(c) Superposition with the map of Brussels

(d) Superposition with a region of Brussels

Fig. 4. Steps in the visualization chain

dimensions of the map, in our Brussels case 1000 × 2000.While the texture map is nearly a square, this grid is clearlynot. In this square there are around twice as much longitudeminutes than latitude minutes. Out of this array we generatethe grayscale image with every pixel being one data point.The value per pixel can range from 0, which is black to255, which is perfect white. We have different ways ofrepresenting the correspondence between the number oftweets and the image intensity:

• Maximum values: the maximum value of all data pointsis the maximum white value, all other values are scaledaccordingly.

• Capped: one value is chosen to be the max, all valuesabove are indistinguishable, all values below are scaledaccordingly.

• Linear scale: all values are scaled linear from 0 to themax value.

• Logarithmic: all values are scaled logarithmic from 0to the max value.

To decide which representation to choose, we took alook at preliminary data. In the considered data set we hadfollowing rough distribution: out of all data points which hadmore than 0 tweets, 10% had a value from 200-2000 tweetsin the considered one month period; roughly 20% had 2-10tweets and all the rest had 1 tweet. As such, we decided toomit the extreme outliers and cap the data to a maximum of10 with a linear scale. For a data set collected over a biggerperiod of time this cap has to be raised and a logarithmicscale might work better to show very small values.

After being generated the density image can be trans-formed to a heatmap, and needs to be resized from thegrid resolution to the texture resolution. The two images -now seen as layers of the same size - can be overlapped.A flowchart of the resulting images can be observed inFig. 4. In Fig. 4(a) we can see the original density map,as resulting from the direct correspondence between thenumber of tweets and the geographical coordinates. Fig 4(b)is the re-scaled version of the density map for a specificconsidered region. Fig. 4(c) shows the superposition betweenthe whole map of the Brussels region and the resultingdensities transformed to a heatmap, while Fig. 4(d) showsa street view of the central square of Brussels with thecorresponding heatmap. Here, green spots correspond toobserved Twitter activity.

Moreover, as we targeted a high-end 3D visualizationapplication, we made use of a HV721RC72 lightfield displaydeveloped by Holografika8. This is a unique hardware com-ponent, offering the possibility to evaluate visual quality in3D video by avoiding any possible visual artifacts introducedby the default lossy compression of the computed lightfields.The heatmap was generated using 3D meshes, and superim-posed on the existing texture. In Fig. 5(a) we can see a viewfrom the top of the map of Brussels and the corresponding

8http://www.holografika.com/Products/HoloVizio-722RC.html

(a) 2D Map

(b) 3D Map

Fig. 5. 3D visualization

tweet densities, with dark values corresponding to lowdensities, and bright values corresponding to high densities.The same image can be rotated by any desired angle andviewed as a 3D image, as can be seen in figure 5(b).

V. CONCLUSIONS AND FUTURE WORK

In this paper we have presented a system for the ac-quisition, analysis and visualisation of Twitter data. Twittermessages are harvested and stored in a distributed cluster,and he data is processed using algorithms implemented in aMapReduce framework. We presented a clustering algorithmcapable of identifying hot topics of interest in a tweet dataset. Also, we designed a visualization method which allowsto follow the density of twitter activity in a given geograph-ical location. The system is a prototype and was meant topresent the potential use of a social media platform as sourceof large scale spatio-temporal information. It represents thebuilding ground for future social media related applicationstargeting a multitude of possible applications with highsocial impact such as emergency situation management, riskand damage assessment and even social unrest.

REFERENCES

[1] J. Kaye, A. Lillie, D. Jagdish, J. Walkup, R. Parada, and K. Mori,“Nokia internet pulse: a long term deployment and iteration ofa twitter visualization,” in CHI’12 Extended Abstracts on HumanFactors in Computing Systems. ACM, 2012, pp. 829–844.

[2] A. Marcus, M. S. Bernstein, O. Badar, D. R. Karger, S. Madden,and R. C. Miller, “Twitinfo: aggregating and visualizing microblogsfor event exploration,” in Proceedings of the SIGCHI conference onHuman factors in computing systems. ACM, 2011, pp. 227–236.

[3] B. Meyer, K. Bryan, Y. Santos, and B. Kim, “Twitterreporter: Break-ing news detection and visualization through the geo-tagged twitternetwork.” in CATA, 2011, pp. 84–89.

[4] D. Thom, H. Bosch, S. Koch, M. Worner, and T. Ertl, “Spatiotemporalanomaly detection through visual analysis of geolocated twitter mes-sages,” in Visualization Symposium (PacificVis), 2012 IEEE Pacific.IEEE, 2012, pp. 41–48.

[5] M. Graham, S. A. Hale, and D. Gaffney, “Where in the world are you?geolocation and language identification in twitter,” The ProfessionalGeographer, vol. 66, no. 4, pp. 568–578, 2014.

[6] B. Han, P. Cook, and T. Baldwin, “Text-based twitter user geolocationprediction,” Journal of Artificial Intelligence Research, pp. 451–500,2014.

[7] M. Cha, Y. Gwon, and H. Kung, “Twitter geolocation and regionalclassification via sparse coding,” in Proceedings of the 9th Interna-tional Conference on Weblogs and Social Media (ICWSM 2015), 2015,pp. 582–585.

[8] R. Compton, C.-K. Lee, T.-C. Lu, L. de Silva, and M. Macy,“Detecting future social unrest in unprocessed twitter data:emergingphenomena and big data,” in Intelligence and Security Informatics(ISI), 2013 IEEE International Conference On. IEEE, 2013, pp.56–60.

[9] T. Hua, C.-T. Lu, N. Ramakrishnan, F. Chen, J. Arredondo, D. Mares,and K. Summers, “Analyzing civil unrest through social media,”Computer, no. 12, pp. 80–84, 2013.

[10] J. Berger and K. L. Milkman, “What makes online content viral?”Journal of marketing research, vol. 49, no. 2, pp. 192–205, 2012.

[11] S. Alhabash and A. R. McAlister, “Redefining virality in less broadstrokes: Predicting viral behavioral intentions from motivations anduses of facebook and twitter,” new media & society, 2014.

[12] D. Godfrey, C. Johns, C. Meyer, S. Race, and C. Sadek, “A case studyin text mining: Interpreting twitter data from world cup tweets,” arXivpreprint arXiv:1408.5427, 2014.

[13] J. A. Hartigan, “Clustering algorithms,” 1975.[14] D. Birant and A. Kut, “St-dbscan: An algorithm for clustering spatial–

temporal data,” Data & Knowledge Engineering, vol. 60, no. 1, pp.208–221, 2007.

twitter data clustering and visualization

Documents