analyzing and visualizing news spread based on images in ... · for our purposes because at the...

38
FACULTY OF SCIENCE Analyzing and visualizing news spread based on images in Social Media networks Author: Fernando Flores García UvA ID: 10408134 Degree: MSc in Artificial Intelligence Supervisor: Marcel Worring Date: January 2015

Upload: others

Post on 08-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

FACULTY OF SCIENCE

Analyzing and visualizing news spread

based on images in Social Media networks

Author: Fernando Flores García

UvA ID: 10408134

Degree: MSc in Artificial Intelligence

Supervisor: Marcel Worring

Date: January 2015

Page 2: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

Contents

1 Abstract 3

2 Introduction 4

3 Related work 73.1 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Data analysis 114.1 Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Model 145.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.2 Metrics and parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.2.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.2.2 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.3 Network diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.4 Burstiness and speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.4.1 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.4.2 Burstiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.5 General equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.5.1 General speed equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.5.2 General burstiness equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.6 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6 Results 246.1 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.2 Burstiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276.3 Comparison of speed and burstiness . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7 Visualization 337.1 Speed and burstiness in the visualization . . . . . . . . . . . . . . . . . . . . . . . . . 337.2 Additional elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

8 Conclusion 36

2

Page 3: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

1 Abstract

We present a new approach to analyze how news items spread in Social Media networks. Our goalis to get insight in this phenomenon, defining insight according to the Merriam-Webster Dictionaryas ”the capacity to discern the true nature of a situation” or ”the act or outcome of grasping theinward or hidden nature of things or of perceiving in an intuitive manner” [1]. The best way to getinsight is a combination of analysis and visualization.

The data analysis is dependent on a model that uses the intrinsic characteristics of news itemsand returns their potential of success based on two features, their burstiness and speed of propaga-tion. To obtain these two metrics, we have defined a model that returns the joint success probabilityof a news item based on the two parameters.

This data analysis is supported by means of a visualization. The main layout of the visualiza-tion is based on a set of tweets that create a topological network. Also a series of statistics thatprovide information on how images are spread throughout the network are displayed. This visual-ization is the way the outcome of the model is specified and measured.

The questions we want to answer are: how does news spread through Social Media networks?,is it possible to find any type of pattern in this process?, is it possible to foresee how future newsitems will be treated based on its similarity with past news items?

Based on the outcomes obtained we can affirm that news items spread faster in the first partof their spread process, having a high burstiness potential, whereas it diminishes throughout thespreading process. Similarly, speed shows a similar pattern. We have also found that the newsitems with the highest thriving potential are those with international scope and also those whosetopic is related with Entertainment or Sports. The latter category is also the one with a highestburstiness potential.

3

Page 4: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

2 Introduction

During the last decade news has seen major changes in its distribution. In the last century infor-mation broadcasting was slow due to technological limitations. Since the boom of the Internet,supported by the emergence of smartphones, users may access real-time information wherever theyare with almost no limit. But this thrive of real-time information is not only unidirectional. Usersnow are also able to spread whatever news items they might consider significant. Social Medianetworks such as Twitter or Facebook have made users a primordial player in news spread, a playerwith the capacity of boosting any news item. Today, both Social Media and social networks existunder the same web environment, that is, it is possible to collect massive online or offline Socialmedia data and at the same time capture the effects of Social Media as well as the influence basedon the activity of social networks. This new relation allows experts to study the diffusion processesin the network

A key aspect and main contribution of this new approach is the usage of images as input in-stead of textual information. Images are easy to recall and understand, they might portray thecontent of a news item and own the capacity of grasping what is up at a glance, something crucialnowadays, when it is often more important to be the first in transmitting a news item than thequality of the item itself. This is the reason we are interested in investigating how news spread inSocial Media networks based on their attached images and not based on their textual information.By doing so we will try to find some characteristic pattern to this particular type of news and also,if possible, to find more generic patterns relating the different types of news. We deal with imagesin both analysis and visualization by searching for any attached image in a tweet and checking ifthis image has been retweeted, with the idea of constructing a network to make easy to follow thetransit of this image over the world.

Our motivation is to get insight on how news items spread in Social Media networks. It is in-teresting as nowadays these Social Media networks have become a main news source worldwide andseems this inertia will continue in future years. That is why understanding how news are broad-casted throughout networks is of vital importance for certain population groups. To get this insightwe consider the best option is a combination of data analysis and visualization. As a secondarytarget group, companies might be interested in how users interact among them in Twitter as abase for their campaigns. Among companies, the ones that might find it most useful to have thedata displayed in such visualization are marketing companies, as their business is partly based onunderstanding how people interact. Besides companies, two main groups might find this applicationuseful. The first ones are journalists, more focused on news and on how they spread and which arethe types of news that spread faster or grow the most. The second group is the one composed bysocial scientists, focused on users and on how users interact depending on a series of parameterssuch as age or genre. It is also useful for social scientists to know which types of news thrive in acertain demographic group.

So how does information flow in Social Media networks? This particular question has been studiedfrom different standpoints. Cha et. al in [2] define three major roles on how data of news itemsabout both major international and minor events spread throughout Social Media networks: massmedia sources, grassroots and evangelists. The first group is able to reach most of the users beingonly the 0.01% of the users whereas the grassroots are the standard users, quite passive but numer-ous, and the evangelist (politicians, celebrities, etc), play a major role in spreading very specificinformation in either minor or big circles. Also users behave differently while in a social networkthan in real life. Wilcox and Stephen defend in [3] that social networks enhance self-esteem in usersthat are focused on close friends, something that is reflected in their behavior, also while broad-casting information. A user focused on his friends will ”take care” of them and will try to maintainthem as informed as possible by, for example, retweeting those news items he considers important.

4

Page 5: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

The amount of followers a user has is of vital importance for certain users like companies or NewsMedia agencies because it might point out their popularity and prestige. We put special attentionon the case of News Media agencies and on how news are spread throughout the Twitter networkand, in special, on how the images attached to these news items extend via Twitter. In certaincases such as in the case of pictures, this piece of information might be shared many times: a useruploads the content and shares it with his group of friends, some of these friends share with theirrespective ones this image and so on, creating a sharing cascade [4].

Among all the series of Social Media networks that have emerged during the last decade Twit-ter has become one of the most influential ones, mainly because it is easy to use and also becauseits topological characteristics and its usability make it a perfect tool to broadcast information [5].Nowadays thousands of famous celebrities publish their personal and professional information inthis social network, but this is only a part of the greatness of this microblogging tool. Currently ithas more than 200 million users and it is translated to more than 30 languages worldwide.

The main success of Twitter is its easiness, allowing users to send text messages with a limitof 140 characters. Users may subscribe to others’ tweets, that is, following them. Part of thesuccess of Twitter comes out of the appearance of the smartphones, allowing users to update theirstatuses wherever they are. Another feature that is also a success key is the so-called retweet, thatis, the action of sharing a tweet. This is the way information spreads in Social Networks. Finally,another important feature is the hashtag, a way to define keywords useful to keep track of a certaintopic.

Twitter has become a popular source from which to propagate information in the past years andbecause it is an always-on tool that may be used everywhere, it permits news items to spread fasterthan in other media such as newspapers or radio. Also user interaction is important in Twitteras users are the ones that spread messages through their follower list, creating a network whosecharacteristics define the success of a news item. Also society requires immediate access to newsnowadays and user opinion in means such as Twitter has become a significant way to measure thepopular sentiment of a country. For study society Twitter has become a major source.

Bearing this information in mind, we have defined a model whose data is supported by a visu-alization and whose motivation is trying to know if these differences that define news items maybe expressed in a visual way, using Twitter as information source. Visualizations provide a betterinsight in this particular task, where the geographical ones are the most useful as they express in atopological way the structure of the network referred above. This topological structure is interestingfor our purposes because at the same time we reference the structure or a network data structureas well as preserving geographical information. Also the analysis of this data is important in thetask of getting insight on Twitter data, as this data should be sparse among the vast amount oftweets so a task of data mining and posterior analysis of the retrieved tweets is mandatory.

The chosen approach is to create a joint layout for these two groups, where both could find usefulinformation at a glance. The visualization should provide a series of tools such as different colors foreach different type of news and statistical panels with the values for the main parameters of a newsitem and also global statistics per type of news. As all this information is computed dynamically,final users may get insight on how news are evolving during their spread.

5

Page 6: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

The attempt of displaying Twitter information is not new, some others did it in the past fromdifferent points of view. Most of them used the idea of an underlying network of elements, but notall of them displayed this network in geographical terms, as many of them were only interested inthe relations of Twitter users. Among this first group we may include the user networks created byBlancs in [6] or Arikan in [7].

Some other applications were created to take advantage of geotagged data retrieved from Twit-ter by means of a geographical visualization. Most of these visualizations are defined as part ofstatistical and monitoring network tools such as tweetPing [8] or A world of tweets [9].

There is another group of applications that use some interesting methods such as element clusteringin Google Maps [10] or OpenStreetMap [11]. Other tools that perform their own data analysis toretrieve structured information are Just Landed [12] or Languages on Twitter [13].

Finally, another group of visualizations that make use of Twitter perform different tasks thanfinding the tweets, and these ones that are most widespread. This is the case of, for instance, theprediction of major events in the days before they occur, as in the case of Kallus in [14], where heclaimed that it was possible to foresee via Twitter the Egyptian people’s discomfort towards theGovernment in the previous days of the 2013 coup d’etat in Egypt. Some others like Rojas in [15]assert that Social Media may predict the outcome of future elections, but this seems to be an ideastill to be developed and proven.

In the next section some related work will be commented from both the analysis and visualiza-tion sides. From there, in subsequent sections, the data analysis will be described, and after thatthe model used will be defined. In the last sections the methods used will be explained and also wereport the results obtained.

6

Page 7: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

3 Related work

Some others in the past have contemplated several solutions to the problems addressed by us.

3.1 Data analysis

By data analysis we refer to the process of taking and processing raw data from a data sourceto adapt it to the requirements needed to be used as an input for the visualization. Others havecarried out this task in the past by different means.

The usage of networks to represent Twitter relations is not a new field. First approaches suchas Blancs’s in [6] define a network of users whose depth depended on the levels existing from a usernode to the central one (1 for the friends, 2 for the friends’ friends, etc). The main problem ofthis type of network is intrinsic as it does not escalate well when having a large number of friends,having problems to show relations of level 4 and superior. Also Arikan defines in [7] a so-calledTwitter social network that used the idea of a network of tweets representing the following actionsby means of edges. Finally, the tool called mentionMap [16] uses this idea of a network of tweets asa base for the visualization, but, once again, no geographical information was used, in this projectthe relations among users are the only way to provide positional information.

All these projects described so far, despite that they introduce the idea of a network of tweets,do not use geographically located data taken from the tweets themselves, something essential forthe purpose of a geo visualization. Internet is by definition a position-free network, i.e., users areable to access to all the sites in the network no matter where they are. This might seem conflictingwith the idea of geotagging user positions but News Media tools take advantage of this featureto retrieve topological information on where people access from and how the use the site. In ourapproach we will also take this geographical information in the shape of latitude and longitudevalues, but this is not the first time this task is done. There are also some visualizations and toolsthat make use of some sort of map layout to display geographical information, being some of themproduction applications and some others scientific visualizations whose code is not available on theInternet.

The possibility of tracking information in real-time or pseudo real-time was taken into consid-eration in the past. For instance, the tool called Twittervision [17], a tool created back in 2007but with a high popularity still today. It is defined as a web mashup of Twitter and Google maps,keeps track of a Twitter user’s activity in real time. An implicit drawback of this application, dueto its real-time nature, is that it does not define a history of tweets. It does not define a networkof tweets either, using a event-like approach instead, showing what is happening in the way of apop-up in the screen. It also does not create any type of network as it only shows the tweets of theuser and the ones he is following.

The usage of Twitter data for statistical purposes is performed in some visualizations such asin the one called A world of tweets [9], a real-time visualization of geolocated tweets around theworld. Here statistical and historical data on the whole Twitter network are used based on a col-lection of statuses obtained continuously. Also the visualization called tweetPing [8] provides agood set of statistical data, checking the Twitter activity in real time and being capable to showa large amount of information, refreshing it every second. This information is not stored but thecounters are reset every time the site is reloaded. Also the visualization Just landed [12] stores ahistory of ”flights” but it does not run in real time. TweetPing returns partial statistics based onthe continent, inside each continent there are tweet, word and char counters. Finally, it also returnsthe latest mentions and hashtags.

7

Page 8: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

Another interesting data analysis algorithm used in this visualization is clustering, that is, the taskof grouping a set of elements with similar properties. The web tool tweepsMap [18] provides aninteresting clustering method for tweets that returns the percentage of total followers into a certainarea, bigger or smaller depending on the zoom level, defining these percentages for provinces, citiesand countries. As drawback, no historic data is stored so it is only possible to obtain real-timeinformation.

Regarding to the model, some others have figured out how to measure the success of a news itemin Social Media networks by using different parameters like for example speed, an easy-to-measuremetric but not widely defined as such in related works. For instance, Xu and Liu in [19] resort tospeed to detect a series of implicit dynamics in Social Media networks with the goal of detectingrumors spread on them. To do so they define a model that had the speed news spreads as one ofits input variables.

The second of these outcome parameters that measure the success potential of a news item spreadon a Social Media network is burstiness. This parameter is more difficult to measure than speeddue to its inner characteristics. Some others as Kim et al. in [20] come up with a way to measurethe burstiness potential of a keyword or topic on Twitter, being this model very robust and able tohandle different issues as abbreviations or typing and spacing errors. Another approach on burstywords on Twitter is the one taken by Mathioudakis and Koudas in [21], reflected in the tool Twit-terMonitor. This tool is able to identify emerging topics on Twitter in real time and by definingcertain criteria it may measure and order the topics by its bursting potential.

3.2 Visualization

Like in the case of data analysis, Social Media and specially Twitter data visualization has beenperformed since long ago. User relations have been visualized by means of a network in visualiza-tions like [7], where the author created three visualizations (one per week) keeping track of howthese links grow and create new interactions. In this visualization no geographical information istaken into consideration, only topics and followers matter. Networks have also been used to showhow people interact in Twitter in the tool mentionMap [16], commented above. This network ex-ploring tool is clear and easy-to-use but at the same time is very powerful and retrieves all therelevant information necessary to sum the relations of a user in Twitter. An interesting feature ofthis project is its interactive nature, providing the user a total freedom of movements throughoutthe network.

In [22], Rıos define a neat and clear visualization of all the geo-tagged tweets since 2009 to 2013.The author of this visualization is part of the Twitter Developer team, so he had access to billionsof tweets to perform this project, something not feasible for other teams. In this case, no furtherinformation of the tweets is used, only the latitude and longitude.

Just landed [12] extracts travel information from tweets and maps the journeys on a map. The mapitself remains two dimensional but the ”flights” are visualized as three dimensional curves. As animportant feature, a chronological order of the Tweets makes it possible to review a certain timeperiod. This visualization is very impressive and clear at the same time, but although it uses edgesit is far from defining a Twitter network. This visualization is shown in figure 1.

8

Page 9: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

Figure 1: Layout of Just landed [12]

A world of tweets [9] also uses geographical information, showing where people are tweeting at fromthe past hour via a heatmap where the more tweets there are from a specific region, the ”hotter”or redder it becomes. According to the authors in [9], through the activity of Twitter users it ispossible to tailor a new map of the world that evolves during the day according to the timezonesand the spreading of mobile technologies. Also tweetPing [8] provides a robust and appealing vi-sualization, showing a map where every position a tweet is triggered is highlighted, obtaining aheatmap of the areas where Twitter is most used. The main drawback of these two visualizations isthat they do not show media information, instead they display statistical data. Finally, conceivedas a tool to upload geographical information, mapsData [23] is a good example of what can be doneby merging Twitter and a geographical visualization. The output is a visualization of this data,providing interesting features as heatmaps, clusters, markers or bubble maps. In this tool tweetnetworks or image tracking are not contemplated.

Another good tool performing a powerful and useful visualization is trendsMap [24]. The ideabehind this tool is that Twitter is a network of ”trends”, so in it all these trends are shown, beingthese trends words, users or tags. These trends are defined worldwide, but by means of the zoomand pan features also regional and local trends are available to show. As an outstanding feature, thetool stores up to 7 days of historic data but, as in the former projects, media related informationas images is not used.

Another interesting approach to show information obtained from a Social Media network is theone taken by Dou et al. with the method called LeadLine and defined in [25]. LeadLine is an inter-active Visual Analytics tool that automatically identify important events in news and Social Medianetworks. This visualization includes interesting features besides a map, such as a steamgraph torepresent the temporal evolution of a topic, a good way to detect temporal events related to theaforementioned topic.

About the Twitter statistics layout, TweetPing [8] defines an appealing visualization with all thebasic information returned by the application in a very clever way. This statistics layout is shownin figure 2.

9

Page 10: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

Figure 2: Statistics layout of Just landed [12]

There have been some other visualizations that have studied how images are spread in SocialMedia networks. For instance, Itoh et al. in [26] define a system to analyze social behaviors byrecognizing changes in trends in people’s ideas, experiences or interests using as input both imagesand text obtained from different Social Media tools like Japanese newspapers or Twitter. Thedifference between our approach and this one is that the authors in [26] created a three dimensionalvisualization of different histograms of stacked imaged on a timeline, where the third dimensionrepresents the different topics. Authors also overlay line charts over the histograms to make iteasier to compare the different histograms. The layout of this visualization is shown in figure 3:

Figure 3: Visualization created by Masahiko Itoh et al. [26]

We have found some concepts missing in [26] that we want to improve. The main missing idea isthe usage of images as input, despite being used by others as in [26], it has not been widely definedas a task for researchers. A second concept, more used than the images but in different contexts,is the creation of a topological network of tweets based on users’ location. Finally, we miss a wayto define the success of a news item in Social Media networks.

10

Page 11: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

4 Data analysis

In order to perform the data analysis, we first need to obtain information about the messages andimages, and also information about the location of the messages. In this case we will use as muchinformation as possible from Twitter, due to the fact that sometimes the geographical informationis not stored into Twitter we will also use an external gazetteer resource called GeoNames [27],which will provide the geographical longitude and latitude based on the location provided by theuser in Twitter.

4.1 Twitter

Due to a limitation provided by the Twitter API (Twitter API version 1.1 [28]) it is possible tocreate graphs of at maximum two layers, with the first layer being the original message and thesecond one the retweets of this original message. If someone retweets a retweet only the originalmessage is stored, thus losing the intermediate information. As we are interested in the spatiallocation of both the tweet and the retweet, the Twitter graph of a news item spread might looklike the one shown in figure 4:

Figure 4: Spatial appearance of a Twitter spread graph.

In the Twitter API 1.1 there are four different types of objects defined: Tweets, Users, Entities(provide metadata and additional contextual information about content posted on Twitter) and,if the user has filled it out, Places (specific, named locations with corresponding geo coordinates.Tweets associates are not necessarily issued from that location). Out of these four entities, thetwo we are interested in the most in order to obtain a good data mapping are Tweets and Places,obtaining the metadata of the tweet and the information of the retweet and also the location wherethe tweet was sent.

As we are keeping track of how images spread on Twitter, we are crawling the application lookingfor tweets with an attached image. With this information in mind, for each tweet and its possibleretweet, we use the set of attributes included in table 1.

11

Page 12: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

Field name DefinitionId Tweet unique id

Username Username of the user that sends the tweetLocation Location the user sends the tweet

Timestamp Time the tweet is sentCountry name Country the tweet is sent

Country population Population of the country the tweet is sentCity name City the tweet is sent

City population Population of the city the tweet is sentNumber of retweets Number of times the tweet was retweeted

Text Actual text of the status updateLatitude Geographical latitude from where the tweet was sent

Longitude Geographical longitude from where the tweet was sentTags Hashtags mentioned into the text

User mentions Usernames mentioned into the text

Table 1: Fields extracted from the Twitter API

Besides these attributes, recalling that they are stored for both the tweet and its consequent retweet,we also store two more generic attributes that share both tweets in the case of a retweet or onlythe tweet in the case of being the origin of a new spread. These are the URL of the image attachedin both tweet and retweet, and the URL of the message.

Not all this information may be obtained directly from Twitter. Sometimes the user does notuse the geotagging function of the Twitter smartphone application or uses Twitter from a desk-top PC or laptop. Under these circumstances, the latitude and longitude are not obtainable fromTwitter, so therefore another resource must be used. We have used an external gazetteer resourcecalled GeoNames [27]. GeoNames is able to accurately return the longitude and latitude of a citybased only on the city or country name.

Unfortunately, the accuracy of GeoNames, yet high, might not be enough depending on the lo-cation the user has used. First, it is possible for the user to use an invented or wrong location, thusreturning either an error or different coordinates. It is also possible the user misspell this location.This is not a critical issue as we are using a fuzzy factor when calling GeoNames so the resourceitself is able to overcome the misspell of one or two letters. Finally, the resource might return awrong location as there might be more than one city with the same name.

4.2 Images

As explained in the previous section, we intend to create a Twitter graph based on the imagesattached to the messages. To do this, we have to keep track of the field that, inside the tweet meta-data, contains the URL of the image. So, rather than check the text that the user has written, wecheck the image that he has shared with his contacts. This process might lead to some errors, mainlydue to the fact that it is possible that the image does not match the topic and text of the news item.

Likewise as in the case of the longitude and latitude, we cannot be sure if the image attachedand the text match each other as it is possible for both to have a different context. It is likely tofind, for instance, a quote on a celebrity into a tweet that is about a different topic, and as we arenot using an image classifier we cannot disregard this tweet despite it is clear that the image of thecelebrity and the topic of the tweet have nothing to do with each other. In order to minimize theseissues, in the case of a likely unsupervised context, the best option is relying only on trustable usersand companies or only taking into consideration tweets that have been retweeted a large number

12

Page 13: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

of times. By doing this, the possibilities of a real match between the attached image and the textof the news item increases significantly.

In this case we are making the assumption that the more people retweet an image the more reliablethe original poster is. Moreover, we will label the spreads by taking into consideration both thecontent of the news item and also the attached image, avoiding possible mismatches between both.The news and its corresponding attached image will be labeled depending on different parametersand, among them, the topic of the news item. In this manner, the evaluation will be more reliable.

Problems might arise when crawling Twitter without a posterior human supervision or image clas-sification, so that it will not be possible to discard all the wrong matches. In that case, someheuristics might be followed, such as relying only on trustable sources such as News Media agenciesor notable celebrities, though for the latter it is also possible to get some errors.

Yet potentially dramatical, mismatches between images and topics, according to our experience,are not very common because, even in the case of a non-related image scenario, some elements ofthe image itself accord with the topic, such as a quote or a non-visual reference only detected byhumans and not by classifiers. This is the reason why a prospective mismatching error will notheavily affect the final outcomes of the model.

13

Page 14: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

5 Model

To get a better insight in how news are spread on Twitter we need to define a set of intrinsiccharacteristics of the news items and, based on these inputs a set of parameters will be obtainedin order to define the potential of a particular news item. This thriving potential will be definedbased on its burstiness and spread speed, that will be the outcomes to return in order to define thesuccess of a particular news item.

5.1 Assumptions

In order to define our model, let us first define some assumptions that are needed to take intoconsideration:

• The more followers a user has, the more influential he is.

• Generally speaking, accounts of companies and celebrities will have a large number of followers.

• The population in the area where a news item occurs affects the speed and spread potentialof the item itself.

• The place where a news item takes place affects its spread.

• The more time passes since a news item occurs, the less possibilities for a big spread throughthe network.

The first assumption refers to the fact that when a user is followed by a large number of other usershis influence will be higher and thus the probability of his tweets to be retweeted more and fasterincreses. Consequently, a user with a low number of followers will have less chances to his posts tobe shared. The same applies for the accounts of companies and celebrities, their potential to reachmore users is higher than the so-called standard users. This is because they transfer their fameoutside the network, in real life, to the media network. This provokes an expected increase of thenumber of followers of such users into the network.

The population and the number of users might not be always interconnected, so that the thirdand fourth assumptions should be seen as two interconnected halves. In the third assumption thepopulation is linked to the success potential of a news item, that is, the more people are into acertain area the more possibilities for a news item that happened there to be uploaded to a SocialMedia network and be spread by someone. But this is not always true because not all the regionsof the planet have the same characteristics: standard of living, Internet access or literacy leveldepends on the region, as it is not the same a news item to occur in New Delhi Metropolitan Areaor in New York, having both approximately the same population.

Also the time elapsed since the event that originated the news item heavily affects to its spread intothe network. Here it is assumed that a news item that does not spread fast has a high probabilityof not being largely spread.

14

Page 15: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

5.2 Metrics and parameters

5.2.1 Metrics

After defining the assumptions of the project now it is time to define the parameters the model willtake and its metrics or outcomes. The output metrics that will point out how successful a Twitternews spread is are two:

• Speed of the news item.

• Burstiness of the news item.

Speed means the velocity with which a news item is spread through the network. This parameteris crucial when related to the second output parameter, burstiness. This metric is not as easy-to-understand as speed, but a good definition may be the one provided by Renaud Lambiotteand Lionel Tabourier, burstiness is the set of intermittent increases and decreases in activity orfrequency of an event, and distributions of bursty processes or events are characterised by heavy,or fat, tails [29]. Such a broad definition accurately fits into the field of Social Media networks byusing the event as the origin of the spread and the increases and decreases as the temporal activityof the network towards the event that generates the news item.

5.2.2 Parameters

With respect to the input parameters the model will need, these will be two:

• Scope of the news item.

• Topic of the news item.

By scope we mean the geographical range of the news item, which we classify in four different types,from smaller to bigger. The smaller scope is the personal or friend group, the second is the localone, the third is the national and finally the bigger group is the international one. By definition,every new group contains the previous ones, as illustrated in figure 5:

Figure 5: Types of scope

15

Page 16: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

Depending on the type of scope, we assume the news items will behave differently, that is, theywill have different potential burstiness and speed. For the personal scope we consider the mid-lowburstiness potential due to the limited number of individuals involved, but as the bounds supposedto a group of friends or acquitances the speed potential is high. In the case of the local scope theburstiness potential is middle because it is partially limited in space but the speed potential remainshigh due to the proximity alleged among the users. For national news the burstiness potential maybe considered as medium to high and its speed potential is from middle to high based on the topic.Finally, for international news both burstiness and speed are considered as the highest among allthe news items ones. Our hypotheses are shown in table 2.

Scope type Burstiness potential Speed potentialPersonal Mid-low Mid-high

Local Mid HighNational Mid-high Mid-high

International Mid-high to high High

Table 2: Burstiness and speed potential based on the scope

The second parameter used to define the model is the topic, namely the type of news. We haveseparated news items into five main topic classes: Sports, Economics, Politics, ”Science, Technologyand Culture” and Entertainment. For Sports burstiness and speed should be mid or mid-high, butalso depends on the sport itself. in Economics burstiness and speed tend to be the average, andabout Politics, those news should have a burstiness and speed over the average. About Science,Technology and Culture, this type of news are not used to be widespread except for big events suchthe release of a new smartphone so their burstiness and speed tend to be less than the average.Finally, Entertainment news items are highly linked to the news item itself, but it is rare anEntertainment news items to spread fast but it might burst high, as sometimes most of the trendingtopic into networks like Twitter are based on gossips or TV shows. The associated speed andburstiness potential to each type of news are summarized in table 3:

Topic type Burstiness potential Speed potentialSports Mid Mid-high

Economics Mid MidPolitics Mid-high Mid-high

Science, Technology and Culture Mid-low Mid-lowEntertainment Depends on the news item Depends on the news item

Table 3: Burstiness and speed potential based on the topic

16

Page 17: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

5.3 Network diffusion

To find a model of how information is diffused into a network let us consider some methods thathave been defined in literature. There are two ways on how the information spreads in the networkand reaches one or more other nodes. The first way information is spread on a network is throughthe connections established among the users and also through the influence of external sources, likeNews Media agencies, relatives or friends, which we will now elaborate.

The first models used to define the behaviour of the information difussion in Social Media net-works did contemplate only the relations among users. In the last years new implementations suchas [30] which take into considerations both approaches have arised. According to this new type ofmodels information ”jumps” accross the network, and the explanation offered by authors is thatthere is an unobservable external influence over the network. Allegedly, about 71% of the informa-tion volume in Social Media networks such as Twitter can be attributed to network difussion, andthe remaining 29% is due to external events and factors outside the network.

Information is spread throughout a Social Media network in different ways, but it seems thatthe optimal approximation, and in a certain way a generalization, is shown in figure 6. In thisparticular case, we build a link from a node to another, if the latter mentions the former in thetweet that contains a topic that the original node had talked about earlier. As there is no sucha mechanism as explicit threading in networks such as Twitter, this is the optimal approximationof the path for the original user employs to diffuse a topic. Figure 6 also shows the way how adiffusion network is created. All posts that contain the topic, based on a keyword or hashtag inSocial Media networks like Twitter, are labeled with timestamps and the diffusion links are createdas explained above. The blue colored nodes are inside the network while the yellow ones are thosethat mentioned the topic, but are not linked to any other message:

Figure 6: How a topic is diffused (left) and network structure (right) [31]

With this model in mind, it is possible to develop a model based on a series of local dynamics thatwill measure the information spread over the next three dimensions:

• Speed: velocity of the spread.

• Scale: number of affected nodes in the network.

• Range: how far the diffusion continues.

17

Page 18: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

In the case of networks like Twitter, the range might measure different magnitudes. As it will beexplained later, in our case the range will measure the geographical distance between the node thatgenerates the information and the node that gets that information and rediffuses it. These threedynamics are visually explained in figure 7:

Figure 7: Local dynamics of a Social Media network [31].

The external influence over the network is of vital importance on information difussion. On SocialMedia networks such as Twitter, users often post links to various websites, being mainly links tonews articles, videos or images. In cases such as ours, the external influence is strongly significant asto we are evaluating the effect of the attached images, which own the focus of the user. Therefore,in order to create a feasible model it is important to know the effect of attaching images to a node.

5.4 Burstiness and speed

To define the success of a sharing cascade or spread two main characteristics are the most influential.The first one is speed, that in this special case will define how fast the spread is, and the secondis burstiness, used here as a way define the spread potential in terms of frequency over the totalpopulation, speed and, in the case of geographically-based networks, area [32]. The more bursty aspread is when the more users into a certain area participate in the network and the smaller timeis taken to cover the area.

Is it really possible to generalize over how these networks are created and spread? It is hard togive a fully trustworthy answer to this question because the chances of creating extensive messagenetworks in Social Media application such as Twitter with thousands or even millions of messagesare rare. Only in global events it is possible to obtain reliable information. Even with this, it is notyet clear whether the behaviour of Social Media networks depends on the size of the network itself.

18

Page 19: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

5.4.1 Speed

Speed is important at determining the success of a news item spread as we consider that the fastera news item spreads the more thriving it is, so therefore, potentially successful news will be sharedfaster than other less thriving. This concept has been deeply developed in the past by authors suchas Yang and Counts in [31], who defined a model based on hazard regression models [33] to quantifythe degree to which a number of features of both users and messages themselves predict the speedof diffusion to the first degree offspring. This constraint also applies for the networks based ontweets, due to, so far, it is not possible to create networks with more than two levels (message andits corresponding retweets).

Aspects such as the author, his activity and other inner characteristics will imply a faster or slowerdiffusion of his messages. For instance, the impact of the messages created by a big company or byan influenctial figure will have a higher probability of spreading faster into a network. With this itis also remarkable that the author has not been necessary a famous figure, even in smaller circlessuch as friend groups there are always more and less influenctial individuals.

Other characteristics thay may influence the success of a message are the inclusion of text ormedia links, the text formatting (not available in all the networks) or the mentions to other users(a message with a high number of user mentions will have a high probability of being fastly andstrongly shared).

Empirical experiments, such as the one made by the authors in [31] based on how long does ittake for a message on Twitter to be shared based on its topic, cast out that when the author ismore active in posting and has a higher rate of being mentioned, the current message will be sharedin a short period of time. Also when the message is a mention, it has a higher chance to continueits diffusion. Finally, in the case of events, the messages created in an early stage of the event aremore likely to be retweeted in a short time. On the other hand, it seems that the existence orabsence of media links does not affect the speed at which a message is shared.

All in all, speed is a determining factor for the success of content into a network. Along withthe owner of the information or other aspects such as the inclusion of media links or the previousactivity of the poster will stimulate or slow the speed of its spread.

5.4.2 Burstiness

The effect of burstiness in Social Media systems has not been widely studied since the boom of suchnetworks. The burstiness of a post into a Social Media network might depend on a series of factors.

The first factor is formed by the statistical properties that define the microscopic evolution ofa social network. These properties tend to vary depending on the network but according to authorsin [34] it is possible to generalize and extract a few common properties shared by all the SocialMedia networks on how new nodes are added to them. Authors determined that most new edgesspan short distances, representing close people or relatives to the person represented by the edge.

Another influential factor of burstiness is the dynamics involved into the edge creation insidethe network. The fact that link creation is a burst process, that is, not homogeneous in time,was demonstrated in [35]. For each user authors created an event time series where an event isrepresented by the creation of an edge. The assumption is that if the edge creation process is ho-mogeneous, the timelapse between two events have an exponential distribution. On the other hand,a bursty behaviour will comprise a set of many short time intervals, being them all a peak of ac-tivity that will form a burst, and also a relatively fewer but longer periods of low or even no activity.

19

Page 20: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

The outcome of the experiment, with the averaged values for the different time series, appearsin figure 8:

Figure 8: Schematic representation of speed in a news item spread [35]

In figure 8 it is possible to appreciate a higher probability when the variable acquires values close to1, also owning the distribution a long tail, that has a relation with the age of the node, so thereforethe more time since the node was created the less bursty it will be.

Other factors that might affect burstiness, that also influence speed as explained above, are thoserelated to the user, like popularity, number of related users or the content of the message. In thecase of particular Social Networks such as Twitter the number of people that follow or unfollow theuser that creates or shares the message affects the burstiness.

Some interesting findings about burstiness in Twitter networks were made by Seth A. Myers, JureLeskovec in [36]. The first is that the users that follow another user during a bursting process tendto be more similar than the ones obtained outside a burst, defining this similarity metric basedon the TF-IDF weighted word vectors between the two users’ aggregated tweet documents. Thisprovides the idea that users become more related during a burst. This is not a rule of thumb in thecase of the followers of big News Media agencies with a wide range of users, as those might severelydiffer with most of the messages of the agency.

Another fact is that not only the topic of the message affects burstiness, but even the usage ofcertain words might lead to a burst of follows or unfollows. Emotive words related to an eventcould lead to a burst of messages and follows and, on the other hand, keywords such as ”free”,”sale” or ”download” increase the probability of an unfollow burst.

20

Page 21: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

5.5 General equation

After defining the parameters and parameters involved in the model it is time to explain how theyall interact, so it is possible to define a general equation for the model. This general equationappears below in equation 1:

(Spspread, Bspread)← Spread(S, T ) (1)

being ”S” the scope, ”T” the topic, ”Sp” the speed and ”B” the burstiness. As explained before,the model will return as outcomes of the speed and burstiness of the spread.

5.5.1 General speed equation

Speed is a physical vectorial magnitude that represents the displacement of an object with respectto the unit of time and is represented by a vector. In our case it is possible to use the speed tomeasure how fast the news item is spread around an area, regardless how big this area is. Whenhaving a set of correlated timestamps we are able to order them and obtain a correlation of events.As we also have the population of the area where each tweet is sent, we can measure the physicaldistance between the position of the tweet and the retweet, obtaining therefore the speed for eachtimelapse, as shown in figure 9:

Figure 9: Representation of speed in a news item spread

Measuring the geographical shortest distance between two points based on their longitude andlatitude is not a trivial task. As their longitude and latitude are indicating relative positions on ansphere, it is needed to use the haversine distance in order to know the distance that separates them,along with the radius of the sphere being used, in this case the Earth. This is being appreciable infigure 10.

21

Page 22: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

Figure 10: Graphical representation of the haversine distance [37]

With this in mind, it is possible to compute the haversine distance d between two points in thesurface of the Earth by applying equation 2:

d = R · c,c = 2 · atan2(

√a,√

1− a),

a = sin2(∆φ/2) + cosφ1 · cosφ2 · sin2(∆λ/2).

(2)

for point p1 = (φ1, λ1) and point p2 = (φ2, λ2), with φ latitude and λ longitude, and R being theradius of the Earth, 6,371 kilometers.

The last step is to return the Weighted Arithmetic Mean of the speed as final outcome. Theselection of this metric is due to the spread has different speed through time, so we will measureand weigh times according to their relative extension with respect to the total timelapse. In orderto do this we will apply equation 3:

Sp =

∑ni=1 wiSpi∑n

i=1 wi(3)

where Spi are the different speed values of each one of sections in the spread and wi the weight orrelative extension of the section with respect to the total time of the spread.

5.5.2 General burstiness equation

The other outcome of the model is burstiness, that is, the different increases and decreases in activityof an event in time. In the model burstiness will measure the potential number of people that havehad access to the news item through time. Therefore, the magnitudes to relate are the populationand time, as shown in figure 11. This magnitude often increases with time into a non-uniform way,but not always. Thus, in order to obtain the burstiness Bi in an interval i, that is, the increase ofthe population in that interval, we will need an approximation to the derivative of the populationp with respect to the time t, as appears in equation 4:

Bi =∂p

∂t(4)

22

Page 23: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

Figure 11: Representation of burstiness in a news item spread

Finally, as in the case of speed, we will compute the Weighted Arithmetic Mean of the partialburstiness of all the intervals of the spread. In order to obtain the final burstiness B we will applyequation 5:

B =

∑ni=1 wiBi∑ni=1 wi

(5)

where Bi are the partial burstiness values for each one of the intervals and wi the weight of eachone of the intervals, that is, the time ratio that each interval lasts in comparation with the totaltimelapse.

5.6 Measurements

In order to measure the two outcomes of the model, speed and burstiness, it is first necessary todefine what magnitudes might be used to take these measurements. In the case of the speed a newsitem takes to spread throughout the network we will use the meters per second magnitude, definedby the International System of Units and represented by m/s, defined by distance in metres dividedby time in seconds. Otherwise, in the case of burstiness, the approximation of the derivative of thepopulation with respect to time will be a valuable metric. For each step into the spread, that is, foreach new message added, a new partial burstiness and speed will be computed and also an averagemetric for both speed and burstiness will be computed by taking the Weighted Arithmetic Meanof all the connections created until that point. How these metrics are chosen is visible in figure 12:

Figure 12: Partial and average values of speed (left) and burstiness (right)

The final burstiness and speed of the whole spread will be the average metrics at the end of thespread.

23

Page 24: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

6 Results

In order to evaluate which type of information has the fastest propagation speed and bursts themost we have crawled Twitter for a total of 3 days, specifically from August 4th to August 6th2014. As we are not interested in any specific topic, we used the generic keyword ”the” to makesure that the maximum amount of tweets were processed. Within the constraints of Twitter crawllimits this resulted in a dataset of 176911 tweets. In order to do our experiments we need tweetswith attached images along with any type of geographical information such the user location or thecoordinated where the message was sent. Out of the 176911 tweets 27709 contain an image and12955 have a valid geo location. If we put both constraints we get 1604, being this the start for ourretweet analysis. The ones without retweet all have speed and burstiness of 0.0 so we are left with1604 tweets containing both geo, image and which are at least retweeted once.

The 1604 tweets are comprised of 36 independent spreads. For analysis we need ground truthso we manually labeled each spread based on two parameters. The first is the topic of each one ofthe news items, defining what is it about. Based on the topic, there are five labels:

• Sports: all types of sport events.

• Economy: economic news items.

• Society: fashion, style and celebrities related information.

• Entertainment: jokes, graphic art and fun posts.

• Politics: news items related to politic information.

The other input is scope, the geographical ambit of the item. This parameter is labeled as follows:

• Friends: information shared among users that belong to a group of friends.

• Local: local news items.

• National: information with scope involving a nation.

• International: worldwide news items.

6.1 Speed

In order to provide a good reasoning, we will show and comment on a series of plots with theoutcomes divided as explained above, with a final remark with a summary of the plots. As the timeeach item needs to spread differs, time is normalized.

In figure 13 appears the speed of the different news items based on the topic of the image it-self. The elements with a highest speed are those related to Entertainment topics, with one inspecial that stands out due to its particular success. Politics, Society and Sports related items showan average speed, whereas the speed of the news items with Economy topic are the lowest.

24

Page 25: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

Figure 13: Instantaneous speed per topic for Entertainment (a), Politics (b), Society (c), Sports(d) and Economy (e)

We can observe the spread graphs of all the 36 spreads used as datased in 14. In it we can seethat partially the reason why Entertainment news items have a high speed is due to one of theseelements having a high number of retweets, which produces a high speed as this particular spreadhas worldwide coverage. It is also interesting to see how speed decreased during time, starting highand decreasing with time. We can see that Sports and Society news items have a good start andprogresively decrease with time and also how Economy and Politics spreads are no fast at any timeof the spread. This is due to in the dataset there is no big Politics event, in that case it is reasonableto think in a high speed, usual in global events:

Figure 14: Instantaneous speed per topic and spread (a) with zoom (b)

25

Page 26: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

Now we will comment on the results based on the scope of the news items, whose outcome isdisplayed in figure 15. The speed of the news items whose scope is a group of friends is notspecially high but pretty decent having into account the number of people involved in such type ofspreads. This happens because nowadays it is normal to use Social Media networks to communicatewith friends that live abroad, producing this medium averaged speed. About local scoped newsitems, their speed of this particular type of news is specially low compared with the rest of scopes,and about national and international items, the bigger the scope the faster the speed and also thenumber of messages per spread is higher, being the speed of international scoped items the highestamong the scopes commented, also due to this type of news has the highest potential number ofusers all around the world, what affects the speed of their related spreads.

Figure 15: Instantaneous speed for friends list (a), local (b), national (c) and international (d) scope

We see in figure 16 how the news items behave attending to their scope. It is visible how interna-tional news items are undoubtedly the ones that spread the fastest, but also there are some nationalevents with a high speed. We can observe a proportionality between both metrics, the larger theextent of the scope the higher the speed. With the exception of the friends group, which may beexplained thinking that nowadays it is possible to have friends all around the world, making thistype of news items fast in comparison with the local ones, circumscribed to a small area:

Figure 16: Instantaneous speed per scope and spread (a) with zoom (b)

26

Page 27: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

6.2 Burstiness

About burstiness and topic, in figure 17 the behaviour of the items in the dataset is displayed.Entertainment scoped news items provide a burstiness value during their spread which is not veryhigh. The same applies for the news items related to Politics are shown. Their burstiness is notspecially high either, but with a peak at the end of their spreads, possibly explained by a peakof interest at the end of their spread process due to they reached a highly populated area with ahigh number of potential viewers. Concerning Society related new items, only one of them shows amedium value for burstiness, the rest being low. Burstiness for all the news items related to Sportshave a high value because of the nature of these events, that have a high burst at the moment theyhappen and end fast, such as a goal in a football match. Finally, about Economy related items,their outcomes are especially low, demonstrating that these type of news are not very popular inNews Media networks:

Figure 17: Instantaneous burstiness per topic for Entertainment (a), Politics (b), Society (c), Sports(d) and Economy (e)

All the spreads in the dataset appear in figure 18. In here we observe how sports related news itemshave a high burst in the beginning of their spread and maintain it high during their spread, whilethe rest of topics have a low burstiness during all their spread:

27

Page 28: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

Figure 18: Instantaneous burstiness per topic and spread

Finally we will comment on the results of burstiness related to scope based on the outcomes displayedin figure 19. About the burstiness for all the spreads for groups of friends, it is the lowest amongall the types of scope, because of by definition the potential number of users in a friend group isthe lowest. The behavior of local news spreads is slightly higher than the ones related to groupsof friends but significantly lower than in the case of national and international scoped news items.National scoped news items exhibit a higher burstiness than friends and local items, having oneparticular item with an outstanding start that decreases through time. International scoped itemshave the highest burstiness value, having an outstanding element that maintains a high burstinessduring all its spread process as well:

Figure 19: Instantaneous burstiness for friends list (a), local (b), national (c) and international (d)scope

The scope metric is measured per label in figure 20. It is possible to observe that, like in the case ofspeed, burstiness is directly affected by the scope of the news item, being the international ones thespreads that own a highest burstiness potential, decreasing this potential to local and friend groupscopes. It is also interesting to observe that the scope of international news items maintains highduring all the spread, due to that by definition the potential of international news will be higherthan any other as their potential population is the whole globe. We also see how friends group and

28

Page 29: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

local news behave with respect to burstiness. Friends group burstiness is the lowest as the potentialpopulation is smaller than a city:

Figure 20: Instantaneous burstiness per scope and spread (a) with zoom (b)

29

Page 30: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

6.3 Comparison of speed and burstiness

In figure 21 appears a comparison of the speed and burstiness of the labeled spreads based on theirtopic. It is visible a high burstiness of sports related spreads due to the initial burst they all have:

Figure 21: Compared speed versus burstiness per topic

Figure 22 shows the compared speed and burstiness for all the spreads. Here we can see how sportsnews items have in general a high burstiness and medium speed, while entertainment related itemsshow the highest speed:

Figure 22: Compared speed versus burstiness per topic and spread

30

Page 31: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

The comparison of speed and burstiness per label based on scope appears in figure 23. The newsitems whose scope is international have both the highest speed and burstiness, because to due totheir nature such news spread worldwide and tend to be bursty. Also news items on groups offriends are fast as it is common to have friends all around the world that comment or share theitem. Finally national news are bursty and speed in a smaller dose than the international ones,while the local news are the least fast and bursty as their geographical location is reduced:

Figure 23: Compared speed versus burstiness per scope

In figure 24 appears a comparison for all the spreads in the dataset based on their scope. Despitea few national scoped elements show a high burstiness the average of the label exhibits a lowerburstiness as shown in figure 23. On the other hand, both national an international news itemshave the highest speed but eventually the international news win the contest and are the morehighly propagated ones:

Figure 24: Compared speed versus burstiness per scope and spread

31

Page 32: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

The last parameter used to determine which type of news is the most successful we will measurethe average number of items per spread. In the case of the spread topic, the results are shown intable 4. The spreads that stand out are the entertainment related ones, it seems pretty clear thatusers tend to share this type of contents the most:

Entertainment Politics Society Economy Sports58 4 6 3 8

Table 4: Average number of elements in a spread based on topic

In table 5 the same comparison is done based on scope. International news items are on averagethe most shared while the other ambits seem to have a similar sharing rate:

Friends Local National International4 7 5 72

Table 5: Average number of elements in a spread based on scope

With all this information we can conclude that the news items that spread the fastest in NewsMedia networks are related mainly to entertainment, jokes and Internet memes. Also sport newsseem to be very popular and really bursty. Also it is possible to conclude that there is a directrelation between the scope and success of a news item, the wider the scope the more successful theitem is. In order to obtain more reliable results it would be useful to get access to a bigger dataset,as the one used here contained only 36 labeled spreads.

32

Page 33: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

7 Visualization

7.1 Speed and burstiness in the visualization

The main layout of the visualization shows a world map and a graph representing the spread of thenews item selected at that point along with some other elements such a slidebar, an image selectorand different statistics panels. Figure 25 shows this main layout:

Figure 25: Main visualization layout

The concepts of speed and burstiness are displayed in the visualization in the way of two line graphs,displaying each of them the current value of burstiness or speed and also an averaged value of eachmetric until the selected timestamp. Also when hovering over one of these lines, a popup with thevalue of the metric will be shown on the right of the line graph. These graphs are shown in figure26:

Figure 26: Compared speed versus burstiness per scope and spread

Also on top of the layout a panel with the current and averaged speed and burstiness is shown.These values change when moving the slidebar place at the bottom of the visualization. Thisstatistics panel is the one that appears in figure 27:

Figure 27: Statistics panel of the visualization

33

Page 34: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

7.2 Additional elements

Besides the elements that provide insight on how speed and burstiness are measured, there are someother visual components in the visualization that have been used to provide a better understandingas auxiliar elements of the visualization.

The main component of the layout is the graph that represents the spread process where a se-ries of nodes share the information uploaded by an initial user, as shown in figure 25. In thegraph there are two types of elements, those that contain only one message or those that containmore than one. When hovering over a node that contains one message information related to thatparticular message is shown, as it appears in figure 28:

Figure 28: Popup shown when hovering over a single node

On the other hand, when hovering over a node with more than one message a popup with thenumber of current messages clustered in the node will appear, as shown in figure 29. Also afterclicking on such node, a list containing information of all the messages grouped in the selected nodewill appear, also shown in figure 29:

Figure 29: Popup and menu shown when hovering or clicking over a clustered node

34

Page 35: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

Another important element of the visualization is a image selector that will switch betweent thedifferent spreads based on the shared image of each one of them. After clicking on ”previous” or”next” a new image will be loaded and with it the graph will be reloaded with the spread of thenew image. This image selector appears in figure 30:

Figure 30: Appearance of the image selector

35

Page 36: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

8 Conclusion

Thanks to the outcome obtained now it is possible to reasonably give answers to the questionsproposed. Despite there is no such a ”success formula” in News Media networks, it seems reason-able to admit that the contents that have the most chances of being viral are those ones basedon entertainment and those that represent graphical jokes. It is also possible to determine that,besides worldwide events, topics such as politics or economy tend to be a discrete burst and seemnot to be specially popular among users of Social Media networks.

Also, regarding scope and speed, we have found a direct relation between both metrics, beingthe international scoped news those that own a higher intrinsic speed and the local ones thosewhose speed is the lowest. It is also specially interesting that the news spreads that are shared by asmall group of users that conform a friends group tend to be faster than local news items as peopleuse to have friends (or followers) all around the world, whereas local scoped news have a limitedrange of action, making them slow compared with the other groups.

Concerning burstiness, sports related news items show a high burstiness mainly due to their instan-taneous nature, as most of the times they are based into a single, punctual event such a goal. Alsoentertainment related items have such a nice burstiness potential, also due to the high number ofaverage tweets of this particular type of news. Out of all the spreads analyzed, the ones related topolitics and economy maintained a low burstiness potential, mainly due to such type of informationdoes not receive high attention from users except when a global event occurs.

Finally, also in burstiness we found a relation between scope and this metric, being that inter-national scoped news items are those with the highest burstiness and the friends group scoped onesthose with a smallest burstiness potential.

We also measured the amount of shares of each one of the labels and there was clear that bothinternational scoped and entertainment related news items have a overwhelming success based ontheir shares. This is partly due to the action of ”outliers”, particularly thriving spreads with hun-dreds of shares that we thought on removing but eventually we decided to maintain them becausethey also returned especially interesting information on how successful posts are shared in networksand how they may affect other spreads.

All in all, according to these results, we may conclude that the success of a news item dependson its topic, the broader it is the more chances of success it will have, and also about its topic, ifit is about entertainment it will have a high chance of becoming viral, and also it could partiallyhappen when it comes to sports news items, but this burstiness potential will be limited in time.

As possible points to improve in the future, it would be interesting to highlight three. The main oneis the fact that, due to the difficulty to crawl useful data, the size of the dataset is small as now theappeareance of outliers affects the outcome of the experiment, so it would be interesting to makethe test with a higher number of news spreads. Also we have found an error in the results returnedby the gazetteer, that sometimes does not return the right location due to there are several placeswith the same name. It would be also interesting to analyze the content of the images used in orderto obtain additional information over the image itself, analyze sets of spreads based on the contentof the images shared, or even automate the process of data labeling.

We consider that the information obtained may be used in the future by others to build up ontop of it, as the results retrieved may be extended in different ways. It provides some basic infor-mation that can also be used in further experiments and it also opens a new field on how to applyimage data analysis to Social Media networks.

36

Page 37: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

References

[1] http://www.merriam-webster.com/dictionary/insight.

[2] H. Haddadi K. Gummadi M. Cha, F. Benevenuto. The world of connections and informa-tion flow in twitter. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEETransactions, 2012.

[3] A.T. Stephen K. Wilcox. Are close friends the enemy? online social networks, self-esteem, andself-control. Journal of Consumer Research, Inc., 2012.

[4] P. A. Dow J. Cheng, L. A. Adamic. Can cascades be predicted? WWW ’14 Proceedings of the23rd international conference on World wide web, 2014.

[5] C. Brock A. George M. Planck, I.L. Pollard. Initial indicators of topic success in twitter:Using topology entropy to predict the success of twitter hashtags. Network Science Workshop(NSW), 2013 IEEE 2nd, 2013.

[6] Y. Blanc. Yoan blanc’s twitter network visualization.http://yoan.dosimple.ch/blog/2007/05/17, 2007.

[7] B. Arikan. Growth of a twitter graph. http://burak-arikan.com/growth-of-a-twitter-graph,2008.

[8] Tweetping. http://tweetping.net.

[9] A world of tweets. http://aworldoftweets.frogdesign.com.

[10] https://maps.google.com.

[11] http://www.openstreetmap.org.

[12] J. Thorp. Just landed. http://blog.blprnt.com/blog/blprnt/just-landed-processing-twitter-metacarta-hidden-data, 2009.

[13] https://www.mapbox.com/labs/twitter-gnip/languages.

[14] N. Kallus. Predicting crowd behavior with big public data. Proceedings of the companionpublication of the 23rd international conference on World wide web companion, 2014.

[15] F. Rojas. How twitter can help predict an election. The Washington Post, 2013.

[16] Mentionmapp. http://mentionmapp.com.

[17] D. Troy. Twittervision. http://twittervision.com, 2007.

[18] Tweepsmap. http://tweepsmap.com.

[19] L. Lui B. Xu. Information diffusion through online social networks. Emergency Managementand Management Sciences (ICEMMS), 2010 IEEE International Conference, 2010.

[20] E. Hwang D. Kim, S. Rho. Detecting trend and bursty keywords using characteristics of twitterstream data. International Journal of Smart Home, 2013.

[21] N. Koudas M. Mathioudakis. Twittermonitor: Trend detection over the twitter stream. 2010ACM SIGMOD International Conference on Management of data, 2010.

[22] M. Rıos. The geography of tweets. https://blog.twitter.com/2013/the-geography-of-tweets,2013.

37

Page 38: Analyzing and visualizing news spread based on images in ... · for our purposes because at the same time we reference the structure or a network data structure as well as preserving

[23] Mapsdata. http://mapsdata.co.uk.

[24] Trendsmap. http://trendsmap.com.

[25] D. Skau W. Ribarsky M.X. Zhou W. Dou, X. Wang. Leadline: Interactive visual analysis of textdata through event identification and exploration. Visual Analytics Science and Technology(VAST), 2012 IEEE Conference, 2012.

[26] M. Kitsuregawa M. Itoh, M. Toyoda. Visualizing time-varying topics via images and textsfor inter-media analysis. Information Visualisation (IV), 2013 17th International Conference,2013.

[27] http://www.geonames.org.

[28] https://dev.twitter.com/docs/platform-objects/tweets.

[29] L. Tabourier R. Lambiotte. Burstiness and spreading on temporal networks. The EuropeanPhysical Journal B, 2013.

[30] J. Leskovec S. Myers, C. Zhu. Information difussion and external influence in networks. KDD’12 Proceedings of the 18th ACM SIGKDD international conference on Knowledge discoveryand data mining, 2012.

[31] S. Counts J. Yang. Predicting the speed, scale, and range of information diffusion in twitter.Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, 2010.

[32] W. Feng S. Ayyorgun. A deterministic definition of burstiness for network traffic characteri-zation. International Conference on Computer Communications and Networks, 2004.

[33] D. Oakes D.R. Cox. Analysis of Survival Data. Chapman Hall, 1984.

[34] R. Kumar A. Tomkins J. Leskovec, L. Backstrom. Microscopic evolution of social networks.KDD ’08 Proceedings of the 14th ACM SIGKDD international conference on Knowledge dis-covery and data mining, 2008.

[35] G. P. Rossi A. Sala X. Wang H. Zheng B. Y. Zhao S. Gaito, M. Zignani. On the burstyevolution of online social networks. HotSocial ’12 Proceedings of the First ACM InternationalWorkshop on Hot Topics on Interdisciplinary Social Networks Research, 2012.

[36] J. Leskovec S. Myers. The bursty dynamics of the twitter information network. WWW ’14Proceedings of the 23rd international conference on World wide web, 2014.

[37] http://blog.karmona.com/wp-content/uploads/2010/10/Image34.gif.

38