twitter media outlet analysis - wordpress.com · at what a handle can do to get more retweets and...
TRANSCRIPT
Twitter Media Outlet Analysis
Hanyu Huang, Harel Kopelman,
Ruchi Patidar and Shiqi Wang
Web Analytics
BYGB7978002201610
Professor Yilu Zhou
12/15/15
An Analysis of How
Media Outlets Do
and Should Utilize
Twitter to
Disseminate Content
1
Executive Summary
News outlets utilize social media handles such as Facebook and Twitter to disseminate
their content, engage consumers and increase traffic to their websites. As of 2011, a Pew
Research Center poll found that media outlets used multiple Twitter accounts- some as many as
98, the average using just 41- to spread their content, and a full 93% of their tweets had external
links which would bring users back to the outlet’s site (Pew).
It therefore behooves news organizations to analyze the way they share their content on
this platform. Not all social media strategies help disperse content equally: some social media
strategies are more effective than others. Our team offers data-based solutions to media
companies by scrutinizing the tweets of sundry media outlets on Twitter and analyzing which
ones create the most engagement in the form of retweets, favorites and comments.
The team first focused on making internal comparisons among a company’s tweets. In
order to make our results generalizable, we also compared stories from different outlets which
are similar to each other and cover the same topic, in order to see which topics in general were
most popular. Our analysis also includes an in-depth look at this most popular trend.
The process of creating these analyses entailed scraping tweets from the last six months
off of a brand’s Twitter profile, comparing its tweets to each other and then comparing tweets
across handles. Tweepy was utilized for the scraping, and the output file was a CSV which we
spent considerable time cleaning. We then utilized a keyword generation program called
TerMine to find the most popular keywords from the text of the tweets, and then used Excel to
perform horizontal lookups to match those generated keywords to the correct tweets.
In our analysis, the data team learned which keywords and topics resulted in the highest
engagement. We had predicted that stories involving celebrities or high-profile persons of
interest (politicians, sports players, etc.) would generate the most engagement, and found this to
be precisely the case for certain outlets (Fox News). We also discovered that breaking news was
the most popular for some outlets, and more popular in general across all outlets- hence tweets
about the Paris attacks generated significant engagement.
The perhaps incidentally discovered gem we uncovered was each media outlet’s specific
Twitter strategy. Comparing what each out outlet chose to tweet about using hashtags and seeing
which generated keywords were actually most popular gave us insight into how each handle
portrays itself. For example, CNN’s most popular tweets and highest-ranking keywords
pertained to breaking news. Fox News’ strategy was focused more on branding the parent
company and quoting other officials and personalities on Twitter.
2
Business Goal Analysis
Addressing the question of how to maximize Twitter engagement is best answered by
using web analytics software, visualization techniques and Excel. There is no better way to look
at what a handle can do to get more retweets and favorites per tweet than to analyze the
popularity of each tweet and see which topics, keywords and hashtags generate the most
engagement. By analyzing each keyword’s and hashtag’s popularity, it becomes possible to
recommend what kind of content outlets should pump out to increase engagement.
There are multitudes of Twitter analysis services available for handles to utilize. Some,
like HootSuite and TweetdDeck, help manage the output and simple viewability of tweets. There
are more sophisticated tools such as TweepsMap which can display the geographical distribution
of a handle’s followers and recommend the best times to tweet. Twittonomy is another service
which gives users a robust dashboard displaying a handle’s average daily tweet count, how often
tweets get retweeted or favorited and which tweets are most popular.
Our line of work diverges from these popular platforms’ roles by providing more
nuanced, focused analysis of which keywords or phrases an outlet should use in its text body.
While services predicting which time is most optimal for tweeting are undoubtedly valuable, we
concluded that giving handles the leg up by analyzing which topics and words were most likely
to generate significant engagement was the best way to go forward. In the media industry,
content is king, and so rather than just help brands play around with tweet timing and
visualization, we help them tailor their content to what users want.
Ultimately, brands want to expand their market share. When it comes to media
specifically, market share means eyeballs, and eyeballs on Twitter are gained by increasing a
handle’s follower count. Especially by looking into which keywords and topics are most popular
across different handles, our analysis could help a respective handle broaden its follower base.
Or, if the brand wished to stay very niche and target the same kinds of users it already has on its
followers list, our analysis can provide a potent tool for recommending what type of content
existing followers liked most.
Dataset Description
The dataset utilized for this project was the tweets scraped off of the media handles we
wished to analyze. We ended up choosing eleven media handles- CNN, Fox News, BBC
Breaking News, The New York Times, The Washington Post, The Wall Street Journal, CBS
News, The Financial Times, Reuters News and Yahoo! News- and scraping approximately 3,000
tweets off of each one’s handle. This was the maximum number the Tweepy permitted us to
scrape at a time, and we thought it was an adequately large sample size. The tweets spanned
about six months’ time. We made sure to run the scrapes all at the same time in order to ensure
generalizability.
System Design
To achieve our goal of scrutinizing sundry outlets and providing reasonable suggestions
on how to maximize Twitter engagement, we divide our system into four steps: Data Collection,
Cleaning, Analysis and Visualization.
3
Step 1. Data Collection
As we are focusing on media industry, we picked eleven social media accounts with more
than 100,000 followers and are influential in the social network. To get these outlets’ tweets and
their metadata, we utilized Tweepy to scrape each outlet’s latest 3,000 tweets. In total, we had
more than 30,000 tweets in the final dataset. Those tweets all contained handle name, date and
time stamp, text of the tweet and retweet and favorite counts. This metadata was vital for the
Data Analysis step.
Step 2. Data Cleaning
Outputs from Tweepy were
rendered in CSV files. These files
contained unidentified characters
and symbols due to encoding issues.
We made significant efforts to
remove these meaningless characters
to avoid inaccuracies in keyword
generation and analysis. We also
deleted all the links the tweet bodies
contained by using Excel functions.
Step 3. Data Analysis
After cleaning each tweet’s
text body, we used TerMine, a keyword generation program developed by University of
Manchester, to find the most popular keywords from the body of the tweet texts. We then used
Excel functions to match the most popular keywords to each tweet by using a horizontal lookup
function. We also identified all of the hashtags that a handle used by manipulating Excel
functions. Finally, we then assigned a category to each tweet in order to enable category analysis.
To complete our analysis, we also applied event analysis in the data analysis stage.
Step 4. Data Visualization
Excel and Tableau were used to generate data analysis and the ensuing visualization.
Three types of analysis were involved: category analysis, event analysis and keywords versus
hashtags analysis. In the category analysis, we used Excel to generate clustered column lines to
display the number of original tweets, retweets and followers for each account, and a bar chart to
display the categories of the top ten retweeted and favorited tweets of each account. For event
analysis, we used Tableau to generate graphs displaying the number of tweets, retweets and
times of the Paris Attack for each account. In the keywords and hashtag analysis, we used
Tableau to generate bubble charts and treemaps displaying the most popular keywords and
hashtags. A simple bar graph was also used to easily discover each outlet’s most popular
individual tweet.
4
System Implementation
The first step in implementing our system design was to select the appropriate Twitter
handles to scrape.
For CNN Breaking
News, the
highlighted text in
this screenshot is
the handle’s screen
name, while ‘CNN
Breaking News’
(above it) is the handle’s name. It is important to note that the two are different, a discrepancy
generated to help users find the outlet easily.
The next step was installing Tweepy in Python and utilizing a Python script to crawl
and scrape data from each media outlet. The script we utilized was found on GitHub and allowed
us to scrape 3,000 tweets at a time. Initially,
this script did not run appropriately, and so we
had to edit it in the Python IDLE editor.
Each teammate used their API token
credentials to access the data. For each outlet,
we would scroll down to the end of this code
and replace the highlighted text using the screen
name of each outlet’s Twitter accounts.
Running the script for each media handle generated a CSV file. This file contained the
outlet’s latest 3,000 tweets’ text,
along with their created time,
their number of retweets and the
number of favorites each one
received.
Each tweet was then cleaned. Almost all the tweets contained links to external websites,
so the first thing we did was remove the short URLs, as their occurrence in the body of the tweet
would interfere
with keyword
generation and
analysis. We
then removed
unidentifiable
characters and
certain
symbols such
as the ‘@’,
which
interfered with data manipulation (because Excel reads it as a function). Once the data was
cleaned completely, we used TerMine to find each outlet’s most powerful keywords from
contents of the tweets.
5
We utilized approximately 310 Termine keywords- each
outlet had different numbers of keywords, and we did not want to use
all the thousands of keywords the program generated for every single
outlet. Some of
them only
appeared once or
twice and were
therefore not
very meaningful.
We then
utilized a
horizontal
lookup in Excel
to match the
keywords to
each text where
a keyword
appeared. Each
tweet’s
keywords were
concatenated and separated with a space.
The final, resulting Excel file contained a
keyword column along with a hashtag column, found much earlier using simple Excel functions.
These Excel files were what we uploaded to Tableau for easy analysis. We initially
utilized treemaps to see which keywords and hashtags were most popular by text, retweet and
favorite count. Tableau was also utilized to show which tweet was most popular for each outlet.
An additionally significant part of our final analysis was to find the top ten keywords and
hashtags of select outlets. We used, for example, a packed bubbles diagram for The New York
Times, Wall Street Journal and BBC to do a
side-by-side comparison of what an outlet chose
to tweet about consciously versus what its users
found to be the most interesting topics.
6
Evaluation
Category Analysis The prime goal of maintaining social media accounts is to expand their influence in the social
network. In the case of news outlets, this influence is meant mostly to strengthen brand appeal,
disseminate content and bring in
traffic to the outlet’s main
website. The easiest way to
achieve these results is to attract
more followers so that an
outlet’s news will be dispersed
to a larger group of people. Our
first step, therefore, the number
of followers is correlated to the
number of original tweets and
retweets and the number of
tweets posted per day. The
following two graphs represent
this analysis and our findings.
There appears to be no strong correlation in regards with the trends presented in the
graphs. However, we could still glean a few insights. In the first graph, we can see that those
handles with a higher proportion of retweets to original tweets enjoyed a larger number of
followers, e.g. NBC News. When we turn to the second graph, things change. While we had
originally thought that
handles which post fewer
tweets per day would
have a lower follower
count, BBC Breaking
News and CNN Breaking
News did not follow this
pattern. Nevertheless,
when we associate the
number of tweets posted
per day with the
proportion of retweets to
original tweets, we can
see that retweets to
original tweets have a
high proportion of retweets to original tweets. This may indicate that all these factors function
together.
SPSS Modeler Client was utilized to generate a Neural Network Model which would
analyze whether there is correlation among all the factors. The number of original tweets,
retweets and average tweets posted per day were set as inputs, and the number of followers was
set as the model’s target. The following graph is the modeling result. The number of original
tweets ranks highest in the predictor importance, though the accuracy is as low as 6% because of
limitations stemming from sample size.
7
We then analyzed the top ten
retweeted and favorited categories by
dividing the tweets into 6 categories. The
graph below displays the category
distribution for top ten most retweeted
tweets. As the Paris attacks coincidently
happened in this period, most of the top
ten most retweeted tweets are about the
Paris attacks, followed by news related to
the president election. When looking into
the retweets’ content and responses, we
find that those controversial topics in society and politics which relate to daily life and
international politics most easily generate intense debates. People tend to express their opinions
and debate these issues with others.
When analyzing the top ten most favorited tweets, results change. The category
distribution becomes more
diversified. People like to click
favorites for tweets about
some warm stories in society,
or news related to celebrities
in sports or entertainment. For
example, the royal baby and
One Direction get more
favorites even if the tweet is a
one-sentence bit of news.
Our recommendation
from this portion of the
analysis is that media outlets should catch up with the latest trends and know the hottest topics in
each field. While outlets will need to find the correct balance between journalism and
transforming their outlets into digital-first platforms, we believe that the correct combination of
the two can help an outlet grow in size and influence and even expand its journalistic platform.
This is a model which several media outlets, most notably BuzzFeed, have built their own
expansive journalistic enterprises upon.
8
Hashtag vs Keyword Analysis
We found that there was often a significant gap between what a media outlet tweeted
about and what its followers were most interested in. Early on in our analysis we realized that we
could easily discover this gap by looking at two metrics: popular keywords versus popular
hashtags.
A hashtag is the designated topic the media outlet chooses to tweet about and “tag” its
tweets with. We discovered these hashtags through Excel functions. Not all outlets used
hashtags- The Washington Post, for example, hardly used them at all, and therefore we didn’t
find any actionable discrepancies in how the outlet used hashtags versus keywords.
The keywords which we generated using TerMine, on the other hand, represent the topic
of the tweet which the outlet did not choose to tag the text with explicitly. “#Paris attacks rock
continent” would come up as a hashtag count in our analysis, whereas a tweet like “Paris attacks
rock continent” would come up as a keyword count. We normalized each keyword by the
number of times a tweet containing it was sent out, and found that there were certainly topics
users were more interested in than the outlets were interested in tweeting about.
This was especially pronounced for some news outlets, such as CNN and Yahoo! News,
where the top hashtags utilized
and retweeted or favorited did
not match up well with the most
popular keywords we found for
each outlet’s tweets. CNN
thought that the #IranDeal and
the #Chattanooga shooting were
very important topics to tweet
about, and they did garner
significant engagement; but the
brand’s top keywords, which
had roughly the same tweet
count (about 30) as its hashtags,
was almost twice as popular for
its most important keyword- Paris Attacks. This was followed by keywords pertaining to famous
personalities such as Donald Trump, Pope Francis, POTUS (President of the United States) and
Hillary Clinton, in that order.
Yahoo! News provides an interesting contrast, as it seemed to utilize the exact opposite
strategy. Its top keyword was “Katie
Couric,” a branding effort aimed at
promoting the outlet’s major
personality. The next top hashtags
were also media personalities which
Yahoo! News presumably wishes to
brand and sell to its followers (see
treemap above).
But the Paris Attacks trend
proved to be its most popular
hashtag, and it generated more
retweets and favorites than did “Katie
9
Couric,” with less than half the tweets using that keyword phrasing. The Paris attacks have
already been shown to be a potent tool for engaging users, but what’s impressive here is that
Yahoo! News managed to organically generate such high results even while maintaining a
blatant branding push.
Our team also utilized other visualization techniques to analyze the differences between
how an outlet chose to market its tweets and what were actually its most popular results.
Applying this technique to the New York Times, it quickly became obvious that the outlet was
far more interested in discussing the
#GOPDebate, whereas its most popular topic
by far was #ParisAttacks. #GOPDebate tallied
up 3,827 favorites and 4,610 retweets; Paris
tallied up 25,436 favorites and 31,048
retweets, with a normalized retweet rate of
approximately 360 per tweet.
The Times chose to tweet
#GOPDebate-branded tweets 33 times the
night of the debate, with each tweet receiving
approximately 140 retweets. And yet its
conscious hashtag choices when it came to
tweets about the Paris attacks were limited to a
mere seven tweets, and each one received a
record 400 retweets. This lack of attention to
hashtags is staggering, seeing as the outlet tweeted almost 330 individual times about Paris. In
the future, outlets should pay close attention to which tweets receive the most retweets per
individual tweet, and use hashtags accordingly.
A hyper-focus on debates and elections by media outlets is certainly understandable.
These events are, to a certain extent, the lifeblood of these companies. CNN and Fox News net
record viewership numbers during these events, and the debates and their elections are generally
discussed widely on social media. A scheduled, anticipatable event like that simply cannot be
ignored by outlets.
There is also certainly a spirit of
intellectual and even moral obligation to
analyze these events. Journalists at least hope
that their coverage and analysis of the
political process informs the public and helps
it make the correct decision in electing its
representatives. In this sense, these media
outlets aren’t merely businesses- they wish to
perform a service of educating the public.
BBC seemed to have the same issues.
The outlet demonstrated the same “bias” as
did The New York Times- its most popular
hashtag by tweet count pertained to elections.
If anything, the results for BBC were even
more imbalanced than they were for The New
York Times.
10
BBC tweeted about the general elections using the hashtag #GE2015 63 times; each
tweet received an average of 700 retweets. But to contrast with the handle’s actually most
popular topic, #RoyalBaby garnered 6,800 retweets each tweet. BBC elected to tweet only six
times about this apparently momentous event. Paris was also more popular than the elections:
each tweet received an average of 1,370 retweets, and BBC elected to use that hashtag only half
as many times as it did the elections.
What is most interesting about BBC’s keyword and hashtag analysis is that the results for
its election-relevant tweets are the same whether or not the handle used a hashtag. #GE2015
garnered 700 retweets a tweet, whereas the
organic keywords “David Cameron”
received 690 retweets each (another
variation of “David Cameron” received
approximately 500 retweets per tweet). But
its most popular organic keyword was the
Germanwings flight 9525 crash, which
received 1,300 retweets a tweet. The next
most popular keyword by normalized
retweet count was “death toll,” which
garnered 830 retweets.
Clearly, BBC and The New York
Times need to reconsider their Twitter
strategies. There are topics and trends which
are clearly more popular than other ones,
and jumping on them will help these outlets expand their influence significantly. In general, the
interest followers express when it comes to events like Paris and Germanwings is far more
intense than it is in scheduled events like elections and political debates, important as those two
ought to be. Consider that The New York Times’ most popular keyword was actually “Breaking
News;” people seem to engage the most with this genre of news. If an outlet wishes to grow its
following, we would recommend it tag tweets with breaking news topics.
These considerations will need to be
weighed against an outlet’s desire to grow
its follower count and expand its influence.
These theoretically “corporate” (vis-à-vis
the aforementioned desires to keep the
public informed) interests are the ones best
served by acting upon our analysis
mentioned here. While an outlet might want
to keep its journalistic, civic-duty purposes
at the forefront of its operations, we believe
it is important to combine the two. BBC
might find it distasteful to tweet too often
about the Royal Baby, but that seems to be
what its follower base wants to see.
Finally, we found that some outlets
did not have a real focus for their followers,
and that affected their popularity
11
significantly. TIME’s top organic keywords pertained to Paris and a trend the outlet was
circulating called “influential teen.” The outlet did not use many hashtags, but the few it did
garnered significant attention: #Pouevee (“open door” in French) and #RefugeesWelcome both
garnered well over 3,500 retweets each, but the outlet only tweeted each hashtag once. The next
most popular hashtag was #DemDebate, which had only 174 retweets for each tweet- and yet the
outlet tweeted that hashtag 15 times.
While TIME seemed to face the same issues as did BBC and The New York Times with
a hyper-focus on the debates and scheduled political events, the main weakness of its strategy
was a lack of any cohesion. TIME’s most popular stories were the feel-good ones about Syrian
refugees and the Paris attacks. Its followers clearly care the most about these topics, and not as
much about political events. While politics certainly interested some followers, we believe based
off of this analysis that TIME should make sure to use hashtags which appeal to its audience in
earnest.
Event Analysis: Paris
The Paris attacks were by far the
most popular topic we found news outlets
to be tweeting about. On the right we
visualized the tweets output on the day of
the attacks themselves. It is easy to see
the tweets spike up at just before 11 PM,
when the BBC (in blue) tweets that
France had closed its borders in the wake
of the terrorist attacks. The New York
Times tweeted quickly afterwards
President Obama’s statement concerning
the attacks that the terrorism was “an
attack on all humanity.”
These two most popular tweets
from two different outlets indicate a lot
about how and why outlets covered these
events. The BBC, a European company,
had followers more interested in tracking developments on the continent itself. Therefore, news
that France had sealed off its borders attracted significant attention. The New York Times, on the
other, an American outlet, caters to an
audience perhaps more removed from the
immediate results of the attack, and its
readership was more concerned with the
humanitarian aspects of the attack.
The next day (shown on the left)
showed a heavy outpour of tweets as
well; these, however, had less to do with
developments in Paris and more to do
with shows of solidarity across the world.
The BBC’s most popular tweets
concerned the One World Trade Center
12
lighting up in the French tricolor. Its next most popular tweet was about the Sydney Opera House
doing the same.
The last, most interesting piece of analysis on
the Paris attacks pertains to what happened not on
the day of the attacks, but rather the political
reverberations they held for Americans. To the right
is a graph detailing tweet count (thickness of the
line) and retweet volume (y-axis) over the days
before and after the attacks (x-axis). Understandably,
the highest retweet counts and volume were on the
fourteenth, the day of the attack. But what’s
interesting to see is a sudden spike in The New York
Times’ retweet count on the seventeenth.
That spike was attributable to just one tweet,
which garnered over 12,000 retweets, detailing that
none of the attackers in Paris was a Syrian refugee.
This seemingly innocuous bit of information was in
fact hugely relevant to a raging political debate
underway in the United States about whether or not
to keep admitting 10,000 Syrian refugees annually
into the United States. The conservative-liberal
divide was split along the lines of security and
compassion; the Times’ tweet that none of the
attackers were indeed Syrian refugees gives strength
to supporters of the argument that admitting more
refugees into the States would pose a minimal
security risk and continue America’s ongoing
commitment to helping those in need.
While breaking news was certainly the most popular tweet type, a savvy social media
handle knows how to manipulate such news later on. Instead of merely discussing the events and
what led to them, the Times had an acute grasp on what its readers wanted to know in the wake
of these events. It isn’t enough to merely report on real-time events: those events’ political
reverberations can be just as important to followers.
Conclusion
There is much work for news outlets to do if they wish to capitalize upon their existing
handles and increase engagement. Media outlets should first and foremost look at what their
followers care about the most in the form of categories, and tailor content accordingly. Outlets
should also look at what keywords generated the most engagement in their efforts to remake
their content. This shift in strategy will require a delicate rebalancing act as outlets strive to
maintain a journalistic integrity without falling prey to more clickbait-esque content in their
handles. Outlets can also ensure their feeds attract new users and increased engagement by
building upon past events and ensuring they analyze not simply what happened, but also those
events’ political reverberations and implications for future policy.
13
For future research, our team would like to focus on one outlet’s multiple handles to see
if using unique handles for different content is a truly effective strategy. CNN has a breaking
news handle which tweets only Associated Pres wire updates, for example, and Fox has different
handles for political, sports and celebrity news. A more comprehensive review of Twitter
strategy would encompass these handles’ influence as well.
It would also be interesting to look beyond media and discover what other industries do
to engage with their followers. Twitter is, of course, an informational micro-blogging platform,
but we believe that there is significant research to be done as to how to optimize strategies in
other industries which aren’t solely information-based. What keywords and topics interest
followers of the oil industry? Or fast-food restaurants? Much corporate Twitter content focuses
on engaging with customers and putting on a friendly face for them. Analyzing what these
companies do to personalize themselves in regards to customers could yield meaningful
strategies for new companies just entering the Twitter-sphere.
14
Works Cited
“A Script to Download All of a User's Tweets into a CSV.” Yanofsky. Github. Web. 7
Octob. 2015.
“How Mainstream Media Outlets Use Twitter." Pew Research Centers Journalism
Project RSS. Pew Research Center, 13 Nov. 2011. Web. 16 Dec. 2015.