data-mining twitter for political science -hickman, alfredo - honors thesis

50
DATA-MINING TWITTER FOR POLITICAL SCIENCE: A PROJECT BASED METHODOLOGICAL APPROACH by ALFREDO HICKMAN JR THESIS Presented to the Faculty of the Honors College The University of Texas at San Antonio In Partial Fulfillment Of the Requirements For the Degree of BACHELOR OF ARTS IN POLITICAL SCIENCE WITH HIGHEST HONORS IN THE HONORS COLLEGE THE UNIVERSITY OF TEXAS AT SAN ANTONIO College of Liberal and Fine Arts Department of Political Science and Geography May 2015

Upload: alfredo-hickman

Post on 11-Aug-2015

101 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

DATA-MINING TWITTER FOR POLITICAL SCIENCE: A PROJECT BASED

METHODOLOGICAL APPROACH

by

ALFREDO HICKMAN JR

THESIS

Presented to the Faculty of the

Honors College

The University of Texas at San Antonio

In Partial Fulfillment

Of the Requirements

For the Degree of

BACHELOR OF ARTS IN POLITICAL SCIENCE

WITH HIGHEST HONORS IN THE HONORS COLLEGE

THE UNIVERSITY OF TEXAS AT SAN ANTONIO

College of Liberal and Fine Arts

Department of Political Science and Geography

May 2015

Page 2: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

DATA-MINING TWITTER FOR POLITICAL SCIENCE: A PROJECT BASED

METHODOLOGICAL APPROACH

PREPARED BY:

________________________________________

Alfredo Hickman Jr

APPROVED BY:

________________________________________

Bryan Gervais, Ph.D., Thesis Advisor

________________________________________

Ritu Mathur, Ph.D., Thesis Reader

________________________________________

Walter Wilson, Ph.D., Thesis Reader

Accepted: _________________________________________

Richard Diem, Ph.D., Dean of the Honors College

Received by the Honors College:

______________________

Page 3: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

iii

ACKNOWLEDGEMENTS

First and foremost, I would like to acknowledge and thank God. I would like to

acknowledge my parents. Had it not been for the sacrifice and efforts of my parents, I would not

exist or be the man that I am today. I would like thank and acknowledge my wife, Crystal. My

wife’s support throughout this project has been a blessing. I would like to thank and

acknowledge the faculty and staff at the University of Texas at San Antonio, its Honors College,

and its College of Liberal and Fine Arts. Dr. Bryan T. Gervais, Ph.D., has been a great source of

knowledge, experience, and wisdom, and is a trusted mentor and advisor. Dr. Ann Eisenberg,

Ph.D., has also been a great source of encouragement and support, in not only the development

of this Thesis and supporting research, but also my all around educational development while at

the University of Texas at San Antonio. In addition, I would like to thank my thesis readers, Dr.

Ritu Mathur, Ph.D., and Dr. Walter Wilson, Ph.D.

Ultimately, I would like to thank and acknowledge the academics, researchers, and

software developers that have contributed to the base of knowledge, information, and software

that exist in the realms of Political Science, Data Science, and Information Systems. In

particular, I would like to thank JetBrains for the development software, Ubuntu and the Linux

community for the platform and support, MongoDB for the database, Robomongo for the

database administration software, and GitHub for hosting the open-source code repositories, and

Guillermo Del Fresno, on GitHub, for developing twitterstream-to-mongodb. The work that I

present in this Thesis is an amalgam of the fields and technologies mentioned, and which builds

on the effort, intellect, and sacrifice of those that have come before me; they are truly the giants

on whose shoulders I stand.

May 2015

Page 4: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

iv

ABSTRACT

DATA-MINING TWITTER FOR POLITICAL SCIENCE: A PROJECT BASED

METHODOLOGICAL APPROACH

Alfredo Hickman Jr, B.A. The University of Texas at San Antonio, 2015

Supervising Professor: Bryan Gervais, Ph.D.

This thesis will examine the creation and use of a data-mining system to extract, process,

and analyze Twitter “tweets” for Political Science. By providing a free and open platform for

rapidly sharing and exchanging ideas, Twitter has become the most popular microblogging site

and system in the world. Twitter allows its users to disclose their actual names, or post tweets

anonymously; this has fostered an environment that allows people to discus and comment on

politics with a scope, liberty, and, candor that has never before existed. Twitter can be an

invaluable tool for political scientists that wish to better understand the motives, thoughts,

sentiments, and social networks of people as it pertains to politics and social phenomena.

During the course of my research, I have built and maintained an information system that

collects and process selective Twitter data live. In conjunction with ps_proj, an authenticated

application I created on Twitter’s Developers Site, I use Twitter’s Streaming Application

Programming Interface (API) to collect streaming data on a randomly selected list of 279

Members of Congress (MCs). Once the tweet data set is captured, I will analyze the messages,

and the accompanying metadata and data. I expect the data, once analyzed, will produce insights

into the American political being, and allow the political scientists to create information products

critical to understanding social and political behavior.

Page 5: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

v

TABLE OF CONTENTS

ACKNOWLEDGEMENTS .................................................................................................................................. IVII

ABSTRACT .............................................................................................................................................................. IV

ACRONYMS AND DEFINITIONS .......................................................... ERROR! BOOKMARK NOT DEFINED.

CHAPTER 1: INTENT AND ETHICAL CONSIDERATIONS ............................................................................. 1

CHAPTER 2: INTRODUCTION .............................................................................................................................. 4

CHAPTER 3: THESIS STATEMENT .................................................................................................................... 11

CHAPTER 4: METHODS AND APPROACH ....................................................................................................... 13

CHAPTER 5: DATA PROCESSING AND ANALYSIS ....................................................................................... 23

CHAPTER 6: POTENTIAL APPLICATIONS ...................................................................................................... 30

CHAPTER 7: CONCLUSION ................................................................................................................................. 32

REFERENCES .......................................................................................................................................................... 34

APPENDICES ........................................................................................................................................................... 35

Page 6: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

vi

ACRONYMS AND DEFINITIONS

API: Application Programming Interface – A programmatic specification and mechanism for

interfacing with software components.

Back-end: The mechanism that allows data to be collected and stored in a distributed

computational system.

Client: Software or hardware system that requires services from another platform.

Cloud computing: Computational services hosted on remote, networked, and, distributed

information systems that are consumed like a commodity.

Front-End: The mechanism that allows a distributed computational system to input, process,

and transmit data.

Host: Software or hardware system that provides a platform for other systems.

IP Address: The identifying value assigned to a device participating in an Internet Protocol

network.

iSCSI: Internet Small Computing Interface – A protocol used to facilitate the use and connection

of storage resources on computer networks.

JSON: Java Script Object Notation- A language independent standard used for transmitting

human readable text between computer systems.

Linux: A free and open source operating system base.

LUN: Logical Unit Number – The identification mechanism used to identify a networked storage

resource in an iSCSI storage model.

MC (s): Member of Congress

MongoDB: A NoSQL document oriented database that uses JSON to provide flexible schemas.

NoSQL: The concept of non-structured storage and retrieval in non-relational databases.

Operating System: The suite of software that provides functionality to client computer software

and host hardware.

Python: A popular, multi-purpose, high-level computer programming language.

Server: A networked computer whose function is to provide services to client computers.

Ubuntu: A Linux based operating system.

Vagrant: A configurable, portable, and reproducible computational work environment.

VirtualBox: A software platform for virtualizing computer operating systems.

Page 7: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

1

CHAPTER 1: INTENT AND ETHICAL CONSIDERATIONS

The intent of this thesis is not to delve into a theoretically normative discourse on the

pros, cons, or applications of data-mining and analytics in general. However, I will briefly

explore some of the politically theoretical and normative literature that influenced me in the

development of my data-mining and analytics system, and the accompanying research. Rather,

my goal is to display and share the empirical and methodological development and application of

a data-mining and analytics information system for the benefit of Political Science. I would be

remiss and negligent if I did not acknowledge and share some of my concerns for the potentially

harmful applications and consequences of data-mining and analytics systems such as the one that

I have created, and those, much more sophisticated systems that are in production and under

development now and will be in the future.

Before delving into the internals and potential applications for a data-mining system such

as the one I present in this thesis, I believe it is crucial to explore some of the ethical

considerations involved with mining data from the public at-large. In the relatively short amount

of time since the Internet was created and made available for public use (by the American

defense and academic communities), people from all over the world have come to depend on the

technology for an ever increasing amount of daily activity. The Internet has revolutionized the

ways in which we live, communicate, create and consume information and generate data,

metadata, and knowledge. With the rapid development of the Internet and peripheral

technologies, humanity has not only been able to share existing knowledge and information, but

has also created and distributed more new information, data, and metadata than in any time prior

in the human experience.

Page 8: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

2

With the astronomic amounts of public and private information, data, and metadata that

have been created and shared on the Internet, have come new possibilities, opportunities, and

derivative technologies. For example, the nascent industries of electronic business intelligence,

data-mining, and data-analytics have emerged in the belief that vast amounts of value can be

generated from the information and data that the public creates and shares on the Internet.

Technologists, by collecting vast amounts of public and private data and metadata, can track,

analyze, and predict human behavior and generate potentially valuable information, products,

and services. With this information, public and private interests can create and construct products

and services that leverage, and potentially manipulate, human behavior in manners never before

possible. With the ability to track, monitor, and potentially manipulate human individuals and

populations at-large, have come many concerns about how electronic information and data are

used and abused.

Massive data breaches, mostly driven by organized criminal and state actors, of the

world’s largest and most powerful private and public institutions and businesses have rattled

many individuals, firms, and governments into questioning how and why electronic data is being

collected, processed, stored, and secured (Rosenzweig, 2013). Revelations of governments

demanding data, and metadata from Internet Service and Data providers legally, illegally, or

otherwise unethically, has alarmed many people in the human rights and civil liberty

communities. Instances such as Yahoo’s complicity in China’s persecution of political dissidents

have alarmed many state and non-state actors into demanding reforms and regulations for how,

and for what purposes data and metadata on people are collected, used, and consumed (Ruggie,

2013).

Page 9: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

3

Issues of ownership over the data and metadata that the public creates and consumes have

also been raised. At the time of this writing the status quo operates under the assumption, and is

the de facto standard, that public data and metadata are mostly commodities to be processed,

sold, bought, and consumed, so long as the providers “general terms and conditions,” do here

apply and have been accepted.

In addition, at the time of this writing, the revelations by the former National Security

Agency (NSA) contractor, Edward Snowden are still fresh on many minds. Edward Snowden

alleged that the United States and other governments are collecting massive amounts of public

and private data and metadata, sometimes illegally, in the name of national security and other

interests (Greenwald, 2014). Since the Snowden revelations, many of the allegations made

against the United States were publically and officially substantiated, and some reforms were

initiated.

With all the potential applications of data-mining and analytics, one must question and

query the potential public and private benefits and harms that can arise in the age of instant

communications and “big-data.” With the enormous amounts of data and metadata that are being

created and consumed daily, we, as a society, can choose to use the information, products, and

services they yield for the benefit or harm of our fellow man, and our shared environment and

communities.

Page 10: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

4

CHAPTER 2: INTRODUCTION

Since Twitter’s creation and website launch in 2006, it has become the largest and fastest

growing micro-blogging site and system on the planet (Farhi, 2009). Twitter’s ability to cater to

people’s innate curiosity and need for information and interaction has resulted in close to 1

billion registered users and 271 million monthly active users since October 2014. Due to the

roughly140 character limit per Tweet, the format of the communication forces people to

construct their messages succinctly and to the point. Contributing to the success of the Twitter

platform, is the ability to post messages anonymously or not, follow other users, retweet other

user’s tweets, follow other users, allow yourself to be followed, and a myriad of other features

that allow people to communicate, associate, and express themselves in ways never before

possible.

Because of Twitter’s popularity, use, and innate features, the site has fostered a

community of opinion and dialog unlike any system that has existed before it. The results of

Twitter’s system and operations are more extensive social networks, contexts, and information,

some of which are new to humanity at large and the social sciences in particular. Because of

Twitter’s success and proliferation, many social and political scientists have researched the

communications posted on the site in effort to understand the intent, motivation, sentiment,

behavior, and other sociological factors of the Twitter users that create them. In the context of

the social sciences, the vast amount of scholarly work done in the realm of Internet based social

networking has come in the form of direct collection, and analysis of social networking messages

and data. While the conventional methods used in the social sciences for collecting, and

analyzing the data are valid, I believe the methodology leaves a crucial factor out of the equation

- the metadata.

Page 11: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

5

However, before delving into the world of Twitter architecture, metadata, data, and

information, and their potential value to the social sciences, I will define some key terms. I will

then briefly explore some of the relevant work that has come before, and how that helped frame

this research and its intent.

In the course of this thesis, I will use following terms in this manner:

1. Metadata (um): The underlying information about the data being referenced that can

serve to provide enriched functionality, context, network, and potential meaning to

the information and data generated. In essence, the metadata is the glue and pointers

that bind and direct the individual message into the larger social network and

information ecosystem.

2. Data (um): The qualitative or quantitative dynamic values or value that make up

information, and which are structured or unstructured (raw) in a manner conducive to

mechanical and/or biological processing means and methods.

3. Information: The qualitative or quantitative product of a causal relationship between

data components in a system and its environment. Information can be transmitted and

consumed via message, observation, perception, or other biological or mechanical

processes. Information is what we want, and what is, but not always, of value from a

data-mining and analytic system.

When shaping the idea for this project, I wanted to not only describe how to build a

functional social media collection and processing system, but also to explore how new

technologies like the Internet and social media can provide insights into the way people create

and consume political data and information. The spark that ignited my interest in the potential

value of social media in regards to political science, was my interpretation of Diana Mutz's

Page 12: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

6

Hearing the Other Side: Deliberative versus Participatory Democracy. Mutz (2006) argues that

exposure to multiple political views decreases participation in political activities and highlights

the potential conflict between deliberative and participatory democracy. Furthermore, Mutz

argues that the context and network in which political discussion takes place does matter, and

that they can serve to either facilitate or hamper political learning and action. Due to the social

norms that govern interpersonal communication and association, people often self-censor their

public political opinions and views in order to avoid conflict, ridicule, rejection, or a wide variety

of other social consequences.

Because tight social fabrics can stifle public political expression of dissenting opinions

and views, the observation of political expression in a medium as open as the Internet can be of

value to the political scientist and psychologist. Since much of the interpersonal communication

that can occur on the Internet and social media is free from the social norms and consequences of

live political expression and association, the observation and analysis of such behavior can

render valuable insights into the uncensored political mind (Gervais 2014).

Research on political discourse and deliberation can be greatly enriched by using data

and metadata driven analysis of political discussion in the context of social networks on the

Internet. By capturing, collecting, processing, and analyzing tweets and their corresponding

metadata, researchers can understand how people create, consume, and share data and

information on the Internet. From these observations, researchers can better understand what

political topics are important to people, and where these topics are important in both the physical

and virtual words. By collecting and analyzing social media communications and their

corresponding metadata, researchers can identify political association and behavior as it occurs

in the context of social networks on the Internet and in real life.

Page 13: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

7

Another field of study within the social sciences that can be advanced with the use of

social media data and metadata collection and analysis is the study of the communication

between government and its constituencies. In research done during the 111th Congress,

Matthew Eric Glassman and others looked into the way that government officials used Twitter to

communicate and inform people on a variety of topics of political importance. What Glassman

and his partners discovered was that MCs in the minority party tended to use social media at

higher rates than those of the majority party, and that the information was constructed to fulfill

requirements of information within functional contexts. The contexts ranged from district and

state constituencies, official political action groups, personal communications, replies to other

comments or questions, and position taking. The implications for out-groups having a larger

voice when not in power or when disenfranchised from society are something of critical value

for the political outlier. This value is even more evident when the communications occur in

contexts where speech and political descent are commonly self-repressed, such as in certain

physical social settings and on traditional media.

What Glassman discovered was that social media allowed people to communicate and be

informed by their representatives in a more direct and unfiltered manner than was possible using

traditional media channels, such as television, radio news, and press conferences. In regards to

this type of study, data-mining and analytics could support the normative and theoretical bodies

of political science by providing new information on issues such as, constituent – representative

relations, political communication and association on the Internet, and the potential for social

media and the Internet to encourage plebiscitary politics. As such, analysis into the social

networks that are created in the physical and virtual worlds when people create, consume, and

share electronic communications on the Internet could provide potential insights into the

Page 14: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

8

“political being.” With this in mind, my research will explore the evolving data and metadata

trail that are created when such actions occur within Twitter. However, the possibilities span

much further than any one website or platform.

I hypothesize that if the Internet and social media provide new networks for

communication and association, along with the potential for social consequences and action, then

the data and metadata that are created and consumed when those actions occur, when analyzed,

can be of value to Political Science. However, it may prove difficult to please strict political

theorists in regards to defining what constitutes political communication, deliberation, and

participation in the context of the Internet and social media, and what that looks like. As such,

Jane Mansbridge (1999) argues that everyday political talk can be useful in promoting political

deliberation and participation if it meets certain and stringent criteria. However, if all political

dialogue were held to this standard, very few discussions would ever be considered true political

deliberation. Nevertheless, if we loosen Mansbridge’s standard and apply the social norms of the

Internet then we can see that the analysis of political communications and social networks can be

of value.

While much of the political communications that take place via social media on the

Internet may not meet all of Mansbridge’s standards, the collection of the communications and

the information that can be derived from the underlying metadata can be of significant value to

the political scientist. As I have mentioned before, there is more to a tweet than just the text of

the message. The majority of what constitutes a tweet is actually a vast construct of metadata and

data structures that serve to provide enriched functionality and value to a tweet and its creators,

distributors, and consumers. I will elaborate on this in later chapters. However, what this implies

is that by collecting and analyzing tweets in their entirety, a political scientist can not only study

Page 15: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

9

the content of the message field, but he or she can also construct sophisticated data models that

describe the locations, sentiments, interests, behaviors, and associations of the people that create,

consume, or share those tweets. So, when taking into account that social media is a primary

medium by which people communicate and act on matters of politics on the Internet, data-

mining and analytics can be an invaluable for the political scientist.

To provide contrast to the research I present in this thesis, I found a research study

conducted at The University of Maryland College of Information Studies, in which Jenifer

Golbeck, Justin M. Grimes and Anthony Rogers (2009), collected and analyzed over 6,000

tweets posted by various MCs. The conclusion of that study was that the tweets MCs created and

shared, “tend not do provide new insights into government or the legislative process, or to

improve transparency, rather they are vehicles for self-promotion.”

In response to that, this thesis will not attempt to prove or disprove that information

collected from social media can serve to be the end-all-be-all of insight into the political mind.

Rather, the intent of my research is to display the development and potential applications of a

data-mining and analytics information system that can yield data and information valuable to

political science. A data-mining and analytics system, like the one I present in this thesis, can be

used to collect and process social media data and metadata, and then to create a framework for

future political studies. In essence, I will support the idea around which the entire “big-data,” and

data mining and analytics industries have emerged. The idea being that there is potentially

significant value in the information produced when the underlying structures that are created

when people create, share, and consume information on the Internet and social media are

analyzed and operationalized.

Page 16: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

10

This thesis also attempts to highlight that in leaving out the overwhelming majority of

what constitutes a tweet (or most other electronic messages) from their analysis, leaves out a

huge factor from the research – the metadata (reference Appendix 1: what a tweet really looks

like). Golbeck, Grimes and Rogers go on to state, “We have chosen not to study the underlying

social network (followers, following, and friends), but this is a rich space for future work.” I will

attempt to fill some of that space with my research and system. I will also support the idea that

the underlying social constructs enumerated in the metadata, and of which the actual message is

only minimal component, can be of significant value to political science.

By collecting, processing, and analyzing tweets in their entirety, metadata and all,

political scientist can develop a more robust understanding of people’s locations, sentiments,

interests, behaviors, and associations as they relate to matters of political interest and activity on

the Internet and in the “real world”. Perhaps, by exploring this new medium for electronic

communication and association, innovative methodologies can be developed to leverage the

Internet and social media, and help bridge the gap between political normative theory and

empirical quantitative analysis…even if only a bit. Enjoy!

Page 17: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

11

CHAPTER 3: THESIS STATEMENT

Contemporary political science research of social media communications involves

collecting data, analyzing the data for components, creating variables, coding the variables,

operationalizing the variables, and attempting to produce an intellectual product of significance

and meaning. What I believe is left out of much of the Internet and social media based research

done in Political Science is the leveraging of information systems to facilitate a more robust

collection and analysis of electronic communications and social networks. As a result, in the

past, much collection and analysis of social media communications have left out some of the

most crucial and potentially valuable components of political communications and social

networks on the Internet, the metadata.

By using data-mining and analytics, political scientists can programmatically collect,

process, and analyze social media and other Internet communications automatically and

perpetually. By using these systems, political scientists can collect and operationalizing massive

Internet derived data sets, and craft unlimited amounts and types of queries and analytics to

create potentially valuable information products. These information products can then can be

used to describe the political sentiments, interests, behaviors, and associations of practically

anyone using the Internet. With that, these information products can then be used to create new

bodies of political knowledge and information. By utilizing data-mining and analytics, political

scientists can produce information products that detail valuable information such as topics of

political interest, and overcome some of the challenges that occur when tackling complex

collection based projects on the Internet with reduced resources.

Page 18: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

12

As such, I will attempt to convey the value and possibilities of employing data-mining

and analytics information systems for the benefit of Political Science by explore the following

topics:

1. How to build a data-mining and analytics information system.

2. How to capture, transfer, and store tweets in their entirety (metadata and all).

3. What exactly is a tweet, and why is it potentially valuable (we will explore a

dissected tweet and identify and explain its composition).

4. Potential applications for data-mining and analytics systems in political science.

Page 19: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

13

CHAPTER 4: METHODS AND APPROACH

Information System and Data Collection

In this section, I will detail the creation and composition of data-mining and analytics

information system used for this project. The software and hardware used during the course of

project is flexible and can be adapted, or scaled as necessary. In addition, with the development

and proliferation of relatively inexpensive and accessible cloud computing services, the

information system I detail here can be adapted and ported over to a cloud provider and scaled as

needed.

The platform I created for this project is comprised of the following components:

1. A physical server computer to host the operating system and client software: this can be a

virtual server if running from the cloud or another networked computer. I chose to use a

dedicated PC computer that I loaded with s server operating system. For production

purposes, I recommend a dedicated physical, virtual, or cloud based server or a cluster of

servers if you really want to scale.

2. An operating system: I chose to use Ubuntu Linux Server as my operating system. I

chose to use Ubuntu Server because it is a free and open source, enterprise capable server

operating system. In addition, Ubuntu is well maintained, documented, and enjoys a

broad user and technical support base on the Internet.

3. Physical computer storage: I chose to create a storage area network (SAN) for my server

to utilize. For this, I used a 4-terabyte network attached storage appliance, created a

virtual disk pool, partitioned an iSCSI logical unit number (LUN) from the pool, and

assigned it as virtual storage for my Ubuntu Server via a routed virtual local area network

(VLAN).

Page 20: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

14

4. A terminal to connect to your server: The terminal can be a physical monitor console, a

web browser, or a software terminal emulator. I chose to use a Secure Shell (SSH)

terminal emulator to securely connect to my server from anywhere.

5. A database: I chose to use MongoDB. MongoDB is an excellent fit for a data-mining and

analytics information system, because it stores documents in the binary form of the Java

Script Object Notation (JSON) that is native to much social media communications.

6. Database administration software: I choose to use Robomongo because it is a free,

secure, and feature rich database administration suite.

7. Programming Language and interpreter (if required): I chose to use the Python

Programming Language, because of its broad documentation, ease of use, clear syntax,

broad support base, rich software library pool, and open source nature.

8. The data-mining software engine: I chose to use twitterstream-to-mongodb by Guillermo

Del Fresno on GitHub (2014), because it is free, open source, licensed for general use

(GNU GPL), and it is written in my favorite programming language, Python.

Once you have acquired the necessary components, you will need to assemble, install, and

deploy your information system. 1

1 Reference the installation and deployment instructions particular to your software, hardware, and

operating system components.

Page 21: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

15

Once the data-mining system is setup, the next step is to create an authenticated

application on the Twitter Developer’s Website at https://dev.twitter.com/ (this can be done

either before or after the previous step). The Twitter application you create in this step will allow

your data-mining and analytics system make authenticated requests to Twitter’s APIs, and is

required for the data-mining portion of this system. From the Twitter developer’s website, you

can create an account, log in, and create a Twitter App (read-only access will suffice, unless you

want your system to publish information on behalf of your application). Once you have created

the Twitter app, you will need to record and safeguard the following values: “consumer key,

consumer secret, access token, and the access token secret.”2

2 Reference the screenshot in Appendix 2: Twitter App, for what the authentication and authorization variables look

like.

Page 22: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

16

Now that the system is established, and the authenticated Twitter app is created, the next

step is to populate your system with the files and values it needs in order to data-mine Twitter. In

this step, you will need to login to your server and navigate to the directory in which your

twitterstream-to-mongodb script is located. Once you are in the correct directory, you will need

create the following files and populate them with the following values.

1. Create a file named “oauth.json”: In this file, enter the following terms and values as

such:

{

“consumer_key” : “enter your consumer key here”,

“consumer_secret” : “enter your consumer secret here”,

“access_token” : “enter your access token here”,

“access_token_secret” : “enter your access token secret here”

}

OAuth is the open standard that Twitter uses to allow for programmatic authentication

and access to their APIs.

2. Create a file named “objects.txt”: In this file, enter the objects you wish to track, each

individual object must be separated by one space, and cannot exceed 400 objects (the

400-object maximum is a Twitter API limitation); the objects can include the following

types of values: #example, @example, and example.3

3 Version 1 of the system I present in this thesis utilizes the Twitter Streaming API’s “track” feature. As such, the

system will only collect tweets that contain the values listed in the objects text. This particular API limitation will

exclude tweets that are created by a value listed in the objects file. However, the system will collect every tweet that

references a value listed in the objects file. In version 2 of this system, I will incorporate the Twitter Streaming

API’s “follow” feature, which will permit the collection of tweets that are directly created or shared by a value listed

in the objects file.

Page 23: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

17

Once the following files are created and populated, the next step is to initiate the script

and data collection. Initiate the script in the following manner, and from the directory that

contains the oauth, objects, and twitterstreamtomongodb.py files:

1. From the terminal, enter the following command and parameters (if on a Windows

system, disregard the “sudo”):

sudo python twitterstreamtomongodb.py --oauth=oauth.json --server=127.0.0.1 --port=27017 --

database=“insert DB name here” --track=objects.txt4

1. 4

Sudo, is used on Linux based systems to invoke the context of another account, typically with elevated or

administrative privileges. The “python” command calls the Python Interpreter to interpret the script (the file

immediately after and ending in the .py extension).

2. --oauth, is the parameter that passes your Twitter app’s credentials, stored in the file, to the program for

authentication and authorization to Twitter’s APIs.

3. --server, denotes the Internet Protocol (IP) address of your server, this value can also be a resolved host

name if you are using an externally provided hosting platform, or have otherwise resolved the IP address to

a host name. In this example, I am using the local host address of 127.0.0.1, which indicates that I am

running the program directly from the local machine. The tweets you collect will be routed or directed to

the IP address you place into this parameter. On this system, I am using a SAN, which has its own set of IP

addresses. However, the SAN is providing virtual storage that is logically attached to the host server, which

is why I am using the local host address.

4. --port, denotes the software endpoint that facilitates application or protocol specific communication. In

this case, port 27017 is the default listening port for MongoDB core services

5. --database, is the parameter that references the database that will house the incoming Twitter data. The

program will automatically create the database on the server hosting MongoDB services that is referenced

in the --sever parameter.

6. --objects, is the parameter that references the text file that contains the objects the system will track and

collect (one object per line with a maximum of 400 objects).

Page 24: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

18

Once the program is initiated, the data-mining begins, and the tweets will start pouring in

as soon as they are created or distributed. From this point, how long you collect tweets is up to

you, and is only limited to the resources you allocate to the data-mining system and Twitter’s

rate limitation protocol. Once you have collected an acceptable data set, the next thing to do is to

analyze the data and generate an information product of potential value. However, before

detailing the analysis portion of this project, I believe that it is crucial to explore the composition

of a tweet, and explore why a collection of tweets can be valuable.

So what exactly is a tweet? The common conception is that a tweet is a roughly 140

character message that, on its face, is only able to communicate the most minimal of information.

However, as I have alluded to throughout this thesis, there is more to a tweet than meets the eye:

much more. At the heart of a tweet lies a rich metadata architecture that binds and directs the

individual tweet into the larger social network and information ecosystem. In essence, the tweet

metadata provides defined fields, which can then be populated by personally identifying and

descriptive data pertaining to the creator, distributor, and consumer of the tweet.

The data and metadata associated with a tweet can then be used to create information

constructs such as: location mosaics, “webs-of-association”, behavior -pattern analyses, and

sentiment analyses. Information constructs derived from the underlying data and metadata

contained within a tweet can then detail how individual creators and consumers of a tweet relate

in the broader Twitter social network, and even in the real world. What this means is that an

individual, or an automated information system, can use Twitter data and metadata, or most other

metadata, to create information models that detail human behavior and association. While there

are numerous ways to depict twitter meta-data, I believe the most accessible manner is through a

visual aid with descriptions of the various components.

Page 25: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

19

The tweet metadata depicted in the following screenshot is a graphical representation

and may be difficult to view. Appendix 1 depicts a tweet’s metadata in its native textual

representation of Java Script Object Notation (JSON).

Page 26: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

20

Page 27: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

21

The following list details some of the potentially significant metadata fields associated

with tweets and describes their functions:

1. _id: This provides a unique alphanumerical identifier for the individual tweet.

2. Contributors: This lists the IDs of users who have contributed to the tweet.

3. Text: The actual message filed of the tweet, this is what most people usually see when

a tweet is created or consumed.

4. In_reply_to_status_id: If the tweet is a reply to another tweet this filed will provide

the integer representation of the original tweet’s ID.

5. Favorite count: How many times the tweet has been “favorited” by other Twitter

users.

6. Source: The generating source of the tweet (such as the Twitter for the iPhone App).

7. Coordinates: The longitude and latitude of the tweets generating source.

8. Entities: This field contains the following sub fields: hashtags, any hashtags

referenced in the tweet; user_mentions, any Twitter users mentioned in the tweet;

symbols, any symbols listed in the tweet; media, the resource locators for an

associated pictures, videos, or other media files associated with the tweet; and urls,

the universal resource locators provided in the tweet.

9. Retweet_count: The number of times the tweet has been retweeted.

10. Retweeted_status: Within retweeted_status, are contained the following descriptive

and identifying data fields, which are associated with the creator of the retweeted

tweet: contributors, id, favorite_count, source, retweeted, coordinates, and entities.

11. User: Within the user field exist data and metadata that identify and describe the

primary composer of the tweet and contain the following fields: id, the unique

identifier of the user account that creates the tweet; verified, whether or not the user’s

Twitter account is verified; friends_count, the number of friends the tweet creator

has; location, the city in which the tweet is created; geo_enabled, indicates whether

the user account has geo-tracking enabled; name, the name of the Twitter account;

lang, the language the tweet is written in; favorites_count, the number of tweets that

the user marks as favorite; screen_name, the screen name of the Twitter user;

created_at, the date-time stamp of the tweets creation; contributors_enabled, indicates

Page 28: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

22

whether or not the Twitter user has permitted the use of authenticated contributors;

time_zone, the time zone in which the tweet is created.

The metadata fields I just described are only a few of the total fields available in the ever-

evolving Twitter system. As you can see, there are many more metadata fields depicted in the

graphical representation, and many more in the textual representation illustrated in Appendix 1.

The potential applications for deriving value from these metadata and data points is limited only

to the creativity, ability, resources, and access of the individual or system that captures,

processes, and analyzes them.

At this stage of the operation, the data-mining system should have collected a database

composed of collections, which will contain every tweet referencing an object listed in your

“objects” file. Now that I have detailed the creation of a data-mining system, created an

authenticated Twitter app, and dissected and explored a tweet’s metadata structure, we can move

on to the methods and approach I used to process and analyze Twitter data and metadata.5

5 Reference Appendix 3 for database backup and restore instructions.

Page 29: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

23

CHAPTER 5: DATA PROCESSING AND ANALYSIS

The following examples are queries and information products I created using data

captured, collected, and processed by my data-mining and analytics system. During my

collection period, beginning on 8 October 2014 at 2000 hrs., and ending on 25 October 2014 at

2000 hrs., I collected almost every tweet referencing a randomly selected list of 279 MCs. In

total, my data-mining and analytics system collected 472,395 tweets, including all the

corresponding metadata: automatically. Now, I will move onto the analysis portion of this

project.

One of the most approachable methods to analyze social media information, without

initially being too bogged down in the intricacies of metadata analysis, is to create a table

analysis. In this example, I select a sample-set of collected Twitter objects, in this case the

Twitter handles of certain MCs, and assign them variables. The variables correspond to the MC’s

name, age, party, chamber, state, district, district competitiveness (DC), and the number of

tweets associated with that MC.6

6 District competitiveness is defined with an “S” for safe, or an “N” for not safe. The district competitiveness

information was collected from Sabato’s Chrystal Ball at http://www.centerforpolitics.org/crystalball/

Page 30: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

24

Handle Name Age Party Chamber State District DC Tweets

@SpeakerBoehner Boehner, John 64 R H OH 8 S 44590

@SteveScalise Scalise, Steve 49 R H LA 1 S 4278

@WhipHoyer Hoyer, Steny 75 D H MD 5 S 1963

@McConnellPress McConnell,

Mitch

72 R S KY - S 4978

@SenatorDurbin Durbin, Richard 69 D S IL - S 1857

@SenFeinstein Feinstein, Dianne 81 D S CA - S 2567

@JoaquinCastrotx Castro, Joaquin 40 D H TX 20 S 1636

@RepCuellar Cuellar, Henry 59 D H TX 28 S 403

@SenSanders Sanders, Bernie 73 D S VT - S 10657

@SenJohnMcCain McCain, John 78 R S AZ - S 20726

@SenTedCruz Cruz, Ted 43 R S TX - S 32307

@SenSchumer Schumer, Chuck 63 D S NY - S 2633

@RepBetoORourke O’Rourke, Beto 42 D H TX 16 S 884

@RepWestmoreland Westmoreland,

Lynn 64 R H GA 3 S 1166

@RepTomPrice Price, Tom 60 R H GA 6 S 777

@repjohnbarrow Barrow, John 59 D H GA 12 S 368

@LEETERRYNE Terry, Lee 52 R H NE 2 N 2671

@RepNickRahall Rahall, Nick 65 D H VA 3 N 610

@CongMikeSimpson Simpson, Mike 64 R H ID 2 N 530

@RepBera Bera, Ami 49 D H CA 7 N 836

Now that descriptive and identifying attributes have been associated with the MC’s

Twitter handles, the next step is to run some basic analytic queries against the collections in the

database and extract some potentially useful information. For the next step, I will query the

collection associated with each of the MCs listed and extract the number of Tweets that

referenced the MC in the “Twitterverse,” during the collection period.7

7 In order to capture the total number of documents (tweets) contained within a collection (MC) within the database,

enter the following command from the Mongo shell or a GUI database management console: If from the command

line terminal, enter the following command: “mongo” – Then from the Mongo Shell enter the following commands:

use “enter db name” – from the database enter the following command: db['@enter-tracking-object-name-

here'].stats()

Page 31: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

25

Once the query executes, the system will produce statistics from the queried collection

and output them to the terminal in a JSON representation. The output will look like this:

{

"ns" : "DMTPS.@SpeakerBoehner",

"count" : 44590,

"size" : 346754336,

"avgObjSize" : 7776,

"storageSize" : 460861440,

"numExtents" : 14,

"nindexes" : 1,

"lastExtentSize" : 124993536,

"paddingFactor" : 1,

"systemFlags" : 1,

"userFlags" : 1,

"totalIndexSize" : 1455328,

"indexSizes" : {

"_id_" : 1455328

},

"ok" : 1

}

For the purpose of this query, the important value to extract is the “count” filed, which is

the total number of documents, tweets in this case, that referenced a particular MC during the

collection period. In the following examples, I will construct more advanced, metadata driven,

queries that will extract identifying and associative data from the collection database.

Page 32: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

26

In the following examples, I have anonymized any personally identifying information my

queries and analytics produced for privacy reasons. For these queries and analytics, I use the

MongoDB Aggregation Framework to query the collection associated with a Member of

Congress, and then to find the following information for every tweet that references the specified

MC:

1. The text of the tweet referencing a specific MC.

2. The Twitter users referenced in the tweet (the intended audience).

3. The user screen name and “real name” of the Twitter account holder that created the

tweet.

4. The amount of friends that the Twitter user has

5. The location and country the tweet was created at

6. The geographic coordinates of the location where the tweet was created.

In order to query the collection and extract the pertinent information, the following query

must be run against the collection you wish to analyze using the MongoDB Aggregation

Framework.

{

$group: {

_id: {

text: "$text",

entities_user_mentions_screen_name: "$entities.user_mentions.screen_name",

user_name: "$user.name",

user_screen_name: "$user.screen_name",

user_friends_count: "$user.friends_count",

place_full_name: "$place.full_name",

geo_coordinates: "$geo.coordinates"

}

}

}

Page 33: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

27

Once the query executes, a document will be created that contains the information you

extracted from the data. The following output is a real example of an information product the

query generated:

{

"_id":{

"text":"These are #Ukraine war crimes. #ukrainevotes jail the Kiev criminals. http://t.co/uI

W6KxqzG4\"\n@WhiteHouse \n@BarackObama \n@SpeakerBoehner",

"entities_user_mentions_screen_name":[

"WhiteHouse",

"BarackObama",

"SpeakerBoehner"

],

"user_name":"Pattys4Putin-USA",

"user_screen_name":"PattyDs50",

"user_friends_count":1598,

"place_full_name":"New Hampshire, US",

"geo_coordinates":[

42.908474,

-71.841744

]

}

}

Page 34: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

28

For the following example, I queried an MC’s collection in order to find all the tweets

referencing the MC that where written in Spanish during my collection period. The query also

found the screen names, real name, city, and state, where the Twitter account holder was when

he or she created the tweet and the geographical coordinates of the exact location the tweet was

created. The following query also employs the MongoDB Aggregation Framework:

{

$group: {

_id: {

text: "$text",

lang: "$lang",

user_screen_name: "$user.screen_name",

user_name: "$user.name",

place_full_name: "$place.full_name",

geo_coordinates: "$geo.coordinates"

}

}

},

{

$match: {

"_id.lang": "es"

}

}

Once the query executes, a document will be created that contains the information you

extracted from the data. The following output is a real example of an information product the

query generated:

"_id":{

"text":"Gracias @JoaquinCastrotx por apoyar la #ReformaMigratoria. Por favor sigue lucha

ndo por #CIR. #TimeIsNow http://t.co/p5F9Y54Ac2 vía @FWD_us",

"lang":"es",

"user_screen_name":"DguezVd",

"user_name":"Vaneza Dominguez",

"place_full_name":"Dallas, TX",

"geo_coordinates":[

32.900652,

-96.871544

{

Page 35: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

29

The three examples I documented here only scratch the surface of what is possible by

incorporating a metadata driven approach to social media data-mining and analytics. By

leveraging the robust Twitter metadata architecture, I was able to collect a vast and nearly

complete dataset referencing 279 MCs and collecting almost a half a million tweets. I was then

able to query the individual collections corresponding to the MCs, and then create potentially

valuable information products. In the example quires, I was able to identify the tweet frequencies

associated with particular MCs, the Twitter handles of users referenced in a tweet, the amount of

friends the tweet generator has, the intended audiences of the tweets, the physical location of the

tweet generators, and even to filter tweets by language.

However, the queries I provided here are only the beginning. Truly, the possibilities for

generating valuable information products by leveraging data-mining and analytics, is only

limited to the creativity, skill, access, resources, and time of the new political data-miner. In the

following section, I will expand upon some of the possible applications of using data-mining and

analytics for the benefit of Political Science.

Page 36: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

30

CHAPTER 6: POTENTIAL APPLICATIONS

While the potential applications for data-mining and analytics in Political Science are

only limited to the creativity of the data-miner, I wanted to provide a hypothetical example of a

political science activity that could benefit from such an approach.

In this hypothetical scenario, a research team is given the task to collect all the tweets

created by, or referencing all congressional candidates during a particular election cycle. Once

the election cycle is over, the research team is to analyze the tweets and generate an information

product that investigates how social media campaigns effect the creation and behavior of

political associations on social media and in the real world.

In this scenario, the task would be difficult, if not impossible, to do with conventional

social media collection and analysis tools. The research team could decide to comb the web for

tweet collection websites, and to manually collect and operationalize the tweets using

spreadsheets and the like. However, this method would be very labor and time intensive, and

would only yield the message field of the tweet and some minimally identifying and descriptive

data. In this method, the research team would not be able to construct a web-of-association that

would identify the congressional districts in which the tweets were created, consumed, or shared.

However, the research team could employ another option. The research team could reach

out to a tweet vendor and purchase all the tweets created by or referencing particular

congressional candidates, and then run queries and analytics against those data sets. However,

this method is expensive and does not lend its self to dynamically adjusting the collection of

tweets, as a research team might do during their project. Nevertheless, if a research team has

access to significant funding, employing a tweet vendor could be a simple method to collect

Page 37: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

31

sizable twitter data sets. However, if you want the metadata associated with the tweet, which is

often more valuable that the message itself, it will cost significantly more.

In this scenario, employing a data-mining and analytics system like the one I created and

detailed in this project is ideal. The information system I created for this project uses all free and

open-source technologies that are readily available and well documented on the Internet. The

support communities for all the technologies required for building and operating this type of

system are highly robust, typically friendly, and usually able to help most anyone troubleshoot or

navigate a particular technology. In addition, the software required to build and operate this type

of system can be run on most commodity hardware, ranging from small desktop computers, to a

massive clusters of networked servers, and even on the cloud. Another benefit of this type of

system is that you can securely access, monitor, and maintain the system remotely from virtually

anywhere with an Internet connection. From your computer at home, your tablet on vacation, or

your smartphone on the road, you can update your object collection list, create new databases,

and write new analytic queries.

A further benefit of employing and maintaining a data-mining and analytics system is

that once it is established, the system can continue to collect information indefinitely. The system

can also be used or replicated by others, who then can use the system as is, or expand the system

and add new functionality and features to it. The possibilities of using open-source technologies

for data-mining and analytics to the benefit of Political Science are almost limitless.

Page 38: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

32

CHAPTER 7: CONCLUSION

When I started this project, I wanted to create an information system that could pave the

way for Political Science researchers to explore new technologies and methods in order to make

their work easier, more innovative, and more productive. I had a strong background in

Information Systems and Cybersecurity, but I had never before created a data-mining and

analytics information system. I thought the process would be fun and challenging. However, I

had no idea how fun and challenging the process would actually be. I knew that there were

significant implications for data-mining and analytics in Political Science, but I was not sure how

to bridge-the-gap, between the disciplines.

After much study, research, trial, and error, I created an information system that can mine

data from the Internet easily, automatically, and perpetually. The following challenge was in the

analytics. When I started this project, I did not have much knowledge or experience in “data-

analytics.” I had, of course, analyzed data before, but not in the context of a formal data-mining

and data-science initiative. Throughout this project, I thought myself a great deal about data-

mining, data-science, and data-analytics. As such, I was able to produce some basic information

products that I am sure will pique the interest of the more adventurous political scientists. The

data-mining and analytics system I created and detailed here is a basic system, but one that I

hope will serve as the foundation for further development and study.

With this system, I was able to capture a relatively large data set of tweets of political

interest relatively easily and automatically. I was then able collect the tweets into a database

capable of storing unstructured data from practically anywhere in the digital world. Furthermore,

I was able manipulate, transform, and query the tweets to produce information products with the

capacity to advance normative political theory and quantitative political analysis. In the end, I

Page 39: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

33

was able to provide a roadmap for future “political data-miners” to get started in constructing

their own data-mining and analytics information systems for the benefit of Political Science.

Page 40: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

34

REFERENCES

Chodorow, Kristina. 2013. MongoDB: The Definitive Guide. Sebastopol: O’Reilly.

Del Fresno, Guillermo. 2014. “twitterstream-to-mongodb” [Software]. GitHub: Retrieved from

https://github.com/gdelfresno/twitterstream-to-mongodb

Farhi, Paul. 2009. “The Twitter Explosion.” American Journalism Review 31(3): 26–31.

http://search.ebscohost.com/login.aspx?direct=true&db=ufh&AN=41877978&site=ehost-

live (February 19, 2010).

Gervais, Bryan T. 2014. “Incivility Online: Affective and Behavioral Reactions to Uncivil

Political Posts in a Web-based Experiment.” Journal of Information Technology &

Politics (Forthcoming)

Golbeck, Jennifer, Justin M. Grimes, and Anthony Rogers. 2010. “Twitter Use by the U.S.

Congress.” Journal of the American Society for Information Science and Technology

61(8): 1612–21.

Greenwald, Glenn. 2014. No Place to Hide: Edward Snowden, the NSA, and the U.S.

Surveillance State. New York: Metropolitan Books.

Mansbridge, Jane. 1999. “Everyday Talk in the Deliberative System” In Deliberative Politics:

Essays on Democracy and Disagreement, ed Stephen Macedo: Oxford University Press,

1 – 211.

McKinney, Wes. 2013. Python for Data Analysis 2nd ed. Sebastopol: O’Reilly.

Mutz, Diana C. 2006. Hearing the Other Side: Deliberative Versus Participatory Democracy.

New York: Cambridge University Press.

Provost, Foster, & Tom Fawcett. 2013. Data Science for Business: What You Need to Know

About Data Mining and Data-Analytic Thinking. Sebastopol: O’Reilly.

Rosenzweig, Paul. 2013. Cyber Warfare: How Conflicts in Cyberspace Are Challenging

America and Changing the World. Santa Barbara: Praeger.

Ruggie, John G. 2013. Just Business: Multinational Corporations and Human Rights. New

York: Norton, W. W. & Company, Inc.

Russell, Matthew A. 2014. Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn,

Google+, GitHub, and More 2nd ed. Sebastopol: O’Reilly.

Page 41: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

35

APPENDICES

Appendix 1: What a Tweet Really Looks Like in Its Native JSON

NOTE: I highlighted the text filed, which contains the actual message portion of a tweet.

/* 0 */

{

"_id" : ObjectId("54361be43b811434f9a21da4"),

"contributors" : null, "truncated" : false,

"text" : "✖ @AustinScottGA08 Silence Is Complicity #MSSen #RememberMississippi #MakeDCListen", "in_reply_to_status_id" : null, "id" : NumberLong(520082176352980993), "favorite_count" : 0, "source" : "<a href=\"http://tweetadder.com\" rel=\"nofollow\">TweetAdder v4</a>", "retweeted" : false, "coordinates" : null, "timestamp_ms" : "1412832228168", "entities" : { "user_mentions" : [ { "id" : 234797704, "indices" : [ 2, 18 ], "id_str" : "234797704", "screen_name" : "AustinScottGA08", "name" : "Rep. Austin Scott" } ], "symbols" : [], "trends" : [], "hashtags" : [ { "indices" : [ 41, 47 ], "text" : "MSSen" }, { "indices" : [ 48, 68 ], "text" : "RememberMississippi" }, { "indices" : [ 69, 82 ], "text" : "MakeDCListen" } ], "urls" : [] }, "in_reply_to_screen_name" : null, "id_str" : "520082176352980993",

Page 42: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

36

"retweet_count" : 0, "in_reply_to_user_id" : null, "favorited" : false, "user" : { "follow_request_sent" : null, "profile_use_background_image" : true, "default_profile_image" : false, "id" : 265658805, "verified" : false, "profile_image_url_https" : "https://pbs.twimg.com/profile_images/455915260524769280/ClR7foxv_normal.png", "profile_sidebar_fill_color" : "DDEEF6", "profile_text_color" : "333333", "followers_count" : 3559, "profile_sidebar_border_color" : "000000", "id_str" : "265658805", "profile_background_color" : "000000", "listed_count" : 57, "profile_background_image_url_https" : "https://pbs.twimg.com/profile_background_images/845237718/447b881c8b774ed9199f6bf5505beb66.jpeg", "utc_offset" : -14400, "statuses_count" : 97286, "description" : "A Declaration Conservative: That 2 secure these (unalienable) rights, Govts R instituted among Men, deriving their just powers from the consent of the governed", "friends_count" : 3389, "location" : "Western Pennsylvania", "profile_link_color" : "000000", "profile_image_url" : "http://pbs.twimg.com/profile_images/455915260524769280/ClR7foxv_normal.png", "following" : null, "geo_enabled" : false, "profile_banner_url" : "https://pbs.twimg.com/profile_banners/265658805/1397533565", "profile_background_image_url" : "http://pbs.twimg.com/profile_background_images/845237718/447b881c8b774ed9199f6bf5505beb66.jpeg", "name" : "Freedoms Fool", "lang" : "en", "profile_background_tile" : false, "favourites_count" : 94, "screen_name" : "freedomsfool", "notifications" : null, "url" : null, "created_at" : "Sun Mar 13 23:26:25 +0000 2011", "contributors_enabled" : false, "time_zone" : "Eastern Time (US & Canada)", "protected" : false, "default_profile" : false, "is_translator" : false }, "geo" : null, "in_reply_to_user_id_str" : null, "possibly_sensitive" : false, "lang" : "en", "created_at" : "Thu Oct 09 05:23:48 +0000 2014", "filter_level" : "medium", "in_reply_to_status_id_str" : null, "place" : null } /* 1 */ { "_id" : ObjectId("5435dcae3b811434f9a1ff12"), "contributors" : null, "truncated" : false, "text" : "RT @FreeTheMarine: GA @AustinScottGA08 Pls support #HRes620 assisting our #MarineHeldInMexico. He needs treatment for PTSD ASAP #BringBackO…", "in_reply_to_status_id" : null, "id" : NumberLong(520014305681768449), "favorite_count" : 0,

Page 43: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

37

"source" : "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>", "retweeted" : false, "coordinates" : null, "timestamp_ms" : "1412816046572", "entities" : { "user_mentions" : [ { "id" : NumberLong(2476804154), "indices" : [ 3, 17 ], "id_str" : "2476804154", "screen_name" : "FreeTheMarine", "name" : "Free Sgt Tahmooressi" }, { "id" : 234797704, "indices" : [ 22, 38 ], "id_str" : "234797704", "screen_name" : "AustinScottGA08", "name" : "Rep. Austin Scott" } ], "symbols" : [], "trends" : [], "hashtags" : [ { "indices" : [ 51, 59 ], "text" : "HRes620" }, { "indices" : [ 74, 93 ], "text" : "MarineHeldInMexico" }, { "indices" : [ 128, 140 ], "text" : "BringBackOurMarine" } ], "urls" : [] }, "in_reply_to_screen_name" : null, "id_str" : "520014305681768449", "retweet_count" : 0, "in_reply_to_user_id" : null, "favorited" : false, "retweeted_status" : { "contributors" : null, "truncated" : false,

"text" : "GA @AustinScottGA08 Pls support #HRes620 assisting our #MarineHeldInMexico. He needs treatment for PTSD ASAP #BringBackOurMarine", "in_reply_to_status_id" : null,

Page 44: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

38

"id" : NumberLong(519365396366110720), "favorite_count" : 12, "source" : "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>", "retweeted" : false, "coordinates" : null, "entities" : { "user_mentions" : [ { "id" : 234797704, "indices" : [ 3, 19 ], "id_str" : "234797704", "screen_name" : "AustinScottGA08", "name" : "Rep. Austin Scott" } ], "symbols" : [], "trends" : [], "hashtags" : [ { "indices" : [ 32, 40 ], "text" : "HRes620" }, { "indices" : [ 55, 74 ], "text" : "MarineHeldInMexico" }, { "indices" : [ 109, 128 ], "text" : "BringBackOurMarine" } ], "urls" : [] }, "in_reply_to_screen_name" : null, "id_str" : "519365396366110720", "retweet_count" : 29, "in_reply_to_user_id" : null, "favorited" : false, "user" : { "follow_request_sent" : null, "profile_use_background_image" : false, "default_profile_image" : false, "id" : NumberLong(2476804154), "verified" : false, "profile_image_url_https" : "https://pbs.twimg.com/profile_images/509417608936845312/OX6Pm-8B_normal.jpeg", "profile_sidebar_fill_color" : "DDEEF6", "profile_text_color" : "333333", "followers_count" : 3542, "profile_sidebar_border_color" : "000000", "id_str" : "2476804154", "profile_background_color" : "000000", "listed_count" : 59, "profile_background_image_url_https" : "https://abs.twimg.com/images/themes/theme1/bg.png", "utc_offset" : -25200,

Page 45: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

39

"statuses_count" : 3706, "description" : "OFFICIAL Tahmooressi Family Account. Please visit: http://t.co/8PyH5q0uWE | #MarineHeldInMexico #HRes620 | Media Requests: [email protected]", "friends_count" : 243, "location" : "www.andrewfreedomfund.com", "profile_link_color" : "134673", "profile_image_url" : "http://pbs.twimg.com/profile_images/509417608936845312/OX6Pm-8B_normal.jpeg", "following" : null, "geo_enabled" : false, "profile_banner_url" : "https://pbs.twimg.com/profile_banners/2476804154/1403332310", "profile_background_image_url" : "http://abs.twimg.com/images/themes/theme1/bg.png", "name" : "Free Sgt Tahmooressi", "lang" : "en", "profile_background_tile" : false, "favourites_count" : 12637, "screen_name" : "FreeTheMarine", "notifications" : null, "url" : "http://www.facebook.com/freethemarine", "created_at" : "Sun May 04 12:12:23 +0000 2014", "contributors_enabled" : false, "time_zone" : "Arizona", "protected" : false, "default_profile" : false, "is_translator" : false }, "geo" : null, "in_reply_to_user_id_str" : null, "possibly_sensitive" : false, "lang" : "en", "created_at" : "Tue Oct 07 05:55:34 +0000 2014", "filter_level" : "low", "in_reply_to_status_id_str" : null, "place" : null }, "user" : { "follow_request_sent" : null, "profile_use_background_image" : true, "default_profile_image" : false, "id" : 959017200, "verified" : false, "profile_image_url_https" : "https://pbs.twimg.com/profile_images/509521711671558144/oqRiNGin_normal.jpeg", "profile_sidebar_fill_color" : "DDEEF6", "profile_text_color" : "333333", "followers_count" : 145, "profile_sidebar_border_color" : "C0DEED", "id_str" : "959017200", "profile_background_color" : "C0DEED", "listed_count" : 2, "profile_background_image_url_https" : "https://abs.twimg.com/images/themes/theme1/bg.png", "utc_offset" : null, "statuses_count" : 4451, "description" : null, "friends_count" : 263, "location" : "", "profile_link_color" : "0084B4", "profile_image_url" : "http://pbs.twimg.com/profile_images/509521711671558144/oqRiNGin_normal.jpeg", "following" : null, "geo_enabled" : false, "profile_background_image_url" : "http://abs.twimg.com/images/themes/theme1/bg.png", "name" : "MomOrWhatever", "lang" : "en", "profile_background_tile" : false, "favourites_count" : 2426, "screen_name" : "MomOrWhatever", "notifications" : null, "url" : null,

Page 46: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

40

"created_at" : "Tue Nov 20 00:24:19 +0000 2012", "contributors_enabled" : false, "time_zone" : null, "protected" : false, "default_profile" : true, "is_translator" : false }, "geo" : null, "in_reply_to_user_id_str" : null, "possibly_sensitive" : false, "lang" : "en", "created_at" : "Thu Oct 09 00:54:06 +0000 2014", "filter_level" : "medium", "in_reply_to_status_id_str" : null, "place" : null } /* 2 */ { "_id" : ObjectId("5435e9183b811434f9a204d8"), "contributors" : null, "truncated" : false, "text" : "RT @fenolj: @AustinScottGA08 When physical abuse #Tahmooressi endured comes 2 light YOU will be accountable. Co-sponsor #HRes620 #BringBa…", "in_reply_to_status_id" : null, "id" : NumberLong(520027633044946944), "favorite_count" : 0, "source" : "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>", "retweeted" : false, "coordinates" : null, "timestamp_ms" : "1412819224039", "entities" : { "user_mentions" : [ { "id" : NumberLong(2680673774), "indices" : [ 3, 10 ], "id_str" : "2680673774", "screen_name" : "fenolj", "name" : "Jackie Fenolio" }, { "id" : 234797704, "indices" : [ 12, 28 ], "id_str" : "234797704", "screen_name" : "AustinScottGA08", "name" : "Rep. Austin Scott" } ], "symbols" : [], "trends" : [], "hashtags" : [ { "indices" : [ 49, 61 ], "text" : "Tahmooressi" }, { "indices" : [

Page 47: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

41

122, 130 ], "text" : "HRes620" }, { "indices" : [ 131, 140 ], "text" : "BringBackOurMarine" } ], "urls" : [] }, "in_reply_to_screen_name" : null, "id_str" : "520027633044946944", "retweet_count" : 0, "in_reply_to_user_id" : null, "favorited" : false, "retweeted_status" : { "contributors" : null, "truncated" : false, "text" : "@AustinScottGA08 When physical abuse #Tahmooressi endured comes 2 light YOU will be accountable. Co-sponsor #HRes620 #BringBackOurMarine.", "in_reply_to_status_id" : null, "id" : NumberLong(519999120233467904), "favorite_count" : 1, "source" : "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>", "retweeted" : false, "coordinates" : null, "entities" : { "user_mentions" : [ { "id" : 234797704, "indices" : [ 0, 16 ], "id_str" : "234797704", "screen_name" : "AustinScottGA08", "name" : "Rep. Austin Scott" } ], "symbols" : [], "trends" : [], "hashtags" : [ { "indices" : [ 37, 49 ], "text" : "Tahmooressi" }, { "indices" : [ 110, 118 ], "text" : "HRes620" }, { "indices" : [ 119, 138 ],

Page 48: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

42

"text" : "BringBackOurMarine" } ], "urls" : [] }, "in_reply_to_screen_name" : "AustinScottGA08", "id_str" : "519999120233467904", "retweet_count" : 2, "in_reply_to_user_id" : 234797704, "favorited" : false, "user" : { "follow_request_sent" : null, "profile_use_background_image" : true, "default_profile_image" : false, "id" : NumberLong(2680673774), "verified" : false, "profile_image_url_https" : "https://pbs.twimg.com/profile_images/519337504214753280/xZ6DzFeB_normal.jpeg", "profile_sidebar_fill_color" : "DDEEF6", "profile_text_color" : "333333", "followers_count" : 125, "profile_sidebar_border_color" : "C0DEED", "id_str" : "2680673774", "profile_background_color" : "C0DEED", "listed_count" : 1, "profile_background_image_url_https" : "https://abs.twimg.com/images/themes/theme1/bg.png", "utc_offset" : null, "statuses_count" : 9624, "description" : null, "friends_count" : 58, "location" : "", "profile_link_color" : "0084B4", "profile_image_url" : "http://pbs.twimg.com/profile_images/519337504214753280/xZ6DzFeB_normal.jpeg", "following" : null, "geo_enabled" : false, "profile_background_image_url" : "http://abs.twimg.com/images/themes/theme1/bg.png", "name" : "Jackie Fenolio", "lang" : "en", "profile_background_tile" : false, "favourites_count" : 39, "screen_name" : "fenolj", "notifications" : null, "url" : null, "created_at" : "Fri Jul 25 22:56:37 +0000 2014", "contributors_enabled" : false, "time_zone" : null, "protected" : false, "default_profile" : true, "is_translator" : false }, "geo" : null, "in_reply_to_user_id_str" : "234797704", "possibly_sensitive" : false, "lang" : "en", "created_at" : "Wed Oct 08 23:53:46 +0000 2014", "filter_level" : "low", "in_reply_to_status_id_str" : null, "place" : null }, "user" : { "follow_request_sent" : null, "profile_use_background_image" : true, "default_profile_image" : false, "id" : 981285295, "verified" : false, "profile_image_url_https" : "https://pbs.twimg.com/profile_images/517386303441481728/RVa6gyU1_normal.jpeg", "profile_sidebar_fill_color" : "DDEEF6",

Page 49: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

43

Appendix 2: Twitter App

Page 50: Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis

44

Appendix 3: MongoDB Database Backup, Restore, and Initialization

1. Logon to your data-mining system.

2. From the terminal, suspend the data collection by pressing “Ctrl + z”

3. Enter the MongoDB administrative shell by entering the command “mongo”

4. From the MongoDB administrative shell, enter the command “use admin” - This will

switch you to the administrative database.

5. From the administrative database, enter the command “db.shutdownServer()” - This will

shut down the MongoDB service.

6. Navigate to a folder designated to hold the database backups (you can create the folder

locally, or on any other storage medium that is logically connected).

7. From the backup directory, enter the command “sudo mongodump --dbpath

/path/to/your/mongodb” – The backup may take some time to complete depending on the

size of the databases. In addition, the backup will contain all the collections contained in

your databases, stored in JSON and their corresponding sub-metadata stored in BSON

(Binary JSON).

8. Once the backup is done, enter the following command to initiate the MongoDB service

and resume logging “sudo mongod --dbpath /path/to/your/mongodb --fork --logpath

/var/log/mongodb.log”

Once the MongoDB service is initiated, enter the following command to resume collection, if

need be:

“sudo python twitterstreamtomongodb.py --oauth=oauth.json --server=127.0.0.1 --port=27017 --

database=“insert DB name here” --track=objects.txt”8

8 Insert the variables appropriate to your system in the server, port, and database fields.