data mining on microblogged information: gender

80
The Pennsylvania State University The Graduate School College of Information Sciences and Technology DATA MINING ON MICROBLOGGED INFORMATION: GENDER RECOGNITION AND SUICIDE PREVENTION A Thesis in Information Sciences and Technology by Hyun-Woo Kim 2010 Hyun-Woo Kim Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science August 2010

Upload: others

Post on 13-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Data Mining on Microblogged Information: Gender Recognition and Suicide PreventionDATA MINING ON MICROBLOGGED INFORMATION:
GENDER RECOGNITION AND SUICIDE PREVENTION
A Thesis in
August 2010
The thesis of Hyun-Woo Kim was reviewed and approved* by the following:
John Yen
Thesis Advisor
Thesis Co-Advisor
Madhu C. Reddy
Director of Graduate Programs in Information Sciences and
Technology
iii
Abstract With the prosperity of Web 2.0 technologies, microblogging has become one of the most
popular services on the Internet. Twitter is currently the most popular microblogging service in the world. Millions of people’s thoughts, opinions, and emotions melt into billions of short posts, or tweets, on Twitter. Most tweets are accessible through the Web and Twitter’s application programming interface (API). Twitter has become a living history and a repository of human thoughts thanks to its gigantic amount of tweets. Analyzing microblogged messages is therefore helpful in understanding and predicting human behavior; computational data mining techniques can be used to recognize common patterns of certain groups of people.
Support Vector Machine (SVM) is a nonparametric supervised learning method
constructing a hyper-plane in a high-dimensional space that has the largest functional margin to achieve a good separation of multiple groups of microbloggers. In general the larger the margin results in the lower the generalization error of the classifier. This thesis adopts Support Vector Machine together with several feature selection methods including SVM-RFE (SVM Recursive Feature Elimination), Relief, and InfoGain algorithms to systematically recognize the genders of microbloggers and shows what gender-specific features are, and how the feature selection process affects the overall classification accuracy.
The gender classifier can be helpful in preventing a serious social problem: suicide. It is
known that risk factors for suicidal thoughts vary with gender and age. Suicide is the third leading cause of death among people ages 10 to 24 according to the Centers for Disease Control and Prevention (CDC). Also, 15 percent of high school students have seriously considered suicide. There is a consistent finding that more than 90% of people who committed suicide had shown a diagnosable psychiatric disorder. Mental health services can help people at high risk for suicide to relieve. However, we cannot solely rely on mental treatment to solve this problem given the fact that two third of the people who committed suicide had not received any appropriate treatment.
The tweets posted by people before they committed suicide over a year clearly showed
that they were suffering from a profound depression. Moreover, some of them posted their final messages to the world on their microblogs. We study how a theoretical model describes the nature of many types of suicide, investigate the microblogging behavior of teenagers who killed themselves, and discuss how a future research may build a statistical model to measure the degree of one’s depression that may be used for suicide prevention.
iv
1.3. Relevant Research .................................................................................................. 7
2.1. Learning Approaches .............................................................................................. 10
2.3. Feature selection ..................................................................................................... 13
2.4. Research Questions ................................................................................................ 15
3.1. Feeders ................................................................................................................... 16
3.2. Controller ............................................................................................................... 18
4.1. Datasets .................................................................................................................. 22
4.2. Preprocessing ......................................................................................................... 23
v
5.2. Classification Evaluation ........................................................................................ 30
6.1. Background ............................................................................................................ 42
6.3. Cyberbullying ......................................................................................................... 45
6.4. Possibility ............................................................................................................... 48
Appendix B Classification Accuracy Details ...................................................................... 59
Appendix C Feature Selection Effect Details ...................................................................... 65
Appendix D Feature Rank Details....................................................................................... 67
List of Figures
Figure 1. The number of tweets about Michael Jackson’s death per hour over time .............. 6
Figure 2. Examples of gender-preferential words .................................................................. 8
Figure 3. An illustration of support vectors, optimal margin and hyperplane ........................ 11
Figure 4. A linearly separable hyper-plane ............................................................................ 11
Figure 5. A SVM-RFE flow chart .......................................................................................... 14
Figure 6. Feature selection accuracy comparison .................................................................. 14
Figure 7. Components of a tweet crawler ............................................................................... 17
Figure 8. Information flow from the feeders to the controller ................................................ 18
Figure 9. A dynamic assignment of Twitter credentials to a remote crawler .......................... 19
Figure 10. An example of a dynamic job assignment ............................................................. 19
Figure 11. Crawling Tweets ................................................................................................... 20
Figure 12. OAuth Authentication Flow .................................................................................. 21
Figure 13. An illustration of the preprocessing tasks ............................................................. 24
Figure 14. Unigram Occuurences ........................................................................................... 26
Figure 15. A high-level overview of the gender classification processes ............................... 28
Figure 16. Classification accuracy of SVM with a linear kernel ........................................... 31
Figure 17. Classification accuracy of Naïve Bayes ................................................................ 33
Figure 18. Classification accuracy of Bayesian Logistic Regression ..................................... 35
Figure 19. Classification accuracy of Random Forest ............................................................ 36
Figure 20. SVM-RFE feature selection algorithm evaluation ................................................ 38
Figure 21. Unigrams selected by SVM-RFE ......................................................................... 38
vii
Figure 23. InfoGain feature selection and other classifiers .................................................... 40
Figure 24. Relief feature selection and other classifiers ......................................................... 41
Figure 25. U.S. high school students who seriously considered/attempted suicide in 2009 ... 43
viii
Table 2. Types of Microblogged Information Production ..................................................... 2
Table 3. Characteristics of Random Forest ............................................................................ 13
Table 4. Who has the most followers/friends on Twitter? ...................................................... 17
Table 5. Size of the datasets ................................................................................................... 23
Table 6. A minimal set of stop words .................................................................................... 24
Table 7. Top 30 frequent unigrams ........................................................................................ 26
Table 8. A confusion matrix ................................................................................................... 29
Table 9. Classification performance measures ....................................................................... 30
Table 10. Classification details and the top 20 features selected by SVM-RFE ..................... 32
Table 11. A bad case: SVM with all the Boolean features ..................................................... 32
Table 12. Evaluation of classification with Naïve Bayes ....................................................... 34
Table 13. Evaluation of classification with Bayesian Logistic Regression ............................ 35
Table 14. Evaluation of classification with Random Forest ................................................... 37
Table 15. Classification details and top 20 features selected by InfoGain .............................. 40
Table 16. Classification details and top 20 features selected by Relief .................................. 41
Table 17. Selected tweets posted by AR before her suicide ................................................... 47
Table 18. Selected tweets posted by AL before her suicide ................................................... 48
Table 19. AR’s Frequent Unigrams and Bigrams .................................................................. 49
Table 20. AL’s Frequent Unigrams and Bigrams ................................................................... 49
Table 21. Frequencies of 5 frequently used unigrams ............................................................ 50
1
1.1. Social Media, Blogging and Microblogging
Web-based technologies have created social media in response to the desire of online
publication and social interaction; people share what they feel and think freely and instantly on
social media. There are many purposes of using social media including collaboration,
communication, and data sharing. Examples of social media are shown in Table 1. Social
technology adoption has been tremendously increased over the last decade with the wide use of
Web 2.0 capabilities facilitating interactive collaboration and information sharing on the Internet.
Forrester, a market research company, reports that 75% of U.S. adults who use the Internet
utilized social media in 2008 compared with 56% in 2007 (Bernoff, Pflaum et al. 2008).
Purpose Application Services
CiteULike, Delicious, Google Reader Digg, Mixx, NowPublic, Reddit MediaWiki
Communication Blog Microblogging Social Networking
Blogger, MovableType, WordPress Twitter, Yammer Facebook, LinkedIn, MySpace
Data Sharing Photo Sharing Video Sharing Presentation Sharing
Flickr, Picasa YouTube Scribd
Epinions, MouthShut, Yelp Google Answers, Yahoo! Answers Second Life
Table 1. Examples of Social Media
2
Blogging is defined as a form of journalism for ordinary people to document their lives,
provide opinions, express emotions and articulate ideas outside the mass media, and forms
community forums (Nardi, Schiano et al. 2004). It is part of the allure of blogging that bloggers
can easily move between the personal and the professional. Microblogging is defined as a form of
blogging promoting shorter posts and frequent updates and therefore fulfilling the need for a
faster mode of content generation (Java, Song et al. 2007). Microbloggers often share their daily
lives as well as information and news with others. Several types of microblogged information
production have been identified by researchers at University of Colorado while they were
analyzing microblogged information on hazard threats: generative, synthetic, and derivative
information production (Starbird, Palen et al. 2010). They are briefly described in Table 2.
Microblogging is also commonly used at work as an informal communication tool because its
brevity, broadcast nature and pervasive access facilitate collaborative work and provide mobility
(Zhao and Rosson 2009).
Generative Information Production
Production of an autobiographical narrative or a raw material adapting outside information that is to be shaped into more meaningful information through discussion
Synthetic Information Production
Production of shaped information by synthesizing general facts and outside knowledge
Derivative Information Production
A user-driven cycle of shaping or re-shaping shared information involving adding comments, providing links to other web resources and passing on the information.
Table 2. Types of Microblogged Information Production (Starbird, Palen et al. 2010)
Twitter is a microblogging service providing real-time transmission and retrieval of short
text messages. A short message posted on Twitter is referred to as a tweet. A tweet consists of up
to 140 characters regardless of language locale. A tweet is either public or private, and public
3
tweets can be viewed by anyone on the Internet. On April 14th, 2010 at Chirp, the official Twitter
developer conference, Twitter representatives reported that the number of registered users had
reached 105 million, and that new users had been signing up at the rate of about 300,000 per day
(Yarow 2010). As far as the volume of data is concerned, the total number of tweets has already
passed 19 billion according to GigaTweet1
1.2. Sentiment Analysis and Gender Recognition
estimation. Twitter is a rich source of human thoughts
given its popularity and abundance of data. Therefore, it can be utilized to understand the
profound structure of empirical reality and to mine common patterns behind people’s everyday
lives (Barabási 2010). Twitter has released an official announcement that the U.S. Library of
Congress will house every tweet since the inception of Twitter service on March 21, 2006 for
preservation and to make them available for patrons (Twitter 2010).
Twitter does not obtain gender information from its users. Twitter asks users to provide
minimal user information when they sign up. This includes a full name, account name, password
and e-mail address. Due to the absence of gender information, researchers have not yet
considered the differences between men and women when performing sentiment analyses on
Twitter. However, getting gender data for randomly selected users is not difficult. Unlike the web
pages themselves, all tweets are shown together with user pictures. Although not all tweets come
with real pictures of the writers, it is easily noticed that a significant amount of pictures are real,
and fortunately many of those pictures are enough to manually distinguish the genders for
training purposes.
Facebook is another social networking service that can also be a good source of people’s
thoughts, and it maintains a much greater variety of user information, including gender. However,
the access to a user space on Facebook is mostly restricted to the user’s acquaintances. To have
access to someone’s space on Facebook, we need the owner’s approval in most cases. For this
reason, a systematic collection of the conversation on Facebook is not a good idea. Thus,
modeling people’s behavior through the use of Facebook data is very difficult unless Facebook
releases their data.
The gender recognition of Twitter users is valuable especially when analyzing sentiments
on certain objects in general. For example, marketing people may be interested in what brands of
goods are preferred or disregarded, and what causes such effects. Indeed, about 20% of tweets are
about brands (Spinelle and Messer 2009); it contains requests for product information or
responses to the requests. A recent research defined action codes of tweets toward brands such as
question/answer, positive/negative comment, recommendation, suggestion and so on, as well as
brand sentiment such as great, swell, so-so, bad, wretched, etc (Jansen, Zhang et al. 2009). They
have chosen representative brands from different industry sectors such as Banana Republic and
H&M for apparel, Honda and Toyota for automotive, and so on. What they eventually have done
is the measurement of the changes in sentiments on selected brands over time on Twitter. They
pioneered the way to analyze brand sentiments on online social media. However, their work did
not consider the differences of preference and opinion between genders; men and women may
have different thoughts and opinions regarding a particular brand. It is feasible not only to detect
each gender’s preferred or disregarded brands, but to distinguish the changes in brand sentiment
between genders over time in response to corporate efforts to meet customers’ satisfaction.
Gender recognition may also be useful in modeling the dynamics of information
propagation in social networks. The SIR model, which is an epidemiological model consistent of
susceptible, infective and removed (or recovered) states, was first formulated in the 1920s to
5
model the spread of epidemic animal diseases, and refined later on (Newman 2002). In the early
2004, Wu et al incorporated and reformulated the SIR model to create a general model of
information flow in social groups, tested it by analyzing the data of incoming and outgoing e-mail
through the e-mail server at HP Labs (Wu, Huberman et al. 2004). Their model predicts how long
an occurrence of information will persist, and how widely it will be spread. Thus, the susceptible
state of the SIR model corresponds to an occurrence of particular information. The infective state
of the SIR model can be explained as a state where information is being actively spread through
computer networks. Lastly, the removed state is a condition where information is no longer
spread. Web Ecology Project, an interdisciplinary research group whose website is available at
http://www.webecologyproject.org, conducted a relevant sentiment analysis to detect sadness in
tweets in response to Michael Jackson’s death which happened on June 25th, 2009 (Kim, Gilbert
et al. 2009).
6
Figure 1. The number of tweets about Michael Jackson’s death per hour over time
(Kim, Gilbert et al. 2009)
They found that the tweets about Michael Jackson’s death proceeded at a rate of 78
tweets per second at the peak, and about 3/4 of such tweets contained the word sad. The graph
shown above also indicates that the grief did not last more than several days. They counted the
occurrences of the words appeared in the dataset. They did not use any model to explain the
spread of the sadness. Assuming that the sadness was a kind of epidemic disease, one may divide
the implicated people into three groups: people in susceptible, infected and removed states
respectively. Once again, there has not been any information spread model considering gender-
specific features. It cannot be overemphasized that how people react to information is often
7
dependent on their emotional sensitivity especially when information is related to human matters
such as death.
1.3. Relevant Research
It has been shown by Corney and his colleagues that machine learning techniques can be
used to perform a gender classification (Corney, Vel et al. 2002). They used an algorithm called
Support Vector Machine (SVM) to classify e-mails according to the authors’ genders. The SVM
is a supervised learning algorithm suitable for classification and regression which constructs a
hyper-plane or set of hyper-planes in a high-dimensional space. When training a SVM classifier,
constructing a hyper-plane that has the largest distance to the nearest input data points of any
class is the key to achieving a good degree of separation which generally leads to lower
generalization error. Getting back to Comey and his colleagues’ experiment, they manually
defined a set of 222 features comprising documented-based, word-based, character-based and
suffix-based features. Their document-based features are the ratio of the number of the blank
lines to the total number of lines, and the average number of the words in a sentence for each
email. Their word-based features include the average word length, the number of function words
and the number of short words. One of their character-based features is the number of the
punctuation characters divided by the total number of characters in an email. Their suffix-based
features are the occurrences of 9 suffixes: -able, -al, -ful, -ible, -less, -ly and -ous. They are to see
if any of these is a gender-preferential one. The specially grouped these 9 suffixed-based features
and the occurrences of two words, sorry and apology, as a gender-preferential language attributes,
even though the effects of these attributes were not evaluated. They achieved up to 70% of
accuracy as a result of classifying 800 emails with a SVM classifier. One thing they could have
8
done is adopted a sophisticated feature selection method to filter out irrelevant features from their
manually defined feature set and obtained more useful features from their training data.
Figure 2. Examples of gender-preferential words (Cheong and Lee 2009)
Another research work has shown that some words such as coffee and greys are more
frequently used by females whereas h1n1 and nizar are more frequently used by males (Cheong
and Lee 2009). The word grey is a part of the name of an American television show Grey’s
Anatomy, and nizar is a politician involved in a constitutional crisis in Malaysia.
A similar approach will be tried for this thesis research. However, more concentration
will be on constructing feature vectors using gender-preferential words rather than counting
specific part-of-speeches or functional words. The reason is, as we have already discussed in the
previous section, that tweets are often a self-expression of thoughts and opinions. They are also
often talking about goods around us. Intuitively, women are more likely to talk about H&M
whereas mean are more likely to talk about Nike when they are discussing apparel.
Anticipated problems are those related to the exponential growth of hypervolume as a
function of dimensionality which is referred to as the curse of dimensionality (Bellman 1957;
9
Bellman 1961). If every unique word used on Twitter were used for constructing feature vectors,
the number of features would be easily more than thousands. It is likely that a large portion of
such a feature set is irrelevant to the labels (e.g. genders). In this case, a classifier would use
almost all its resources to represent the irrelevant features behaving relatively badly. More
seriously, it may result in an abnormal termination of the classification task due to an insufficient
memory space. Consequently, it is important to carefully select features that are influential in
reaching a better performance.
There are three major types of machine learning approaches: unsupervised learning,
reinforcement learning, and supervised learning. The unsupervised learning is utilized to let a
computer find patterns in the input data even though the algorithm does not receive any explicit
feedback. Clustering is one of the most common unsupervised learning tasks. The reinforcement
learning (Sutton and Barto 1998) is to utilized to let a computer learn from a series of trials and
errors. It is important to discover effective actions in order to minimize exploration in
reinforcement learning. In supervised learning, a computer observes input-output pairs and builds
a mathematical function pairing input vectors and output vectors. Supervised learning can be used
for both classification and regression tasks. A classification task is to assign each input to a
discrete category whereas a regression task is to compute a desired output that comprises one or
more continuous variables. In this chapter, we will briefly explore four popular supervised
learning algorithms that are known to be suitable for text categorization: Support Vector Machine
(SVM), Naïve Bayes, Bayesian Logistic Regression, and Random Forest. Also covered are three
feature selection algorithms to reduce dimensionality of input data space: Information Gain,
Relief, and SVM Recursive Feature Elimination (SVM-RFE).
11
2.2. Supervised Learning for Text Categorization
In text categorization, each word, or unigram, appearing in a document is a feature for
classification. Thus, the number of features can be a large number and causes high dimensionality.
The challenge is how to classify computationally efficiently and to avoid overfitting. Support
Vector Machine (SVM), Naïve Bayes, Bayesian Logistic Regression are known as suitable
supervised learning algorithms for text categorization.
SVM, is currently one of the most popular supervised learning algorithms (Russell and
Norvig 2009), and has three main properties. First, it constructs a maximum margin separator, a
decision boundary with the largest possible distance to example points. Second, it creates a
linearly separating hyper-plane. Using the kernel trick, it embeds input data into a high-
dimensional space. Even though input data are not linearly separable in the original input space, it
is linearly separable in a higher-dimension space, as shown in Figure 4. In other words, a linear
separator in a higher-dimension space is usually a nonlinear separator in the original space. Third,
it is a nonparametric learning algorithm which uses all input data to make each prediction.
Figure 3. An illustration of support vectors,
optimal margin and hyperplane
12
The idea behind kernel machines was introduced in 1960s (Aizerman, Braverman et al.
1964) but it had not been in full development until Boser et al. introducted the kernel trick in
1992 (Boser, Guyon et al. 1992). SVM is proven to be effective for text categorization (Joachims
2001) and handwritten digit recognition (DeCoste and Schölkopf 2002). Nevertheless, there exist
some limitations of SVMs that are coming from the curse of dimensionality and global templates
(Bengio and LeCun 2007).
Naïve Bayes (Hilden 1984) is a simple probabilistic model based on Bayes’ theorem. It
assumes that the presence or absence of a feature is not related to the presence or absence of any
other feature of input data. Namely, it assumes that all features are conditionally independent. In
spite of its simplicity, it has shown a good performance on many complex problems (Domingos
and Pazzani 1997; Hand and Yu 2001; Rish 2001). Moreover, it is an efficient algorithm because
it estimates all necessary parameters for classification with a small amount of training data.
Bayesian Logistic Regression is a logistic regression using sparse Bayesian models and
Laplace approximation (Genkin, Lewis et al. 2007; Nemes, Jonasson et al. 2009). Logistic
regression is a generalized linear model to predict the probability of an event occurrence by
fitting input data to a logistic curve.
Even though a certain model is well suited for some problems, it may not work well
when applied in the context of other problems. Moreover, it is sometimes hard to find a good
model for a specific problem. An ensemble method is a supervised learning method combining
multiple models to achieve a better predictive performance while promoting diversity among the
models (Opitz and Maclin 1999). Random Forest (Breiman 2001) is an ensemble classifier that
builds multiple decision trees with controlled variation. Its output is the mode of the results from
individual trees. It assures controlled variation by combining the bagging idea (Breiman 1996)
and the random selection of features. The major advantages and disadvantages of Random Forest
are summarized in Table 3.
13
Advantages Disadvantages
- Produces a highly accurate classifier - Handles a large number of features - Estimates the importance of each feature
- Prone to overfitting - Incompetent for dealing with a large
number of irrelevant features - Computationally expensive
Table 3. Characteristics of Random Forest
2.3. Feature selection
Feature selection is a process to reduce the size of a feature set. In many supervised
learning problems, feature selection is important for many reasons: generalization performance
and running time requirements. More importantly, a large number of features are problematic due
to high dimensionality. The main idea is to select influential features in the classification task.
When there are many irrelevant or less important features, feature selection may improve
classification accuracy with less computational cost.
SVM-RFE (SVM Recursive Feature Elimination) is a patented linear multivariate feature
selection method (U.S. Patent No. 7,117,188) that was originally developed to find genes
biologically related to cancer tissues (Guyon, Weston et al. 2002). This was performed and tested
on a colon cancer dataset It begins with all the features of a given dataset and then repeats the
recursive removal of useless features from the initial feature set until there is no performance
degradation, as depicted in Figure 5. The SVM-RFE authors experimentally demonstrated that
SVM-RFE outperforms other feature selection algorithms in terms of classification accuracy.
Figure 6 compares SVM-RFE with other gene selection algorithms, and shows that SVM-RFE
outperforms the others. Note that SVM was used as classifiers for all cases except for one
14
denoted by a dashed blue curve where the baseline feature selection and the baseline classifier
were tried. This utilized the Leave-One-Out (LOO) cross validation technique.
Figure 5. A SVM-RFE flow chart
Adapted from (Guyon 2007)
Adapted from (Guyon, Weston et al. 2002)
15
Relief algorithm (Kira and Rendell 1992) is another feature selection algorithm inspired
by instance-based learning using a simple statistical method rather than heuristic search. It ranks
all the features according to their relevance to the targets. It is suitable for two-class classification
and needs sufficient training instances. Another simple feature selection algorithm is one using
the measure of information gain (Kullback and Leibler 1951) called InfoGain feature selection.
Like Relief algorithm, it also ranks all the features according to information gain values.
2.4. Research Questions
The classification and feature selection algorithms described here will be used in the
subsequent experiment to answer the following research questions.
Q1. Are there gender-preferential features in microblogged information?
Q2. Is gender classification with machine learning methods achievable?
Q3. Do traditional text classification methods work well with microblogged information?
Q4. Is a feature selection process also important in gender classification?
Gender-preferential words can be excellent features of microblogged information for
gender classification. If there are many, the next step would be choosing a suitable machine
learning method to build a gender classification model. We will first see if it works well with
SVM which is known to be a state-of-the-art classification method. Another interesting query
would be to find out if there is any traditional machine learning algorithm that reasonably
performs gender classification. Last but not least, we want to answer how the feature selection
process affects classification accuracy.
Chapter 3 Tweet Crawlers
Most tweets are open to the public on the Internet. Twitter provides an API (Application
Programming Interface) for an easy access to tweets. Hence, it is not necessary to implement a
crawler parsing nasty HTML (Hypertext Markup Language) code to extract microblogged
information from Twitter web pages. However, there is a strict traffic limit to follow when using
the API. The number of requests to Twitter should not exceed 150 per hour. In other words, a
normal tweet crawler can send at most 3,600 requests a day. The main reason for such a limit is
that Twitter does not want a mass collection of its data. However, a group of distributed tweet
crawlers can gather more data in unit time. If ten simultaneous crawlers are running and they are
assigned all different IP addresses and are connected to one central database management system,
then they will process 36,000 requests a day and 252,000 requests per week. Figure 7 depicts
components of distributed tweet crawlers.
3.1. Feeders
There are two feeders in the system: the username feeder and the account/password
feeder. The username feeder gets the list of usernames on Twitter and distributes it to the tweet
crawlers that are running on all different machines with different IPs. Figure 5 shows how many
followers some celebrities have. Given the large numbers of followers and friends of the
17
celebrities, it is not difficult to get a list of at least several million users from Barack Obama,
Britney Spears, etc.
Barack Obama (BarackObama) Washington, DC 2,587,692 751,161 402
Whole Foods Market (WholeFoods) Austin, TX 1,552,072 535,498 5,098
DowningStreet (DowningStreet)
Britney Spears (britneyspears) Los Angeles, CA 3,703,122 430,604 312
Zappos.com CEO Tony (zappos) Las Vegas 1,491,167 396,625 1,878
Table 4. Who has the most followers/friends on Twitter?
Retrieved from http://twitterholic.com/top100/friends/
Another feeder is the account/password feeder. As the whole system consists of multiple
tweet crawlers, and each individual crawler remotely runs on an independent machine, the
crawler controller needs to maintain all Twitter accounts and corresponding passwords to
dynamically assign a pair to a crawler. Crawlers use this information to sign into Twitter
whenever they begin to collect tweets. Assigning, removing, and modifying a pair of account and
password will be made through the account/password feeder.
Figure 8. Information flow from the feeders to the controller
3.2. Controller
The crawler controller manages concurrent crawlers running on remote machines. The
controller assigns a username to a crawler according to crawlers’ availability. The controller also
assigns a pair of Twitter account and password to each crawler dynamically when a crawler
begins to run. The dynamic assignment allows the system to be scalable; we can assign additional
accounts to the controller, and simply run more crawlers for an increased ability to collect data.
19
Figure 9. A dynamic assignment of Twitter credentials to a remote crawler
After the dynamic assignment of a pair of a Twitter account and its corresponding
password from the controller to a newly launched crawler, the crawler tries to sign in with the
account and password. If it is successful, the crawler acknowledges it to the controller. The
controller then assigns a job to the crawler as shown in Figure 10.
Figure 10. An example of a dynamic job assignment
3.3. Distributed Crawlers
The number of tweet crawlers may vary. Figure 11 depicts the crawling process. The
crawler controller keeps connections to all crawlers alive, and sends orders to each crawler. For
example, the controller asks crawler #1 to get all tweets of Barack Obama. At the same time, the
20
controller also asks crawler #2 to get all tweets of Britney Spears. Each crawler then sweeps over
all of the tweets of the designated persons through the use of Twitter API, and sends them to the
central database. The central database takes the data from the crawlers, and stores it in a
searchable form to generate data at a later time.
Figure 11. Crawling Tweets
Tweet collection previously used to accompany frequent authentication, causing a
potential leak of user credentials. Now Twitter requires all of its clients to follow a new
authentication protocol called OAuth for secure and efficient communication (Doty 2009). The
authentication flow of OAuth protocol is shown in Figure 12.
21
Figure 12. OAuth Authentication Flow (Atwood, Conlan et al. 2007)
22
4.1. Datasets
There are many social groups of people who may have different interests. To make sure
that the result from the experiment is not biased by a certain group of people, microblogged
information from three different groups of Twitter users is collected: faculty/students at Penn
State University, celebrities, and active microbloggers, denoted by PSU, CEL, and RND
respectively for convenience. The microbloggers are Twitter users who have posted at least 100
tweets. The members of PSU and CEL groups are manually chosen whereas the members of
RND group are randomly chosen from all the active Twitter users whose language is English by a
computer program. There are 30 people in each group, of which 15 are male and 15 are female. In
addition to the three groups, one special group containing all of the people of the three groups is
maintained and denoted by ALL. Thus, there are 90 people in the ALL group. A set of
accumulated tweets over time from each individual is collected by tweet crawler. For PSU and
CEL groups, up to 100 tweets are collected from each individual. For RND group, given the
diversity, up to 400 tweets are collected from each individual. The number of tweets, words and
characters of each tweet dataset are shown in Table 5.
23
Male
Female
Table 5. Size of the datasets
In each dataset, the volume of microblogged information from males is larger than that
from females. It may be necessary to do some data balancing when classifiers give too much
importance to one gender. At this moment, however, we proceed to the next step without
balancing.
4.2. Preprocessing
Several refining tasks are performed at the preprocessing step. Figure 13 is an illustration
of the preprocessing tasks. First of all, it is necessary to remove stop words, links to web
resources, user screen names followed by the symbol @ (e.g. @obama), hashtags followed by the
symbol # (e.g. #health), RT (retweet) tags, and some HTML tags that come with texts. A stop
word is a word that is very commonly used but is not meaningful. It is often a functional word
(e.g. the definite article, the indefinite article) or a number. The reason for excluding stop words
is that they are not gender-preferential words. To make matters worse, they significantly increase
24
the dimension of the input feature vectors unnecessarily. A minimal set of stop words is shown in
Table 6 as an example. A total of 579 stop words are used in the experiment for this thesis.
Figure 13. An illustration of the preprocessing tasks
I a about an are as at be by com de en for from how
in is it la of on or that the this to was what when where
who will with und the www
Table 6. A minimal set of stop words
(Source: http://www.ranks.nl/resources/stopwords.html)
Second, word stemming, which is a process of eliminating common inflectional endings
from words, is performed to group various forms of words from the same basis into one. For
example, the Porter stemming algorithm (Porter 1980) transforms loves, loved and loving into
love. Let’s consider the following sentence as an example.
The way to the bridge building on the Penn State campus is long and hard
when walking in the rain.
There are seven general stop words in the sentence: the, to, on, is, and, when and in. Stop
word removal and word stemming tasks transform the original sentence into:
wai bridg build penn state campu long hard walk rain
4.3. Corpus Building
A corpus of 18,708 unigrams from all the tweets has been built for generating feature
vectors; every unigram has a unique ID. Unigrams are sorted by occurrences in descending order;
a lower unigram ID is therefore associated with a more frequent unigram. The top 30 unigrams in
terms of occurrence are shown below. Note that all unigrams are stemmed by Porter Stemming
algorithm. The most frequent unigram dai appears 1,471 times in all tweets. A unigram may
appear more than once in a tweet. The unigram dai is originated from the words day and days.
26
ID Unigram Occurrences
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
dai (day) good love
watch great lol
peopl (people) night
twitter tonight show
1471 1400 1299 1226 1117 1052 932 874 760 659 614 588 584 574 557
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
happi (happy) morn (morning)
tomorrow week
gui (guy) word fun
498 486 471 470 456 456 451 446 441 440 438 433 426 421 417
Table 7. Top 30 frequent unigrams
Figure 14 shows how the occurrence of the nth frequent word changes as n increases. The
curve forms a power-law distribution. The figure tells us that a few top unigrams extensively
appear whereas 9,384 unigrams in the tail appear just once.
Figure 14. Unigram Occuurences
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 10 4
4.4. Feature Vector Building
A feature vector has a label (gender: M or F) and tweet feature values. The number of
elements is the same as the number of the selected features from a feature selection process. Each
tweet is represented as a feature vector. Three different types of feature vectors are generated for
a comparison purpose; feature types may affect classification accuracy. The first uses Boolean
feature values (0 or 1) so that only the presence or absence of selected features is used as input
data. The second uses count values where each feature value represents how many times the
associated unigram appears in the tweet. The third uses TF-IDF (term frequency–inverse
document frequency) values. A high TF-IDF value is reached by a high term frequency and a low
document frequency. A tweet dataset D of n tweets is defined as = {1,2, … , } where each
element of the set represents an individual tweet that consists of terms. The TF (term count) in a
given tweet is the number of times a given term appears in that tweet, and is formally defined as
, = ,
∑ ,
where , is the number of occurrences of the term in tweet and the denominator ∑ , is
the sum of the number of occurrences of all terms in tweet . The IDF (inverse document
frequency) measures how important a term is. It is formally defined as
= log ||
|{: ∈ }|
where || represents the total number of tweets, and |{: ∈ }| the document frequency of
term i (the number of tweets in which the term appears). The TF-IDF of term i residing in tweet
j is defined as
Chapter 5 Gender Classification
In Chapter 2, we explored four supervised learning algorithms suitable for text
classification (Support Vector Machine, Naïve Bayes, Bayesian Logistic Regression, and
Random Forest) and feature selection algorithms that work with the learning algorithms (SVM
Recursive Feature Elimination, InfoGain, and Relief). This chapter is dedicated to comprehensive
analyses of the pairs of classification and features selection algorithms. We will see how
accurately each pair of algorithms classifies genders.
Figure 15. A high-level overview of the gender classification processes
29
5.1. Classification Performance Measures
A confusion matrix (Provost and Kohavi 1998) for a two class classifier is a 2-by-2
matrix where its elements represent information regarding actual and predicted classifications.
The following table shows a confusion matrix for a two class classifier.
Classified as Negative Classified as Positive
Actually Negative a (true negative) b (false positive)
Actually Positive c (false negative) d (true positive)
Table 8. A confusion matrix
Each element in the confusion matrix is interpreted as follows: a (true negative) is the
number of the negative instances correctly classified, b (false positive) is the number of the
negative instances incorrectly classified, c (false negative) is the number of positive instances
incorrectly classified, and d (true positive) is the number of positive instances correctly classified.
The accuracy (AC) is the proportion of the total number of correctly classified instances to all the
instances. It is determined using the equation: = + +++
The recall or true positive rate (TP)
is the proportion of the number of positive instances correctly classified to all the positive
instances. It is calculated by the equation: = +
The false positive rate (FP) is the
proportion of the number of negatives instances incorrectly classified as positive to the number of
all the negative instances: = +
The true negative rate (TN) is defined as the proportion of
the number of negative instances correctly classified to the number of all the negative instances:
= +
The false negative rate (FN) is the proportion of the number of positives instances
incorrectly classified as negative to the number of all the positive instances: = +
The
30
precision (P) is the proportion of the number of positive instances correctly classified (true
positive) to the number of all the instances classified as positive: = +
The F-Measure
(Lewis and Gale 1994), also known as the balanced F-score, is an accuracy measure ranging from
zero (worst) to one (best). It is interpreted as a weighted average of the precision and recall, and is
formally defined as = 2 × × +
When all positive instances are incorrectly classified, the F-
measure will have a value of zero. A summary of these measurements are as follows.
TP Rate FP Rate Precision Recall F-Measure
+
+
+
5.2. Classification Evaluation
Support Vector Machine
Figure 16 shows that SVM-RFE feature selection method works very well with a SVM
classifier. There are many parameters to be set when using SVM. Precisely, the C-SVC type of
SVM with a linear kernel of the cost parameter (C value) 1.0 is used to generate the figure. When
a small number of selected features (roughly less than 2,000) are used and when the features are
Boolean values, it is surprising that a combination of SVM and SVM-RFE yields an accuracy that
is very close or equal to 100% on average as result of 10-fold cross validation. It is clearly shown
that using all the features is not a good idea at all even though they fit into the memory and the
desired algorithms work. NONE feature selection is indeed not a feature selection method. It
selects top n frequent unigrams from the corpus as features. It is used to see how it differs in
31
performance from other feature selection methods. Note that NONE begins with 50-70% of
accuracies at the beginning and constantly yields about 70%. It is also worthwhile to note that
SVM-RFE takes a long time when there are a lot of features to select. It took about 1-2 hours on a
2GHz Linux machine to rank whole features whereas InfoGain accomplished this same task in a
minute. Even though SVM-RFE outperforms others, InfoGain or Relief should be considered
when there is a time constraint for classification.
Table 10 (Right) shows the top 20 features selected by SVM-RFE. Red colored (also
underlined) unigrams mean that they are more likely to be used by women whereas blue colored
unigrams (without underlines) are more likely to be used by men. The unigram tie is a good
feature for gender classification because it never appears in the women’s tweets while it appears 4
times in the tweets from 3 men out of 15. The unigram yai which is a stemmed form of yay
appears twice in the tweets from only 1 man while it appears 22 times in the tweets from 9
women. The unigram tea also shows a similar result; it never appears in the men’s tweets whereas
it appears 8 times in the tweets from 4 women. These statistically strong gender-preferential
features might have helped SVM classify much more accurately.
Figure 16. Classification accuracy of SVM with a linear kernel
(Dataset: ALL, Cross-validation: 10-fold)
2000 4000 6000 8000 10000 12000 14000 16000 18000 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
2000 4000 6000 8000 10000 12000 14000 16000 18000 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cross Validation: 10-fold
Top 20 Features Selected by SVM-RFE (Rank, Unigram ID, Unigram)
Correctly Classified Instances: 90 (100%) Incorrectly Classified Instances: 0 (0%) Confusion Matrix: M F classified as 45 0 | M 0 45 | F
1 1922 tie 2 2783 3 154
whoop yai
write
14 273 15 393 polit
tea
16 1122 fantasi 17 1193 main 18 562 impress 19 114 person 20 1711 adventur
Table 10. Classification details and the top 20 features selected by SVM-RFE
SVM(Linear) – Dataset: ALL (Boolean) , 10-fold Cross Validation # of features: 18708 (all)
Correctly Classified Instances 63 70 % Incorrectly Classified Instances 27 30 % TP Rate FP Rate Precision Recall F-Measure Class 0.622 0.222 0.737 0.622 0.675 M 0.778 0.378 0.673 0.778 0.722 F Confusion Matrix: M F classified as 28 17 | Actually M (negative) 10 35 | Actually F (positive)
Table 11. A bad case: SVM with all the Boolean features
A bad case is observed when all features are used. In such cases, feature selection
methods are useless because there are no features discarded. As shown in the confusion matrix in
Table 11, the classifier slightly leans toward the female group; the false negative rate is much
higher than the false positive.
33
Naïve Bayes
Despite the extreme simplicity of Naïve Bayes algorithm, it performs well with SVM-
RFE especially when a small number of TF-IDF features are used.
Figure 17. Classification accuracy of Naïve Bayes
(Dataset: ALL, Cross-validation: 10-fold)
Upper Left: Boolean features, Upper Right: Count features, Bottom: TF-IDF features
2000 4000 6000 8000 10000 12000 14000 16000 18000 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
2000 4000 6000 8000 10000 12000 14000 16000 18000 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
2000 4000 6000 8000 10000 12000 14000 16000 18000 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Naïve Bayes
# of features: 10,000 selected by InfoGain
Correctly Classified Instances 76 84.4 % Incorrectly Classified Instances 14 15.6% TP Rate FP Rate Precision Recall F-Measure Class 0.822 0.133 0.86 0.822 0.841 M 0.867 0.178 0.83 0.867 0.848 F Confusion Matrix: M F classified as 38 7 | Actually M (negative) 3 42 | Actually F (postive)
Correctly Classified Instances 62 68.9 % Incorrectly Classified Instances 28 31.1 % TP Rate FP Rate Precision Recall F-Measure Class 0.667 0.289 0.698 0.667 0.682 M 0.711 0.333 0.681 0.711 0.696 F Confusion Matrix: M F classified as 30 15 | Actually M (negative) 13 32 | Actually F (positive)
Table 12. Evaluation of classification with Naïve Bayes
Table 12 shows details on two cases. 10,000 features are used in both cases, but one set is
selected by SVM-RFE and the other by the InfoGain feature selection algorithm. The other
conditions are kept the same. Apparently, even with the same number of features, there may be a
huge difference in classification accuracy according to the feature selection method.
Bayesian Logistic Regression
Bayesian Logistic Regression works best with SVM-RFE regardless of what feature
types are used. However, its accuracy is much worse when TF-IDF features are supplied.
35
(Dataset: ALL, Cross-validation: 10-fold) Upper Left: Boolean features, Upper Right: Count features, Bottom: TF-IDF features
Bayesian Logistic Regression Dataset: ALL (Boolean) , 10-fold CV
# of features: 50
Bayesian Logistic Regression
selected by SVM-RFE Dataset: ALL (Boolean) , 10-fold CV # of features: 50 selected by InfoGain
Correctly Classified Instances 90 100.0 % Incorrectly Classified Instances 0 0.0 % TP Rate FP Rate Precision Recall F-Measure Class 1 0 1 1 1 M 1 0 1 1 1 F Confusion Matrix: M F classified as 45 0 | Actually M (negative) 45 0 | Actually F (postive)
Correctly Classified Instances 75 83.3 % Incorrectly Classified Instances 15 16.7 % TP Rate FP Rate Precision Recall F-Measure Class 0.822 0.156 0.841 0.822 0.831 M 0.844 0.178 0.826 0.844 0.835 F Confusion Matrix: M F classified as 37 8 | Actually M (negative) 7 38 | Actually F (positive)
Table 13. Evaluation of classification with Bayesian Logistic Regression
2000 4000 6000 8000 10000 12000 14000 16000 18000 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SVMRFE INFOGAIN RELIEF NONE
2000 4000 6000 8000 10000 12000 14000 16000 18000 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SVMRFE INFOGAIN RELIEF NONE
2000 4000 6000 8000 10000 12000 14000 16000 18000 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SVMRFE INFOGAIN RELIEF NONE
Random Forest
A Random Forest classifier of 30 decision trees is used to generate the results. SVM-RFE
has been the best for other classification algorithms, but does not work well with Random Forest
on all the datasets. The best result (94% accuracy) is achieved by the InfoGain feature selection
and TF-IDF features.
(Dataset: ALL, Cross-validation: 10-fold)
Upper Left: Boolean features, Upper Right: Count features, Bottom: TF-IDF features
2000 4000 6000 8000 10000 12000 14000 16000 18000 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
2000 4000 6000 8000 10000 12000 14000 16000 18000 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
2000 4000 6000 8000 10000 12000 14000 16000 18000 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Random Forest (# decision trees: 30) Dataset: ALL (TF-IDF) , 10-fold CV
# of features: 4,000
Random Forest (# decision trees: 30)
selected by InfoGain Dataset: ALL (TF-IDF) , 10-fold CV # of features: 4,000 selected by Relief
Correctly Classified Instances 84 93.3 % Incorrectly Classified Instances 0 6.7 % TP Rate FP Rate Precision Recall F-Measure Class 0.867 0 1 0.867 0.929 M 1 0.133 0.882 1 0.938 F Confusion Matrix: M F classified as 39 6 | Actually M (negative) 0 45 | Actually F (postive)
Correctly Classified Instances 73 81.1 % Incorrectly Classified Instances 15 18.9 % TP Rate FP Rate Precision Recall F-Measure Class 0.8 0.178 0.818 0.8 0.809 M 0.822 0.2 0.804 0.822 0.813 F Confusion Matrix: M F classified as 36 9 | Actually M (negative) 8 37 | Actually F (positive)
Table 14. Evaluation of classification with Random Forest
When using a Random Forest classifier, choosing a right number of decision trees is an
important factor. Although finding a good number of trees is not tried in this research, it is
worthwhile to find one when other classifiers underperform. Nevertheless, using a large number
is not a good idea. Empirically, when the number of the trees is over a certain point, accuracy
does not increase anymore. Therefore, finding a reasonable number is a key to achieving both
computational efficiency and classificational accuracy.
5.3. Feature selection Evaluation
SVM Recursive Feature Elimination
From the figures in Section 5.1 through 5.4, we have seen that feature selection
algorithms play a key role in improving classification accuracy. SVM-RFE works especially well
with every classification algorithm except for Random Forest when Boolean features are supplied.
For Count features, SVM is the best when a relatively small number of features (e.g. 100) are
used.
38
Figure 20. SVM-RFE feature selection algorithm evaluation
Figure 21 shows what unigrams are selected by SVM-RFE when 100 and 1,000 features
are selected respectively from the ALL dataset. Note that there is an order of the selected features,
but it is disregarded here. In both plots, it is shown that SVM-RFE selects features from a wider
range of unigrams. It means that the algorithm selects a feature even though the feature is not
frequently used. However, in each plot of the figure, there exists a strong locality of the selected
features in the leftmost side where the most frequently unigrams are located. Every feature
selection algorithm selects most of their features from ones in which unigram IDs are lower than
2,000 when 100 features are selected, and 4,000 when 1,000 are selected.
Figure 21. Unigrams selected by SVM-RFE (Dataset: ALL)
Left: 100 features, Right: 1,000 features
2000 4000 6000 8000 10000 12000 14000 16000 18000 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SVMRFE - Boolean
SVM (Linear) Naive Bayes Bayesian Logistic Regression Random Forest (N=30)
2000 4000 6000 8000 10000 12000 14000 16000 18000 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SVMRFE - Count
SVM (Linear) Naive Bayes Bayesian Logistic Regression Random Forest (N=30)
2000 4000 6000 8000 10000 12000 14000 16000 18000
SVMRFE (TFIDF)
SVMRFE (Count)
SVMRFE (Boolean)
InfoGain (TFIDF)
InfoGain (Count)
InfoGain (Boolean)
Relief (TFIDF)
Relief (Count)
Relief (Boolean)
nth Unigram
SVMRFE (TFIDF)
SVMRFE (Count)
SVMRFE (Boolean)
InfoGain (TFIDF)
InfoGain (Count)
InfoGain (Boolean)
Relief (TFIDF)
Relief (Count)
Relief (Boolean)
nth Unigram
39
In Figure 22, the plots are generated by ranking whole unigrams in the four different
feature selection approaches. We see where an nth feature is located and how it is differently
located by other feature selection algorithms. In every approach, there is a strong correlation that
about 500 highly ranked features are likely to be the most frequently appeared unigrams of the
dataset. Nevertheless, again, SVM-RFE selects features from the widest spectrum. SVM-RFE and
Relief algorithms often highly rank less frequent unigrams, and lowly rank a large number of
frequent unigrams. The difference between the two algorithms is that Relief puts most of the
highly frequent unigrams in either the head or the tail where SVM-RFE scatters them over the
whole range. This means that SVM-RFE is least affected by unigram frequency. This is an
advantage because in gender classification, we are more interested in finding gender-preferential
features regardless of the feature popularity.
Figure 22. Ranks of selected features
2000 4000 6000 8000 10000 12000 14000 16000 18000
2000
4000
6000
8000
10000
12000
14000
16000
18000
2000
4000
6000
8000
10000
12000
14000
16000
18000
2000
4000
6000
8000
10000
12000
14000
16000
18000
2000
4000
6000
8000
10000
12000
14000
16000
18000
InfoGain Evaluation
The InfoGain feature selection can be an alternative method when SVM-RFE is not
suitable for a desired dataset. Its accuracy is not as good as that of SVM-RFE, but it is notably
faster than SVM-RFE. It goes well with Naïve Bayes and Bayesian Logistic Regression yielding
90% or more of accuracy when 500-1000 Boolean features are supplied.
Figure 23. InfoGain feature selection and other classifiers
Top 20 Features Selected by InfoGain (Rank, Unigram ID, Unigram)
1 2 3 4 5 6 7 8 9 10
1934 838 858 51 1563 2024 1922 690 154 339
owner steve qualiti
11 12 13 14 15 16 17 18 19 20
83 920 393 1564 851 2488 2750 1989 426 1501
care sourc polit quickli
sec cheaper xoxo champion
Table 15. Classification details and top 20 features selected by InfoGain
2000 4000 6000 8000 10000 12000 14000 16000 18000 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
INFOGAIN - Boolean
SVM (Linear) Naive Bayes Bayesian Logistic Regression Random Forest (N=30)
2000 4000 6000 8000 10000 12000 14000 16000 18000 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
INFOGAIN - Count
SVM (Linear) Naive Bayes Bayesian Logistic Regression Random Forest (N=30)
41
The Relief feature selection algorithm helps classification algorithms yield less
fluctuating accuracy. Classification is least affected by the number of features when features
selected by Relief are supplied. It goes well with Bayesian Logistic Regression and Boolean
features.
Top 20 Features Selected by Relief (Rank, Unigram ID, Unigram)
1 2 3 4 5 6 7 8 9 10
137 339 169 313 132 234 32 187 180 83
turn wife join issu team monei call pai beauti care
11 12 13 14 15 16 17 18 19 20
562 107 392 122 10 689 427 190 51 275
impress open catch write lol inform mention worri haha media
Table 16. Classification details and top 20 features selected by Relief
2000 4000 6000 8000 10000 12000 14000 16000 18000 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RELIEF - Boolean
SVM (Linear) Naive Bayes Bayesian Logistic Regression Random Forest (N=30)
2000 4000 6000 8000 10000 12000 14000 16000 18000 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RELIEF - Count
SVM (Linear) Naive Bayes Bayesian Logistic Regression Random Forest (N=30)
42
Chapter 6 Suicide Prevention
We have discussed various machine learning techniques and how they are used to mine
information from social media. It is time to think about how such advances in computational
intelligence may improve the quality of human lives and how it may solve social problems. In
this chapter, we discuss a possibility of incorporating machine learning and social computing
techniques to prevent potential suicide as a future work beginning with a background on suicide
and suicide prevention efforts
6.1. Background
Suicide is a serious problem nowadays. The Centers for Disease Control and Prevention
(CDC) reports that suicide is the third leading cause of death among people ages from 10 to 24,
and that 13.8 percent of U.S. high school students have seriously considered attempting suicide
during 2009 (CDC 2010). Their statistics reveals that female high school students in the U.S. are
a high risk group. As shown in Figure 25, 17.4% of the female students seriously considered
attempting suicide while 10.5% of the male students did so. Among three female groups (White,
Black, and Hispanic), the Hispanic female group was the most risky (20.2%). On the other hand,
black male students were the least risky (7.82%).
43
Figure 25. U.S. high school students who seriously considered/attempted suicide in 2009 (CDC 2010)
There is a consistent finding that more than 90% of people who committed suicide had
shown a psychiatric disorder (Grøholt, Ekeberg et al. 1997). The role of mental health services is
therefore important for those who are well aware of their mental illness and proactive in curing it.
However, we cannot solely rely on mental treatment to solve the problem. It is known that more
than two thirds of the people who committed suicide had not received any appropriate treatment
(Marttunen, Aro et al. 1992; Brent, Perper et al. 1993; Shaffer, Gould et al. 1996; Grøholt,
Ekeberg et al. 1997; Isacsson 2001). Also, it is impossible for clinicians to reach all of the people
who are in need. Moreover, teenagers who have previously attempted suicide are associated with
a greater risk for completed suicide, particularly within the subsequent 12 months (Joiner 2005).
Therefore, suicide prevention efforts are essential.
In the United States, there have been efforts to prevent suicide since the late 1950s with
the opening of the Los Angeles Suicide Prevention Center (Litman, Shneidman et al. 1961). As
the number of young people who committed grew dramatically in the 1970s, the call for suicide
prevention efforts grew louder and louder. In 1989, the U.S. Health and Human Services
10.5
16.1
13.1
7.8
18.1
13.0
10.7
20.2
15.4
10.5
17.4
13.8
0
5
10
15
20
25
44
published a report on youth suicide prevention strategies ((Eds.) Rosenberg and Baer 1989). The
growth of the suicide prevention movement accelerated afterwards. In 2004, the Garrett Lee
Smith Memorial Act (PL 108-355) was enacted into law making at the first youth suicide
prevention bill. The act authorizes funding for grants in the following three areas: A development
and implementation of early suicidal intervention and prevention strategies, a development and
implementation of suicidal prevention programs for public and private institutions of higher
education, and an establishment of a suicide technical assistance center to support the first two
areas.
There are many kinds of theoretical models of suicide: sociological, psychological,
psychosocial, biological, family, and biopsychosocial models. One of the best known models of
suicide is the one developed by Emile Durkheim in the 19th century. It is a sociological suicide
model that categorizes suicide into four types: egoistic, altruistic, anomic and fatalistic suicide
(Durkheim 1951). According to the model, an egoistic suicide occurs due to isolation from the
community which may be caused by a failure to accomplish meaningful friendship or a failure to
engage in social activities. The second is an altruistic suicide. It is due to a sacrifice for the good
of the society. Commandments of religious sacrifice can be a reason for this type of suicide. The
second is an anomic suicide. This happens when one is not able to cope with a personal crisis. It
is also related to satisfaction regulated by society. An acute anomie often comes from a failure to
keep the well-being of families. The fourth is a fatalistic suicide which is executed when one is
too overregulated by the society. A young person who commits suicide in prison can be an
example of this kind of suicide.
45
6.3. Cyberbullying
Suicide prevention is also essential in the cyber world. We are now living in the
Information Age, and the majority of adolescent people are digital natives. As the use of online
social networking services on the Internet became very popular (Bernoff, Pflaum et al. 2008;
Gartner 2010), a new problem emerged, the cyberbullying. Cyberbullying is usually accompanied
by the disclosure of victims' personal information on public online space for the purpose of
defaming or ridiculing the victims. Sending harassing and threatening messages to the victims
through emails and instant messengers is also a common cyberbullying activity.
Cyberbullying often leads to youth suicide. Megan Meier (November 6, 1992 – October
17, 2006) is one of the victims. It is known that her suicide was attributed to cyberbullying
through MySpace, an online social networking service. She received a message from a boy soon
after opening her MySpace account and thought that he was attractive. They became online
friends but never met in person. Megan began to exchange messages with him via MySpace.
However, the peace did not last for a long time. On October 15, 2006, she received a hoax
message from him via MySpace saying “I don’t know if I want to be friends with you anymore
because I’ve heard that you are not very nice to your friends.” Other people could read this
message on her MySpace page and write comments. The boy discussed the hoax with another
person who later added a comment to the original message saying “Everybody in O’Fallon knows
how you are. You are a bad person and everybody hates you. Have a shitty rest of your life. The
world would be a better place without you” Meier responded with a message reading “You’re the
kind of boy a girl would kill herself over.” and hanged herself in her bedroom closet twenty
minutes later.
Unfortunately, however, cyberbullying was not recognized as a grave crime at that time.
A broader consensus on the seriousness of a bullying was formed in early 2010 after the suicide
46
committed by Phoebe Prince (November 24, 1994 – January 14, 2010) in South Hadley,
Massachusetts. She suffered from months of constant bullying made by at least nine teenagers
from her high school. Her tragedy brought a call for more stringent, specific anti-bullying laws.
As a result, a state anti-bullying committee was set up in Massachusetts in March 2010. Finally,
the measure was signed into Massachusetts law on May 3, 2010, and inspired New York State to
introduce a similar legislation.
Actress Demi Moore is one of the few people who has helped people stop a potential
suicide in the cyber world. She uses Twitter for counseling and has more than 2.5 million
followers. Her rapid response to people who need immediate counseling has actually saved
several lives. Actress Nia Vardalos is also trying to prevent tragedy on Twitter. She once saw a
desperate tweet and alerted Florida authorities. The Seminole County Sheriff received the suicide
threat report and took an uninjured young man to a local hospital. Such an effort is invaluable and
needs to be continuous. Even though many organizations and individuals are working on suicide
prevention both online and offline, it would be better if we have a way to detect suicidal thoughts
systematically on social media.
AR, a pseudonym, was a fifteen-year-old girl when she hanged herself in her bedroom
closet in April 2010. Cyberbullying was a direct reason; she received hurtful text messages from
her classmates a day before her death. However, it was not all. Her parents said that she had been
hospitalized for suicidal thoughts in the previous years. AR’s case is worth an intensive analysis
because her inner world is well described in her messages to herself posted on Twitter over the
past one year. The tweets in the following tables show her unstable mental status; it seems that
she had been suffering from a manic-depressive illness for a long time. It is known that
47
god why cant people just be nice. today fucking sucks
you know, life really sucks
WHAT THE FUCK. Why are people such assholes?
Im really getting tired of all these tears. its starting to mess with my sleep and
my eyes have a permanent red rin…
Ive Never Felt So Alone
what? what what what? im so lost all the time
i want to shoot myself
really, really fuckin depressed
Mmm. Im happy. This is kinda weird...
SOOOOOOOO FUCKING SORE.
Table 17. Selected tweets posted by AR before her suicide
HN, a pseudonym, is another person whose suicide is confirmed. She left 23 public
messages on Twitter before she killed herself. Her final message was “Rich get richer, poor get
poorer, families on the street, govt doesn't care. God bless the usa, but can He save it?”
AL, a pseudonym, had been an active Twitter user. Her suicide is not confirmed but is
strongly suspected. She was known to be a transgender. She left more than 700 messages on
Twitter. It seems that she wanted her suicide to get a broad attention. She used the hash tag #tears
to make her messages more accessible by other people.
48
5/3/2010
My last day in this world.
I will be taking care of all of you. I will be in a better place.
Goodbye World
My pain will go away soon thank god.
I'm gonna call my cell phone provider and cancel my line. Before I commit
suicide. If u have my # just erase it I will no longer have it.
If I don't tweet that means that I'm with granma and god. Love you all. Have
everything ready #tears I'm scared
Ok I'm goning now! GOODBYE! #tears #tears
Table 18. Selected tweets posted by AL before her suicide
6.4. Possibility
Corpora of 30 Randomly Chosen People
To compare the frequencies of certain words used by the three people above with the
ones from the people who belong to the RND group. There are 15 males and 15 females
respectively in the group. They are active Twitter users whose language is English. They are
people who have posted at least 100 tweets on Twitter.
49
AR’s unigrams and bigrams are shown below. It is evident that she was frequently using
vulgar words. Especially, the word fuck(ing) was used much more frequently than average.
Rank Unigrams Bigrams
Table 19. AR’s Frequent Unigrams and Bigrams
AL’s corpus does not seem to have special characteristics. Like normal girls, she
frequently used the word love. It may be because she did not have a chronic illness; only the most
recent tweets showed suicidal thoughts.
Rank Unigrams Bigrams
50
Table 21. Frequencies of 5 frequently used unigrams
51
Chapter 7 Conclusion
In this research, we reflected upon the presented ideas and concepts now that we have
seen results from various combinations of classification and feature selection algorithms. It is
shown that the selection of features is crucial for an accurate gender classification. Without
appropriate feature selection, even with a state-of-the-art classifier, classification accuracy ranged
from 50% to 70% in many cases. With appropriate feature selection, the accuracy could be from
90% to 100%. This answers the research question “Q4. Is a feature selection process also
important in gender classification?” Yes, it is very important.
It turned out that SVM-RFE is the best one among the four feature selection approaches
when dealing with a small feature set. Because of its recursive nature of feature selection, it
requires a large amount of memory as well as a long computation time. Choosing a right feature
type is also important. It is observed that Boolean and Count types work well with microblogged
information especially when a dataset is small. With a good feature set, SVM and Bayesian
Logistic Regression performed perfect especially when the number of features is less than a
thousand in this experiment.
Some gender-preferential features have been found through an analysis of the outcomes
produced by the feature selection algorithms. We have seen in which groups each highly ranked
feature is prevailing. This answers to the research question Q1 “Are there gender-preferential
features in microblogged information?” Also answered is the question “Is gender classification
with machine learning methods achievable?” Gender classification is achievable with a good
52
combination of machine learning techniques. An answer to the research question “Q3. Do
traditional text classification methods work well with microblogged information?” may vary
according to the characteristics of a dataset. We have seen that the performance of Naïve Bayes is
as good as SVM.
In the suicide prevention chapter, we have seen the recent trend and the seriousness of
youth suicide. A comprehensive understanding of the theoretical models describing the nature of
suicide and an investigation into the actual microblogging behavior of the people who killed
themselves are essential to a potential computer-based suicide prevention system that is capable
of identifying suicidal thoughts from personal microblogs. One simple way to build a statistical
suicide model is to define two sets of words: positive and negative sets. The negative set will
contain words that are relevant to suicidal thought together with weights between 0 and 1. A
negative word with a high weight is likely to be used when thinking about suicide. The positive
set will contain words that are not relevant to suicidal thought also with weights between 0 and 1.
When measuring how likely a person is to approach the serious phenomenon, it would make more
sense to group words or phrases that have similar meanings or express similar moods into one set
and to analyze the use of the terms in such a group.
53
Bibliography (Eds.) Rosenberg, M. L. and K. Baer (1989). Report of the Secretary's task force on youth suicide. Volume 4: Strategies for the prevention of youth suicide
, Government Printing Office (DHHS Pub. ADM 89-1624).
Aizerman, M., E. Braverman, et al. (1964). "Theoretical foundations of the potential function method in pattern recognition learning." Automation and remote control
25(6): 821-837.
.
Barabási, A.-L. (2010). Bursts : the hidden pattern behind everything we do
. New York, N.Y., Dutton.
Bellman, R. (1961). Adaptive control processes: a guided tour
. Princeton, NJ, Princeton University Press.
Bengio, Y. and Y. LeCun (2007). Scaling learning algorithms towards AI. Large-Scale Kernel Machines
. L. Bottou, O. Chapelle, D. DeCoste and J. Weston, MIT Press, Cambridge, MA: 321- 358.
Bernoff, J., C. N. Pflaum, et al. (2008). The Growth of Social Technology Adoption, Forrester Research. Boser, B. E., I. M. Guyon, et al. (1992). A training algorithm for optimal margin classifiers
. 5th Annual Workshop on Computational Learning Theory, Pittsburgh, PA, ACM, New York, NY.
Breiman, L. (1996). "Bagging predictors." Machine Learning
24(2): 123-140.
45(1): 5-32.
Brent, D. A., J. A. Perper, et al. (1993). "Psychiatric risk factors for adolescent suicide: a case- control study." Journal of American Academy of Child and Adolescent Psychiatry
32(3): 521.
59(SS-5).
Cheong, M. and V. Lee (2009). Integrating web-based intelligence retrieval and decision-making from the twitter trends knowledge base. Proceeding of the 2nd ACM workshop on Social web search and mining. Hong Kong, China, ACM: 1-8.
54
Corney, M., O. d. Vel, et al. (2002). Gender-Preferential Text Mining of E-mail Discourse. Proceedings of the 18th Annual Computer Security Applications Conference
, IEEE Computer Society: 282.
DeCoste, D. and B. Schölkopf (2002). "Training Invariant Support Vector Machines." Machine Learning
46: 161-190.
Domingos, P. and M. Pazzani (1997). "On the Optimality of the Simple Bayesian Classifier under Zero-One Loss." Mach. Learn.
29(2-3): 103-130.
, ACM.
. Glencoe, Ill.,, Free Press.
Gartner (2010). Gartner Top End User Predictions for 2010: Coping with the New Balance of Power, Gartner. Genkin, A., D. D. Lewis, et al. (2007). "Large-scale Bayesian logistic regression for text categorization." Technometrics
49(3): 291-304.
Grøholt, B., O. Ekeberg, et al. (1997). "Youth suicide in Norway, 1990-1992: A comparison between children and adolescents completing suicide and age-and gender-matched controls." Suicide and Life Threatening Behavior
27: 250-263.
46(1): 389-422.
Hand, D. and K. Yu (2001). "Idiot's Bayes: Not So Stupid after All?" International Statistical Review
69(3): 385-398.
Hilden, J. (1984). "Statistical diagnosis based on conditional independence does not require it." Computers in biology and medicine
14(4): 429-435.
102(2): 113-117.
Jansen, B., M. Zhang, et al. (2009). "Twitter power: Tweets as electronic word of mouth." Journal of the American Society for Information Science and Technology
60: 1–20.
Java, A., X. Song, et al. (2007). Why we twitter: understanding microblogging usage and communities. Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis
. San Jose, California, ACM: 56-65.
Joachims, T. (2001). A statistical learning learning model of text classification for support vector machines
. 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, ACM, New York, NY.
Joiner, T. E. (2005). Why people die by suicide
. Cambridge, Mass., Harvard University Press.
Kim, E., S. Gilbert, et al. (2009). "Detecting Sadness in 140 Characters: Sentiment Analysis and Mourning Michael Jackson on Twitter." Retrieved 08/18/2009, from http://www.webecologyproject.org/2009/08/detecting-sadness-in-140-characters/. Kira, K. and L. A. Rendell (1992). A practical approach to feature selection. Proceedings of the ninth international workshop on Machine learning
. Aberdeen, Scotland, United Kingdom, Morgan Kaufmann Publishers Inc.: 249-256.
Kullback, S. and R. A. Leibler (1951). "On information and sufficiency." The Annals of Mathematical Statistics
22(1): 79-86.
Lewis, D. and W. Gale (1994). A sequential algorithm for training text classifiers
, Springer- Verlag New York, Inc.
Litman, R. E., E. S. Shneidman, et al. (1961). "Los Angeles suicide prevention center." American Journal of Psychiatry
117(12): 1084.
Marttunen, M. J., H. M. Aro, et al. (1992). "Adolescent suicide: endpoint of long-term difficulties." Journal of American Academy of Child and Adolescent Psychiatry
31(4): 649.
Nardi, B. A., D. J. Schiano, et al. (2004). "Why we blog." Commun. ACM
47(12): 41-46.
Nemes, S., J. Jonasson, et al. (2009). "Bias in odds ratios by logistic regression modelling and sample size." BMC medical research methodology
9(56).
Newman, M. (2002). "Spread of epidemic disease on networks." Physical Review E
66(1): 16128.
Opitz, D. and R. Maclin (1999). "Popular ensemble methods: An empirical study." Journal of Artificial Intelligence Research
11: 169-198.
14(3): 130-137.
Provost, F. and R. Kohavi (1998). "Guest editors' introduction: On applied research in machine learning." Machine Learning
30(2): 127-132.
Rish, I. (2001). An empirical study of the naive Bayes classifier. IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence
: 41–46.
Russell, S. J. and P. Norvig (2009). Artificial Intelligence: A Modern Approach
, Prentice Hall.
Shaffer, D., M. S. Gould, et al. (1996). "Psychiatric diagnosis in child and adolescent suicide." Archives of General Psychiatry
53(4): 339.
Spinelle, J. and A. Messer (2009). "Tweeting is more than just self-expression." Retrieved 09/10/2009, from http://live.psu.edu/story/41446. Starbird, K., L. Palen, et al. (2010). Chatter on the red: what hazards threat reveals about the social life of microblogged information
. 22nd ACM conference on Computer Supported Cooperative Work, Savannah, GA, ACM, New York, NY.
Sutton, R. and A. Barto (1998). Reinforcement learning: An introduction
, The MIT press.
337(1-2): 327-335.
Yarow, J. (2010). "Twitter Finally Reveals All Its Secret Stats." Retrieved 05/09/2010, from http://www.businessinsider.com/twitter-stats-2010-4. Zhao, D. and M. B. Rosson (2009). How and why people Twitter: the role that micro-blogging plays in informal communication at work. Proceedings of the ACM 2009 international conference on Supporting group work
. Sanibel Island, Florida, USA, ACM: 243-252.
2000 4000 6000 8000 10000 12000 14000 16000 18000
SVMRFE (TFIDF)
SVMRFE (Count)
SVMRFE (Boolean)
InfoGain (TFIDF)
InfoGain (Count)
InfoGain (Boolean)
Relief (TFIDF)
Relief (Count)
Relief (Boolean)
nth Unigram
SVMRFE (TFIDF)
SVMRFE (Count)
SVMRFE (Boolean)
InfoGain (TFIDF)
InfoGain (Count)
InfoGain (Boolean)
Relief (TFIDF)
Relief (Count)
Relief (Boolean)
nth Unigram
SVMRFE (TFIDF)
SVMRFE (Count)
SVMRFE (Boolean)
InfoGain (TFIDF)
InfoGain (Count)
InfoGain (Boolean)
Relief (TFIDF)
Relief (Count)
Relief (Boolean)
nth Unigram
SVMRFE (TFIDF)
SVMRFE (Count)
SVMRFE (Boolean)
InfoGain (TFIDF)
InfoGain (Count)
InfoGain (Boolean)
Relief (TFIDF)
Relief (Count)
Relief (Boolean)
nth Unigram
SVMRFE (TFIDF)
SVMRFE (Count)
SVMRFE (Boolean)
InfoGain (TFIDF)
InfoGain (Count)
InfoGain (Boolean)
Relief (TFIDF)
Relief (Count)
Relief (Boolean)
nth Unigram
SVMRFE (TFIDF)
SVMRFE (Count)
SVMRFE (Boolean)
InfoGain (TFIDF)
InfoGain (Count)
InfoGain (Boolean)
Relief (TFIDF)
Relief (Count)
Relief (Boolean)
nth Unigram
SVMRFE (TFIDF)
SVMRFE (Count)
SVMRFE (Boolean)
InfoGain (TFIDF)
InfoGain (Count)
InfoGain (Boolean)
Relief (TFIDF)
Relief (Count)
Relief (Boolean)
nth Unigram
SVMRFE (TFIDF)
SVMRFE (Count)
SVMRFE (Boolean)
InfoGain (TFIDF)
InfoGain (Count)
InfoGain (Boolean)
Relief (TFIDF)
Relief (Count)
Relief (Boolean)
nth Unigram
SVMRFE (TFIDF)
SVMRFE (Count)
SVMRFE (Boolean)
InfoGain (TFIDF)
InfoGain (Count)
InfoGain (Boolean)
Relief (TFIDF)
Relief (C