lsa and classification modeling in applications for …web.ccsu.edu/datamining/data mining...

151
Latent Semantic Analysis and Classification Modeling in Applications for Social Movement Theory Judith E. Spomer A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Data Mining Department of Mathematical Sciences Central Connecticut State University New Britain, Connecticut March 2009 Thesis Advisor Dr. Roger Bilisoly Department of Mathematical Sciences

Upload: dinhquynh

Post on 30-Jul-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Latent Semantic Analysis and Classification Modeling in Applications for Social

Movement Theory

Judith E. Spomer

A Thesis

Submitted in Partial Fulfillment of the

Requirements for the Degree of

Master of Science in Data Mining

Department of Mathematical Sciences

Central Connecticut State University

New Britain, Connecticut

March 2009

Thesis Advisor

Dr. Roger Bilisoly

Department of Mathematical Sciences

LSA and Classification Modeling in Applications for SMT 2

Latent Semantic Analysis and Classification Modeling in Applications for Social

Movement Theory

Judith E. Spomer

An Abstract of a Thesis

Submitted in Partial Fulfillment of the

Requirements for the Degree of

Master of Science in Data Mining

Department of Mathematical Sciences

Central Connecticut State University

New Britain, Connecticut

March 2009

Thesis Advisor

Dr. Roger Bilisoly

Department of Mathematical Sciences

Key Words: Social Movement Theory, Collective Action, Framing, Linguistics,

Latent Semantic Analysis, Text Mining, Data Mining

LSA and Classification Modeling in Applications for SMT 3

ABSTRACT

Social Movement Theory (SMT) is an area of study in Sociology and Political

Science that provides an analytical framework for understanding the factors involved in

organized social action. A social movement develops in response to an injustice or issue

about which people rally in an effort to solve the problem. In recent years, the threat of

terrorism has accelerated research in SMT. Much of this research has focused on

understanding the framing process, whereby a Social Movement Organization (SMO)

issues communications intended to influence perceptions and enlist help from the

members of a community or general population.

The Internet has become a primary medium for SMOs to distribute electronic text

to describe an issue, place blame, identify victims, propose solutions, and ask readers to

take action on an issue. Texts such as these are framing documents. The research

presented in this paper introduces the application of statistical methods in text analytics

as a means to extend research involving the framing process. This thesis proposes that

Latent Semantic Analysis techniques combined with classification modeling algorithms

results in models that are able to discover small numbers of framing documents scattered

among thousands of text documents. The models themselves provide insight into the

character of framing documents.

Global warming was selected as the social movement upon which to base this

study. Global warming framing documents were collected from Internet sites, and were

combined with other documents that address global warming, but are not framing in

nature. This corpus served to train and test statistical models that not only detected

framing documents, but further classified these by framing task with high accuracy.

LSA and Classification Modeling in Applications for SMT 4

These methods can be implemented with commercial software and serve as a resource for

the study of both SMT and active social movements.

LSA and Classification Modeling in Applications for SMT 5

TABLE OF CONTENTS

ABSTRACT ........................................................................................................................ 3

DEDICATION .................................................................................................................... 8

ACKNOWLEDGEMENTS ................................................................................................ 9

INTRODUCTION ............................................................................................................ 10

SOCIAL MOVEMENTS ......................................................................................... 13

FRAMING ........................................................................................................... 14

GLOBAL WARMING ............................................................................................ 17

OBJECTIVES ....................................................................................................... 17

METHODOLOGY ................................................................................................. 20

RELATED RESEARCH .................................................................................................. 22

METHODS ....................................................................................................................... 24

COLLECTION OF ELECTRONIC TEXT DOCUMENTS .............................................. 24

PREPROCESSING OF TEXT DOCUMENTS .............................................................. 25

Document Classification ............................................................................... 25

Removal of Personal Identifying Information ............................................... 26

Parsing the Text ............................................................................................ 26

Term Weighting ............................................................................................. 29

Singular Value Decomposition ..................................................................... 30

EXPLORATORY DATA ANALYSIS ........................................................................ 33

PREPARATION FOR CLASSIFICATION MODELING ................................................ 36

Training and Test Data Sets ......................................................................... 36

LSA and Classification Modeling in Applications for SMT 6

Balancing the Training Data Set .................................................................. 37

Derivation of Dummy Variables ................................................................... 37

PROFILING SELECTED SVD VARIABLES ............................................................ 43

SVD_2 ........................................................................................................... 44

SVD_6 ........................................................................................................... 48

MODELING ALGORITHMS ................................................................................... 55

CART Algorithm............................................................................................ 55

Logistic Regression Algorithm ...................................................................... 57

Neural Network Algorithm ............................................................................ 58

Combination Models ..................................................................................... 59

EVALUATION METRICS ...................................................................................... 61

MODEL 1: FRAMING/NON-FRAMING CLASSIFICATION ...................................... 63

CART Model 1............................................................................................... 63

Logistic Regression Model 1 ......................................................................... 68

Neural Network Model 1 ............................................................................... 74

Voting Combination Model 1 ........................................................................ 77

Mean Model Response Probability Combination Model 1 ........................... 80

Selection of Final Model 1 ............................................................................ 82

MODEL 2: FRAMING TASK CLASSIFICATION ..................................................... 84

CART Model 2............................................................................................... 84

Logistic Regression Model 2 ......................................................................... 95

Neural Network Model 2 ............................................................................. 105

Combination Model 2 ................................................................................. 108

LSA and Classification Modeling in Applications for SMT 7

Selection of Final Model 2 .......................................................................... 110

DISCUSSION ................................................................................................................. 113

COMPARISON OF MODEL ALGORITHMS TO K-NEAREST NEIGHBORS ................ 113

IMPORTANT PREDICTOR VARIABLES ................................................................ 117

THE DIFFICULTY OF CLASSIFICATION .............................................................. 118

CONCLUSION ............................................................................................................... 119

FUTURE WORK ................................................................................................ 120

REFERENCES ............................................................................................................... 122

BIOGRAPHICAL STATEMENT .................................................................................. 128

APPENDIX A: REPRESENTATIVE GLOBAL WARMING DOCUMENTS ........... 129

NON-FRAMING DOCUMENT ............................................................................. 129

DIAGNOSTIC DOCUMENT ................................................................................. 129

PROGNOSTIC DOCUMENT ................................................................................. 130

MOTIVATIONAL DOCUMENT ............................................................................ 130

APPENDIX B: CLUSTER RESULTS FOR ENTIRE CORPUS ................................. 131

APPENDIX C: DUMMY VARIABLES FOR FRAMING/NON-FRAMING MODELS

......................................................................................................................................... 136

APPENDIX D: DUMMY VARIABLES FOR

DIAGNOSTIC/PROGNOSTIC/MOTIVATIONAL MODELS .................................... 139

APPENDIX E: TERMS ASSOCIATED WITH THE HIGHEST SVD_6 VALUES.... 141

LSA and Classification Modeling in Applications for SMT 8

DEDICATION

This thesis is dedicated to my husband, Philip, and to my dear children Kathryn,

Jenna, Alexander, and Nicaea. Your encouragement, love, and support have given me

the strength and enthusiasm to pursue a Master‟s degree in a fascinating field and to

complete this final effort in the program.

LSA and Classification Modeling in Applications for SMT 9

ACKNOWLEDGEMENTS

I would like to thank the members of my thesis committee, Professor Roger

Bilisoly, thesis advisor and text mining mentor, and Professors Daniel Larose and

Zdravko Markov for serving on my committee and holding me to a high standard in

writing this thesis. I want to express my sincere gratitude to my academic advisor,

Professor Daniel Larose, for his guidance and instruction and for his efforts in creating

this unique degree program.

Words cannot express the gratitude that I feel for my friends Deborah Hoy, Lisa

Kennicott, Cindy Kleist, Lydia Koch, Janet Price, Sue Robinson, and so many others.

Your patient listening, encouragement, and prayers kept me going. In addition, Lydia

Koch got in the trenches with me to scour the Internet for global warming framing

documents. She also painstakingly proofread this document. I must extend a fervent

thank you to my friend and colleague, Randall LaViolette, PhD, for his insight, tireless

reviews, and insistence that I make this a scholarly work.

My fellow students have truly made my classes a pleasure, especially Don

Wedding, who saved me from procrastination, Kathleen Alber, who is an angel of

kindness, and Lucia Lake, who inspired me to do my best.

Above all, I thank my parents, Don and Marge Fisk, for their unconditional love,

for encouraging me to always pursue and enjoy learning, and for setting an exemplary

example of integrity.

LSA and Classification Modeling in Applications for SMT 10

INTRODUCTION

The explosive popularity of the Internet since the 1990s has resulted in a flood of

text that can be stored electronic form. Email messages, news reports, technical papers,

word processor documents, even the text on the web pages themselves are rich sources of

information. Analysts are bombarded with more text than they can possibly read or

absorb. In response, research into the processing and analysis of text has blossomed.

The need to find information on the Internet has fueled the development of information

retrieval. The need to discover meaning or themes in a corpus of documents has led to

the development of algorithms that parse words from text and represent words and

documents in a numeric form for subsequent processing. Raw text is unstructured, that

is, it is not neatly organized into a set of observations each of which is described by a set

of variables. Once text has been processed and represented in numeric form, it is

structured and data mining tools can be brought to bear in the analysis.

The discipline of data mining has generated algorithms and processes by which an

experienced practitioner can discover patterns and characteristics within structured data.

Models can be developed that categorize potential business customers by the likelihood

that they will respond to an advertisement. Building a statistical model to perform such a

task is classification modeling. This study makes use of algorithms that convert text into

a structured and meaningful format and then applies classification modeling methods.

The entire process, however, is guided by a theory that originated in an entirely different

discipline: Social Science.

The theoretical underpinnings of this study have parallels in a well-established

practice known as credit score modeling. A hundred years ago, banks relied on

LSA and Classification Modeling in Applications for SMT 11

accumulated knowledge to make lending decisions. That knowledge was solidly based

on thousands of years of lending experience honed by the incentive to turn a profit. In

ancient Rome, money lenders knew it was unwise to lend money to a man who did not

repay his debts. That is still true today. Bankers then, and now, have conducted their

business under theories that have been confirmed by experience. With the advent of

computers came the ability to develop and implement statistical models based on the

foundation of lending theory. Today, credit institutions develop credit score models from

historical data characteristics and the known financial behavior of many customers. The

trained model provides a score for a new loan applicant based on the applicant‟s

historical data characteristics. A higher score is associated with a higher likelihood that

this applicant will repay the loan.

When a credit bureau declares that a loan applicant is unlikely to repay a loan due

to a long history of poor fiscal responsibility, that declaration is not based on capriciously

discovered data patterns. It rests solidly on demonstrated theories from observing the

behavior of millions of similar consumers and translating that behavior into models. In

other words, to a finance professional, the model makes sense. There will be exceptions,

but most often the credit score is an excellent indicator of whether a lending institution

can expect to recover the money it loans to an applicant and make a profit. The accuracy

with which credit score models classify loan applicants validates the theory that past

fiscal behavior is indicative of future fiscal behavior.

The study described in this paper also uses theory to guide classification of text

documents, not long established theory, but a newer, emerging theory. The theory of

lending rests upon knowledge gained by untold numbers of practitioners over thousands

LSA and Classification Modeling in Applications for SMT 12

of years with copious data sources. It has been validated by millions of successful

decisions from credit score models. The theory that guides the efforts in this study has

been developed in modern times by relatively few, but dedicated, social scientists who

have pored over evidence from events that occur quite rarely in comparison to the

frequency with which loans are made. The theory that inspired this study is Social

Movement Theory (SMT), which is an area of study in Social Science and Political

Science that provides an analytical framework for understanding the factors involved in

organized social action. Organized social action could be mild, but when it becomes

disruptive, it captures the attention of government and law enforcement agencies. Will

the actions simply snarl city traffic or result in deaths and injuries?

A key element of SMT is the framing process, whereby communications are

prepared with intent to influence perceptions and enlist help from others in order to

address a social problem. The discovery of framing communication is an essential

element in anticipating social activist events. These communications are often

disseminated via the Internet. If we simply troll the Internet, looking for impending

social violence, the odds for success are low. However, if SMT is correct in its

assumptions of the process whereby people are influenced, recruited, and moved to

action, then we have a template to guide our search for evidence of this framing process.

Can the process be disrupted or altered to prevent violence? The answer to that question

is well beyond the scope of this effort. Instead, the research presented in this paper

focuses on establishing a method to find the evidence, in the form of disseminated

writings, of framing processes. The result is a set of highly accurate models that

effectively discover and classify texts that perform framing functions. SMT guided and

LSA and Classification Modeling in Applications for SMT 13

permeated this effort. In return, the results of this effort contribute to the validation of

SMT assumptions regarding the characteristics of framing communication.

Social Movements

Social movements (Della Porta & Diani, 1999; McAdam, McCarthy, & Zald,

1988) spring from the efforts of persons who become concerned about a societal problem,

whether real or perceived. These persons form groups, known as Social Movement

Organizations (SMOs) in order to more effectively address the problem. SMOs articulate

and publicize their chosen issue in a manner designed to elicit support and involvement

from others. SMOs often adopt the stance that solutions to their issue may be brought to

fruition through collective social action.

The collective nature of these actions magnifies the result when compared to the

actions of just one individual. An environmental SMO may encourage persons, through

direct contact, to recycle plastic bottles. The SMO may also ask these persons to recruit

friends and acquaintances to join the recycling effort. The objective is to engage enough

participants to make a measureable improvement in the environment. One may argue

that this type of collective action is harmless and cannot help but improve the

environment to some degree. However, other actions may be more threatening than

merely recycling plastic bottles.

Protests and demonstrations can disrupt personal and business activities, involve

dangerous actions, or turn violent. Climbing a smokestack to unfurl a banner that decries

greenhouse gas emissions not only disrupts business, but also raises the specter of

possible injuries to the protestors, workers, or damage to equipment. The following text,

LSA and Classification Modeling in Applications for SMT 14

obtained from an environmental SMO web site encourages readers to take this type of

action in an effort to publicize the causes of global warming:

X marks the spot: Take your banner drop to the source: hang it on a power station,

smokestack, at an import terminal, or the roof of a head office and it‟s likely to

get loads of attention. The harder it is to get up, the harder it will be for them to

get down! (Rising Tide, 2008)

Framing

SMOs employ framing to craft the manner in which others interpret events

relative to the issue of concern. Framing may be described as the method by which an

individual organizes and categorizes events, situations, and personal experiences

(Goffman, 1974). These “frames,” through which one observes life, can be influenced by

persuasive rhetoric. Framing provides the means for SMOs to inform others of the issue

at hand, change the manner in which others think about the issue, and invite participation

to act on the issue. In this context, framing refers to these actions of SMOs. Their goal is

to change the frames through which others view life events and, ultimately, to change the

manner in which others act upon an issue. Frames that promote joining together with

others to take action on a social issue are known as Collective Action Frames (CAF).

The CAF process can be broken into three key tasks (Snow & Benford, 1988):

1. Diagnostic, which defines the problem, often places blame, and may describe

how innocent victims are affected;

2. Prognostic, which presents solutions or steps to resolve the issue; and

3. Motivational, which states an urgent need for action to address the problem,

and invites others to join in ameliorative collective social actions.

LSA and Classification Modeling in Applications for SMT 15

This definition of the core framing tasks is fundamental to the research described in this

paper. This study hinged upon developing a methodology to characterize and discover

evidence of these three framing tasks via processing of writings obtained from the

Internet.

An example of motivational framing found on the Internet is shown in Figure 1.

This web page, obtained from the Greenpeace website and reproduced here with

permission, asks the reader to join an action to halt the expansion of Heathrow Airport.

Greenpeace, along with some celebrities, purchased a plot of land in the middle of the

proposed new runway at Heathrow. The reader is asked to sign up as an owner on the

legal deed of trust. Greenpeace wants to demonstrate the breadth of public support for its

position by obtaining as many owners as possible for this plot of land. Notice that,

toward the bottom of Figure 1, there is a link titled “Invite your friends to join.” This is

an effort to recruit more adherents to the cause. Figure 1 contains both text and images.

Images were not processed in this study, but the text can be extracted. The extracted text

then becomes a “document” which is subsequently processed and analyzed. The text in

Figure 1 is simply presented as an example and was not part of the corpus of documents

that were used in this study.

LSA and Classification Modeling in Applications for SMT 16

Figure 1. An example of an Internet motivational framing document. From

Greenpeace UK website, http://www.greenpeace.org.uk/climate/airplot, viewed

February 2, 2009. Used with permission.

Framing, in the context of social movements, has moved beyond academic

research in recent times. Framing theory is now being actively studied and put into

practice. For instance, the FrameWorks Institute is a nonprofit think tank that has been in

existence for ten years and focuses solely on framing public issues. Its mission is “to

advance the nonprofit sector‟s communications capacity by identifying, translating and

modeling relevant scholarly research for framing the public discourse about social

problems” (FrameWorks, 1999). FrameWorks has assisted the Climate Message Project,

LSA and Classification Modeling in Applications for SMT 17

a coalition of environmental SMOs in determining how to reframe the issue of global

warming (FrameWorks, n.d.).

Global Warming

Global warming has been selected as the social issue on which to base this study.

Global warming, sometimes referred to as climate change, is a contested topic. Various

factions debate whether or not the Earth is truly warming. Those that agree that the Earth

is experiencing an unprecedented period of warming argue among themselves over the

cause of that warming, the timing and effects of warming, and viable solutions to the

threat.

Concerns over the presumed effects of global warming have spawned social

movements that span cultural, religious, and geographical boundaries. This issue has

support from odd bedfellows like the Communist Party, which has published “Global

Warming – The Communist Solution” (Communist Party USA, 2008), and the Southern

Baptist Convention, which touts its own measures to combat global warming (Southern

Baptist Convention, 2007). From Australia (Climate Action Network Australia, 2008) to

Saudi Arabia (New Europe, 2008) the debate continues and global warming SMOs

abound.

Objectives

This study demonstrates a method to build classification models that can sift

through a corpus of documents, all of which are written on the topic of global warming,

and discover the small proportion of texts that are framing in nature. The purpose behind

these framing texts can range from attempts to sway public opinion regarding the issue to

LSA and Classification Modeling in Applications for SMT 18

recruiting persons to join organized efforts, such as protests or demonstrations, in order to

bring about desired change. The ability to deploy a model that can detect signs of such

activity, for instance by observing public Internet postings, could provide indications of

impending social conflict.

The social actions espoused by these framing documents could be harmless,

mildly disruptive, or in some cases could lead to violence. Global warming protests are

generally peaceful. In some cases, though, global warming protests have turned

disruptive or violent. At the EU-US Summit in June 2001, U.S. opposition to the Kyoto

Protocol set off protests in which environmentalists and anti-globalization activists threw

bottles and stones at Swedish riot police (BBC News, 2001). On two days in July 2008,

environmental protesters brought operations at the world‟s largest coal terminal to a

standstill by chaining themselves to a conveyor belt (Reuters, 2008). Regardless of

whether these social actions are peaceful or violent, early warning can aid communities

and law enforcement agencies in efforts to minimize the negative effects of expressed

social unrest.

This study does not address the issue itself, nor does it take a stand on the

controversy. Rather, this study takes advantage of the abundance of related documents

that have been produced in electronic form. Some of these are scientific publications,

studies, and news articles that are, or should be, objective and non-framing in nature.

Increasingly, the Internet is employed as the media of choice for disseminating social

activist views to the general public. The World Wide Web is a rich source of framing

documents that have been produced with the intent of influencing opinion on global

warming or recruiting others to join the efforts of the movement. For example, the

LSA and Classification Modeling in Applications for SMT 19

following motivational framing text was obtained from a site promoting a July 2008

climate rally in Australia.

Get serious! NO DESALINATION PLANT -- PHASE OUT COAL

NO NEW FREEWAY TUNNEL -- NO BAY DREDGING

YES to renewable energy, public transport & urgent action to stop global

warming.

We are calling for Victorians to join the Climate Emergency Rally

on July 5. We want to send a wake-up call to state and federal

governments that they are heading in the wrong direction. New coal, new

freeways and desalination plants increase our use of and reliance on fossil

fuels dramatically at a time when we must be cutting our use even more

dramatically. We are calling on governments to implement sustainable

alternatives to these irresponsible and expensive projects.

We call on all community groups and individuals to join us to send

this important message to the government. We are going to form a 140-

metre-long human sign to spell the words „Climate Emergency‟.

Please organise your group to send endorsement, tell everyone you

know, and come on the day wearing something red to symbolise

emergency. (Climate Rally, 2008)

The Climate Rally organizers were successful. The event was held in Melbourne,

Australia on July 5, 2008, with approximately 1,500 (police estimate) to 3,000-5,000

LSA and Classification Modeling in Applications for SMT 20

(organizers‟ estimate) in attendance. After listening to speakers and conducting a

peaceful march, the “Climate Emergency” sign (Figure 2) was formed by rally

participants. (Courtice, 2008) No violence was involved in this demonstration.

Figure 2. Photograph demonstrating the success of Climate Rally 2008 in

obtaining participants. (Campbell, 2008)

Methodology

This study employs a combination of Latent Semantic Analysis techniques and

statistical modeling algorithms (logistic regression, decision trees, and neural networks)

to produce models that accurately classify new, unseen text documents.

Latent Semantic Analysis (LSA) is a well established information retrieval

methodology that returns pertinent documents in response to a query (Deerwester,

Dumais, Furnas, Landauer, & Harshman, 1990). Perhaps the best known examples of

information retrieval applications are Internet search engines. LSA parses documents

from a corpus and represents the corpus as a matrix, most often with a row for each term

(word or phrase), a column for each document, and term counts or weights populating the

LSA and Classification Modeling in Applications for SMT 21

cells. Some analysts construct the matrix with rows representing documents and columns

representing terms, but in this study, the former representation is employed. The matrix

is known as a term-document matrix. It is sparse, meaning it has a large number of cells

with zero values for terms that do not appear in a particular document. The structure is a

high dimensional matrix, meaning there may be thousands of columns and tens of

thousands of rows.

LSA deals with the complexities of this large sparse matrix by employing singular

value decomposition (SVD) to reduce the dimensionality while retaining most of the

information in the corpus. SVD enables the calculation of a series of numerical values

for each text document. These calculated values can serve as input to classification

algorithms resulting in a tool that can accurately identify specific types of text, for

instance, the influential documents that are indicative of social action.

The first task in this study is to train a model to correctly classify framing and

non-framing documents. Second, a more specific classification model is developed to

further classify framing documents as belonging to one of the three main framing tasks:

diagnostic, prognostic, or motivational. Implementation of these techniques may open

the door to expanded research applications. For example, such applications might

monitor activist Internet postings and provide ongoing input for social scientists‟ study of

the dynamics of social movements.

LSA and Classification Modeling in Applications for SMT 22

RELATED RESEARCH

Employing LSA to generate predictive document attributes for classification

models is not new. For example, LSA has been applied in concert with the k-Nearest

Neighbors (kNN) algorithm to perform classification of topics in Reuters international

news reports (Naohiro, Murai, Yamada, & Bao, 2006). Also, the use of kNN and LSA

for document classification is not restricted to English. The same methods were used in a

study that classified Bulgarian news articles (Nakov, Valchanova, & Angelova, 2003). A

disadvantage of kNN is that it requires storage and processing of the training data to

accomplish classification of each new observation (Larose, 2005, p. 104). Rather than

using kNN, this study explores other classification algorithms which do not require a

large data store. Decision trees, logistic regression, and neural networks, as well as

ensemble modeling are considered. These algorithms can be trained to recognize a new

observation as belonging to one of a set of defined classes without requiring maintenance

of a large data store.

Classification of documents by framing tasks is a more difficult problem than

classifying news articles by subject. In this study, all documents address the same topic.

The models must detect more nebulous attributes such as motivation and intent. This

effort may be likened to classification by ideology. Different ideologies may be present

in documents that are written about a single topic. Ideological classification has been

successfully performed using singular value decomposition and a naïve Bayes classifier

to determine the party affiliation (Democrat or Republican) of Senators based on the text

of speeches made in the United States Senate (Morrow, Bader, Chew, & Speed, 2008).

LSA and Classification Modeling in Applications for SMT 23

Attributes that distinguish framing texts have been discussed extensively in Social

Movement Theory literature. A common approach is to develop a list of framing

keywords based on the most frequently occurring terms that are found in a collection of

framing documents (Triandafyllidou & Fotiou, 1998; Semetko & Valkenburg, 2000).

Computer-assisted qualitative data analysis software (CAQDAS) in conjunction with

word maps (electronic lists of words linked by associations) is another method that has

been proposed for identification of framing text (Koenig, 2005). Laborious processes

have also been used to characterize framing texts, such as manual extraction of words and

phrases which are then assigned codes for further analysis (Cooper, 2002). This is

effective, but expensive in terms of time and finances. It is also prone to issues of

human-induced bias and error.

The aforementioned methods as applied to framing texts have been utilized to

analyze processes and features of frame construction, rather than as a means to produce

input for classification models, which is the objective of this study. Examination of such

models can reveal additional insight into social movement frames. But, more notably, the

ability to develop framing classification models may extend theory into practice by

providing the means to monitor texts from various sources for indications of emerging or

escalating collective social actions.

LSA and Classification Modeling in Applications for SMT 24

METHODS

The processing of text for this study, including importation of documents, parsing,

singular value decomposition, and exploratory clustering, was performed using SAS®

Text Miner software (SAS® Text Miner, 2003-2005), a component of SAS® Enterprise

Miner™ (SAS® Enterprise Miner™, 2003-2005). Portions of the analysis utilized

SAS® and SAS/STAT® software (SAS® Software, 2002-2003). Additional analysis and

classification modeling used SPSS Clementine® (SPSS Clementine®, 2007).

The primary data mining task is classification of text documents, resulting in two

models. The first is dichotomous, classifying text documents as being either framing or

non-framing. The second, a polychotomous model, classifies text documents as one of

four types: diagnostic, prognostic, motivational, or non-framing. SAS® Text Miner

converts the information in the text into a structured form which can then be fed into

Clementine decision tree, logistic regression, and neural network algorithms for the

purpose of classification.

Collection of Electronic Text Documents

Publicly available text documents in electronic form, all addressing the topic of

global warming, were collected. Abstracts from technical papers, conference

presentations, and reviews (ISI Web of Knowledge, 2008) were assumed to be non-

framing documents. Framing texts were gathered from web sites that support various

social movements focused on the global warming issue. The framing texts were

annotated with the source web site URL, the date of access, and, as available, the Web

page date. A total of 6,531 framing and non-framing text documents were collected.

Examples of each type of document are shown in Appendix A.

LSA and Classification Modeling in Applications for SMT 25

Preprocessing of Text Documents

Document Classification

The documents that were analyzed in this study were obtained from abstracts of

journal papers, conference proceedings, news reports, text from web pages, or text

downloaded from web pages in the form of a pdf, a word processing document, or as text

contained within a spreadsheet. All documents in the corpus were classified by the

author as framing or non-framing. The framing documents were further classified, again

by the author, as belonging to one of three core framing tasks: diagnostic, prognostic, or

motivational (Snow & Benford, 1988). This classification was necessarily subjective;

however, every effort was made to faithfully adhere to the definitions of the three

framing tasks.

Some documents contained elements of all three framing tasks. An example of

this could be a web page that first mentions the dangers of global warming (diagnostic),

then goes on to say that legislation is needed to counteract the causes of global warming

(prognostic), and finally asks the reader to come and protest in front of the building

where legislators are preparing to vote on such legislation (motivational). When more

than one framing task was evident in a document, the document was classified by the task

that dominated the text. Distributions of documents by classification are shown in

Figures 3 and 4.

Figure 3. Distribution of documents by framing classification.

LSA and Classification Modeling in Applications for SMT 26

Figure 4. Distribution of documents by core framing task.

Removal of Personal Identifying Information

Names and all other personal identifying information were removed from the

framing documents since the focus of this study is on the analysis of text and not the

persons mentioned in the text.

Parsing the Text

Humans can process (e.g. read) text data in its raw, unstructured form. Processing

text by a computer, however, requires a series of steps to convert the words into a

numeric representation. The most basic representation is a count of the number of times

each word occurs in each document. Before the words can be counted, they must be

extracted from the documents. Parsing generally uses spaces and punctuation to separate

text into individual words.

After parsing out all the words that are found in a corpus, the term list may

contain tens of thousands of terms, some of which provide little value to the analysis.

Therefore parsing may also incorporate algorithms to exclude these extraneous terms.

The parsing process may be further refined by defining “terms.” A term is a distinct item

consisting of either a single word (e.g., “atmosphere,” “enact,” “important”) or a phrase

consisting of two or more words (e.g., “sea level,” “greenhouse gas emission,” “polar ice

cap”). SAS® Text Miner software provides a variety of parsing options. The options

LSA and Classification Modeling in Applications for SMT 27

selected for this study were: part of speech tagging, stemming, stop word list, and noun

phrases.

Part of Speech Tagging

Some words that are spelled identically may have different meanings depending

upon the part of speech. For example, consider the word “rose.” As a noun it refers to a

flower. As a verb it is the past tense of a word that means “to ascend.” As an adjective it

is a color that is pale red. Rose may also be a proper name for a woman. For this reason,

“rose” as a noun should be considered distinct from “rose” as a verb, and so on. Part of

speech tagging allows each of these four forms of “rose” to be processed individually in

order to maintain those distinctions. The number of occurrences of the verb “rose” in

each document is generated independently of the occurrences of the noun “rose” and each

is listed in the term list for the document collection. Without the part of speech

qualification, “rose” would appear once in the term list and the number of occurrences

would be the sum of occurrences of all forms of “rose.”

Stemming

In contrast to part of speech tagging, which keeps some words distinct even when

they are spelled identically, stemming combines all of the grammatical forms of a word

into one canonical form. In effect, stemming reduces the number of terms in the term list

and increases the accuracy by ensuring that multiple forms of a single word are not listed

separately.

Verbs are most often the target of stemming. The verb “go” has different

spellings for its tenses such as “gone,” “went,” and “going.” Stemming combines all

LSA and Classification Modeling in Applications for SMT 28

forms of this verb into the canonical form, “go.” SAS® Text Miner software precedes

the canonical form with a plus sign to indicate the presence of other forms.

Singular and plural forms of a noun are stemmed into the singular form. As with

verbs, SAS® Text Miner software indicates the presence of other forms by preceding the

singular noun with a plus sign.

Removal of Selected Terms

Some parts of speech are considered to be non-informative. For example,

conjunctions, such as “but,” “and,” “or,” are often placed in this category. While

grammatically useful, these words contribute little meaning to the text and can be

removed from the list of terms. The following parts of speech were removed from the

Global Warming corpus: Conjunction, Determiner, Auxiliary or Modal, Preposition,

Pronoun, Participle, Interjection, and Number. This leaves the following informative

parts of speech: Noun, Proper Noun, Verb, Adjective, Adverb, and Abbreviation.

Stemming and the removal of non-informative parts of speech are mechanized

methods that transform the list of terms into a smaller, more meaningful set. A stop word

list allows the analyst to manually specify additional deletions from the list of terms.

Stop words are terms that do not contribute meaning in the context of the analysis that is

being conducted. The determination of stop words should be carefully conducted in

concert with the goals of the researcher (Bilisoly, 2008, p. 245). A basic stop word list

was applied in this analysis. It contained 154 terms, such as “it,” “either,” and “this,” as

well as the individual letters of the alphabet.

LSA and Classification Modeling in Applications for SMT 29

Noun Phrases

Phrases are small groups of words that express a single idea. When certain

phrases occur repeatedly throughout the corpus of documents, the ideas represented by

those phrases may be captured by treating the entire phrase as a “term.” Counting

occurrences of “polar bear” and “polar ice cap” can be of more value in the analysis of

the corpus than counting the individual occurrences of “polar,” “bear,” “ice,” and “cap.”

The option in SAS® Text Miner software to identify noun phrases was selected in this

study.

Term Weighting

A term-document matrix, with rows representing terms and columns representing

documents was constructed. Each cell in the matrix was populated with the log-entropy

weighted term frequency (SAS Institute, Inc., 2003) as follows:

(1)

where

fij is the frequency of term i in document j

gi is the number of times that term i appears in the entire corpus

n is the number of documents in the corpus.

Using weights, rather than raw frequencies, results in a more realistic

representation of the importance of the terms (Manning & Schütze, 1999, pp. 541-543).

If the term “dog” appears once in one document and five times in a second document,

one may surmise that the second document is more likely than the first to be focused on

the topic of dogs. But, if “dog” occurs fifty times in one document and one hundred

LSA and Classification Modeling in Applications for SMT 30

times in another document, those extra fifty occurrences in the second document do not

necessarily mean that the second document is twice as likely to be about dogs. In this,

admittedly extreme case, one would tend to state merely that both documents are

definitely about dogs. Logarithmic scaling of the term frequencies dampens the effect of

the higher counts, thus imparting a more reasonable measure of the term relevance.

Another important relation is obtained by incorporating the global frequency of

the term in the calculation of term weight. If the term “dog” appears frequently in many,

or all, of the documents, then that term will not be useful in distinguishing the documents

from one another. This is reflected in a lower term weight. This could be the case when

all documents in the corpus are about dog obedience training. If, however, the entire

corpus is about veterinary care for small pets and “dog” appears frequently in a small

number of documents, then “dog” will have a higher term weight. In this case, “dog” can

be of value when separating the documents by types of pets.

Singular Value Decomposition

This corpus of 6,531 documents contains over 23,000 terms after selecting only

the most informative parts of speech, applying a stop word list, and performing

stemming. The term-document matrix is quite sparse, meaning most cells contain zeroes.

This sparse, highly dimensional matrix cannot be processed efficiently or effectively.

Thus, singular value decomposition (SVD) is performed to transform the matrix into a

lower dimensional, compact form while still retaining most of the information

represented by the original matrix.

SVD decomposes a rectangular matrix into three matrices, which we shall refer to

as U, D, and V. The original matrix can be reconstructed by multiplication as UDVT. A

LSA and Classification Modeling in Applications for SMT 31

term-document matrix is more often than not rectangular since there are typically many

more terms than there are documents. The matrix U describes the original rows (terms)

as vectors of derived factor values. V describes the original columns (documents)

similarly. These factor values will be referred to as dimensions. D is a diagonal matrix

containing singular scaling values ordered from largest to smallest. In text mining, the

dimensionality is typically reduced by eliminating dimensions from U, D, and V,

beginning with the smallest values in D. When the dimensionality is reduced in this

manner, the reconstructed matrix, UDVT, is a least-squares best fit of the original matrix.

(Landauer, Foltz, & Laham, 1998)

For this study, only the first one hundred dimensions were calculated, giving a

truncated decomposition of the term-document matrix. Truncating to one hundred

dimensions is the default software setting and generally provides more than enough

information for classification modeling. In the truncated singular value decomposition of

the term-document matrix, the matrix VT contains a row for each document and a column

for each of the one hundred SVD dimensions. Now, rather than representing each

document as a vector of weights for tens of thousands of terms, each document is

represented as a vector in a space of one hundred dimensions. These one hundred SVD

dimension values for each document become the input variables for the classification

models.

The popularity of SVD in the field of text analytics is due to more than just its

ability to reduce dimensionality. The truncation of the decomposition addresses, at least

in part, the problem of synonymy (Manning, Raghavan, & Schütze, 2008, pp. 378-382).

Synonymy occurs when two or more different words have the same meaning, such as

LSA and Classification Modeling in Applications for SMT 32

“road” and “street.” Suppose we compare two document vectors that are composed of

term weights. One document contains the term “road” but not “street” and the other

document mentions “street” but not “road.” The term weight for “street” is zero in the

first document, as is the term weight for “road” in the second document. A calculation of

similarity between the two documents could rate the documents less similar than would a

human reader. Truncated SVD reflects similar co-occurrences of terms in the dimension

values and thus approximates the manner in which a human perceives similarity between

words (Landauer, Foltz, & Laham, 1998, p. 4).

LSA and Classification Modeling in Applications for SMT 33

Exploratory Data Analysis

The end result of the text preprocessing is a data set that contains a row for each

document and a column for each of the one hundred SVD dimensions. Each document is

now represented as a series of continuous numerical values. The first step in exploring

the corpus is to cluster the documents on the basis of the SVD dimension values. The

second step is exploration of the individual SVD dimensions as candidates for predictor

variables in the classification models.

SAS® Text Miner software provides two algorithms for clustering documents:

Expectation Maximization and Hierarchical. The documents in the Global Warming

corpus were clustered using Expectation Maximization with the SVD dimension values

serving as input variables. Twenty-one clusters were selected. Selection was an iterative

process beginning with thirteen clusters determined by the software. The documents

were then reclustered with the number of clusters specified by the analyst until the

twenty-one clusters were chosen. This set of clusters was the smallest set that clearly

represented distinct concepts.

For each cluster, SAS® Text Miner software returns a list of descriptive terms,

the number and proportion of documents, and the root mean squared standard deviation.

The descriptive terms for each cluster are the terms with the highest binomial

probabilities (SAS Institute, Inc., 2003). This calculation is defined in equation (2). The

clusters were profiled by consideration of the descriptive terms and occasional browsing

of individual documents that were assigned to each cluster. As a result of profiling, a

name was assigned to each cluster to identify its contents. The clusters are described in

detail in Appendix B.

LSA and Classification Modeling in Applications for SMT 34

(2)

where

F is the binomial cumulative distribution function

k is the number of times the term appears in cluster j

N is the number of documents in cluster j

t is the total number of times the term appears in all clusters

n is the number of documents in the corpus.

The proportions of framing and non-framing documents in each cluster (Figure 5)

reveal that the framing documents are primarily present in six clusters. The fact that the

framing documents are clustered together indicates there are detectable attributes, as

described by the SVD dimension values, which are shared by the framing documents.

Figure 5. Proportion of framing documents in clusters.

LSA and Classification Modeling in Applications for SMT 35

Likewise, the proportions of framing documents by task (Figure 6) demonstrate a

tendency for these documents to cluster together, although not as cleanly as framing vs.

non-framing. Note that some diagnostic documents, and to a lesser degree the prognostic

and motivational documents, are found in clusters that are primarily non-framing (e.g.

Effect of GW on Human Populations). This suggests that the diagnostic framing

documents may be the most difficult to model since they have some commonality with

documents of other classes.

Figure 6. Proportion of framing documents by framing task in clusters.

LSA and Classification Modeling in Applications for SMT 36

Preparation for Classification Modeling

Training and Test Data Sets

The corpus of documents was randomly split into a training data set of 4,358

documents and a test data set of 2,173 documents. The two-thirds, one-third split was

chosen to provide a sufficient number of documents to train the models while retaining an

adequate set of documents to assess model performance. Random selection was within

document class in order to maintain class proportions for both data sets (Figure 7).

Training Data Set

Test Data Set

Figure 7. Proportions of the four target classes within the training and test

data sets.

The training data set was processed as described previously in “Preprocessing of

Text Documents” to obtain SVD dimension values for each document. This

preprocessing, starting with the raw text documents, was performed only on the training

set documents to ensure that the SVD values for the training set were not influenced by

the documents in the test data set.

LSA and Classification Modeling in Applications for SMT 37

Balancing the Training Data Set

The training data set was balanced by random removal of non-framing documents

until the proportion of framing documents reached approximately 20%. Balancing to

20% provides the model with a sufficient number of target observations for training

(Larose, 2006, pp. 298-299; Pyle, 2003, p. 396). Class proportions in the balanced

training data set are shown in Figure 8.

Figure 8. Class proportions within the balanced training data set.

Derivation of Dummy Variables

The predictor variables in this study are the continuous SVD dimension values.

In the case of logistic regression, interpretation of the coefficient for a continuous

variable requires the assumption that the logit is linear in this variable (Hosmer &

Lemeshow, 2000, p. 63). If this does not hold, then various transformations may be

applied. In general, the process of transforming variables can be laborious, but is rather

straight forward when the target variable is dichotomous. When the target variable is

polychotomous, as it is for the framing task classification models, meeting the linearity

assumption can be impossible since a particular variable may require different

transformations for each possible target value.

A solution to this dilemma is found in the creation of a set of one or more

dichotomous dummy variables for each continuous predictor variable (Larose, 2006, p.

LSA and Classification Modeling in Applications for SMT 38

176). Each dummy variable is assigned a value of one if the predictor variable is within a

certain range, and zero otherwise. A form of bivariate analysis was employed to define

the number of dummy variables and their associated ranges for each continuous

predictor. This analysis reveals ranges of variable values for each SVD that are

positively, or negatively, associated with target variable values. This analysis also

reveals ranges of SVD values that display no relationship to the target variable values.

For some predictors, the bivariate analysis revealed little or no relationship between the

any values of the predictor and the target variable. In those cases, the predictor variables

were removed from consideration.

At this point, it should be noted that SVD variables are ordered such that SVD_1

explains more of the variance in the term document matrix than SVD_2, and so on. Thus,

one may expect that the higher ranked SVD variables, such as SVD_1 and SVD_2, will be

more effective predictors in a classification model. This was evident in the bivariate

analysis where SVD variables beyond SVD_35 displayed little relationship, positive or

negative with the target variables, so this analysis was discontinued after SVD_35.

The training data set was used for the bivariate analysis, to avoid the influence of

the documents that were set aside for testing. Initially, for each SVD variable, all

documents were binned into five percent intervals, meaning each interval encompassed a

range of SVD values such that approximately five percent of the documents in the corpus

had values within that range. This analysis was performed by coding a SAS® software

program which produced a table illustrating the relationship of each predictor variable

with the target variable.

LSA and Classification Modeling in Applications for SMT 39

A set of these tables were produced for dichotomous target variables representing

each possible class: non-framing, diagnostic, prognostic, and motivational. Table 1

illustrates the bivariate analysis for the non-framing target variable and the SVD_23

continuous predictor variable, which is representative of the bivariate analysis performed

for all combinations of target and predictor variables. The ratios in the table are defined

as:

If (% of F) > (% of NF) then Ratio = (% of F) / (% of NF)

If (% of F) < (% of NF) then Ratio = - (% of NF) / (% of F)

where

NF is the number of non-framing documents in the 5% interval

F is the number of framing documents in the 5% interval

% of NF is the percent of non-framing documents from the training data set

(1,551 as shown in Figure 8) that are in the 5% interval.

% of F is the percent of framing documents from the training data set (396

as shown in Figure 8) that are in the 5% interval.

The horizontal solid lines define dummy variable ranges and were added to the

table by the analyst, as were the numbers “1” through “4” and the letters “N” which can

be seen on the right-hand side of Table 1. Four dummy variables were created for

SVD_23, one for each range of values labeled “1” through “4.” The letter “N” represents

a neutral interval, described in more detail below.

LSA and Classification Modeling in Applications for SMT 40

Table 1

Bivariate Analysis of SVD_23 for Non-Framing (NF) vs. Framing (F) Classification

NF F % of NF % of F

5%

Interval

Ratio Dummy

Variable

Range Neg. Neutral Pos.

41 57 2.64% 14.39% LOW -< -.1431 5.45 1

74 22 4.77% 5.55% -.1431 -< -.1047 1.16

77 20 4.96% 5.05% -.1047 -< -.0838 1.02 N

88 11 5.67% 2.77% -.0838 -< -.0674 -2.04

2

86 12 5.54% 3.03% -.0674 -< -.0552 -1.83

83 16 5.35% 4.04% -.0552 -< -.0458 -1.32

83 11 5.35% 2.77% -.0458 -< -.0363 -1.93

83 14 5.35% 3.53% -.0363 -< -.0278 -1.51

83 15 5.35% 3.78% -.0278-<-.0172 -1.41

79 18 5.09% 4.54% -.0172 -< -.0073 -1.12

83 15 5.35% 3.78% -.0073 -< .0019 -1.41

85 13 5.48% 3.28% .0019 -< .0116 -1.67

86 10 5.54% 2.52% .0116 -< .0206 -2.20

83 15 5.35% 3.78% .0206 -< .0300 -1.41

85 14 5.48% 3.53% .0300 -< .0416 -1.55

82 13 5.28% 3.28% .0416 -< .0545 -1.61

77 21 4.96% 5.30% .0545 -< .0700 1.07 N

76 21 4.90% 5.30% .0700-< .0885 1.08

72 26 4.64% 6.56% .0885 -< .1127 1.41 3

45 52 2.90% 13.13% .1127 - HIGH 4.53 4

LSA and Classification Modeling in Applications for SMT 41

The dummy variables are designed to capture ranges of SVD_23 that exhibit solid

positive (or negative) ratios between the two possible target values. Ratios within

approximately ±1.10 may be considered neutral. These intervals are labeled “N” and do

not require dummy variables. Neutral intervals are also used to separate adjacent positive

and negative intervals.

The SVD_23 dummy variables were calculated in Clementine “Derive” nodes and

were named SVD23_01, SVD23_02, SVD23_03, and SVD23_04. For example, the

derivation of SVD23_02 is:

if (SVD_23 >= -0.0838) and (SVD_23 < 0.0545) then

1

else

0

endif

In addition to allowing for the assumption of linearity in logistic regression, these

dummy variables generalize the information that is obtained from the original predictor

variable, thus reducing the risk of over-fitting a model. Rather than following the process

just described, dummy variables could be created from binning the predictor variables

into equal-sized bins. Many software packages, including Clementine, provide

convenient tools to do so. The method employed here, adapted from the author‟s

personal experience in credit risk modeling, requires additional time and effort, but

results in more meaningful dummy variables. Raymond Anderson, in his book on credit

LSA and Classification Modeling in Applications for SMT 42

scoring methods (Anderson, 2007, p. 358), outlines a similar process for defining dummy

variables in retail credit scoring. He recommends first creating fine classes consisting of

small, equal ranges for each predictor variable, and then combining those classes into

logical groupings that display similar risk. The fine and grouped classes correspond to

the 5% intervals and subsequent dummy variable ranges that were incorporated in this

analysis. Anderson further explains (2007, p. 359) the necessity for at least one neutral

interval containing classes that are near average risk, that have insufficient data, or that

do not logically fit with any of the defined dummy variables.

Two sets of dummy variables were calculated. The first set, listed in Appendix C,

was derived from the bivariate analysis of the SVD values and the non-framing target

variable. These dummy variables are intended for use in the first model, framing versus

non-framing classification.

The second set of dummy variables, described in Appendix D, was derived from

simultaneous consideration of the bivariate analysis for each of the framing classes. The

dummy variables in this second set are to be employed in the finer classification of

framing documents as belonging to one of the three framing tasks: diagnostic,

prognostic, or motivational. For this reason, intervals for each SVD variable were

defined to accommodate all three target variables.

LSA and Classification Modeling in Applications for SMT 43

Profiling Selected SVD Variables

The SVD variables are assigned rather nondescript names, SVD_1, SVD_2, etc.,

by the decomposition software. With some effort, an analyst can gain insight into the

nature of each SVD variable. In the course of developing classification models for this

study, two SVD variables were singled out as being particularly effective in the

classification models: SVD_2 and SVD_6. SVD_2 appeared to be very significant in

models which separate framing and non-framing documents. SVD_6 was important when

classifying documents by framing task. After model development was completed, these

two variables were profiled in order to understand why they were so important in the

final models. The profiling of these two variables is presented at this point in the paper in

order to provide the reader with additional understanding prior to the description of the

modeling process.

In the truncated singular value decomposition of the term-document matrix, the

matrix U contains a row for each term and a column for each of the one hundred SVD

dimensions. Thus, utilizing the U matrix, SVD dimension values are available for the

terms. These values, along with the bivariate analysis for each variable, are now used to

examine SVD_2 and SVD_6 in detail.

LSA and Classification Modeling in Applications for SMT 44

SVD_2

Analysis of SVD_2 clearly demonstrates a strong relationship between this

variable and the separation of framing and non-framing documents. The bivariate

analysis of documents for SVD_2 vs. the framing/non-framing target variable (Table 2)

shows that low values, less than -0.0051, of SVD_2 have a heavily negative association

with the framing class. From -0.0051 to 0.0687 SVD_2 is neutral. Above 0.0687 SVD_2

becomes increasingly positively associated with the framing class, except for the highest

values which are neutral.

So we may deduce that documents with a low SVD_2 value are more likely to be

non-framing and documents with higher SVD_2 values are likely to be framing. The

SVD_2 values for the documents appear to be an excellent indicator for framing vs. non-

framing documents.

LSA and Classification Modeling in Applications for SMT 45

Table 2

Bivariate Analysis of SVD_2 for Non-Framing (NF) vs. Framing (F) Classification

NF F % of NF % of F

5%

Interval

Ratio

Neg. Neutral Pos.

96 0 6.18% 0.0% LOW -< -.4183 .

98 0 6.31% 0.0% -.4183 -< -.3868 .

98 0 6.31% 0.0% -.3868 -< -.3636 .

98 0 6.31% 0.0% -.3636 -< -.3433 .

96 0 6.18% 0.0% -.3433 -< -.3204 .

99 0 6.38% 0.0% -.3204 -< -.3028 .

96 0 6.18% 0.0% -.3028 -< -.2815 .

97 0 6.25% 0.0% -.2815 -< -.2613 .

98 0 6.31% 0.0% -.2613 -< -.2378 .

97 0 6.25% 0.0% -.2378 -< -.2122 .

99 0 6.38% 0.0% -.2122 -< -.1812 .

94 3 6.06% 0.75% -.1812 -< -.1394 -8.00

93 3 5.99% 0.75% -.1394 -< -.1054 -7.91

91 7 5.86% 1.76% -.1054 -< -.0617 -3.32

84 13 5.41% 3.28% -.0617 -< -.0051 -1.65

79 19 5.09% 4.79% -.0051 -< .0687 -1.06

31 66 1.99% 16.66% .0687 -< .1420 8.34

4 93 0.25% 23.48% .1420 -< .1945 91.06

101 94 24.93% 23.73% .1945 - HIGH -1.05

Note. There are 1,551 non-framing documents and 396 framing documents.

LSA and Classification Modeling in Applications for SMT 46

In order to discover the concepts represented by SVD_2, the terms associated with

SVD_2 must be inspected. Table 3 contains the terms that are most positively and most

negatively associated with SVD_2. A plus sign in front of a term indicates that term has

been stemmed. The terms associated with lower SVD_2 values appear to be analytic

(study, investigate, analysis), factual (result, data, observed, quantitative), and related to

climate change (condition, temporal, variation, climatic, temperature). None of the

terms in the list of the lowest SVD_2 values indicates passion or social involvement.

The terms associated with high SVD_2 values are quite different from those

associated with the lower values. These terms are social (people, own, personal),

emotional (care, heart, justice), and above all, these terms are action-oriented (action,

work, hear, call, do, encourage, bring, join). The objects of the actions are also evident

(business, public, government).

LSA and Classification Modeling in Applications for SMT 47

Table 3

Terms Associated with SVD_2

25 Terms with Lowest SVD_2 Values 25 Terms with Highest SVD_2 Values

Term POS Value Term POS Value

+ result Noun -0.5918 + people Noun 0.4497

+ study Noun -0.5804 own Adjective 0.4485

data Noun -0.5780 + thing Noun 0.4333

+ condition Noun -0.5730 just Adverb 0.4222

+ indicate Verb -0.5649 + action Noun 0.4138

temporal Adjective -0.5496 + live Verb 0.4110

+ sensitivity Noun -0.5361 + work Verb 0.4107

observed Adjective -0.5324 + business Noun 0.4092

+ investigate Verb -0.5322 + hear Verb 0.4076

+ variation Noun -0.5308 + call Verb 0.3941

+ region Noun -0.5299 + care Verb 0.3938

climatic Adjective -0.5298 personal Adjective 0.3927

northern Adjective -0.5277 + do Verb 0.3901

+ period Noun -0.5181 + encourage Verb 0.3863

+ analysis Noun -0.5174 + heart Noun 0.3842

+ factor Noun -0.5170 public Adjective 0.3837

sensitive Adjective -0.5087 + call Noun 0.3822

quantitative Adjective -0.4974 + government Noun 0.3804

potential Adjective -0.4915 + bring Verb 0.3773

+ suggest Verb -0.4910 nationally Adverb 0.3751

dynamics Noun -0.4910 + happen Verb 0.3745

spatial Adjective -0.4885 enough Adverb 0.3744

+ temperature Noun -0.4884 justice Noun 0.3727

+ model Noun -0.4867 + join Verb 0.3709

+ surface Noun -0.4808 accessible Adjective 0.3692

LSA and Classification Modeling in Applications for SMT 48

SVD_6

SVD_6 was singled out as the most important predictor variable by the two

models that were trained to discriminate between the three core framing tasks. The

bivariate analysis for SVD_6 suggests that diagnostic documents are negatively

associated with low values of SVD_6 and positively associated with high values of

SVD_6 (Table 4). In contrast, prognostic documents are positively associated with low

values of SVD_6 and negatively associated with high values of SVD_6 (Table 5).

Motivational documents are negatively associated with low values and positively

associated with high values of SVD_6 (Table 6).

LSA and Classification Modeling in Applications for SMT 49

Table 4

Bivariate Analysis of SVD_6 for Non-Diagnostic (ND) vs. Diagnostic (D) Classification

ND D % of ND % of D

5%

Interval

Ratio

Neg. Neutral Pos.

95 2 5.11% 2.24% LOW -< -.2289 -2.28

96 2 5.16% 2.24% -.2289 -< -.1562 -2.30

96 2 5.16% 2.24% -.1562 -< -.1183 -2.30

95 1 5.11% 1.12% -.1183 -< -.0959 -4.55

96 0 5.16% 0.00% -.0959 -< -.0788 .

98 1 5.27% 1.12% -.0788 -< -.0641 -4.69

97 1 5.22% 1.12% -.0641 -< -.0507 -4.65

92 4 4.95% 4.49% -.0507 -< -.0375 -1.10

96 2 5.16% 2.24% -.0375 -< -.0245 -2.30

95 3 5.11% 3.37% -.0245 -< -.0121 -1.52

96 1 5.16% 1.12% -.0121 -< -.0007 -4.60

95 2 5.11% 2.24% -.0007 -< .0106 -2.28

96 2 5.16% 2.24% .0106 -< .0203 -2.30

93 3 5.00% 3.37% .0203 -< .0326 -1.48

95 3 5.11% 3.37% .0326 -< .0451 -1.52

94 2 5.05% 2.24% .0451 -< .0594 -2.25

95 4 5.11% 4.49% .0594 -< .0775 -1.14

86 13 4.62% 14.60% .0775 -< .1051 3.16

79 16 4.25% 17.97% .1051 -< .1417 4.23

73 25 3.92% 28.08% .1417 - HIGH 7.15

Note. There are 1,858 non-diagnostic documents and 89 diagnostic documents.

LSA and Classification Modeling in Applications for SMT 50

Table 5

Bivariate Analysis of SVD_6 for Non-Prognostic (NP) vs. Prognostic (P) Classification

NP P % of NP % of P

5%

Interval

Ratio

Neg. Neutral Pos.

30 67 1.65% 50.75% LOW -< -.2289 30.71

80 18 4.40% 13.63% -.2289 -< -.1562 3.09

86 12 4.73% 9.09% -.1562 -< -.1183 1.92

93 3 5.12% 2.27% -.1183 -< -.0959 -2.25

95 1 5.23% 0.75% -.0959 -< -.0788 -6.91

96 3 5.28% 2.27% -.0788 -< -.0641 -2.33

97 1 5.34% 0.75% -.0641 -< -.0507 -7.05

95 1 5.23% 0.75% -.0507 -< -.0375 -6.91

96 2 5.28% 1.51% -.0375 -< -.0245 -3.49

98 0 5.39% 0.00% -.0245 -< -.0121 .

96 1 5.28% 0.75% -.0121 -< -.0007 -6.98

94 3 5.17% 2.27% -.0007 -< .0106 -2.28

96 2 5.28% 1.51% .0106 -< .0203 -3.49

94 2 5.17% 1.51% .0203 -< .0326 -3.42

95 3 5.23% 2.27% .0326 -< .0451 -2.30

94 2 5.17% 1.51% .0451 -< .0594 -3.42

97 2 5.34% 1.51% .0594 -< .0775 -3.53

98 1 5.39% 0.75% .0775 -< .1051 -7.13

90 5 4.95% 3.78% .1051 -< .1417 -1.31

95 3 5.23% 2.27% .1417 - HIGH -2.30

Note. There are 1,815 non-prognostic documents and 132 prognostic documents.

LSA and Classification Modeling in Applications for SMT 51

Table 6

Bivariate Analysis of SVD_6 for Non-Motivational (NM) vs. Motivational (M) Classification

NP P % of NP % of P

5%

Interval

Ratio

Neg. Neutral Pos.

96 1 5.41% 0.57% LOW -< -.2289 -9.48

96 2 5.41% 1.14% -.2289 -< -.1562 -4.74

94 4 5.30% 2.28% -.1562 -< -.1183 -2.32

92 4 5.19% 2.28% -.1183 -< -.0959 -2.27

95 1 5.36% 0.57% -.0959 -< -.0788 -9.38

93 6 5.24% 3.42% -.0788 -< -.0641 -1.53

95 3 5.36% 1.71% -.0641 -< -.0507 -3.13

92 4 5.19% 2.28% -.0507 -< -.0375 -2.27

94 4 5.30% 2.28% -.0375 -< -.0245 -2.32

96 2 5.41% 1.14% -.0245 -< -.0121 -4.74

88 9 4.96% 5.14% -.0121 -< -.0007 1.04

95 2 5.36% 1.14% -.0007 -< .0106 -4.69

96 2 5.41% 1.14% .0106 -< .0203 -4.74

92 4 5.19% 2.28% .0203 -< .0326 -2.27

92 6 5.19% 3.42% .0326 -< .0451 -1.51

88 8 4.96% 4.57% .0451 -< .0594 -1.09

80 19 4.51% 10.85% .0594 -< .0775 2.40

78 21 4.40% 12.00% .0775 -< .1051 2.73

65 30 3.66% 17.14% .1051 -< .1417 4.67

55 43 3.10% 24.57% .1417 - HIGH 7.92

Note. There are 1,772 non-motivational documents and 175 motivational documents.

LSA and Classification Modeling in Applications for SMT 52

Investigation of the terms associated with SVD_6 (Table 7) should give more

insight into its relationship with the framing task classifications. The bivariate analyses

of SVD_6 showed that prognostic documents are positively associated with low SVD_6

values in contrast with diagnostic and motivational documents which are negatively

associated with low SVD_6 values. This association is quite apparent from observing the

twenty five terms with the lowest SVD_6 values. These terms are indicative of solutions

to global warming such as reducing home energy consumption and options to reduce

driving one‟s personal vehicle.

The positive association of high SVD_6 values with diagnostic and motivational

documents, as indicated by the bivariate analyses, is not immediately apparent from

perusal of the twenty five terms that have the highest SVD_6 values. The terms protest

and bandwagon, which could occur in motivational documents, are present in this list.

Climate changing and Guatemala could be indicative of diagnostic documents.

Apparently, more than just twenty five terms should be analyzed for the high SVD_6

values. The SVD_6 bivariate analyses show that diagnostic and motivational documents

are most positively associated with SVD_6 values that are 0.1417 and greater. There are

over 600 terms in that interval, which are listed in Appendix E.

Terms highlighted in yellow in Appendix E appear to be motivational. These

terms include types of actions: tough action, global action, urgent action, real action,

international action, and future action. There are verbs and phrases defining the activity:

lobby, act, commit, send, fight, gather, and win. Events can be found in this list: protest,

meeting, training, strategy sessions, and rally. The emotional appeal for action is also

evident: exciting, urgently, anger, and alarm. And finally, the hallmark of a

LSA and Classification Modeling in Applications for SMT 53

motivational document is the emphasis on people gathering together to take action:

bandwagon, group, movement, organize, mobilize, global movement, and friends. The

presence of these terms in the list of terms associated with high SVD_6 values gives

credence to the usefulness of SVD_6 in distinguishing motivational framing documents.

Terms that seem to be diagnostic are highlighted in green in Appendix E.

Diagnostic documents define a problem, often place blame and identify victims and

consequences. The problem definition is seen in the presence of terms such as: climate

change, climate-changing, climate crisis, rising sea levels, danger, devastating, drastic

increase, environmental destruction, and dangerous climate change. Placing blame is

indicated by the terms: polluter, rich countries, oil giant, foreign oil, aviation emissions,

interest-group, and corporation. Victims are also abundant in the list of terms: aquatic

life, women, coastal regions, human health, low-income, poor, amazon, mangrove forest,

wildlife, and rainforest. The association of these terms with high SVD_6 values validates

the usefulness of SVD_6 for identifying diagnostic framing documents.

LSA and Classification Modeling in Applications for SMT 54

Table 7

Terms Associated with SVD_6

25 Terms with Lowest SVD_6 Values 25 Terms with Highest SVD_6 Values

Term POS Value Term POS Value

+ appliance Noun -0.4575 increased instances Noun Group 0.6423

+ pound Noun -0.4562 + giant Noun 0.6374

+ save Verb -0.4530 + protest Verb 0.6362

+ thermostat Noun -0.4521 climate-changing Adjective 0.6341

+ saving Noun -0.4259 bbc Prop 0.6305

heating Noun -0.4237 + cite Verb 0.6304

compact Adjective -0.4190 + cooperative Noun 0.6302

+ install Verb -0.4142 guatemala Prop 0.6299

saving Adjective -0.4095 world economy Noun Group 0.6293

+ home Noun -0.4089 other biofuels Noun Group 0.6282

+ cost Verb -0.4071 + proponent Noun 0.6271

electricity Noun -0.4064 booming Adjective 0.6260

carpooling Verb -0.4056 bandwagon Noun 0.6251

properly Adverb -0.4041 massive amounts Noun Group 0.6251

energy use Noun Group -0.4020 corn ethanol Noun Group 0.6251

american household Noun Group -0.3983 political Prop 0.6248

+ household Noun -0.3982 + price Verb 0.6237

+ heater Noun -0.3976 useless Adjective 0.6232

+ ton Noun -0.3941 + hill Prop 0.6223

energy bill Noun Group -0.3930 + acre Noun 0.6223

+ window Noun -0.3917 + commission Verb 0.6219

bonus Noun -0.3910 consolidation Noun 0.6213

telecommuting Prop -0.3860 + herbicide Noun 0.6212

telework Prop -0.3860 + breed Noun 0.6208

+ utility Noun -0.3832 corn Verb 0.6203

LSA and Classification Modeling in Applications for SMT 55

Modeling Algorithms

Several modeling algorithms are incorporated in this study. CART, logistic

regression, and neural network models were developed for both the framing/non-framing

model (Model 1) and the non-framing/diagnostic/prognostic/motivational model (Model

2). In addition, the results of these modeling algorithms were incorporated into

combination (ensemble) models. A brief discussion of each modeling method follows.

CART Algorithm

Classification and Regression Trees (CART) are a type of decision tree, which

begin with a root node which contains all observations in the data set. The predictor

variables are tested to determine the best manner in which to split the root node

observations into two or more nodes that distinguish the classes. These nodes are placed

below the root node and the process is repeated, building a structure that looks more like

a root system than a tree. The final nodes are called leaf nodes.

The CART method (Breiman, Friedman, Olshen, & Stone, 1984) is a decision tree

algorithm in which each node is split into just two branches. The upper node is referred

to as the parent node and the two nodes beneath each parent node are called child nodes.

The target variable must be discrete, consisting of a finite set of classes. The predictor

variables may be categorical or continuous. Beginning with the root node, the tree grows

by successively splitting each node into two child nodes until a prescribed stopping

criterion is met. The objective is to end up with leaf nodes that are as pure as possible,

meaning these nodes each contain as high of a proportion of observations belonging to

one class as possible.

LSA and Classification Modeling in Applications for SMT 56

Breiman et al. (1984) proposed the Gini index of diversity to measure the purity

of class homogeneity in child nodes. CART chooses the splitting criterion such that the

Gini index is minimized. This index is calculated as:

(3)

(4)

(5)

where

c is the number of target classes

n is the number of observations in the parent node

is the number of observations in the left child node

is the number of observations in the right child node

is the number of class j observations in the left child node

is the number of class j observations in the right child node

Stopping criteria prevent the tree from being grown to the point where the model

is overfit. An overfit model will classify training observations with very high precision,

but is not generalized enough to perform well when classifying new data. To prevent

overfitting, the decision trees in this study are limited to six levels below the root node,

and require a minimum of 2% of all observations in a parent node and 1% in a child

node.

LSA and Classification Modeling in Applications for SMT 57

Logistic Regression Algorithm

Logistic regression describes the relationship between one or more predictor

variables and a categorical response. The response, or target variable, can be

dichotomous, having two values, or polychotomous, having more than two values.

Model 1 in this study has a dichotomous target variable with the two values “Framing”

and “Non-Framing.” Model 2 has a polychotomous target variable with four values:

“Non-Framing,” “Diagnostic,” “Prognostic,” and “Motivational.” Simple logistic

regression incorporates one predictor as compared to two or more predictors for multiple

logistic regression. The SVD values that have been calculated from the corpus of

documents in this study provide up to one hundred possible predictor variables.

Therefore, multiple logistic regression is used.

For a multiple logistic regression model with response Y and p independent

predictor variables described by the vector x’ = (x1, x2, …, xp), the conditional mean of Y

given x is denoted as:

. (6)

Logistic regression models this using:

(7)

where g(x) is a linear function of the parameters:

. (8)

LSA and Classification Modeling in Applications for SMT 58

Note that equation (7) implies that

(9)

This is called the logit transformation. The logistic regression parameters, β’ = (β0, β1,

…, βp) are estimated by maximum likelihood estimation. The log likelihood function is

expressed as:

(10)

Differentiating L(β|x) with respect to each parameter, setting the result equal to zero, and

solving generates the maximum likelihood estimator.

Neural Network Algorithm

Artificial Neural Networks, or simply neural networks, are simplified models of

biological nervous systems such as the human brain. A biological neuron collects

information from other neurons through dendrites. The neuron processes the information

and fires when a threshold is attained, sending information through an axon to other

neurons.

A neural network is composed of layers that correspond to the neuron functions.

The input layer consists of a node for each predictor, which is connected to one or more

hidden layers, the last of which connects to an output layer having one or more nodes. If

the target variable is dichotomous or consists of ordered classes, then one output node

may be tested against threshold values to determine the classification. When the target

variable consists of multiple unordered classes, the output layer will contain one node for

LSA and Classification Modeling in Applications for SMT 59

each class. The output node with the highest value for an observation determines the

classification of that observation.

The nodes in each layer are connected to all nodes in the next layer, but are not

connected to each other. The connections are weighted, initially with random weights. A

hidden layer node produces a single value from a linear combination of the inputs to the

node and their associated weights. This value is fed into a nonlinear activation function,

which mimics the firing of an actual neuron. In neural networks, the activation function

is most often the sigmoid function (Larose, 2005, p. 133):

. (11)

The data for each observation moves through the network, producing an output

that is compared to the true target value. The error is used to adjust the connection

weights and the process is repeated, gradually improving the model results.

Combination Models

Combining the results of two or more models can capitalize on the strong points

and minimize the weak spots of the individual models. Two methods of combining

models are employed in this study: Voting and Mean Model Response Probabilities.

Voting is akin to conducting an election. Each model “votes” for the

classification that it has calculated for a particular observation. The winning

classification can be selected by a majority vote, or by other voting rules. Another way to

tally the votes is to classify an observation as “X” only if all models vote for “X.” In

LSA and Classification Modeling in Applications for SMT 60

addition, the classification of “X” could require just one model to vote for “X.” (Larose,

2006, pp. 304-306)

Mean Model Response Probability is an alternative method to combine the results

of several models. This method brings into play the confidences for the decisions made

by the models. For each contributing model, the Model Response Probability (MRP) is

calculated using the confidence variable that Clementine provides for a scored

observation. For the dichotomous Model 1 response, the MRP is calculated as:

if classification = "Framing" then

MRP = 0.5 + (reported confidence) / 2

else

MRP = 0.5 – (reported confidence) / 2

endif

The Mean MRP for all models is then calculated as the sum of the individual

model MRP values divided by the number of models. A normalized histogram of the

Mean MRP overlaid with the target variable is produced. This histogram provides

guidance in determining a cutoff value of the Mean MRP that separates the classes.

(Larose, 2006, pp. 308-312)

LSA and Classification Modeling in Applications for SMT 61

Evaluation Metrics

The models in this study are designed to discover small numbers of interesting

(framing) documents that are buried within a collection of mostly uninteresting (non-

framing) documents. For traditional measures of error, the large number of non-framing

documents will have a disproportionate influence in communicating the effectiveness of

the model. Intuitively one does not mind if a few uninteresting documents are included

when a model plucks framing documents from a large mass of texts, as long as most of

the interesting documents are found. However, if those few uninteresting documents

become numerous, they thwart the intentions behind developing a classification model in

the first place. Four measures, precision, recall, F1 measure, and accuracy, reflect these

points of view and are often used to evaluate models that deal with text (Manning,

Raghavan, & Schütze, 2008, pp. 142-144). These are the metrics that will serve to

evaluate the models that are developed in this study.

For the framing/non-framing classification, precision is the proportion of framing

documents that truly are framing out of all documents classified as framing by the model.

Recall is the proportion of framing documents that were correctly identified by the model

out of all framing documents that exist in the data set. The balanced F1 measure is the

equally weighted harmonic mean of precision and recall. Accuracy is the proportion of

correctly classified documents. These metrics are defined as:

(12)

(13)

LSA and Classification Modeling in Applications for SMT 62

(14)

(15)

The non-framing/diagnostic/prognostic/motivational model requires a slight

modification to the four metrics. Precision, recall, F1 measure, and accuracy are

calculated individually for each of the four classes as if there were a separate confusion

matrix for each class (e.g. motivational versus not motivational). Overall precision,

recall, F1 measure, and accuracy are then calculated.

For a polychotomous target variable, there are two methods for calculating overall

precision, recall, F1 measure, and accuracy. Macro-averaging calculates the average of

the individual evaluation measures over all classes. For example, the macro-averaged

precision is equal to the sum of the non-framing precision, the diagnostic precision, the

prognostic precision, and the motivational precision, divided by four. Micro-averaging

creates one large confusion matrix, and then calculates accuracy over the entire table. In

situations where the classes contain similar numbers of observations, the results from

these two methods are comparable. When the number of documents varies greatly

between classes, as it does in this study (non-framing is much larger than the other three

classes), the large classes dominate micro-averaging results. To avoid this effect, macro-

averaging will be employed to evaluate overall model precision, recall, F1 measure, and

accuracy for the polychotomous target variables.

LSA and Classification Modeling in Applications for SMT 63

Model 1: Framing/Non-Framing Classification

The goal of Model 1 is to accurately separate framing texts from non-framing

texts. A variety of modeling algorithms (CART, logistic regression, and neural network)

were employed using continuous or dummy predictor variables. Two target variables

were calculated for Model 1: Framing_Name, a dichotomous string variable with the

values “Framing” or “Non-Framing,” and NON_FRAMING, a dichotomous integer

variable with the value 1 indicating a non-framing document or 0 indicating a framing

document. These target variables portray the same information, but provide the option of

using a string or numeric target at the discretion of the modeler.

CART Model 1

Training: CART Model 1

Decision trees are not adversely affected by non-linear predictor variables. Thus,

both the continuous SVD variables and the calculated SVD dummy variables are

candidates for CART predictors. A CART model can use categorical target variables that

are either numeric or string. Either of the Model 1 target variables would be acceptable

to the CART algorithm and would generate equivalent results. Framing_Name was

chosen as the target variable because it is more descriptive.

Two CART models were developed. One used the SVD dummy variables listed

in Appendix C as predictor variables. The other used the continuous SVD variables,

SVD_1 through SVD_100 as predictor variables. Both models were trained using the

balanced training set of documents.

LSA and Classification Modeling in Applications for SMT 64

Validation: CART Model 1

The unbalanced test data set was scored by each model. The resulting confusion

matrices are shown in Tables 8 and 9. The rows are the true framing and non-framing

classifications for the documents in the test data set. The columns are the classifications

generated by the CART models. Each cell contains the cross-tabulated number of

documents.

Table 8

CART Model 1 Dummy Variables Confusion Matrix

True Classification

Model Classification

Total Framing Non-Framing

Framing 196 17 213

Non-Framing 30 1,930 1,960

Total 226 1,947 2,173

Table 9

CART Model 1 SVD Variables Confusion Matrix

True Classification

Model Classification

Total Framing Non-Framing

Framing 208 5 213

Non-Framing 107 1,853 1,960

Total 315 1,858 2,173

LSA and Classification Modeling in Applications for SMT 65

For the purpose of calculating evaluation measures for this model, a classification

of “Framing” is considered to be “positive,” and a classification of “Non-Framing” is

“negative.” The CART model that used dummy variables has thirty false positives

(documents that are actually negative, but were classified as positive by the model) and

seventeen false negatives (documents that are positive, but were classified as negative by

the model) in the confusion matrix. The evaluation measures for these two CART

models are shown in Table 10.

Table 10

CART Model 1 Evaluation

Evaluation Metric DV Model SVD Model

Precision 0.8673 0.6603

Recall 0.9202 0.9765

F1 Measure 0.8929 0.7879

Accuracy 0.9784 0.9485

Selection: CART Model 1

The dummy variable model‟s accuracy of 0.9784 is higher than the accuracy of

0.9485 for the SVD variable model. The SVD variable model had high recall, but low

precision. It found all but five of the framing documents in the test data set, but one-third

of the documents that it identified as framing were not. The dummy variable model had

slightly lower recall than the SVD variable model, but returned a small number of false

positives resulting in higher precision. The differences in precision and recall for these

two models are reflected in the F1 measure, which is higher for the dummy variable

LSA and Classification Modeling in Applications for SMT 66

model. Therefore, the dummy variable model is selected as the best model. Figure 9

illustrates the structure of the decision tree for the dummy variable CART Model 1. This

model employs just three of the SVD dummy variables. The most important split is on

SVD2_02, which has a value of 1 when SVD_2 ≥ 0.0687. The second most important

split is on SVD12_03, which has a value of 1 when SVD_12 ≥ 0.0596. The third split is

on SVD8_04, which has a value of 1 when SVD_8 ≥ 0.0329.

Figure 9. The decision tree generated by CART Model 1.

LSA and Classification Modeling in Applications for SMT 67

Variable Importance: CART Model 1

The option to calculate variable importance was selected in the CART model

node. The variable importance values convey the relative importance that each variable

contributes in making the classification for this particular model. These values for the

predictor variables sum to 1.0. Figure 10 displays the importance of predictor variables

as determined by the CART model. This graph indicates that SVD2_02 is by far the most

important variable in terms of classifying a document as either framing or non-framing.

Figure 10. CART Model 1 variable importance.

LSA and Classification Modeling in Applications for SMT 68

Logistic Regression Model 1

Training: Logistic Regression Model 1

The continuous SVD variables were not used as predictor variables for the logistic

regression model since they require consideration of the linearity assumption. The input

variables for Logistic Regression Model 1 were the framing/non-framing dummy

variables listed in Appendix C. Since the dummy variables have just two possible values,

0 and 1, linearity is not a problem. An additional advantage to using dummy variables

rises from the fact that these variables are based on ranges of their associated SVD

variables. This assists in reducing the risk of over-fitting the model. In effect, they force

a decision on a larger range of values and prevent a detailed decision, perhaps to several

decimal points, that could perfectly separate training data classes, but result in inaccurate

classification of new, unseen data.

The target variable for this model was NON_FRAMING, a dichotomous integer

variable with the value 1 indicating a non-framing document and 0 indicating a framing

document. Stepwise variable selection was chosen in Clementine‟s logistic regression

node.

The estimated logistic regression equation for Model 1 is:

(16)

.

LSA and Classification Modeling in Applications for SMT 69

The logistic regression parameter estimates and other statistics as provided by

Clementine are shown in Table 11. The coefficients of the parameters are found in the

column labeled “B” as well as in the estimated logistic regression equation. The

information in Table 11 will be referenced in the interpretation of the effect of the

predictors on the response.

Table 11

Results of Logistic Regression for Model 1

95.0% Confidence

Interval for Exp(B)

NON_FRAMINGa B

Std.

Error Wald df Sig. Exp(B)

Lower

Bound

Upper

Bound

Intercept -7.798 1.726 20.422 1 0.000

SVD1_01=0 1.625 0.368 19.520 1 0.000 5.080 2.470 10.448

SVD2_01=0 1.009 0.464 4.736 1 0.030 2.743 1.105 6.804

SVD2_02=0 8.038 0.566 201.561 1 0.000 3095.514 1020.534 9389.402

SVD3_01=0 -1.836 0.410 20.047 1 0.000 0.159 0.071 0.356

SVD5_05=0 1.829 0.551 11.035 1 0.001 6.230 2.117 18.334

SVD5_06=0 2.500 0.575 18.943 1 0.000 12.188 3.953 37.580

SVD6_02=0 -1.224 0.361 11.499 1 0.001 0.294 0.145 0.596

SVD6_03=0 1.390 0.605 5.268 1 0.022 4.013 1.225 13.148

SVD6_05=0 1.700 0.652 6.808 1 0.009 5.475 1.527 19.636

SVD8_04=0 -1.766 0.381 21.453 1 0.000 0.171 0.081 0.361

SVD9_01=0 -1.396 0.535 6.803 1 0.009 0.248 0.087 0.707

SVD11_01=0 1.402 0.417 11.279 1 0.001 4.063 1.793 9.208

SVD12_03=0 -1.952 0.410 22.608 1 0.000 0.142 0.064 0.318

SVD22_01=0 1.341 0.364 13.591 1 0.000 3.823 1.874 7.799

aThe reference category is 0.

LSA and Classification Modeling in Applications for SMT 70

Effect of the Predictors on the Response

All fourteen predictor variables in Logistic Regression Model 1 have two possible

values of either 0 or 1. The odds ratios (OR) for each of these variables is in the column

labeled “Exp(B)” in Table 11 and is calculated as:

eOR (18)

The odds ratio expresses the odds that a document is non-framing when it has a

value of zero for these dichotomous predictors. An odds ratio of one means the

document is just as likely to be framing as it is to be non-framing. If it is greater than

one, the document is more likely to be non-framing. Conversely, an odds ratio that is less

than one means the document is more likely to be framing.

For example, the odds ratio of 5.080 for SVD1_01 can be interpreted as “if

SVD1_01 for a particular document is equal to zero, then that document is more than five

times more likely to be non-framing than it is to be framing.” In the case of SVD2_02,

with an odds ratio of 3095.514, one may say that “if SVD2_02 is equal to zero, then a

non-framing classification is more than 3,000 times as likely compared to a framing

classification.” Apparently, SVD2_02 is a powerful predictor of the framing status of a

document.

The Wald test for the significance of each of the parameters can be found in Table

11 under the column labeled “Wald.” This is calculated as the coefficient estimate

divided by the standard error of the coefficient. The p-value, P(|z| > Wald), for each

variable is in the column labeled “Sig.” Each of the dichotomous variables has a p-value

LSA and Classification Modeling in Applications for SMT 71

less than 0.05 which implies that these variables are useful in the model for predicting

framing vs. non-framing.

The 95% confidence intervals for the odds ratios, eβ, can be found in the last two

columns of Table 11. One is not contained in any of the intervals for the predictor

variables in consideration. So, with 95% confidence, one can state that the odds ratio for

each of these variables is not one. Thus, all fourteen predictor variables are significant in

this model.

Validation: Logistic Regression Model 1

The unbalanced test data set was scored by Logistic Regression Model 1. Table

12 shows the confusion matrix. As with the CART models, the true classifications are in

the rows and the model classifications are in the columns. Each cell contains the cross-

tabulated number of documents.

Table 12

Logistic Regression Model 1 Confusion Matrix

True Classification

Model Classification

Total Framing Non-Framing

Framing 203 10 213

Non-Framing 44 1,916 1,960

Total 247 1,926 2,173

The evaluation metrics for Logistic Regression Model 1 are shown in Table 13.

This model discovered seven more framing documents than did CART Model 1, which is

reflected in the higher recall. The precision for Logistic Regression Model 1 is lower

LSA and Classification Modeling in Applications for SMT 72

than the precision for CART Model 1, due to the higher number of false negatives

produced by the logistic regression model. Overall, CART Model 1 (F1 measure =

0.8929 and accuracy = 0.9784) slightly outperformed Logistic Regression Model 1 (F1

measure = 0.8826 and accuracy = 0.9751).

Table 13

Logistic Regression Model 1 Evaluation

Evaluation Metric Value

Precision 0.8219

Recall 0.9531

F1 Measure 0.8826

Accuracy 0.9751

Variable Importance: Logistic Regression Model 1

The option to calculate variable importance was selected in the logistic regression

model node, and Figure 11 displays the importance of the predictor variables in the

logistic regression model. This assessment of variable importance agrees in part with the

CART variable importance, namely that SVD2_02 is by far the most important variable in

regards to classifying a document as framing or non-framing.

LSA and Classification Modeling in Applications for SMT 73

Figure 11. Logistic Regression Model 1 variable importance.

LSA and Classification Modeling in Applications for SMT 74

Neural Network Model 1

Training: Neural Network Model 1

The target variable for this model is NON_FRAMING, a variable with 1 indicating

a non-framing document and 0 indicating a framing document. All predictor variables

for a neural network model must be normalized to values between zero and one (Larose,

2005, p. 129). To meet this requirement, min-max normalization was applied to the

continuous SVD variables, and the resulting variables were used in Neural Network

Model 1. The formula to normalize SVDx is:

(17)

The neural network model was trained using the balanced training set of

documents. Clementine reported an estimated accuracy of 99.026 for this model. The

numbers of neurons were: thirty for the input layer, three for the hidden layer, and one

for the output layer.

Validation: Neural Network Model 1

The unbalanced test data set was scored by the neural network model. The

resulting confusion matrix is displayed in Table 14, with rows and columns arranged as

for the CART and logistic regression models.

LSA and Classification Modeling in Applications for SMT 75

Table 14

Neural Network Model 1 Confusion Matrix

True Classification

Model Classification

Total Framing Non-Framing

Framing 206 7 213

Non-Framing 3 1,957 1,960

Total 209 1,964 2,173

The evaluation metrics for Neural Network Model 1 are shown in Table 15. This

model surpassed the previous models in all four metrics. The F1 measure, 0.9763,

outstrips both the CART and logistic regression models which were 0.8929 and 0.8826,

respectively. The accuracy for this model, 0.9954, is much better than the accuracy for

either CART Model 1 (0.9784) or Logistic Regression Model 1 (0.9751).

Table 15

Neural Network Model 1 Evaluation

Evaluation Metric Value

Precision 0.9856

Recall 0.9671

F1 Measure 0.9763

Accuracy 0.9954

LSA and Classification Modeling in Applications for SMT 76

Variable Importance: Neural Network Model 1

Calculation of variable importance was selected in the neural network model

node. The resulting list of variables is displayed in Figure 12, which singles out SVD_2

as the most important variable. This is in line with the variable importance results from

the CART and logistic regression models where the SVD2_02 dummy variable was

established as most important.

Figure 12. Neural Network Model 1 variable importance.

LSA and Classification Modeling in Applications for SMT 77

Voting Combination Model 1

Three voting models were created to combine the results of the CART, logistic

regression, and neural network models:

(a) Classify the observation as “Framing” if one or more of the models

generated a “Framing” classification.

(b) Classify the observation as “Framing” if two or more of the models

generated a “Framing” classification.

(c) Classify the observation as “Framing” only if all three of the models

generated a “Framing” classification.

The confusion matrices for the three voting models are shown in Tables 16

through 18, and the evaluation metrics for each of the voting models are in Table 19.

Table 16

Confusion Matrix for Voting Model 1a: 1 or More Models = “Framing”

True Classification

Model Classification

Total Framing Non-Framing

Framing 209 4 213

Non-Framing 52 1,908 1,960

Total 261 1,912 2,173

LSA and Classification Modeling in Applications for SMT 78

Table 17

Confusion Matrix for Voting Model 1b: 2 or More Models = “Framing”

True Classification

Model Classification

Total Framing Non-Framing

Framing 203 10 213

Non-Framing 23 1,937 1,960

Total 226 1,947 2,173

Table 18

Confusion Matrix for Voting Model 1c: All 3 Models = “Framing”

True Classification

Model Classification

Total Framing Non-Framing

Framing 193 20 213

Non-Framing 2 1,958 1,960

Total 195 1,978 2,173

Voting Model 1c had much higher precision, due to just two false positives, but

lower recall than the other two voting models. Both the F1 measure and accuracy for

Voting Model 1c were higher than the other two models, resulting in its selection as the

best overall performer of the three voting models.

LSA and Classification Modeling in Applications for SMT 79

Table 19

Voting Combination Model 1 Evaluation

Evaluation Metric Voting 1a Voting 1b Voting 1c

Precision 0.8008 0.8982 0.9897

Recall 0.9812 0.9531 0.9061

F1 Measure 0.8819 0.9248 0.9461

Accuracy 0.9742 0.9848 0.9899

LSA and Classification Modeling in Applications for SMT 80

Mean Model Response Probability Combination Model 1

The Mean Model Response Probability (MMRP) for Model 1 was calculated as

the sum of Model Response Probabilities for the CART, logistic regression, and neural

network models divided by three. Figure 13 displays a histogram of the MMRP with a

yellow band indicating the cutoff that was chosen to separate framing from non-framing.

MMRP values greater than 0.346 were classified as framing and values less than or equal

to 0.346 were classified as non-framing.

Figure 13. The Mean Model Response Probability histogram for Model 1.

The confusion matrix for the MMRP Combination Model 1 is shown in Table 20.

There are seven false positives and 15 false negatives. The associated evaluation

measures are in Table 21. The F1 measure, 0.9474, is slightly higher than that for Voting

Model 1c which had a F1 measure of 0.9461. The accuracy is 0.9899 which is exactly

equivalent to Voting Model 1c.

LSA and Classification Modeling in Applications for SMT 81

Table 20

Mean MRP Combination Model 1 Confusion Matrix

True Classification

Model Classification

Total Framing Non-Framing

Framing 198 15 213

Non-Framing 7 1,953 1,960

Total 205 1,968 2,173

Table 21

Mean MRP Combination Model Evaluation

Evaluation Metric Value

Precision 0.9659

Recall 0.9296

F1 Measure 0.9474

Accuracy 0.9899

LSA and Classification Modeling in Applications for SMT 82

Selection of Final Model 1

The selection of the model which best performs for the framing/non-framing

classification model is based upon the evaluation measures generated from classifying the

test data set. The evaluation measures for all candidate models are listed in Table 22.

Table 22

Model 1 Candidates by Decreasing Accuracy

Model Precision Recall F1 Measure Accuracy

Neural Network 0.9856 0.9671 0.9763 0.9954

Mean MRP 0.9659 0.9296 0.9474 0.9899

Voting 1c 0.9897 0.9061 0.9461 0.9899

Voting 1b 0.8982 0.9531 0.9248 0.9848

CART (Dummy Variables) 0.8673 0.9202 0.8929 0.9784

Logistic Regression 0.8219 0.9531 0.8826 0.9751

Voting 1a 0.8008 0.9812 0.8819 0.9742

The neural network model has the highest F1 measure, 0.9763, and the highest

accuracy, 0.9954, and is thus, in general, the best performer for the framing/non-framing

model. In practice, though, the selection of a model will hinge upon the intended use,

which could result in the selection of a different model. If one is interested in ensuring

that all instances of framing documents are found and would accept a high false positive

rate, then the Voting 1a Model would be chosen. This model correctly flagged 209 out of

the 213 framing documents in the test data set as being framing giving the highest recall

of 0.9812, but it also included 52 false positives in the set of documents that it classified

LSA and Classification Modeling in Applications for SMT 83

as framing, resulting in the lowest precision, 0.8008. If one requires high precision in

correctly identifying framing documents, then the Voting 1c Model would be chosen.

This model classified 195 documents as framing and only two of those documents were

in reality non-framing, giving a precision of 0.9897.

LSA and Classification Modeling in Applications for SMT 84

Model 2: Framing Task Classification

Model 1 classified global warming documents as being either framing or non-

framing. Model 2 expands upon the role of Model 1 by further classifying framing

documents as belonging to one of the three core framing tasks: diagnostic, prognostic, or

motivational. CART, logistic regression, neural network, and combination models were

created and evaluated to determine the best classifier. The predictor variables were the

same variables used for Model 1: either the SVD variables generated by text mining or

the dummy variables derived from the SVD variables. The target variable for Model 2 is

CAF_Name, a polychotomous string variable with four possible values: “Non-Framing,”

“Diagnostic,” “Prognostic,” or “Motivational.”

Two general approaches will be employed. One is to train a model on the entire

training data set using CAF_Name as the target variable. The second approach is to train

a model to classify just the three core framing tasks using only the framing documents

from the training data set and then combine the resulting model with Neural Network

Model 1, the best overall performer for the framing versus non-framing model, to classify

the training data set documents into one of the four classes.

CART Model 2

Both the continuous SVD variables and the calculated SVD dummy variables are

candidates for CART predictors since, as mentioned previously, decision trees are not

adversely affected by non-linear predictor variables. Two CART models were

developed. CART Model 2a used the continuous SVD variables, SVD_1 through

SVD_100 as predictor variables. CART Model 2b was a combination of two models: a

LSA and Classification Modeling in Applications for SMT 85

CART model trained to classify only the framing tasks and Neural Network Model 1

which provided classification for non-framing documents.

Training: CART Model 2a

All one hundred SVD variables were used as predictor variables in training

CART Model 2a. The resulting decision tree (Figure 14) uses just four of the SVD

variables: SVD_2, SVD_6, SVD_7, and SVD_11.

Figure 14. The decision tree generated for CART Model 2a.

LSA and Classification Modeling in Applications for SMT 86

Validation: CART Model 2a

After scoring the test data set with CART Model 2a, a confusion matrix was

produced (Table 23).

Table 23

CART Model 2a Confusion Matrix

True Classification

Model Classification

Total Non-Framing Diagnostic Prognostic Motivational

Non-Framing 1,853 81 25 1 1,960

Diagnostic 4 20 0 8 32

Prognostic 1 10 47 12 70

Motivational 0 15 3 93 111

Total 1,858 126 75 114 2,173

This model involves four possible document classes. A set of evaluation

measures was calculated for each class, as well as the macro average for each measure

across all classes. The evaluation measures (Table 24) indicate good performance for

CART Model 2a in classifying the motivational framing task, and fair performance for

prognostic. The precision and accuracy for the diagnostic class were particularly

disappointing. Less than 16% of the documents that were classified as diagnostic by the

model were correct. The macro-averaged F1 measure, 0.6747, is low due to the model‟s

performance for the diagnostic class. The macro-averaged accuracy for this model was

0.9632.

LSA and Classification Modeling in Applications for SMT 87

Table 24

CART Model 2a Evaluation

Evaluation

Metric Non-Framing Diagnostic Prognostic Motivational

Macro-

Average

Precision 0.9973 0.1587 0.6267 0.8158 0.6496

Recall 0.9454 0.6250 0.6714 0.8378 0.7699

F1 Measure 0.9707 0.2532 0.6483 0.8267 0.6747

Accuracy 0.9485 0.9457 0.9765 0.9821 0.9632

Variable Importance: CART Model 2a

The SVD_2 variable was flagged as the most important predictor variable in this

model (Figure 15). Three other SVD variables were also identified as important:

SVD_11, SVD_6, and SVD_7.

Figure 15. CART Model 2a variable importance.

LSA and Classification Modeling in Applications for SMT 88

CART Model 2b

CART Model 2b is a combination of a CART model that classifies the framing

documents by core task and Neural Network Model 1 that classifies documents by

framing vs. non-framing. The first step is the development of the CART model for

Diagnostic/Prognostic/Motivational (DPM) classification. There are two possible sets of

predictor variables for this model: the continuous SVD variables and the SVD DPM

dummy variables (Appendix D). Two CART models were developed, one for each set of

predictor variables. These models are labeled CART DPM SVD, using continuous SVD

variables, and CART DPM DV, using dummy predictor variables. The models were

trained using only the framing documents from the training data set.

The framing documents from the test data set were scored with each CART DPM

model. The resultant confusion matrices are shown in Table 25 (CART DPM SVD) and

Table 26 (CART DPM DV).

Table 25

CART DPM SVD Confusion Matrix

True Classification

Model Classification Total

Diagnostic Prognostic Motivational

Diagnostic 20 3 9 32

Prognostic 7 50 13 70

Motivational 7 5 99 111

Total 34 58 121 213

LSA and Classification Modeling in Applications for SMT 89

Table 26

CART DPM DV Confusion Matrix

True Classification

Model Classification Total

Diagnostic Prognostic Motivational

Diagnostic 14 4 14 32

Prognostic 7 48 15 70

Motivational 4 4 103 111

Total 25 56 132 213

The evaluation measures (Table 27) were derived from the confusion matrices.

For each model, the macro-averaged recall, precision, F1 measure, and accuracy were

calculated for indication of overall performance. The only metric in which the DV model

outperformed the SVD model was recall for the motivational class. The CART DPM

SVD model had the highest macro-averaged metrics for all four measures, and was thus

chosen as the model to be combined with the Neural Network Model 1. The decision tree

for CART DPM SVD is illustrated in Figure 16.

Table 27

Evaluation for CART DPM Models

Diagnostic Prognostic Motivational Macro-Average

Metric SVD DV SVD DV SVD DV SVD DV

Precision 0.5882 0.5600 0.8621 0.8571 0.8182 0.7803 0.7562 0.7325

Recall 0.6250 0.4375 0.7143 0.6857 0.8919 0.9279 0.7437 0.6837

F1 Measure 0.6061 0.4912 0.7813 0.7619 0.8534 0.8477 0.7469 0.7003

Accuracy 0.8779 0.8638 0.8685 0.8592 0.8404 0.8263 0.8623 0.8498

LSA and Classification Modeling in Applications for SMT 90

Figure 16. The decision tree generated for CART DPM SVD.

LSA and Classification Modeling in Applications for SMT 91

Variable Importance for the CART DPM SVD model is shown in Figure 17. Up

to this point, SVD_2 has consistently topped the lists of variable importance. For this

model, SVD_2 sinks to second in importance, being displaced by SVD_6. This chart also

introduces a newcomer to the most important variables, SVD_62. This is the first model

that is aimed solely at classifying the framing tasks. It appears that, for the CART

algorithm, SVD_6 is better in distinguishing the core framing tasks than in the more

general separation of framing and non-framing documents.

Figure 17. CART DPM SVD variable importance.

LSA and Classification Modeling in Applications for SMT 92

The CART DPM SVD model was combined with Neural Network Model 1 to

create a new model, CART Model 2b, that can identify all four classes: non-framing,

diagnostic, prognostic, and motivational. The final classification was determined by the

following logic:

if CART_DPM_SVD_conf >= NN_NF_conf then

CART_DPM_SVD

elseif NN_NF = “Framing” then

CART_DPM_SVD

else

"Non-Framing"

endif

In essence, if the CART DPM SVD model has higher confidence in its decision

than Neural Network Model 1, then the observation is classified according to the CART

DPM SVD model. This means the observation is a framing document and is identified as

one of the core framing tasks. If Neural Network Model 1 has the higher confidence and

classified the observation as “Framing,” then the CART DPM SVD model‟s

classification is used to provide classification by framing task. Finally, if Neural

Network Model 1 has a higher confidence and has classified the observation as “Non-

Framing,” then that will be the final classification for the observation.

In addition to determining the classification of the combined model, the

associated confidence is also assigned to the decision. If the observation was classified

as “Non-Framing,” then the combined model confidence is set to the confidence of

LSA and Classification Modeling in Applications for SMT 93

Neural Network Model 1. Otherwise, the confidence of CART DPM SVD becomes the

confidence of the final decision. Carrying forward the confidence in this manner is

required for the final step in developing Model 2, namely the creation of a combined

model. After this logic was applied to define CART Model 2b, a confusion matrix was

generated from scoring the test data set (Table 28). The evaluation metrics for this model

may be found in Table 29.

Table 28

CART Combination Model 2b Confusion Matrix

True Classification

Model Classification

Total Non-Framing Diagnostic Prognostic Motivational

Non-Framing 1,935 10 13 2 1,960

Diagnostic 2 19 2 9 32

Prognostic 3 6 48 13 70

Motivational 0 7 5 99 111

Total 1,940 42 68 123 2,173

CART Model 2b performed quite well for non-framing documents. It

demonstrated a less than stellar job in classifying diagnostic documents. It found fewer

than 60% of them in the test data set and less than half of the documents that it tagged as

diagnostic were truly diagnostic. CART Model 2b‟s performance in classifying

prognostic documents was better. Classification of motivational documents was fairly

good. The model found nearly 90% of the motivational documents in the test data set

and over 80% of the documents that it designated as motivational were correct. The

LSA and Classification Modeling in Applications for SMT 94

macro-averaged F1 measure for CART Model 2a is 0.7619 and the macro-averaged

accuracy is 0.9834. Both of these metrics are distinct improvements over the macro-

averaged F1 measure, 0.6747, and macro-averaged accuracy, 0.9632, for CART Model

2a. Therefore, CART Model 2b is selected as the best CART model for Model 2.

Table 29

CART Model 2b Evaluation

Evaluation

Metric Non-Framing Diagnostic Prognostic Motivational

Macro-

Average

Precision 0.9974 0.4524 0.7059 0.8049 0.7401

Recall 0.9872 0.5938 0.6857 0.8919 0.7897

F1 Measure 0.9923 0.5135 0.6957 0.8462 0.7619

Accuracy 0.9862 0.9834 0.9807 0.9834 0.9834

LSA and Classification Modeling in Applications for SMT 95

Logistic Regression Model 2

The continuous SVD variables require consideration of the linearity assumption

for logistic regression. Therefore, the calculated Diagnostic/Prognostic/Motivational

(DPM) SVD dummy variables were chosen as the predictor variables for a multinomial

logistic regression model that classifies framing documents by core framing task, and the

model was trained on the framing documents from the balanced training data set. The

target variable was the polychotomous string variable CAF_Name. The resultant logistic

regression model was subsequently combined with Neural Network Model 1 to produce a

model that classifies a new document as being non-framing, diagnostic, prognostic, or

motivational.

Multinomial logistic regression with stepwise variable selection was chosen in

Clementine‟s logistic regression node. Diagnostic was the reference class. The estimated

logistic regression equation for the prognostic class is:

(19)

.

LSA and Classification Modeling in Applications for SMT 96

The estimated logistic regression equation for the motivational class is:

– (20)

.

The logistic regression parameter estimates and other statistics as provided by

Clementine are shown in Table 30. The coefficients, which are maximum likelihood

estimates of the parameters, are found in the column labeled “B.”

LSA and Classification Modeling in Applications for SMT 97

Table 30

Results of Logistic Regression DPM for Model 2

95.0% Confidence

Interval for Exp(B)

CAF_Namea B

Std.

Error Wald df Sig. Exp(B)

Lower

Bound

Upper

Bound

MO

TIV

AT

ION

AL

Intercept 0.572 1.385 0.171 1 0.680

DPM_SVD2_02=0 -2.309 0.445 26.925 1 0.000 0.099 0.042 0.238

DPM_SVD3_01=0 -1.530 0.565 7.325 1 0.007 0.217 0.071 0.656

DPM_SVD5_01=0 -1.315 0.478 7.566 1 0.006 0.269 0.105 0.685

DPM_SVD5_03=0 1.841 0.701 6.890 1 0.009 6.303 1.594 24.921

DPM_SVD6_01=0 0.681 0.760 0.803 1 0.370 1.976 0.445 8.769

DPM_SVD6_03=0 0.682 0.524 1.692 1 0.193 1.977 0.708 5.522

DPM_SVD8_01=0 -1.819 0.509 12.758 1 0.000 0.162 0.060 0.440

DPM_SVD8_03=0 1.596 0.621 6.613 1 0.010 4.932 1.462 16.641

DPM_SVD9_02=0 1.087 0.469 5.377 1 0.020 2.966 1.183 7.434

DPM_SVD10_01=0 1.506 0.439 11.764 1 0.001 4.511 1.907 10.669

DPM_SVD11_02=0 -1.510 0.473 10.191 1 0.001 0.221 0.087 0.558

DPM_SVD27_01=0 -1.248 0.502 6.189 1 0.013 0.287 0.107 0.767

PR

OG

NO

ST

IC

Intercept 0.544 1.329 0.167 1 0.682

DPM_SVD2_02=0 -0.407 0.466 0.761 1 0.383 0.666 0.267 1.661

DPM_SVD3_01=0 -1.580 0.662 5.694 1 0.017 0.206 0.056 0.754

DPM_SVD5_01=0 -1.596 0.523 9.320 1 0.002 0.203 0.073 0.565

DPM_SVD5_03=0 0.689 0.624 1.217 1 0.270 1.991 0.586 6.767

DPM_SVD6_01=0 -3.088 0.622 24.617 1 0.000 0.046 0.013 0.154

DPM_SVD6_03=0 1.722 0.564 9.335 1 0.002 5.598 1.854 16.900

DPM_SVD8_01=0 -0.440 0.541 0.663 1 0.416 0.644 0.223 1.859

DPM_SVD8_03=0 1.090 0.616 3.127 1 0.077 2.974 0.889 9.956

DPM_SVD9_02=0 1.711 0.534 10.258 1 0.001 5.534 1.942 15.767

DPM_SVD10_01=0 0.760 0.460 2.739 1 0.098 2.139 0.869 5.265

DPM_SVD11_02=0 -0.372 0.482 0.597 1 0.440 0.689 0.268 1.773

DPM_SVD27_01=0 0.113 0.555 0.041 1 0.839 1.119 0.377 3.325

aThe reference category is: Diagnostic.

LSA and Classification Modeling in Applications for SMT 98

Effect of the Predictors on the Response

All of the predictor variables in this model are dummy variables. The odds ratios

(OR) for each of these variables is in the column labeled “Exp(B)” in Table 30. With a

polychotomous target variable, the odds ratio is interpreted as with a binary target. For

each logit function, the odds ratio expresses the odds of that outcome as compared to the

reference outcome. For the motivational logit, the odds ratio expresses the odds that a

document is motivational when it has a value of zero for the dichotomous predictors. For

example, the odds ratio of 2.966 for DPM_SVD9_02 can be interpreted as “if

DPM_SVD9_02 for a particular document is equal to zero, then the odds of that

document being motivational are 2.966 times greater than the odds of that document

being diagnostic.”

The Wald value for each parameter can be found in Table 30 under the column

labeled “Wald.” The p-value, P(|z| > Wald), for each variable is in the column labeled

“Sig.” According to the Wald test for significance, there are a number of predictor

variables that do not meet the α = 0.05 significance level in this model: DPM_SVD6_01

and DPM_SVD6_03 for the motivational logit, and DPM_SVD2_02, DPM_SVD5_03,

DPM_SVD8_01, DPM_SVD8_03, DPM_SVD10_01, DPM_SVD11_02, and

DPM_SVD27_01 for the prognostic logit. Does this mean that those predictor variables

are not significant in this model? No, it does not. Hosmer and Lemeshow (2000, p. 270)

point out that for a multinomial logistic regression model, the likelihood ratio test should

be used to assess significance of the predictor variables. Table 31 lists the likelihood

ratio test values from the Clementine output for this model, which determines that all

predictor variables in this model are significant at the α = 0.05 level of significance.

LSA and Classification Modeling in Applications for SMT 99

Table 31

Logistic Regression DPM Likelihood Ratio Tests

Likelihood Ratio Tests

Effect -2 Log Likelihood

of Reduced Model Chi-Square df Sig

Intercept 369.378 0 0 .

DPM_SVD2_02 408.469 39.091 2 0.000

DPM_SVD3_01 379.080 9.703 2 0.008

DPM_SVD5_01 380.880 11.502 2 0.003

DPM_SVD5_03 377.161 7.784 2 0.020

DPM_SVD6_01 442.855 73.478 2 0.000

DPM_SVD6_03 379.969 10.591 2 0.005

DPM_SVD8_01 386.327 16.949 2 0.000

DPM_SVD8_03 376.484 7.107 2 0.029

DPM_SVD9_02 381.435 12.057 2 0.002

DPM_SVD10_01 381.926 12.548 2 0.002

DPM_SVD11_02 382.716 13.338 2 0.001

DPM_SVD27_01 379.202 9.825 2 0.007

Note. The chi-square statistic is the difference in -2 log-likelihoods between the final

model and a reduced model. The reduced model is formed by omitting an effect from the

final model. The null hypothesis is that all parameters of that effect are 0.

The 95% confidence intervals for the odds ratios, eβ, can be found in the last two

columns of Table 30. One is not contained in any of the intervals for the predictor

variables in this model. So, with 95% confidence, one can state that the odds ratio for

LSA and Classification Modeling in Applications for SMT 100

each of these predictor variables is not one. Thus, all of the predictor variables are

significant in this model.

Validation: Logistic Regression DPM Model

The framing records from the test data set were scored by the Logistic Regression

DPM Model. The resulting confusion matrix is shown in Table 32.

Table 32

Logistic Regression DPM Confusion Matrix

True Classification

Model Classification Total

Diagnostic Prognostic Motivational

Diagnostic 20 1 11 32

Prognostic 5 52 13 70

Motivational 3 2 106 111

Total 28 55 130 213

The evaluation measures for the Logistic Regression DPM Model are shown in

Table 33. Recall was lowest for the diagnostic class with 62.5% of the actual diagnostic

documents correctly flagged by the model. The model correctly discovered 95.5% of the

motivational documents in the test data set. The prognostic class had the highest value

for precision: 94.6% of the documents classified as prognostic were correct. The macro-

averaged F1 measure, 0.7928, bested the F1 metric of 0.7469 for the CART DPM Model.

Likewise, the macro-averaged accuracy for the Logistic Regression DPM Model, 0.8905,

was higher than the 0.8623 macro-averaged accuracy for the CART DPM Model.

LSA and Classification Modeling in Applications for SMT 101

Table 33

Logistic Regression DPM Evaluation

Metric Diagnostic Prognostic Motivational Macro-Averaged

Precision 0.7143 0.9455 0.8154 0.8250

Recall 0.6250 0.7429 0.9550 0.7743

F1 Measure 0.6667 0.8320 0.8797 0.7928

Accuracy 0.9061 0.9014 0.8638 0.8905

Variable Importance as determined by the Logistic Regression DPM Model

(Figure 18) agrees with the CART DPM Model that SVD_2 is not as important for

classifying framing tasks as it is for distinguishing framing from non-framing documents.

The top two variables in Figure 18 are dummy variables from SVD_6, which was the

most important variable for the CART DPM Model.

Figure 18. Logistic Regression DPM Model variable importance.

LSA and Classification Modeling in Applications for SMT 102

The Neural Network Model 1 and Logistic Regression DPM models were

combined to create Logistic Regression Model 2, to identify all four document classes:

non-framing, diagnostic, prognostic, and motivational. The final classification is

determined by the following logic:

if LogReg_Diagnostic_Conf >= NN_NF_conf then

LogReg_Classification

elseif LogReg_Prognostic_Conf >= NN_NF_conf then

LogReg_Classification

elseif LogReg_Motivational_Conf >= NN_NF_conf then

LogReg_Classification

elseif NN_NF = “Non-Framing” then

"Non-Framing"

else

LogReg_Classification

endif

If the Logistic Regression DPM Model has higher confidence in any of its

classifications than the Neural Network Model 1, then the observation is classified

according to the logistic regression model. If the Neural Network Model 1 has a higher

confidence and that model classified the observation as “Non-Framing,” then the “Non-

Framing” classification is made for this combination model. If the Neural Network

Model 1 has a higher confidence and has classified the observation as “Framing,” then

the logistic regression classification by core framing task becomes the final classification.

LSA and Classification Modeling in Applications for SMT 103

The associated confidence for the model that determined the final classification is

designated as the confidence for the combined model.

After this logic was applied to find the classification of Logistic Regression

Model 2, a confusion matrix was generated from scoring the test data set (Table 34). The

evaluation measures calculated from this confusion matrix are in Table 35.

Table 34

Logistic Regression Model 2 Confusion Matrix

True Classification

Model Classification

Total Non-Framing Diagnostic Prognostic Motivational

Non-Framing 1,940 6 11 3 1,960

Diagnostic 2 18 1 11 32

Prognostic 4 3 50 13 70

Motivational 0 3 2 106 111

Total 1,946 30 64 133 2,173

The diagnostic class proved once again to be the most difficult to identify. The

model found 56.3% of the diagnostic documents and 60% of those classified as

diagnostic were accurate (Table 35). Over 95% of the motivational documents in the test

data set were found by the model, but 27 of the 133 documents classified as motivational

were false positives. The macro-averaged F1 Measure, 0.7973, is higher than the same

metric, 0.7619 for CART Model 2b. The macro-averaged accuracy for Logistic

Regression Model 2 is 0.9864, which bests the 0.9834 macro-averaged accuracy for

CART Model 2b.

LSA and Classification Modeling in Applications for SMT 104

Table 35

Logistic Regression Model 2 Evaluation

Evaluation

Metric Non-Framing Diagnostic Prognostic Motivational

Macro-

Average

Precision 0.9969 0.6000 0.7813 0.7970 0.7938

Recall 0.9898 0.5625 0.7143 0.9550 0.8054

F1 Measure 0.9933 0.5806 0.7463 0.8689 0.7973

Accuracy 0.9880 0.9880 0.9844 0.9853 0.9864

LSA and Classification Modeling in Applications for SMT 105

Neural Network Model 2

Training: Neural Network Model 2

The target variable for Neural Network Model 2 was CAF_Name, the

polychotomous string variable used in the other Model 2 models. The predictor variables

for this model were the first thirty-five continuous SVD variables, normalized as for

Neural Network Model 1. Recall that the bivariate analysis of the SVD variables versus

the target variables revealed little value in the variables beyond SVD_35. This model was

trained using the balanced training set of documents.

Clementine reported an estimated accuracy of 95.105 for the neural network

model. The numbers of neurons were: thirty-five for the input layer, three for the hidden

layer, and four for the output layer.

Validation: Neural Network Model 2

The unbalanced test data set was scored by Neural Network Model 2. The

resulting confusion matrix is displayed in Table 36 and the evaluation measures are in

Table 37.

Once again, the diagnostic class has the lowest precision and recall. The

prognostic class also has a fairly low recall, returning 64.3% of the prognostic documents

in the test data set. For the motivational class, the model returned over 96% of the

motivational documents in the test data set and 82% of the documents classified as

motivational were correct. The macro-averaged F1 measure for the Neural Network

Model 2 is 0.8221, and the accuracy is 0.9892.

LSA and Classification Modeling in Applications for SMT 106

Table 36

Neural Network Model 2 Confusion Matrix

True Classification

Model Classification

Total Non-Framing Diagnostic Prognostic Motivational

Non-Framing 1,954 2 3 1 1,960

Diagnostic 4 20 2 6 32

Prognostic 5 4 45 16 70

Motivational 0 2 2 107 111

Total 1,963 28 52 130 2,173

Table 37

Neural Network Model 2 Evaluation

Evaluation

Metric Non-Framing Diagnostic Prognostic Motivational

Macro-

Average

Precision 0.9954 0.7143 0.8654 0.8231 0.8495

Recall 0.9969 0.6250 0.6429 0.9640 0.8072

F1 Measure 0.9962 0.6667 0.7377 0.8880 0.8221

Accuracy 0.9931 0.9908 0.9853 0.9876 0.9892

LSA and Classification Modeling in Applications for SMT 107

Variable Importance: Neural Network Model 2

The neural network assessment of variable importance (Figure 19) singled out

SVD2 as the most important variable. SVD7 and SVD6 rank second and third in

importance.

Figure 19. Neural Network Model 2 variable importance.

LSA and Classification Modeling in Applications for SMT 108

Combination Model 2

Thus far three models have been developed to classify documents as non-framing,

diagnostic, prognostic, or motivational. CART and logistic regression models have been

combined with Neural Network Model 1 to perform this task, and the third model is a

neural network model. These three models were incorporated into a fourth model by

weighted voting.

Judging by the macro-averaged accuracy, Neural Network Model 2 is most

accurate with an accuracy of 0.9892. Logistic Regression Model 2 follows with an

accuracy of 0.9864 and CART Model 2 has an accuracy of 0.9834. A simple weighting

scheme was added to a voting model to weight the neural network model higher than the

logistic regression model which is weighted higher than the CART model. The vote tally

for each class for a document is calculated as:

(20)

where

c is the class (non-framing, diagnostic, prognostic, motivational)

Votec is the vote tally for class c

CVotec is 1 if CART Model 2 classified the observation as c,

is 0 otherwise

LVotec is 1 if Logistic Regression Model 2 classified the observation as c,

is 0 otherwise

NVotec is 1 if Neural Network Model 2 classified the observation as c,

is 0 otherwise.

LSA and Classification Modeling in Applications for SMT 109

If the vote tally generates more than one classification for a document or no

classifications for a document, then the model with the highest confidence determines the

final classification for that document. The confusion matrix for Combination Model 2 is

in Table 38 and the associated evaluation measures are in Table 39.

Table 38

Combination Model 2 Confusion Matrix

True Classification

Model Classification

Total Non-Framing Diagnostic Prognostic Motivational

Non-Framing 1,948 5 5 2 1,960

Diagnostic 2 22 2 6 32

Prognostic 4 4 47 15 70

Motivational 0 1 2 108 111

Total 1,954 32 56 131 2,173

Table 39

Combination Model 2 Evaluation

Evaluation

Metric Non-Framing Diagnostic Prognostic Motivational

Macro-

Average

Precision 0.9969 0.6875 0.8393 0.8244 0.8370

Recall 0.9939 0.6875 0.6714 0.9730 0.8314

F1 Measure 0.9954 0.6875 0.7460 0.8926 0.8304

Accuracy 0.9917 0.9908 0.9853 0.9880 0.9890

LSA and Classification Modeling in Applications for SMT 110

Selection of Final Model 2

The F1 measures and accuracies, both by class and macro-averaged overall, which

resulted from classifying the test data set with each model are listed in Table 40. The

macro-averaged F1 measures for the four models range from 0.7619 to 0.8304 and the

accuracies range from 0.9834 to 0.9892. The macro-averaged F1 measure is highest for

Combination Model 2, followed by Neural Network Model 2. The Neural Network

Model 2 had the highest macro-averaged accuracy, but it barely edged out Combination

Model 2 by just 0.0002. These two models merit closer examination.

Table 40

Model 2 F1 Measure and Accuracy Metrics

Document

Class

CART

2b

Logistic

Regression

Neural

Network Combination

F1 Measure

Non-Framing 0.9923 0.9933 0.9962 0.9954

Diagnostic 0.5135 0.5625 0.6667 0.6875

Prognostic 0.6957 0.7463 0.7377 0.7460

Motivational 0.8462 0.8689 0.8880 0.8926

Macro-Averaged F1 Measure 0.7619 0.7973 0.8221 0.8304

Accuracy

Non-Framing 0.9862 0.9880 0.9931 0.9917

Diagnostic 0.9834 0.9880 0.9908 0.9908

Prognostic 0.9807 0.9844 0.9853 0.9853

Motivational 0.9834 0.9853 0.9876 0.9880

Macro-Averaged Accuracy 0.9834 0.9864 0.9892 0.9890

LSA and Classification Modeling in Applications for SMT 111

The confusion matrices for both models are displayed together in Table 41 for

comparison. An additional column, “% Found,” has been added to the matrices. This is

the recall measure expressed as a percentage. As discussed in the consideration of Model

1, the intended use for the model should guide model selection. If overall accuracy is of

paramount importance, then Neural Network Model 2 is best although it barely edges out

its competition. If the purpose of the model is to filter these three types of framing

documents from a flood of Internet posts and present the results to humans who assess

risks associated with collective action, then Combination Model 2 discovered higher

proportions of all three types of framing documents than Neural Network Model 2, but at

a slight cost. The false positive rates from Combination Model 2 are higher than Neural

Network Model 2 for the diagnostic and prognostic classes. The macro-averaged F1

measures reflect these differences between the two models. The analyst may be willing

to tolerate additional false positives rather than risk losing one motivational document

that completes the picture. In that case, Combination Model 2 would be chosen as the

final model.

LSA and Classification Modeling in Applications for SMT 112

Table 41

Model 2 Comparison of Neural Network and Combination Models

True

Classification

Model Classification

Total

Non-

Framing Diagnostic Prognostic Motivational

%

Found

Neu

ral

Net

work

2 Non-Framing 1,954 2 3 1 1,960 99.7%

Diagnostic 4 20 2 6 32 62.5%

Prognostic 5 4 45 16 70 64.3%

Motivational 0 2 2 107 111 96.4%

Total 1,963 28 52 130 2,173

Com

bin

atio

n 2

Non-Framing 1,948 5 5 2 1,960 99.4%

Diagnostic 2 22 2 6 32 68.8%

Prognostic 4 4 47 15 70 67.1%

Motivational 0 1 2 108 111 97.3%

Total 1,954 32 56 131 2,173

LSA and Classification Modeling in Applications for SMT 113

DISCUSSION

Comparison of Model Algorithms to k-Nearest Neighbors

The review of literature for this thesis cited publications that reported success in

utilizing LSA methods to provide predictors for a k-Nearest Neighbors (kNN) model that

performs document classification (Naohiro et al., 2006; Nakov et al., 2003). The

methods used in this study were compared to a kNN model. The Memory-Based

Reasoning node in SAS Enterprise Miner uses the kNN algorithm to classify a new

observation according to the known classifications of the k most similar observations

from the training data set, where the analyst selects the value of k. The SVD values in

the training data set served as input to train kNN models for both Model 1 and Model 2

classification tasks. Models were trained for each of four values of k: 5, 10, 15, and 20.

For both Model 1 and Model 2, the k = 5 models had the lowest error rates and these

were used for the following comparison.

The confusion matrix for the kNN Model 1 is in Table 42. Table 43 provides a

comparison of evaluation measures for the kNN Model 1 and the Model 1 candidates in

this study. The kNN Model 1 performed admirably. It misclassified nine framing

documents and twenty-one non-framing documents for a total of thirty misclassifications.

Both in terms of the F1 measure and accuracy, the kNN Model 1 lagged behind

the neural network model and both combination models. The recall for the kNN model

reflects the large proportion of framing documents discovered by the model, just two

fewer than were discovered by the neural network model. However, the relatively large

number of false positives returned by the kNN model resulted in a lower precision.

LSA and Classification Modeling in Applications for SMT 114

Table 42

kNN Model 1 Confusion Matrix

True Classification

Model Classification

Total Framing Non-Framing

Framing 204 9 213

Non-Framing 21 1,939 1,960

Total 225 1,948 2,173

Table 43

Comparison of kNN and Model 1 Candidates, Ranked by Decreasing Accuracy

Model Precision Recall F1 Measure Accuracy

Neural Network 0.9856 0.9671 0.9763 0.9954

Mean MRP 0.9659 0.9296 0.9474 0.9899

Voting 1C 0.9897 0.9061 0.9461 0.9899

k-Nearest Neighbors 0.9067 0.9577 0.9315 0.9862

Voting 1b 0.8982 0.9531 0.9248 0.9848

CART (Dummy Variables) 0.8673 0.9202 0.8929 0.9784

Logistic Regression 0.8219 0.9531 0.8826 0.9751

Voting 1a 0.8008 0.9812 0.8819 0.9742

The kNN Model 2 also performed well, but could not best the Model 2 candidates

in this study. The macro-averaged F1 measure in Table 45 shows that the kNN model,

with 0.7045, is lower than any of the other models. The macro-averaged accuracy

LSA and Classification Modeling in Applications for SMT 115

measure for the kNN model was the same, 0.9834, as the macro-averaged accuracy for

CART Model 2. The other Model 2 candidates in this study had higher macro-averaged

accuracies as compared to the kNN model. The kNN precision and recall measures,

0.4000 and 0.2500 respectively, for the diagnostic class were disappointing. The reason

for these low measures can be seen in the confusion matrix for kNN Model 2 (Table 44).

This model returned very few true positives for the diagnostic class and resulted in more

false positives than true positives. In addition, the kNN model performed poorly, relative

to the other four models, in classifying motivational documents.

Table 44

kNN Model 2 Confusion Matrix

True Classification

Model Classification

Total Non-Framing Diagnostic Prognostic Motivational

Non-Framing 1,950 1 4 5 1,960

Diagnostic 7 8 1 16 32

Prognostic 4 5 49 12 70

Motivational 0 6 11 94 111

Total 1,961 20 65 127 2,173

LSA and Classification Modeling in Applications for SMT 116

Table 45

Comparison of Evaluation for kNN and Model 2 Candidates

Document Class kNN

CART

2b

Logistic

Regression

Neural

Network Combination

F1 Measure

Non-Framing 0.9946 0.9923 0.9933 0.9962 0.9954

Diagnostic 0.3077 0.5135 0.5625 0.6667 0.6875

Prognostic 0.7259 0.6957 0.7463 0.7377 0.7460

Motivational 0.7899 0.8462 0.8689 0.8880 0.8926

Macro-Averaged F1 Measure 0.7045 0.7619 0.7973 0.8221 0.8304

Accuracy

Non-Framing 0.9903 0.9862 0.9880 0.9931 0.9917

Diagnostic 0.9834 0.9834 0.9880 0.9908 0.9908

Prognostic 0.9830 0.9807 0.9844 0.9853 0.9853

Motivational 0.9770 0.9834 0.9853 0.9876 0.9880

Macro-averaged Accuracy 0.9834 0.9834 0.9864 0.9892 0.9890

The kNN algorithm did not outperform the best models in this study, but it is

gratifying to see that kNN did return strong results as seen in the evaluation measures.

There is, however, a reason to hesitate when considering a kNN model for

implementation in text classification. Hastie, Tibshirani, and Friedman (2001, pp. 22-27)

point out that the use of a local method, such as kNN, in high dimensions will fall prey to

the curse of dimensionality (Bellman, 1961). SVD was employed to reduce the

dimensionality of the data, but there are still N = 1,943 training documents distributed

over p = 100 SVD values which certainly places this data set in the high dimension

category.

LSA and Classification Modeling in Applications for SMT 117

Important Predictor Variables

Earlier in this paper, two SVD variables, SVD_2 and SVD_6, were profiled. Now,

after the classification models have been presented, the reason for selecting these two

variables for profiling is explained. Table 46 lists the four most important predictors for

each model. For Model 1, SVD_2 is consistently the most important predictor variable.

SVD_2 is also the most important predictor variable for CART Model 2a and Neural

Network Model 2, both of which classified documents into the non-framing, diagnostic,

prognostic, or motivational classes.

Table 46

Four Most Important Predictor Variables by Model

Model 1 Model 2

CART Logistic

Regression

Neural

Network

CART

CART

DPM

Logistic

Regression DPM

Neural

Network

SVD2_02 SVD2_02 SVD_2 SVD_2 SVD_6 DPM_SVD6_01 SVD_2

SVD12_03 SVD3_01 SVD_7 SVD_11 SVD_2 DPM_SVD6_03 SVD_7

SVD12_02 SVD22_01 SVD_12 SVD_6 SVD_7 DPM_SVD8_01 SVD_6

SVD1_01 SVD11_01 SVD14 SVD_7 SVD_62 DPM_SVD2_02 SVD_12

Recall that the CART DPM SVD and Logistic Regression DPM models were

trained to classify only the diagnostic, prognostic, and motivational classes. For both of

those models, SVD_6 was the most important predictor. SVD_6 also appears in the list of

the most important predictors for the other Model 2 models, but not at the top of the list.

SVD_6 seems to be effective in distinguishing the framing tasks while SVD_2 separates

LSA and Classification Modeling in Applications for SMT 118

framing and non-framing texts. The profiling of these two variables provided evidence to

back up this assumption.

The Difficulty of Classification

In this study, all of the documents address one topic, Global Warming, and the

task is to detect those documents that were written for the purpose of influencing

perceptions and actions regarding the topic. This task is made more challenging when

one considers the fact that both non-framing and diagnostic framing texts can define

climate change and its effect on our planet. The subtle difference is that the non-framing

document may have been written for the purpose of educating the reader, while the

diagnostic framing document is intended to not only educate, but also influence the

reader‟s perception of events and personal experience. In the same manner, elements of

prognostic framing text may logically be found in non-framing text. Morrow et al.

(2008), in discussing their research involving classifying Senate speeches by political

party, mention the challenging nature of ideological classification, “… it seems that a

more ideologically-based classification might be a more difficult problem than

classifying by author – often, Democrats and Republicans use the same words but are

discussing very different ideas” (p. 8).

LSA and Classification Modeling in Applications for SMT 119

CONCLUSION

The accuracy of the methods employed in this study was excellent. For the model

that distinguished framing from non-framing documents, the accuracies ranged from

97.5% to 99.5%. Likewise, the polychotomous models, which identified non-framing

versus diagnostic versus prognostic versus motivational documents, had accuracies

ranging from 98.3% to 98.9%. To place these results in perspective, the literature review

identified a study that had a similar goal of classifying ideology in text documents. That

paper reported best accuracies for a dichotomous target variable in the 85% to 92%

range, and other model accuracies ranging from 70% to 90% as being acceptable

(Morrow et al. 2008, Figure 4, p. 7). A polychotomous model was not addressed in that

study.

The fact that this study has been successful in demonstrating that framing

documents can be accurately distinguished from non-framing documents lends credence

to the theory that framing involves distinct and identifiable language characteristics.

Moreover, the successful classification of framing documents by core framing task, with

high accuracy, provides the means to measure these fine distinctions. From the results

seen here, one may presume that social scientists can use these techniques to further the

study, measurement, and validation of current thinking regarding the framing efforts of

Social Movement Organizations.

Latent Semantic Analysis techniques were shown to be effective in providing

robust predictor variables for the classification models. The neural network modeling

algorithm performed well for both models, but it was a combination model that excelled

in the more difficult problem of finding documents belonging to specific framing tasks.

LSA and Classification Modeling in Applications for SMT 120

This study could have failed to accomplish the goal of developing classification

models to discover framing documents. The problem is difficult. Moreover, the tenets of

Social Movement Theory upon which this thesis rests, have been developed through

observation and analysis with less emphasis on quantification. This is understandable,

indeed necessary, when one considers the topic is dependent upon human nature, not

physical science. Major Jennifer Chandler, USAF, notes that “Research has

predominately focused on understanding why and how frames generate resonance”

(Chandler, 2005). She then explains the need for framing research to conduct studies to

better define the mechanisms of framing processes.

Future Work

The corpus of documents was split into training and test data sets. When data are

plentiful, partitioning the corpus into training, validation, and test data sets is the accepted

practice. In that case, the error on the validation data set aids in model selection, and the

test data set provides the estimation of predictive error on new data (Hastie et al., 2001, p.

196). The small number of framing documents, by core task, used in this corpus,

necessitated the use of just training and test data sets. Cross-validation and bootstrap

methods are designed to estimate prediction error when the environment is not data rich.

Incorporating one or both of these methods may present a more realistic estimate of the

accuracy of the models.

Some writers of framing texts are becoming more sophisticated in their frame

construction by adopting a reasonable rather than a rhetorical tone (Benjamin, 2007).

Benjamin posits that a rhetorical tone is patronizing, dogmatic, and biased. Her research

indicates that people respond to a rhetorical tone with skepticism and resistance. In

LSA and Classification Modeling in Applications for SMT 121

contrast, she paints a reasonable tone as non-argumentative, optimistic, and based upon

widely accepted values. In this case, Benjamin theorizes that the reader is more likely to

be encouraged and to start thinking about solving the issues. Future work can be

undertaken to adopt the approach that was demonstrated in this study for the

identification of tone in framing documents, thus singling out those that are more apt to

be successful in recruiting followers.

LSA and Classification Modeling in Applications for SMT 122

REFERENCES

Anderson, R. (2007). The credit scoring toolkit: Theory and practice for retail credit risk

management and decision automation. USA: Oxford University Press.

BBC News. (2001). Summit fails to solve climate dispute. Retrieved September 8, 2008,

from http://news.bbc.co.uk/1/hi/world/europe/1387667.stm

Bellman, R. E. (1961). Adaptive control processes. Princeton University Press.

Benjamin, D. (2007, December). Finding a reasonable tone. Retrieved January 5, 2009,

from FrameWorks Institute: http://www.frameworksinstitute.org/framebytes.html

Bilisoly, R. (2008). Practical text mining with Perl. Hoboken, NJ: John Wiley & Sons,

Inc.

Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression

trees. Boca Raton, FL: Chapman & Hall/CRC Press.

Campbell, P. (2008, July 6). Image:2008-07 climate rally human sign 2.jpg. Retrieved

November 17, 2009, from Greenlivingpedia:

http://www.greenlivingpedia.org/Image:2008-

07_climate_rally_human_sign_2.jpg

Chandler, J. (2005, May). The explanatory value of social movement theory. Strategic

Insights , IV (5).

Climate Action Network Australia. (2008). Mission. Retrieved September 9, 2008, from

http://www.cana.net.au/index.php?site_var=12

Climate Camp. (2008). Camp for climate action Australia. Retrieved May 19, 2008, from

http://www.climatecamp.org.au/

LSA and Classification Modeling in Applications for SMT 123

Climate Rally. (2008). Climate emergency rally. Retrieved May 20, 2008, from

http://climaterally.blogspot.com/

Communist Party USA. (2008). Global warming - the communist solution. Retrieved

September 9, 2008, from http://www.cpusa.org/article/view/933/

Cooper, A. (2002). Media framing and social movement mobilization: German peace

protest against INF missiles, the Gulf War, and NATO peace enforcement in

Bosnia. European Journal of Political Research , 41, 37-80.

Courtice, B. (2008, July 5). Thank you and well done. Retrieved November 17, 2008,

from CLIMATE CRIMINALS TOUR OF MELBOURNE: Climate Emergency

Rally:

http://climaterally.blogspot.com/search/label/Climate%20Emergency%20Rally

Deerwester, S., Dumais, S., Furnas, G., Landauer, T., & Harshman, R. (1990). Indexing

by latent semantic analysis. Journal of the American Society for Information

Science , 41 (6), 391-407.

Della Porta, D., & Diani, M. (1999). Social movements: An introduction. Oxford:

Blackwell Publishers.

FrameWorks. (n.d.). FrameWorks issues: Global warming. Retrieved January 5, 2009,

from The FrameWorks Institute Web Site:

http://www.frameworksinstitute.org/globalwarming.html

FrameWorks. (1999). Mission of the FrameWorks Institute. Retrieved January 5, 2009,

from The FrameWorks Institute Web Site:

http://www.frameworksinstitute.org/mission.html

LSA and Classification Modeling in Applications for SMT 124

Goffman, E. (1974). Frame analysis: An essay on the organization of experience. New

York, NY: Harper & Row.

Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning.

New York, NY: Springer-Verlag.

Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic regression (2 ed.). Hoboken,

NJ: John Wiley & Sons, Inc.

ISI Web of Knowledge. (2008). Thomson Reuters.

Koenig, T. (2005). Routinizing frame analysis. Proceedings of the ISA RC-33

Methodology Conference. Leverkusen: Leske & Budrich.

Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic

analysis. Discourse Processes , 25, 259-284.

Larose, D. (2006). Data mining methods and models. Hoboken, NJ: John Wiley & Sons.

Larose, D. (2005). Discovering knowledge in data. Hoboken, NJ: John Wiley & Sons.

Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language

processing. Cambridge, MA: The MIT Press.

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information

retrieval. New York, NY: Cambridge University Press.

McAdam, D., McCarthy, J., & Zald, M. (1988). Social movements. In N. Smelser (Ed.),

Handbook of sociology. Thousand Oaks, CA: Sage Publications.

Morrow, J., Bader, B., Chew, P., & Speed, A. (2008). Ideological determination using

small amounts of text. International Studies Association 49th Annual Convention.

San Francisco, CA.

LSA and Classification Modeling in Applications for SMT 125

Nakov, P., Valchanova, E., & Angelova, G. (2003). Towards deeper understanding of the

LSA performance. Proc. Recent Advances in Natural Language, (pp. 311-318).

Borovetz, Bulgaria.

Naohiro, I., Murai, T., Yamada, T., & Bao, Y. (2006). Text classification by combining

grouping, LSA and kNN. Proceedings from 5th IEEE/ACIS ICIS-COMSAR '06.

Los Alamitos, CA: IEEE Computer Society.

New Europe. (2008, June 16). Saudi Arabia joins global warming fight scheme.

Retrieved September 9, 2008, from http://www.neurope.eu/articles/87683.php

Pyle, D. (2003). Business modeling and data mining. San Francisco, CA: Morgan

Kaufmann.

Reuters. (2008). Protesters disrupt loading at Australian coal port. Retrieved September

8, 2008, from

http://www.reuters.com/article/rbssMiningMetalsSpecialty/idUSSYD1146922008

0714

Rising Tide. (2008, February 26). Topple the fossil fuel empire. Retrieved June 9, 2008,

from risingtide.org.uk:

http://risingtide.org.uk/files/rt/15%20Actions%20to%20Topple%20the%20Fossil

%20Fuel%20Empire%20-%20Web%20Version.pdf

SAS Institute, Inc. (2003). Descriptive terms of clusters. Text Miner Node . Cary, NC.

SAS Institute, Inc. (2003). Weighting methods. Text Miner Node . Cary, NC.

LSA and Classification Modeling in Applications for SMT 126

SAS® Enterprise Miner™. (2003-2005). Version 2.3 of the SAS System for Windows,

copyright © 2003 - 2005 SAS Institute Inc. SAS and all other SAS Institute Inc.

product or service names are registered trademarks or trademarks of SAS

Institute Inc., Cary, NC, USA.

SAS® Software. (2002-2003). Version 9.1.3 of the SAS System for Windows, copyright ©

2002 - 2003 SAS Institute Inc. SAS and all other SAS Institute Inc. product or

service names are registered trademarks or trademarks of SAS Institute Inc.,

Cary, NC, USA.

SAS® Text Miner. (2003-2005). Version 2.3 of the SAS System for Windows, copyright

© 2003 - 2005 SAS Institute Inc. SAS and all other SAS Institute Inc. product or

service names are registered trademarks or trademarks of SAS Institute Inc.,

Cary, NC, USA.

Semetko, H., & Valkenburg, P. (2000). Framing European politics: a content analysis of

press and television news. Journal of Communication , 50 (2), 93-109.

Shao, G. (1994). Potential impacts of climate change on a mixed broadleaved-Korean

pine forest stand: A gap model approach. International-Geosphere-Biosphere-

Program Workshop on the Application of Forest-Stand-Models-to-Global-

Change-Issues. Apeldoorn Netherlands: Kluwer Academic Publ.

Sierra Club. (2008). Global warming policy solutions. Retrieved May 14, 2008, from

http://www.sierraclub.org/energy/energypolicy/

Snow, D., & Benford, R. (1988). Ideology, frame resonance and participant mobilization.

International Social Movement Research , 1, 197-219.

LSA and Classification Modeling in Applications for SMT 127

Southern Baptist Convention. (2007). SBC resolutions: On global warming. Retrieved

September 9, 2008, from

http://www.sbc.net/resolutions/amResolution.asp?ID=1171

SPSS Clementine®. (2007). Rel. 12.0.1 SPSS, Incorporated. Chicago, IL.

Triandafyllidou, A., & Fotiou, A. (1998). Sustainability and modernity in the European

Union: A frame theory approach on policy-making. Sociological Research Online

, 3 (1).

World Development Movement. (2008). No new coal - stop Kingsnorth. Retrieved May

19, 2008, from World Development Movement Campaigns:

http://www.wdm.org.uk/campaigns/climate/action/kingsnorth.htm

LSA and Classification Modeling in Applications for SMT 128

BIOGRAPHICAL STATEMENT

Judith Spomer is a Senior Member of Technical Staff at Sandia National

Laboratories1 in Albuquerque, New Mexico. She holds a B.S in Computer Science from

Indiana University of Pennsylvania. During her career she has worked as a process

control engineer, software engineer, and credit risk modeler in the chemical and financial

services industries. Mrs. Spomer is married with four children and makes her home in

Tijeras, NM.

1 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin

Company, for the United States Department of Energy‟s National Nuclear Security Administration under

Contract DE-AC04-94AL85000.

LSA and Classification Modeling in Applications for SMT 129

Appendix A: Representative Global Warming Documents

Non-Framing Document

A gap-typed forest dynamic model KOPIDE was used to assess the dynamic

responses of a mixed broadleaved-Korean pine forest stand to climate change in

northeastern China. The GFDL climate change scenario was applied to derive the

changes in environmental variables, such as 10 degrees C based DEGD and PET/P,

which were used to implement the model. The simulation result suggests that the climate

change would cause important changes in stand structure. Korean pine, the dominant

species in the area under current climate conditions, would disappear under the GFDL

equilibrium scenario. Oak and elm would become the dominant species replacing

Korean pine, ash and basswood. Such a potential change in forest structure would

require different strategies for forest management in northeastern China. (Shao, 1994)

Diagnostic Document

No new coal – Stop Kingsnorth. In April 2008 the government will decide

whether Kingsnorth in Kent will have the first new coal-fired power station in the UK for

decades. Of all fuels, coal is the most polluting - even worse than burning oil or gas.

Kingsnorth power station alone will release more CO2 each year than Ghana. It will not

use carbon capture and storage technology, and so will contribute to climate change that

is already hitting the world‟s poor first and hardest. For the UK to be encouraging the

development of new coal-fired power stations, instead of promoting the switch to a low

carbon future, is madness in an era of impending climate crisis. (World Development

Movement, 2008)

LSA and Classification Modeling in Applications for SMT 130

Prognostic Document

Reduce emissions to avoid dangerous global warming: Scientists tell us that we

must cut greenhouse gas emissions by at least 80% by 2050 to prevent global

temperatures from rising more than 2º C over pre-industrial averages. Not only must

global warming policy require such emissions reductions, but it must also ensure the U.S.

adheres to this mandate by requiring periodic scientific review of progress toward

sufficient emission reductions that will meet this goal. Legislation should direct EPA to

adjust its regulatory process based on future scientific study and review of climate change

to ensure that we meet measurable, intermittent emission reduction benchmarks between

now and 2050 that will prevent a rise in global temperatures above dangerous levels.

(Sierra Club, 2008)

Motivational Document

Welcome to Climate Camp Australia. The camp for climate action will be five

days of inspiring workshops & direct action aimed at shutting down the world's largest

coal port in Newcastle, just north of Sydney. If you are concerned about climate change,

and want real action instead of more hot air, then we encourage you to come, bring your

friends and family and get involved. Whether you are old or young, a seasoned protestor

or if you've never been to a protest in your life, if you share our passion for climate

action, then climate camp is for you! We'd love for you to get involved and help make

the camp as big, bold and effective as possible. Whatever your background, there is a

role for you. Find out more about how you can get involved. (Climate Camp, 2008)

LSA and Classification Modeling in Applications for SMT 131

Appendix B: Cluster Results for Entire Corpus

Cluster Name Descriptive Terms No. of

Docs

% of

Docs

RMS

Std

Atmospheric

Observations &

Measurements

+cloud, +sensor, +observation, +technique,

+instrument, +aerosol, +parameter, +mission,

+satellite, earth, +measure, +provide, +resolution,

data, atmospheric, +measurement, +accuracy,

+present, +surface, +study

280 4.3% 0.0903

Atmospheric

Variation

+variability, +record, +variation, atmospheric,

+circulation, +mechanism, +temperature,

+atmosphere, solar, past, +surface, +activity,

+ocean, last, +cycle, +forcing, +show, +time,

+scale, +warming

294 4.5% 0.1033

Climate Models

+climate, +estimate, +result, +water, data,

+assess, +present, +simulation, model, +scenario,

+condition, +impact, +study, hydrological, +use,

+change, +method, +base, future, +scale

538 8.2% 0.0996

Direct Action,

Protest

+people, direct action, +day, +come, +coal,

+workshop, +station, +action, +want, +join,

+group, +protest, +stop, camp, +camp, direct,

+expansion, +take, action, +movement

49 0.8% 0.0891

LSA and Classification Modeling in Applications for SMT 132

Cluster Name Descriptive Terms No. of

Docs

% of

Docs

RMS

Std

Faith-Based

Response

+care, +tradition, +creation, +man, god, +live,

faith, +thing, +responsibility, +life, +see,

+protect, +call, +earth, +do, +way, just, +come,

+world, +community

17 0.3% 0.0838

Forests

+carbon, +increase, +forest, +rate, +effect,

+increase, +management, +concentration,

+response, +growth, atmospheric, +tree, +soil,

+specie, +ecosystem, +plant, +model, potential,

+area, +high

624 9.6% 0.1123

Fossil Fuels

fossil fuels, +paper, renewable, +emission,

+production, +resource, +gas, +technology,

+power, +plant, +generation, +efficiency, global,

+development, fossil, +reduction, +fuel, +energy,

+warming, +source

465 7.1% 0.0925

Friends & Group

Actions

+friend, +join, +do, +send, +know, +school,

+way, +help, +action, +make, +write, +group,

just, +take, +start, +see, +idea, +want, +people,

+good

84 1.3% 0.0867

LSA and Classification Modeling in Applications for SMT 133

Cluster Name Descriptive Terms No. of

Docs

% of

Docs

RMS

Std

GHGs / Ozone

+gas, warming, +use, +process, +atmosphere,

+emission, +high, ozone, environmental, +system,

+warming, carbon dioxide, +method, +product,

+potential, +application, global, global warming,

+problem, +low

476 7.3% 0.1013

Glaciers

+snow, +extent, +sea, +balance, +glacier,

+surface, +accumulation, +summer, +cover, ice,

+temperature, +area, +record, +show, +indicate,

+variability, +region, +year, +period, +trend

247 3.8% 0.0962

Government &

Corporate

Response to GW

+help, +send, +state, +clean, +government,

renewable energy, +take, +stop, now, +efficiency,

+invest, +reduce, renewable, +company, +create,

+action, +solution, +energy, +power, +do

107 1.6% 0.0927

Effect of GW

on Human

Populations

human, +world, +food, health, +people, +country,

+population, +problem, +affect, +do, +cause,

environmental, +environment, +increase, +make,

+warming, global warming, more, +part, other

311 4.8% 0.1164

LSA and Classification Modeling in Applications for SMT 134

Cluster Name Descriptive Terms No. of

Docs

% of

Docs

RMS

Std

Challenges &

Strategies to

Address GW

+challenge, +impact, +develop, +environment,

information, +assessment, +ecosystem,

+management, +approach, +resource, +strategy,

+policy, climate change, +research, +paper,

+issue, +system, +address, environmental,

+problem

570 8.7% 0.0990

Habitats &

Populations

+community, +response, +diversity, +range,

+habitat, +pattern, +population, genetic,

+distribution, +suggest, +specie, +plant, +predict,

climatic, environmental, +rate, +environment,

+analysis, +condition, +change

332 5.1% 0.0899

Holocene

Period

+indicate, holocene, +core, +right, +period,

+evidence, +lake, bp, climatic, last, +record,

+sequence, glacial, all, +occur, +record, +suggest,

+basin

323 4.9% 0.0952

International

GW Actions

+community, climate, +leader, +create, +build,

+do, +country, +action, +group, +take, +people,

+student, +join, +world, +see, +solution,

+government, action, international, +way

102 1.6% 0.0944

LSA and Classification Modeling in Applications for SMT 135

Cluster Name Descriptive Terms No. of

Docs

% of

Docs

RMS

Std

International

GW Policy

international, +technology, +gas, convention,

+paper, +emission, +sector, +cost, climate, kyoto,

framework, +greenhouse, +carbon, +reduce,

+energy, +reduction, +policy, economic, change,

ghg

365 5.6% 0.0965

Lifestyle

Changes

+appliance, +recycle, +reduce, +big, +drive, +do,

+save, carbon dioxide, +pound, +take, +make,

+home, +energy, +bulb, +car, +replace, money,

+buy, +help, +use

123 1.9% 0.0909

Precipitation

Variation

+year, +variation, +precipitation, +region,

+increase, climatic, +area, +degree, +temperature,

+show, +period, +trend, +land, data, +analysis,

+analyze, mean, annual, +vegetation, +result

487 7.5% 0.1082

Sea Level

+sea level, +coast, +risk, coastal, +river,

+frequency, +storm, +rise, +event, +area,

extreme, +flood, recent, +scenario, future,

+change, +impact, +large, +paper, climate change

273 4.2% 0.1044

Water

Ecosystems

+specie, +temperature, +water, +surface,

+ecosystem, +fish, +ocean, +low, +lake, +river,

+effect, +increase, +population, +increase, +high,

+affect, +change, +large, +suggest, +region

464 7.1% 0.1123

LSA and Classification Modeling in Applications for SMT 136

Appendix C: Dummy Variables for Framing/Non-Framing Models

SVD Dimension Variable Dummy Variable Condition for Value = 1

SVD_1

SVD1_01 SVD_1 < 0.4229

SVD1_02 SVD_1 >= 0.5168

SVD_2

SVD2_01 (SVD_2 >= -0.1812) and

(SVD_2 < -0.0051)

SVD2_02 SVD_2 >= 0.0687

SVD_3

SVD3_01 SVD_3 < -0.0108

SVD3_02 (SVD_3 >= -0.0108) and

(SVD_3 < 0.0091)

SVD3_03 (SVD_3 >= 0.0396) and

(SVD_3 < 0.0589)

SVD3_04 (SVD_3 >= 0.0589) and

(SVD_3 < 0.1601)

SVD3_05 SVD_3 >= 0.1601

SVD_4

SVD4_01 (SVD_4 >= -0.0928) and

(SVD_4 < -0.0350)

SVD4_02 (SVD_4 >= -0.0350) and

(SVD_4 < 0.0529)

SVD4_03 (SVD_4 >= 0.0727) and

(SVD_4 < 0.1340)

SVD4_04 (SVD_4 >= 0.1340) and

(SVD_4 < 0.1687)

SVD4_05 SVD_4 >= 0.1687

LSA and Classification Modeling in Applications for SMT 137

SVD Dimension Variable Dummy Variable Condition for Value = 1

SVD_5

SVD5_01 (SVD_5 >= -0.2667) and

(SVD_5 < -0.1719)

SVD5_02 (SVD_5 >= -0.1719) and

(SVD_5 < -0.1173)

SVD5_03 (SVD_5 >= -0.0966) and

(SVD_5 < -0.0582)

SVD5_04 (SVD_5 >= -0.0582) and

(SVD_5 < 0.0773)

SVD5_05 (SVD_5 >= 0.1000) and

(SVD_5 < 0.1389)

SVD5_06 SVD_5 >= 0.1389

SVD_6

SVD6_01 SVD_6 < -0.2289

SVD6_02 (SVD_6 >= -0.1183) and

(SVD_6 < 0.0594)

SVD6_03 (SVD_6 >= 0.0775) and

(SVD_6 < 0.1051)

SVD6_04 (SVD_6 >= 0.1051) and

(SVD_6 < 0.1417)

SVD6_05 SVD_6 >= 0.1417

SVD_8

SVD8_01 SVD_8 < -0.1989

SVD8_02 (SVD_8 >= -0.1989) and

(SVD_8 < -0.1622)

SVD8_03 (SVD_8 >= -0.1622) and

(SVD_8 < -0.0279)

SVD8_04 SVD_8 >= 0.0329

LSA and Classification Modeling in Applications for SMT 138

SVD Dimension Variable Dummy Variable Condition for Value = 1

SVD_9

SVD9_01 SVD_9 < -0.0789

SVD9_02 SVD_9 >= 0.1197

SVD_10

SVD10_01 SVD_10 < -0.0792

SVD10_02 (SVD_10 >= -0.0582) and

(SVD_10 < 0.0356)

SVD_11 SVD11_01 SVD_11 < -0.1174

SVD_12

SVD12_01 SVD_12 < -0.1686

SVD12_02 (SVD_12 >= -0.1038) and

(SVD_12 < 0.0447)

SVD12_03 SVD_12 >= 0.0596

SVD_22

SVD22_01 SVD_22 < 0.0516

SVD22_02 SVD_22 > 0.1076

SVD_23

SVD23_01 SVD_23 < -0.1047

SVD23_02 (SVD_23 >= -0.0838) and

(SVD_23 < 0.0545)

SVD23_03 (SVD_23 >= 0.0885) and

(SVD_23 < 0.1127)

SVD23_04 SVD_23 >= 0.2011

SVD_27

SVD27_01 SVD_27 < -0.1047

SVD27_02 SVD_27 >= 0.0277

LSA and Classification Modeling in Applications for SMT 139

Appendix D: Dummy Variables for Diagnostic/Prognostic/Motivational Models

SVD Dimension Variable Dummy Variable Condition for Value = 1

SVD_1

DPM_SVD1_01 SVD_1 < 0.3574

DPM_SVD1_02 (SVD_1 >= 0.3574) and

(SVD_1 < 0.4342)

DPM_SVD1_03 SVD_1 >= 0.5168

SVD_2 DPM_SVD2_01

(SVD_2 >= -0.0617) and

(SVD_2 < 0.1945)

DPM_SVD2_02 SVD_2 >= 0.1945

SVD_3 DPM_SVD3_01 SVD_3 >= 0.0091

SVD_4

DPM_SVD4_01 SVD_4 < -0.0350

DPM_SVD4_02 SVD_4 >= 0.0727

SVD_5

DPM_SVD5_01 SVD_5 < -0.0966

DPM_SVD5_02 (SVD_5 >= -0.0582) and

(SVD_5 < 0.0458)

DPM_SVD5_03 SVD_5 >= 0.1000

SVD_6

DPM_SVD6_01 SVD_6 < - 0.1183

DPM_SVD6_02 (SVD_6 >= -0.0959) and

(SVD_6 < 0.0451)

DPM_SVD6_03 SVD_6 >= 0.0594

SVD_8

DPM_SVD8_01 SVD_8 < -0.0500

DPM_SVD8_02 (SVD_8 >= -0.0061) and

(SVD_8 < 0.1149)

DPM_SVD8_03 SVD_8 >= 0.1149

SVD_9 DPM_SVD9_01 SVD_9 < 0.0124

DPM_SVD9_02 SVD_9 >= 0.1009

LSA and Classification Modeling in Applications for SMT 140

SVD Dimension Variable Dummy Variable Condition for Value = 1

SVD_10

DPM_SVD10_01 (SVD_10 >= -0.0792) and

(SVD_10 < 0.0230)

DPM_SVD10_02 SVD_10 >= 0.0509

SVD_11

DPM_SVD11_01 SVD_11 < -0.0961

DPM_SVD11_02 SVD_11 >= 0.0022

SVD_12 DPM_SVD12_01 (SVD_12 >= -0.1270) and

(SVD_12 < 0.0291)

SVD_23

DPM_SVD23_01 SVD_23 < -0.0838

DPM_SVD23_02 (SVD_23 >= -0.0458) and

(SVD_23 < 0.0300)

DPM_SVD23_03 SVD_23 >= 0.0885

SVD_27 DPM_SVD27_01 SVD_27 >= 0.1030

LSA and Classification Modeling in Applications for SMT 141

Appendix E: Terms Associated with the Highest SVD_6 Values

Terms highlighted in yellow are associated with motivational framing text and

terms highlighted in green are associated with diagnostic framing text.

Term Value Term Value

increased instances 0.6423 + death 0.2601

+ giant 0.6374 entire 0.2590

+ protest 0.6362 + hold 0.2572

climate-changing 0.6341 + exacerbate 0.2558

bbc 0.6305 simply 0.2558

+ cite 0.6304 particularly 0.2550

+ cooperative 0.6302 genetic 0.2515

guatemala 0.6299 foreign 0.2506

world economy 0.6293 used 0.2490

other biofuels 0.6282 + head 0.2487

+ proponent 0.6271 + differ 0.2476

booming 0.6260 gulf 0.2468

bandwagon 0.6251 negligent 0.2462

massive amounts 0.6251 halt 0.2446

corn ethanol 0.6251 + reveal 0.2445

political 0.6248 + chemical 0.2390

+ price 0.6237 operational 0.2376

useless 0.6232 + commit 0.2375

+ hill 0.6223 + argue 0.2362

+ acre 0.6223 vulnerable 0.2349

+ commission 0.6219 biofuels 0.2342

consolidation 0.6213 meat 0.2332

+ herbicide 0.6212 + nation 0.2327

+ breed 0.6208 + movement 0.2321

LSA and Classification Modeling in Applications for SMT 142

Term Value Term Value

corn 0.6203 + percentage 0.2306

processing plants 0.6198 + face 0.2299

contaminating 0.6195 + solve 0.2283

saltwater 0.6195 + people 0.2278

cargill 0.6195 food 0.2278

imported grain 0.6186 + send 0.2271

liver 0.6183 exciting 0.2269

specific technology 0.6177 + want 0.2257

energy resource 0.6175 global action 0.2249

new breed 0.6173 + story 0.2248

+ mutation 0.6173 + rise 0.2243

agriculture 0.6165 + group 0.2224

food crops 0.6141 manufacturing 0.2201

+ team 0.6135 + standing 0.2187

content 0.6127 urgently 0.2181

midwest 0.6122 growing 0.2180

+ spark 0.6120 nations 0.2174

+ toxin 0.6119 + close 0.2165

minnesota 0.6116 social scientists 0.2155

recent study 0.6107 + producer 0.2153

+ suit 0.6107 united 0.2150

+ kernel 0.6101 in 0.2142

public concern 0.6099 classic 0.2137

ethanol production 0.6098 epa 0.2136

tilman 0.6091 + accept 0.2129

dairy 0.6086 + stand 0.2104

biotechnology 0.6079 though 0.2103

+ carcinogen 0.6077 + crop 0.2103

+ shock 0.6063 + country 0.2103

+ pen 0.6052 switch 0.2102

LSA and Classification Modeling in Applications for SMT 143

Term Value Term Value

adm. 0.6051 + consequence 0.2102

+ tout 0.6049 national 0.2083

+ preserve 0.6046 + clear 0.2078

oversight 0.6038 + point 0.2078

statistic 0.6029 + speak 0.2077

corn-based 0.6008 is 0.2074

animal feed 0.6008 + group 0.2073

truth 0.6005 + zone 0.2071

+ adult 0.6000 + leader 0.2069

+ heighten 0.5995 + organization 0.2068

corn 0.5988 + begin 0.2067

possible increases 0.5987 eco-systems 0.2058

+ soybean 0.5986 + mobilize 0.2053

public health 0.5982 wipe out 0.2044

inconvenient 0.5964 + deal 0.2035

aquatic life 0.5964 + scientist 0.2032

alarm 0.5963 elsewhere 0.2020

poor air quality 0.5960 poland 0.2017

farmer-owned 0.5957 + investigation 0.2013

processed 0.5951 safety 0.2012

safety 0.5940 + polluter 0.1992

energy intensive 0.5926 proclaim 0.1990

in addition 0.5919 human health 0.1987

emissions reduction 0.5891 anger 0.1983

+ squeeze 0.5885 + step 0.1980

+ consolidate 0.5872 + summit 0.1976

while 0.5869 statistics 0.1974

coastal regions 0.5855 + citizen 0.1966

stanford 0.5837 prime 0.1959

answer 0.5833 failure 0.1953

LSA and Classification Modeling in Applications for SMT 144

Term Value Term Value

radio 0.5826 + prospect 0.1949

ahead 0.5763 chair 0.1946

unsafe 0.5700 + dacca 0.1938

preferable 0.5656 square 0.1935

experts 0.5654 tackling climate change 0.1934

+ crop 0.5649 + science 0.1924

+ curtail 0.5648 leading scientists 0.1918

+ analyst 0.5644 + modify 0.1918

nutritional 0.5635 + look 0.1917

encouraging 0.5598 + percent 0.1909

ethanol 0.5587 accounting 0.1909

+ well 0.5547 + mind 0.1905

stranglehold 0.5513 + undermine 0.1905

+ contaminate 0.5512 + organize 0.1902

+ hectare 0.5460 + coalition 0.1895

usda 0.5459 shame 0.1894

back 0.5452 + share 0.1891

+ magazine 0.5411 yesterday 0.1890

+ export 0.5401 + environmentalist 0.1882

nonprofit 0.5355 + see 0.1879

new york times 0.5322 + reality 0.1878

+ import 0.5293 + risk 0.1861

high probability 0.5284 + link 0.1854

+ violation 0.5275 increasingly 0.1850

soy 0.5247 domination 0.1846

independence 0.5241 biodiversity 0.1842

+ engineer 0.5230 rich countries 0.1838

sustainable agriculture 0.5228 + manufacturer 0.1837

processing 0.5208 rising sea levels 0.1836

food supply 0.5202 talks 0.1834

LSA and Classification Modeling in Applications for SMT 145

Term Value Term Value

+ barrel 0.5201 + grass 0.1827

+ toxicity 0.5188 tackling 0.1827

converting 0.5181 director 0.1825

genetically 0.5177 + agenda 0.1822

+ satisfy 0.5171 + report 0.1819

selling 0.5170 entire world 0.1815

foreign oil 0.5146 fight 0.1813

+ competitor 0.5132 + language 0.1806

sugarcane 0.5096 + host 0.1797

negate 0.5093 + food 0.1788

grave 0.5082 funding 0.1785

food prices 0.5061 speech 0.1781

+ benchmark 0.5054 funding 0.1778

proud 0.5038 + mosque 0.1772

several times 0.4992 vast majority 0.1771

+ hospitalization 0.4978 oil giant 0.1769

+ alga 0.4974 + leave 0.1767

modified 0.4930 + meeting 0.1766

high 0.4905 clean 0.1764

one 0.4871 + far 0.1763

engineering 0.4863 forward 0.1762

maize 0.4860 likely 0.1759

+ grain 0.4857 socio-cultural 0.1759

+ tie 0.4857 + training 0.1757

+ alumnus 0.4834 + convert 0.1755

dead 0.4830 + skill 0.1752

public 0.4811 financial support 0.1751

+ concur 0.4795 + negotiate 0.1748

+ pose 0.4783 + combat 0.1743

traditionally 0.4772 + month 0.1735

LSA and Classification Modeling in Applications for SMT 146

Term Value Term Value

applied 0.4734 extra pressure 0.1735

energy-intensive 0.4727 political parties 0.1726

+ ingredient 0.4712 + test 0.1723

yellow 0.4690 + plan 0.1721

profit 0.4671 political 0.1720

climate crisis 0.4653 in addition 0.1717

+ rank 0.4633 + flagship 0.1717

birth 0.4633 + election 0.1713

childhood 0.4627 university 0.1709

+ hurt 0.4607 + create 0.1706

+ jump 0.4594 + supporter 0.1704

+ sign 0.4580 global movement 0.1704

significant amount 0.4575 women 0.1698

+ hit 0.4575 coastal 0.1697

administration 0.4538 strategy sessions 0.1695

+ danger 0.4525 + review 0.1692

lobby 0.4520 + move 0.1690

+ founder 0.4491 adaptation 0.1685

sustainable 0.4472 + whale 0.1682

intensive 0.4463 lifespan 0.1680

+ process 0.4437 + hand 0.1678

ceres 0.4420 + ramification 0.1678

agricultural land 0.4398 protection 0.1677

club 0.4389 alternative 0.1673

devastating 0.4388 climate change 0.1665

truly 0.4378 bali 0.1664

resistant 0.4366 aspirational targets 0.1662

+ settlement 0.4364 + gathering 0.1662

likewise 0.4338 + thinker 0.1662

mexico 0.4335 above 0.1660

LSA and Classification Modeling in Applications for SMT 147

Term Value Term Value

+ cropland 0.4333 + future 0.1659

+ sell 0.4323 turn out 0.1657

large part 0.4320 city/town 0.1648

health 0.4320 location 0.1648

small-scale 0.4299 + fact 0.1646

institute 0.4296 keep up 0.1641

heavily 0.4289 political leaders 0.1639

commonly 0.4282 + young 0.1639

fuels 0.4281 + shift 0.1634

+ note 0.4260 friends 0.1633

+ farmer 0.4243 capitalism 0.1632

+ pesticide 0.4230 + gather 0.1630

+ score 0.4209 climate talks 0.1628

ideal 0.4187 aviation emissions 0.1628

in. 0.4185 + industrialize 0.1626

+ factor 0.4162 observer 0.1618

steam 0.4156 + belief 0.1615

hunger 0.4132 + fertilizer 0.1615

hardly 0.4124 environmental impacts 0.1615

+ threaten 0.4120 extinct 0.1614

unintended 0.4107 environmental groups 0.1605

standard 0.4022 historic 0.1605

policy 0.4014 + win 0.1604

+ equal 0.3973 + activist 0.1600

sierra 0.3964 urgent 0.1594

+ lung 0.3955 + chance 0.1593

+ nitrate 0.3937 mangrove forest 0.1591

financing 0.3928 environmental destruction 0.1587

disclosure 0.3909 advisory 0.1586

+ intend 0.3877 + talent 0.1585

LSA and Classification Modeling in Applications for SMT 148

Term Value Term Value

co-op 0.3871 + washington, d.c. 0.1585

brazil 0.3853 + billion 0.1585

low-income 0.3836 drastic increase 0.1585

local 0.3805 + member 0.1584

+ sound 0.3770 + register 0.1583

amazon 0.3760 debilitating 0.1580

+ infrastructure 0.3746 + part 0.1578

+ researcher 0.3734 rigorous 0.1575

back 0.3702 climate 0.1573

+ classify 0.3700 executive director 0.1571

due 0.3697 interest-group 0.1570

+ corporation 0.3696 politics 0.1569

dependence 0.3626 bold solutions 0.1565

clear 0.3622 + rally 0.1561

+ warn 0.3620 risk 0.1560

+ feed 0.3614 rio 0.1560

asthma 0.3600 + exceed 0.1559

leading 0.3593 melting 0.1558

+ harvest 0.3586 + like 0.1558

+ opponent 0.3563 real action 0.1557

+ instance 0.3539 + back 0.1557

nothing 0.3531 + funder 0.1556

doubt 0.3528 last 0.1554

rising 0.3517 ministers 0.1553

organization 0.3517 + interview 0.1548

+ grow 0.3498 + russia 0.1545

never 0.3495 + interview 0.1545

everyday 0.3492 urgent action 0.1544

+ force 0.3487 + conference 0.1539

+ gain 0.3479 public 0.1537

LSA and Classification Modeling in Applications for SMT 149

Term Value Term Value

meaningful 0.3473 + articulate 0.1536

+ tank 0.3453 music 0.1535

+ million 0.3427 optimism 0.1535

+ board 0.3410 + warning 0.1534

greenpeace 0.3390 + culminate 0.1527

+ player 0.3390 real changes 0.1527

act 0.3363 youth 0.1526

+ crisis 0.3359 today 0.1521

climate-friendly 0.3327 future action 0.1519

+ subsidy 0.3324 + surprise 0.1519

research 0.3280 reception 0.1515

+ rise 0.3268 rising 0.1512

+ representative 0.3262 deforestation 0.1506

wrong direction 0.3239 + debate 0.1504

+ price 0.3220 + opinion 0.1504

+ feed 0.3203 vibrant 0.1502

top 0.3161 history 0.1500

large scale 0.3116 + bell 0.1499

+ hope 0.3115 + chart 0.1499

+ sit 0.3101 + put 0.1495

environmental 0.3099 alarm 0.1492

supporting 0.3092 + advantage 0.1492

wildlife 0.3065 coastal resources 0.1484

+ publish 0.3046 vice 0.1479

smog 0.3040 clear signal 0.1478

+ finding 0.3039 + issue 0.1478

international 0.3037 dangerous climate change 0.1477

america 0.2992 tough action 0.1476

+ air 0.2984 summit 0.1474

+ law 0.2970 rational 0.1474

LSA and Classification Modeling in Applications for SMT 150

Term Value Term Value

+ breed 0.2936 + cut 0.1473

animal 0.2887 + like 0.1471

continuing 0.2881 saturday 0.1470

development 0.2863 constantly 0.1470

+ dominate 0.2863 brown 0.1465

+ confirm 0.2855 + pledge 0.1465

aquatic 0.2854 + tourist 0.1463

+ respond 0.2841 + arrangement 0.1462

+ ban 0.2831 legislative 0.1462

imported 0.2821 dry up 0.1461

center 0.2818 + sector 0.1459

+ attempt 0.2794 even 0.1459

+ note 0.2792 amazing 0.1457

+ expert 0.2773 description 0.1457

+ negotiator 0.2771 unstoppable 0.1454

+ organism 0.2768 serious environmental issues 0.1453

+ politician 0.2763 climate justice 0.1453

corporate 0.2760 inaction 0.1452

+ talk 0.2760 minister 0.1451

+ kill 0.2759 + frame 0.1451

+ hard 0.2757 + wave 0.1447

water quality 0.2751 cultural 0.1446

+ support 0.2741 environmental community 0.1445

novel 0.2736 necessary 0.1445

may 0.2717 + total 0.1444

+ rainforest 0.2699 + field 0.1443

impossible 0.2685 international action 0.1442

specifically 0.2678 presidential candidates 0.1442

able 0.2678 ready 0.1436

+ conclude 0.2677 peer-reviewed 0.1434

LSA and Classification Modeling in Applications for SMT 151

Term Value Term Value

massive 0.2676 good bet 0.1432

+ world 0.2675 + content 0.1431

oil 0.2665 + struggle 0.1431

strategic 0.2656 + artist 0.1429

+ farmland 0.2654 st. 0.1426

+ probability 0.2642 emerging 0.1424

+ continue 0.2624 up 0.1423

potentially 0.2611 specific 0.1422

+ blend 0.2609 + session 0.1421

indeed 0.2607 presidential 0.1420

+ poor 0.2604