computer assisted text analysis

Upload: terry-college-of-business

Post on 08-Jan-2016

60 views

Category:

Documents


0 download

DESCRIPTION

Laura K. NelsonKellogg School of ManagementAugust 7, 2015Vancouver, BC

TRANSCRIPT

Computer-Assisted Text Analysis: An Overview and Guide

Laura K. NelsonKellogg School of ManagementContent Analysis PWDAcademy of Management Annual ConferenceAugust 7, 2015Vancouver, BCComputer-Assisted Text Analysis: An Overview and GuideGoalsDescribe the wide range of computer-assisted text analysis techniques availableHint: more than dictionariesProvide some guidance about how to choose your methodsGive empirical examples of these methods in practiceWhy Use Computer-Assisted Techniques?SpeedHumans are slowText is becoming largeReliability / ReproducibilityValidity (this is controversial)Expanded memoryUnburdened by bias

Does not remove the need for interpretation!Reliability / Reproducibiity: Computers will code (almost) the same way every time. Its quite difficult to get humans to do that.3Overview:Types of Automated Text AnalysisUnsupervised exploration (hypothesis forming/inductive)Topic modelingLexical selectionHuman-Guided Categorical Analysis (traditional content analysis deductive hypothesis testing)Supervised machine learningDictionariesNatural Language Processing (guided inductive/hypothesis testing)Part-of-Speech TaggingNamed Entity RecognitionConcordancesSentiment analysis

There are many types of automated text analysis (automated text analysis and computer-assisted text analysis are exchangeable, Ill use them both in this presentation. They can be used for both deductive and inductive. There is unsupervised exploration. This is similar to the grounded theory approach to traditional content analysis, but the computer is guiding it. Human-guided categorical analysis is akin to traditional content analysis, but the computer speeds up the process and in some ways makes it more accurate, reliable, and reproducible. Natural Language Processing gets deeper into the structure and language, allowing for further analysis.4

I will be following this general decision tree to provide some guidance to how to choose out of the wide range of methods avaiable.5Question 1:Do you want to inductively explore the text?First question: do you want to explore the text, allow categories to emerge from the data?6

Unsupervised Exploration: The GoalInformativeGroups of Words

The goal of this is to turn data into a bag of words, and from there into informative groups of words, that the researcher can then use to form hypotheses about their data and further analyze them. First you have to prepare your data.8Set-Up: Document-Term Matrix*ambitpovertipeoplefullDocument14200Document21370Document32001Document49142Document50026*Cells can be word frequencies or weighted word scoresQuestion 2: Themes or Style?If themes:

Question 3: Multiple Categories?Single Category per Text: Clustering

Clustering methods partition your documents into mutually exclusive groups. You can then relate these clusters to other variables, or examine changes in the clusters over time. The most frequent words in the cluster suggest what the cluster is about.If you have multiple topics per document11Multiple Categories: Topic Modeling

Topic modeling allows documents to contain multiple categories. 12But Which Algorithm?Is the order of the documents important?Yes? Structural Topic Modeling (STM)Are the topics correlated?Yes? Correlated Topic Modeling (CTM)Order is relatively arbitrary, topics may not be related?Latent Dirichlet Allocation (LDA)Which algorithm for clustering? Its complicated, but right now K-Means is consistently better on text-based data. Use it.13

Question 2: Themes or Style?If styleStyle: Lexical SelectionGoal: find words that are distinctive to different groups of textOne solution: Difference of ProportionsThese lists of distinguishing words can give you a basic idea about what the text is about, and when used in comparison, can suggest the way in which themes or categories are treated within and across texts, and allows you to directly compare different categories within the text. Different organizations, publications, or parties, or geographical areas, for example, may discuss the same topic differently. For example, republicans versus democrats.16Difference of ProportionsChicagoDoPchicago5.31children4.59center4.34union3.61school3.48abort3.19nixon2.93day2.86vietnam2.57people2.50city2.44hospital2.38cwlu2.37New York CityDoPmovement12.54women11.34feminist8.91radical8.56liberation7.69political5.81history5.68feminine 3.85male3.52left2.96revolution2.58consciousnessraising2.45oppress2.41

ConcreteAbstract17Here I noticed a pattern, Chicago was using concrete words, and NYC was using abstract words. So I turned back to LDA.Question 1:Do you want to test a hypothesis?If yes:

Question 2: Themes or Styles?If themes

Supervised Machine Learning

This replicates traditional content analysis, but instead of training a team of human researchers, you use algorithms to train a machine to code your text.20Which Algorithm?You want individual documents coded:Document Classification (e.g. SVM, Naive Bayes)You want proportion of documents in each category:ReadMe (R package)Dictionary MethodsStandardized DictionariesLIWC (can be used for sentiment analysis)Custom DictionaryThis can also be done using dictionaries. I will get to an empirical example using dictionaries later in the talk. 22Question 2: Themes or Styles?If styles

Natural Language ProcessingTakes into account features of words, relationships between words, grammatical structures, etc.

Now we finally get out of the bag-of-words and retain the original structure of the text. This also takes into account the relationship between words, and features of words themselves. 25Examples: Use NLP to test hypothesesHypothesis: Author A is more descriptive than Author B.Test: Part-of-Speech tagger, extract adjectives, count and compare.Hypothesis: Organizations in New York City are more internationally focused than organizations in Silicon Valley.Test: Named Entity Recognition, compare against lists of corporations, places, and people.If you want to get fancy, use Wikipedia to identify whether the named entities are international versus U.S.26Example: Use NLP to test hypothesesHypothesis: the word disruptive is used in a positive way, and has a different meaning, for Silicon Valley organizations compared to Wall Street organizations.Test: ConcordancesOr, there may be certain important concepts that you think are treated differently by different groups.27NLP: Concordancesvery heartily so exceedingly remarkably as vast a great amazingly extremely good sweet Mostly positivemean part maddens doleful gamesome subtly uncommon careful untoward exasperate loving passing mouldy christian few true mystifying imperial modifies contemptibleMostly negativeong the former , one was of a most monstrous size . ... This came towards us , ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r ll over with a heathenish array of monstrous clubs and spears . Some were thick d as you gazed , and wondered what monstrous cannibal and savage could ever hav that has survived the flood ; most monstrous and most mountainous ! That Himmal they might scout at Moby Dick as a monstrous fable , or still worse and more de th of Radney .'" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere l ing Scenes . In connexion with the monstrous pictures of whales , I am strongly ere to enter upon those still more monstrous stories of them which are to be fo ght have been rummaged out of this monstrous cabinet there is no telling . But So you would conclude the word monstrous is used positively by Melville, and negatively by Austen. Perhaps this suggests a gendered different, which could then be tested.28Example: NLP and WordNetHypothesis: Womens movement organizations in New York City approach politics more abstractly compared to those in Chicago, who have a more concrete approach to politics.

One question that often comes up is whether texts are concrete or abstract. There are many questions you can examine with this. My hypothesis suggests the organizations approached politics through fundamentally different understandings. I wrote a program to count the number of hypernyms for each noun and verb (which required I extracted nouns and verbs first, using PoS tagger), and then averaged this over the text for different organizations. There are a lot of NLP techniques, Ive gone through only a few here.29Tactics and Issues Over TimeStructural Topic Models (structured on year)Used R package stm (Roberts, Stewart, and Tingly)Further grouped the 40 topics into 7 topic categoriesPython NLTK, extracted verbs/verb phrasesHand identified tactics, created dictionaries of tactical categories

Most prevalent topic, internal business, which is not surprising given the data source. Next most prevalent is regulations, followed by direct environmental improvement and then environmental problems.31Tactics by Year

Tactics by Year

Is the environmental movement shifting away from collective strategy, toward placing the burden on individuals? Shift away from strategy of confrontation and punishment, toward non-confrontation and cooperation

34Conclusion: Research design is key! Good data is critical!Match your method to your question and data. Be purposeful, not trendyUse multiple methods, including qualitative, to verify the analysisLearn a programming languageOff-the-shelf tools box you in (see point 2). I recommend Python, R is also goodRead NLP and machine learning literature

Like any research project ever research design is key, and good data it critical. Never ever substitute sexy methods for a strong research design and solid data collection. I cant emphasize this enough. The methods are only as good as the design and questions. Then match your methods to your question. Read literature outside your discipline to stay up to date. Computational linguistics, computer science, and now, data science.35Laura K. [email protected]

Happy text analyzing!Tactical Categories*Direct Environmental Protection: build, improve, protect, recycleNon-Disruptive Protest: chant, demonstrate, organize, petition, protestDisruptive Protest: blockade, chain, prevent, damage, sabotagePolitical: campaign, donate, elect, endorse, regulateJuridical: audit, enforce, inspect, represent, testifyVerbal Statements: advocate, comment, criticize, explain, refuteBusiness: boycott, buy, invest, purchase, sponsorEducation/Raising Awareness: editorial, outreach, publish, report, tweetOrganization/Movement Building: fund-raise, initiate, launch, participateNegotiations: deal, discuss, engage, listen, persuade*Categories are not mutually exclusive