political text analysis - kohei watanabe · • base form of words (e.g. ‘eat’ is the lemma of...
TRANSCRIPT
Political Text Analysis 1
Political Text AnalysisLecture 7
Kohei Watanabe
Advanced techniques
Political Text Analysis 2
Part-of-speech tagging (1)
• POS taggers are tools to extract more information about words based on syntactical parsing
• Part-of-speech• Category of words (e.g. noun, verb, adjective, adverb, pronoun, preposition,
conjunction, etc.)• Lemma
• Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’)• Named-entity
• Words referring to entities (e.g. people, places, organizations etc.)
• POS taggers use language models trained on a manually annotated corpus
Political Text Analysis 3
Political Text Analysis 4
Doc ID Sentence ID Token ID Token Lemma POS Entity1 2013-Obama 1 1 Vice vice PROPN2 2013-Obama 1 2 President president PROPN3 2013-Obama 1 3 Biden biden PROPN PERSON_B4 2013-Obama 1 4 , , PUNCT5 2013-Obama 1 5 Mr. mr. PROPN6 2013-Obama 1 6 Chief chief PROPN7 2013-Obama 1 7 Justice justice PROPN8 2013-Obama 1 8 , , PUNCT9 2013-Obama 1 9 Members member NOUN
10 2013-Obama 1 10 of of ADP11 2013-Obama 1 11 the the DET ORG_B12 2013-Obama 1 12 United united PROPN ORG_I13 2013-Obama 1 13 States states PROPN ORG_I
14 2013-Obama 1 14 Congress congress PROPN ORG_I15 2013-Obama 1 15 , , PUNCT16 2013-Obama 1 16 distinguished distinguished ADJ17 2013-Obama 1 17 guests guest NOUN18 2013-Obama 1 18 , , PUNCT19 2013-Obama 1 19 and and CCONJ20 2013-Obama 1 20 fellow fellow ADJ
Part-of-speech tagging (2)
• POS information can be included in tokens
Political Text Analysis 5
## [1] "Vice/PROPN" "President/PROPN" "Biden/PROPN" ## [4] ",/PUNCT" "Mr./PROPN" "Chief/PROPN" ## [7] "Justice/PROPN" ",/PUNCT" "member/NOUN" ## [10] "of/ADP" "the/DET" "United/PROPN" ## [13] "States/PROPN" "Congress/PROPN" ",/PUNCT" ## [16] "distinguished/ADJ" "guest/NOUN" ",/PUNCT" ## [19] "and/CCONJ" "fellow/ADJ" "citizen/NOUN" ## [22] ":/PUNCT" "\n\n/SPACE" "each/DET" ## [25] "time/NOUN" "-PRON-/PRON" "gather/VERB" ## [28] "to/PART" "inaugurate/VERB" "a/DET"
Part-of-speech tagging (3)
• European languages (POS tagger)• spaCy
• spacyr (on CRAN)• Universal Dependencies
• Udpipe (on CRAN)
• Asian languages (morphological analysis tools)• Mecab
• RMecab, RcppMecab (on CRAN)• Jieba
• jiebaR (on CRAN)• HanNanum
• koNLP (on CRAN)
Political Text Analysis 6
Word embeddings (1)
• Word embedding is a technique to represent meanings of words in a vector space
• Each word has a numeric “word vector” that represent its meaning• Similarity between word vectors is their semantic similarity
• Some of the models have “linear substructure”• e.g. ‘man’ – ‘woman’ ∝ ‘king’ – ‘queen’
• Word embedding models • LSA, Word2vec, GloVe etc.
• Use sentences (LSA) or word windows (Word2vec, GloVe)
Political Text Analysis 7
Word embeddings (2)
• Words with the same context have the same meaning according to the distribution hypothesis
• Syntagmatic associates• Words that neighbor to each other (first order collocations)
• e.g. I will go to see a lawyer to seek legal advice• “lawyer” and “legal” are words in the same topic
• e.g. I met a bad doctor yesterday• “bad” is modifier of “doctor”
• Paradigmatic parallels (synonyms)• Words that have similar neighbors (second order collocations)
• e.g. I will go to see a lawyer/barrister to seek legal advice• e.g. I met a bad/terrible doctor yesterday
Political Text Analysis 8
Latent Semantic Analysis
• Latent Sematic Analysis/Indexing helps us to identify synonyms• Proposed for information retrieval systems by Deerwester et. al
(1990)• Applied to automatically extract synonyms from corpus by Turney and
Littman (2003)
• Essentially, LSA is factor analysis of DFMs• Simplest form of word embeddings where documents are sentences• Use SVD for reducing document dimensions to 200-300 factors
Political Text Analysis 9
Singular Value Decomposition (SVD)
Political Text Analysis 10
X ≈ �X = DST′
D�X
T′S
≈
term
document
m × n m × k
k × k k × n
singular values
Smoothing of DFM
• Using right-singular vectors TS ′, we reduce documents to 𝑘𝑘factors
Political Text Analysis 11
Semi-supervised models
Political Text Analysis 12
Semi-supervised learning
• Semi-supervised models for text analysis uses ‘seed words’ as weak supervision
• Semi-supervised learning is aimed at balancing the cost and control• Cost: the amount of users’ manual input to perform analysis• Control: the degree of users’ control over results of analysis
• Different from semi-supervised models developed by computer scientists
• Typically, they expand training set by (unsupervised) clustering
Political Text Analysis 13
Newsmap
• Dictionary-based semi-supervised document classifier• Construct training set using seed dictionary
• e.g. geographical dictionary, topic dictionary• When
• P 𝑓𝑓 𝐶𝐶 is the probability of feature 𝑓𝑓 in documents with label 𝐶𝐶• P 𝑓𝑓 �̂�𝐶 is the probability of feature 𝑓𝑓 in documents with labels not 𝐶𝐶
• Association 𝑠𝑠 of word 𝑓𝑓 for category 𝐶𝐶 is a log-likelihood ratio
𝑠𝑠 = logP 𝑓𝑓 𝐶𝐶P 𝑓𝑓 �̂�𝐶
• Available as an R package• newsmap (on CRAN) with multi-lingual geographical dictionary
Political Text Analysis 14
Latent Semantic Scaling (LSS)
• Semi-supervised document scaling based on LSA1. Smooth a DFM by SVD with 𝑘𝑘 = 300
• Documents should be sentences2. Weight features by proximity to seed words
• Compute cosine similarity of features to seed words• For example, sentiment seed words are
• Positive {good, nice, excellent, positive, fortunate, correct, superior}• Negative {bad, nasty, poor, negative, unfortunate, wrong, inferior}
3. Scale documents by the feature weights in the same way as Wordscores
• Available as an R package• LSS (https://github.com/koheiw/LSS)
Political Text Analysis 15
Watanabe 2017Measuring news bias: Russia’s official news agency ITAR-TASS’ coverage of the Ukraine crisis
Political Text Analysis 16
Research question
• How much TASS’s news coverage of the Ukraine crisis was biased?
• News bias is one of the most important concept in media studies• However, it is very difficult to measure news bias
• Ukraine crisis was a significant geo-political event in recent years• Yanukovych's sudden pro-Russian policy change triggered anti-government
protests in Kiev in November 2013 • Russia swiftly annexed Crimea in March 2014 through ‘referendum’• Russia continue to support pro-Russian separatists in east Ukraine
Political Text Analysis 17
Data
• English news articles about Ukraine published in 2013-2014 by news agencies
• TASS: Russia’s state-owned news agency (n=87,725)• Interfax: Russia commercial news agency (n=103,236)• Reuters: Western international news agency (n=21,718)
Political Text Analysis 18
Analysis
• LSS• Train two models for democracy and sovereignly with sentiment seed
words• Use collocations of “democra*” or “sovereign*” as terms of the models• Seed words were generic positive-negative words
• Predict sentiment of all news articles by the LSS models• Use Interfax as an independent benchmark
• Newsmap• Train and apply geographical classifier to select news about Ukraine
• Regression• Fit regression models to estimated the impact of desirable or
undesirable events for the Russian government
Political Text Analysis 19
Key events in the Ukraine crisis
Political Text Analysis 20
Framing of democracy
Political Text Analysis 21
Framing of sovereignty
Political Text Analysis 22
Political Text Analysis 23
Estimated state-ownership bias
• State-ownership bias is estimated by interaction between TASS and time dummies
Political Text Analysis 24
Sentiment in stories with quotes
• There is non-linear association between quotes and sentiment in sovereignty
• It suggest that TASS news coverage is negative because of
• Description of events• Low quotes proportion
• Comments by Russian officials • High quotes proportion
Political Text Analysis 25
Validation
• Compute correlation between human and machine coding
Political Text Analysis 26
Conclusions
• LSS is a useful tool for domain-specific sentiment analysis• Passed face validation (trend) and criterion validation (correlation)
• Russian government utilized TASS for its international propaganda
• TASS’s news coverage changed in directions predicted based on Russia’s interest
• Positive shift after desirable events• Negative shift after undesirable events
• Typology of news bias in western media does not always apply
Political Text Analysis 27
Watanabe 2018Newsmap: A semi-supervised approach to geographical news classification
Political Text Analysis 28
Problems
• Simple dictionary matching does not work well• Dictionary cannot perform single membership classification well• Place names are too ambiguous
• Geo/non-geo ambiguity• “Nice” can be either Nice in France or an English adjective
• Geo/geo ambiguity• “London” refer to either the United Kingdom’s capital or a city in Canada’s Ontario
• Place names too many place indicator• People (e.g. David Cameron) or organizations (e.g. the Pentagon),
Political Text Analysis 29
Existing approaches
• Disambiguation techniques• Knowledge-based approach
• Prioritize places with larger population or physical areas • Map-based approach
• Choose places in closer proximity when place names are ambiguous• Data-driven approach
• Extract association between place names from the corpus
Political Text Analysis 30
Solution
• Develop a geographical news classifier to identify most strongly associated countries
• Extract both place and non-place features from the corpus• Construct small seed dictionary only with names of country and
capital cities, for example:• Ukraine: {Ukraine, Ukrainian*, Kiev}• Iraq: {Iraq, Iraqi*, Baghdad}
• Only include capitalized words to reduce geo/non-geo ambiguity • Words with geo/geo ambiguity should have small weight in the model
Political Text Analysis 31
Data
• News summaries collected in 2014• Training set
• Yahoo News US edition (n=156,980)• Test set
• From newspapers from different part of the world• The Times (United Kingdom)• The New York Times• The Australian• The Nation (Kenya)• The Times of India
• 5,000 summaries manually classified by at least three coders • Used crowdsourcing Prolific Academic (Oxford-based online recruiting platform)
Political Text Analysis 32
Experiment
• Compare Newsmap’s classification accuracy with three tools• Open Calais and Geoparser.io are commercial tools• Gazetteer is a larger dictionary of place names
• 27,678 place names in 255 countries collected by the US government agencies
Political Text Analysis 33
Political Text Analysis 34
Evaluation: precision
Political Text Analysis 35
• Precision is low in GB, ZA, SS and JP due to ambiguity in place names in dictionary
Evaluation: recall
Political Text Analysis 36
• Recall is high in Newsmap because it exploits more non-place features
Ambiguous place names
Political Text Analysis 37
Place names Country Misclassified cases Ambiguity type
“South” region Cameroon South Africa, South Sudan, South Dakota etc.
Geo/geo
“Unity” state South Sudan Software products Geo/non-geo
“Nigel” town South Africa Nigel Farage (British politician) Geo/non-geo
“Obama” city Japan Barack Obama (American president) Geo/non-geo
“Gay” city Russia LGBT rights Geo/non-geo
“Morsi” city India Mohamed Morsi (Egyptian president) Geo/non-geo
Issues
• Entity recognition is more difficult in non-English languages• Proper adjectives (e.g. American) are not capitalized in many
European languages• Tokenization of multi-word expression (e.g. New York) is necessary for
high classification accuracy
• Newsmap fails classify sports news• Athletes and sport events move quickly between countries
Political Text Analysis 38
Recapitulation Summary of the course
Political Text Analysis 39
What is quantitative text analysis? (1)
• Methodology to understand social process through textual data• Pre-processing is about low-level linguistic phenomena
• Morphological and syntactical• POS tagging, syntactical parsing, lemmatization (stemming)
• Lexical• Dictionary lookup, collocation analysis
• Main analysis is about high-level linguistic phenomena• Discourse
• Frequency analysis, scaling or classification• Pragmatics
• Ideology of authors, preference of actors, bias in documents, etc.
Political Text Analysis 40
What is quantitative text analysis? (2)
• There are symbolic and statistical approaches• Symbolic approach
• Easy to make analysis consistent with theory• Requires much manual input but little computation (e.g. dictionary)
• Statistical approach• Difficult to make analysis consistent with theory • Requires little manual input but much computation (e.g. topic models)
• However, we must connect concepts (symbols) and numbers(statistics) to make sense
• We connect concepts and numbers • Symbolic approach: pre-definition• Statistical approach: post-hoc interpretation
Political Text Analysis 41
Data collection (1)
• Political document corpora are distributed publicly• UN General Assembly Debate Corpus
https://github.com/sjankin/UnitedNations• Manifesto Corpus
https://manifesto-project.wzb.eu• Various political corpora
https://github.com/quanteda/quanteda.corpora
Political Text Analysis 42
Data collection (2)
• News articles corpora usually not available• Commercial databases
• Nexis, Factiva and proQuest• News websites
• RSS feeds• Scraping
• rvest and RSelenium (on CRAN)
• Web API• New York Times API
• See https://koheiw.net/?p=643
Political Text Analysis 43
Data management
• Save documents in UTF-8 to avoid data corruption• HTML or XML are better than plain text or CSV
• Import data into R using existing tools• Text, Word, PDF files
• readtext (on CRAN)• Excel spreadsheet
• rio or readxl (on CRAN) • HTML from newspaper databases
• newspapers (https://github.com/koheiw/newspapers)
Political Text Analysis 44
Statistical analysis (1)
• Types of statistical analysis• Positional (string-of-words)
• Collocation (contiguous/non-contiguous) • Word embeddings
• Non-positional (bag-of-words)• Frequency (absolute/relative)• Document/feature similarity• Naive Bayes• Wordscores• Wordfish• Correspondence analysis• Topic models
Political Text Analysis 45
Statistical analysis (2)
• Types of machine learning• Supervised models
• Naive Bayes• Wordscores
• Unsupervised models• Wordfish• Correspondence analysis• Topic models• Word embeddings
• Semi-supervised models• Newsmap• LSS• Seeded LDA
Political Text Analysis 46
Examples (1)
• Dictionary• Measure historical geo-political risks (Caldara & Iacoviello 2018)• Dynamics between ruling and opposition parties European legislatures
(Proksch et al., forthcoming)
• Document similarity• Adoption of bills from other US states (Jansa et al. 2018)
• Collocation• Representation of Muslims in newspapers (Baker et al. 2012)
• Naive Bayes classifier• Identify jihadist clerics (Nielsen 2017)
Political Text Analysis 47
Examples (2)
• Wordscores• Policy positions of Irish and British parties (Benoit & Laver 2003)• Foreign policies of UN member countries (Baturo et al. 2017)
• Wordfish• Policy positions of German parties (Slapin & Proksch 2008)
• Correspondence analysis• Themes in US presidential election (Schonhardt-Bailey 2005)
• LDA• Climate change denial by think tanks (Boussalis & Coan 2016)• Predict occurrence of political violence (Mueller & Rauh 2018)
Political Text Analysis 48
Types of validity and validation
• Content validity• Measurement captures all the relevant aspects
• Cannot check empirically
• Criterion validity• Results of measurement is correlated with known measurement
• Compared the results with existing data or manually created dataset (“gold standard” or “ground truth”)
• Face validity• Results of the measurement is consistent with our knowledge
• Show the results your supervisors or colleagues (“eyeball test”)
Political Text Analysis 49
Interpretation
• Focus to discourse and pragmatics• Discourse
• What are the tones of the documents?• What are the topics of the document?• What are the frames of the documents?
• Pragmatics• Who are the readers of the documents?• Who are the authors of the documents?• Why do the authors wrote the documents?• Why the documents were published at that particular time?
Political Text Analysis 50
Further materials
• Quanteda• Reference and example
https://quanteda.io• Stackoverflow
https://stackoverflow.com/tags/quanteda• Bug report
https://github.com/quanteda/quanteda/issues
• Text analysis in Asian languages• Obstruction to Asian-language text analysis
https://koheiw.net/?p=766• Quantitative Text Analysis in Japanese
https://koheiw.net/?p=831
Political Text Analysis 51
International conference at Waseda
• POLTEXT 2019• See https://www.poltextconference.org• An international conference on quantitative text analysis
• Next will be is Innsbruck Austria in 2020• Will be held in this building
• 13 September• Pre-conference event (tutorial sessions)
• 14-15 September• Main conference (presentation)
Political Text Analysis 52
ReferencesDeerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing
by latent semantic analysis. JASIS, 41(6), 391–407.
Turney, P. D., & Littman, M. L. (2003). Measuring Praise and Criticism: Inference of Semantic Orientation from Association. ACM Transactions on Information Systems, 21(4), 315–346. https://doi.org/10.1145/944012.944013
Turney, P. D., & Pantel, P. (2010). From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37(1), 141–188.
Watanabe, K. (2017). Measuring news bias: Russia’s official news agency ITAR-TASS’ coverage of the Ukraine crisis. European Journal of Communication, 32(3), 224–241. https://doi.org/10.1177/0267323117695735
Watanabe, K. (2018). Newsmap: A semi-supervised approach to geographical news classification. Digital Journalism, 6(3), 294–309. https://doi.org/10.1080/21670811.2017.1293487
Political Text Analysis 53