political text analysis - kohei watanabe · • base form of words (e.g. ‘eat’ is the lemma of...

Political Text Analysis 1

Political Text AnalysisLecture 7

Kohei Watanabe

Advanced techniques


Part-of-speech tagging (1)

• POS taggers are tools to extract more information about words based on syntactical parsing

• Part-of-speech• Category of words (e.g. noun, verb, adjective, adverb, pronoun, preposition,

conjunction, etc.)• Lemma

• Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’)• Named-entity

• Words referring to entities (e.g. people, places, organizations etc.)

• POS taggers use language models trained on a manually annotated corpus



Doc ID Sentence ID Token ID Token Lemma POS Entity1 2013-Obama 1 1 Vice vice PROPN2 2013-Obama 1 2 President president PROPN3 2013-Obama 1 3 Biden biden PROPN PERSON_B4 2013-Obama 1 4 , , PUNCT5 2013-Obama 1 5 Mr. mr. PROPN6 2013-Obama 1 6 Chief chief PROPN7 2013-Obama 1 7 Justice justice PROPN8 2013-Obama 1 8 , , PUNCT9 2013-Obama 1 9 Members member NOUN

10 2013-Obama 1 10 of of ADP11 2013-Obama 1 11 the the DET ORG_B12 2013-Obama 1 12 United united PROPN ORG_I13 2013-Obama 1 13 States states PROPN ORG_I

14 2013-Obama 1 14 Congress congress PROPN ORG_I15 2013-Obama 1 15 , , PUNCT16 2013-Obama 1 16 distinguished distinguished ADJ17 2013-Obama 1 17 guests guest NOUN18 2013-Obama 1 18 , , PUNCT19 2013-Obama 1 19 and and CCONJ20 2013-Obama 1 20 fellow fellow ADJ


• POS information can be included in tokens


## [1] "Vice/PROPN" "President/PROPN" "Biden/PROPN" ## [4] ",/PUNCT" "Mr./PROPN" "Chief/PROPN" ## [7] "Justice/PROPN" ",/PUNCT" "member/NOUN" ## [10] "of/ADP" "the/DET" "United/PROPN" ## [13] "States/PROPN" "Congress/PROPN" ",/PUNCT" ## [16] "distinguished/ADJ" "guest/NOUN" ",/PUNCT" ## [19] "and/CCONJ" "fellow/ADJ" "citizen/NOUN" ## [22] ":/PUNCT" "\n\n/SPACE" "each/DET" ## [25] "time/NOUN" "-PRON-/PRON" "gather/VERB" ## [28] "to/PART" "inaugurate/VERB" "a/DET"


• European languages (POS tagger)• spaCy

• spacyr (on CRAN)• Universal Dependencies

• Udpipe (on CRAN)

• Asian languages (morphological analysis tools)• Mecab

• RMecab, RcppMecab (on CRAN)• Jieba

• jiebaR (on CRAN)• HanNanum

• koNLP (on CRAN)


Word embeddings (1)

• Word embedding is a technique to represent meanings of words in a vector space

• Each word has a numeric “word vector” that represent its meaning• Similarity between word vectors is their semantic similarity

• Some of the models have “linear substructure”• e.g. ‘man’ – ‘woman’ ∝ ‘king’ – ‘queen’

• Word embedding models • LSA, Word2vec, GloVe etc.

• Use sentences (LSA) or word windows (Word2vec, GloVe)


Word embeddings (2)

• Words with the same context have the same meaning according to the distribution hypothesis

• Syntagmatic associates• Words that neighbor to each other (first order collocations)

• e.g. I will go to see a lawyer to seek legal advice• “lawyer” and “legal” are words in the same topic

• e.g. I met a bad doctor yesterday• “bad” is modifier of “doctor”

• Paradigmatic parallels (synonyms)• Words that have similar neighbors (second order collocations)

• e.g. I will go to see a lawyer/barrister to seek legal advice• e.g. I met a bad/terrible doctor yesterday


Latent Semantic Analysis

• Latent Sematic Analysis/Indexing helps us to identify synonyms• Proposed for information retrieval systems by Deerwester et. al

(1990)• Applied to automatically extract synonyms from corpus by Turney and

Littman (2003)

• Essentially, LSA is factor analysis of DFMs• Simplest form of word embeddings where documents are sentences• Use SVD for reducing document dimensions to 200-300 factors


Singular Value Decomposition (SVD)


X ≈ �X = DST′

D�X

T′S

≈

term

document

m × n m × k

k × k k × n

singular values

Smoothing of DFM

• Using right-singular vectors TS ′, we reduce documents to 𝑘𝑘factors


Semi-supervised models


Semi-supervised learning

• Semi-supervised models for text analysis uses ‘seed words’ as weak supervision

• Semi-supervised learning is aimed at balancing the cost and control• Cost: the amount of users’ manual input to perform analysis• Control: the degree of users’ control over results of analysis

• Different from semi-supervised models developed by computer scientists

• Typically, they expand training set by (unsupervised) clustering


Newsmap

• Dictionary-based semi-supervised document classifier• Construct training set using seed dictionary

• e.g. geographical dictionary, topic dictionary• When

• P 𝑓𝑓 𝐶𝐶 is the probability of feature 𝑓𝑓 in documents with label 𝐶𝐶• P 𝑓𝑓 �̂�𝐶 is the probability of feature 𝑓𝑓 in documents with labels not 𝐶𝐶

• Association 𝑠𝑠 of word 𝑓𝑓 for category 𝐶𝐶 is a log-likelihood ratio

𝑠𝑠 = logP 𝑓𝑓 𝐶𝐶P 𝑓𝑓 �̂�𝐶

• Available as an R package• newsmap (on CRAN) with multi-lingual geographical dictionary


Latent Semantic Scaling (LSS)

• Semi-supervised document scaling based on LSA1. Smooth a DFM by SVD with 𝑘𝑘 = 300

• Documents should be sentences2. Weight features by proximity to seed words

• Compute cosine similarity of features to seed words• For example, sentiment seed words are

• Positive {good, nice, excellent, positive, fortunate, correct, superior}• Negative {bad, nasty, poor, negative, unfortunate, wrong, inferior}

3. Scale documents by the feature weights in the same way as Wordscores

• Available as an R package• LSS (https://github.com/koheiw/LSS)


https://github.com/koheiw/LSS

Watanabe 2017Measuring news bias: Russia’s official news agency ITAR-TASS’ coverage of the Ukraine crisis


Research question

• How much TASS’s news coverage of the Ukraine crisis was biased?

• News bias is one of the most important concept in media studies• However, it is very difficult to measure news bias

• Ukraine crisis was a significant geo-political event in recent years• Yanukovych's sudden pro-Russian policy change triggered anti-government

protests in Kiev in November 2013 • Russia swiftly annexed Crimea in March 2014 through ‘referendum’• Russia continue to support pro-Russian separatists in east Ukraine


Data

• English news articles about Ukraine published in 2013-2014 by news agencies

• TASS: Russia’s state-owned news agency (n=87,725)• Interfax: Russia commercial news agency (n=103,236)• Reuters: Western international news agency (n=21,718)


Analysis

• LSS• Train two models for democracy and sovereignly with sentiment seed

words• Use collocations of “democra*” or “sovereign*” as terms of the models• Seed words were generic positive-negative words

• Predict sentiment of all news articles by the LSS models• Use Interfax as an independent benchmark

• Newsmap• Train and apply geographical classifier to select news about Ukraine

• Regression• Fit regression models to estimated the impact of desirable or

undesirable events for the Russian government


Key events in the Ukraine crisis


Framing of democracy


Framing of sovereignty


Estimated state-ownership bias

• State-ownership bias is estimated by interaction between TASS and time dummies


Sentiment in stories with quotes

• There is non-linear association between quotes and sentiment in sovereignty

• It suggest that TASS news coverage is negative because of

• Description of events• Low quotes proportion

• Comments by Russian officials • High quotes proportion


Validation

• Compute correlation between human and machine coding


Conclusions

• LSS is a useful tool for domain-specific sentiment analysis• Passed face validation (trend) and criterion validation (correlation)

• Russian government utilized TASS for its international propaganda

• TASS’s news coverage changed in directions predicted based on Russia’s interest

• Positive shift after desirable events• Negative shift after undesirable events

• Typology of news bias in western media does not always apply


Watanabe 2018Newsmap: A semi-supervised approach to geographical news classification


Problems

• Simple dictionary matching does not work well• Dictionary cannot perform single membership classification well• Place names are too ambiguous

• Geo/non-geo ambiguity• “Nice” can be either Nice in France or an English adjective

• Geo/geo ambiguity• “London” refer to either the United Kingdom’s capital or a city in Canada’s Ontario

• Place names too many place indicator• People (e.g. David Cameron) or organizations (e.g. the Pentagon),


Existing approaches

• Disambiguation techniques• Knowledge-based approach

• Prioritize places with larger population or physical areas • Map-based approach

• Choose places in closer proximity when place names are ambiguous• Data-driven approach

• Extract association between place names from the corpus


Solution

• Develop a geographical news classifier to identify most strongly associated countries

• Extract both place and non-place features from the corpus• Construct small seed dictionary only with names of country and

capital cities, for example:• Ukraine: {Ukraine, Ukrainian*, Kiev}• Iraq: {Iraq, Iraqi*, Baghdad}

• Only include capitalized words to reduce geo/non-geo ambiguity • Words with geo/geo ambiguity should have small weight in the model


Data

• News summaries collected in 2014• Training set

• Yahoo News US edition (n=156,980)• Test set

• From newspapers from different part of the world• The Times (United Kingdom)• The New York Times• The Australian• The Nation (Kenya)• The Times of India

• 5,000 summaries manually classified by at least three coders • Used crowdsourcing Prolific Academic (Oxford-based online recruiting platform)


Experiment

• Compare Newsmap’s classification accuracy with three tools• Open Calais and Geoparser.io are commercial tools• Gazetteer is a larger dictionary of place names

• 27,678 place names in 255 countries collected by the US government agencies


Evaluation: precision


• Precision is low in GB, ZA, SS and JP due to ambiguity in place names in dictionary

Evaluation: recall


• Recall is high in Newsmap because it exploits more non-place features

Ambiguous place names


Place names Country Misclassified cases Ambiguity type

“South” region Cameroon South Africa, South Sudan, South Dakota etc.

Geo/geo

“Unity” state South Sudan Software products Geo/non-geo

“Nigel” town South Africa Nigel Farage (British politician) Geo/non-geo

“Obama” city Japan Barack Obama (American president) Geo/non-geo

“Gay” city Russia LGBT rights Geo/non-geo

“Morsi” city India Mohamed Morsi (Egyptian president) Geo/non-geo

Issues

• Entity recognition is more difficult in non-English languages• Proper adjectives (e.g. American) are not capitalized in many

European languages• Tokenization of multi-word expression (e.g. New York) is necessary for

high classification accuracy

• Newsmap fails classify sports news• Athletes and sport events move quickly between countries


Recapitulation Summary of the course


What is quantitative text analysis? (1)

• Methodology to understand social process through textual data• Pre-processing is about low-level linguistic phenomena

• Morphological and syntactical• POS tagging, syntactical parsing, lemmatization (stemming)

• Lexical• Dictionary lookup, collocation analysis

• Main analysis is about high-level linguistic phenomena• Discourse

• Frequency analysis, scaling or classification• Pragmatics

• Ideology of authors, preference of actors, bias in documents, etc.


What is quantitative text analysis? (2)

• There are symbolic and statistical approaches• Symbolic approach

• Easy to make analysis consistent with theory• Requires much manual input but little computation (e.g. dictionary)

• Statistical approach• Difficult to make analysis consistent with theory • Requires little manual input but much computation (e.g. topic models)

• However, we must connect concepts (symbols) and numbers(statistics) to make sense

• We connect concepts and numbers • Symbolic approach: pre-definition• Statistical approach: post-hoc interpretation


Data collection (1)

• Political document corpora are distributed publicly• UN General Assembly Debate Corpus

https://github.com/sjankin/UnitedNations• Manifesto Corpus

https://manifesto-project.wzb.eu• Various political corpora

https://github.com/quanteda/quanteda.corpora


https://github.com/sjankin/UnitedNations

https://manifesto-project.wzb.eu/

https://github.com/quanteda/quanteda.corpora

Data collection (2)

• News articles corpora usually not available• Commercial databases

• Nexis, Factiva and proQuest• News websites

• RSS feeds• Scraping

• rvest and RSelenium (on CRAN)

• Web API• New York Times API

• See https://koheiw.net/?p=643


https://koheiw.net/?p=643

Data management

• Save documents in UTF-8 to avoid data corruption• HTML or XML are better than plain text or CSV

• Import data into R using existing tools• Text, Word, PDF files

• readtext (on CRAN)• Excel spreadsheet

• rio or readxl (on CRAN) • HTML from newspaper databases

• newspapers (https://github.com/koheiw/newspapers)


https://github.com/koheiw/newspapers

Statistical analysis (1)

• Types of statistical analysis• Positional (string-of-words)

• Collocation (contiguous/non-contiguous) • Word embeddings

• Non-positional (bag-of-words)• Frequency (absolute/relative)• Document/feature similarity• Naive Bayes• Wordscores• Wordfish• Correspondence analysis• Topic models


Statistical analysis (2)

• Types of machine learning• Supervised models

• Naive Bayes• Wordscores

• Unsupervised models• Wordfish• Correspondence analysis• Topic models• Word embeddings

• Semi-supervised models• Newsmap• LSS• Seeded LDA


Examples (1)

• Dictionary• Measure historical geo-political risks (Caldara & Iacoviello 2018)• Dynamics between ruling and opposition parties European legislatures

(Proksch et al., forthcoming)

• Document similarity• Adoption of bills from other US states (Jansa et al. 2018)

• Collocation• Representation of Muslims in newspapers (Baker et al. 2012)

• Naive Bayes classifier• Identify jihadist clerics (Nielsen 2017)


Examples (2)

• Wordscores• Policy positions of Irish and British parties (Benoit & Laver 2003)• Foreign policies of UN member countries (Baturo et al. 2017)

• Wordfish• Policy positions of German parties (Slapin & Proksch 2008)

• Correspondence analysis• Themes in US presidential election (Schonhardt-Bailey 2005)

• LDA• Climate change denial by think tanks (Boussalis & Coan 2016)• Predict occurrence of political violence (Mueller & Rauh 2018)


Types of validity and validation

• Content validity• Measurement captures all the relevant aspects

• Cannot check empirically

• Criterion validity• Results of measurement is correlated with known measurement

• Compared the results with existing data or manually created dataset (“gold standard” or “ground truth”)

• Face validity• Results of the measurement is consistent with our knowledge

• Show the results your supervisors or colleagues (“eyeball test”)


Interpretation

• Focus to discourse and pragmatics• Discourse

• What are the tones of the documents?• What are the topics of the document?• What are the frames of the documents?

• Pragmatics• Who are the readers of the documents?• Who are the authors of the documents?• Why do the authors wrote the documents?• Why the documents were published at that particular time?


Further materials

• Quanteda• Reference and example

https://quanteda.io• Stackoverflow

https://stackoverflow.com/tags/quanteda• Bug report

https://github.com/quanteda/quanteda/issues

• Text analysis in Asian languages• Obstruction to Asian-language text analysis

https://koheiw.net/?p=766• Quantitative Text Analysis in Japanese



https://quanteda.io/

https://stackoverflow.com/tags/quanteda

https://github.com/quanteda/quanteda/issues



International conference at Waseda

• POLTEXT 2019• See https://www.poltextconference.org• An international conference on quantitative text analysis

• Next will be is Innsbruck Austria in 2020• Will be held in this building

• 13 September• Pre-conference event (tutorial sessions)

• 14-15 September• Main conference (presentation)


https://www.poltextconference.org/

ReferencesDeerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing

by latent semantic analysis. JASIS, 41(6), 391–407.

Turney, P. D., & Littman, M. L. (2003). Measuring Praise and Criticism: Inference of Semantic Orientation from Association. ACM Transactions on Information Systems, 21(4), 315–346. https://doi.org/10.1145/944012.944013

Turney, P. D., & Pantel, P. (2010). From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37(1), 141–188.

Watanabe, K. (2017). Measuring news bias: Russia’s official news agency ITAR-TASS’ coverage of the Ukraine crisis. European Journal of Communication, 32(3), 224–241. https://doi.org/10.1177/0267323117695735

Watanabe, K. (2018). Newsmap: A semi-supervised approach to geographical news classification. Digital Journalism, 6(3), 294–309. https://doi.org/10.1080/21670811.2017.1293487


https://doi.org/10.1145/944012.944013

https://doi.org/10.1177/0267323117695735

https://doi.org/10.1080/21670811.2017.1293487