political text analysis - kohei watanabe · • base form of words (e.g. ‘eat’ is the lemma of...

53
Political Text Analysis 1 Political Text Analysis Lecture 7 Kohei Watanabe

Upload: others

Post on 02-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Political Text Analysis 1

Political Text AnalysisLecture 7

Kohei Watanabe

Page 2: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Advanced techniques

Political Text Analysis 2

Page 3: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Part-of-speech tagging (1)

• POS taggers are tools to extract more information about words based on syntactical parsing

• Part-of-speech• Category of words (e.g. noun, verb, adjective, adverb, pronoun, preposition,

conjunction, etc.)• Lemma

• Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’)• Named-entity

• Words referring to entities (e.g. people, places, organizations etc.)

• POS taggers use language models trained on a manually annotated corpus

Political Text Analysis 3

Page 4: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Political Text Analysis 4

Doc ID Sentence ID Token ID Token Lemma POS Entity1 2013-Obama 1 1 Vice vice PROPN2 2013-Obama 1 2 President president PROPN3 2013-Obama 1 3 Biden biden PROPN PERSON_B4 2013-Obama 1 4 , , PUNCT5 2013-Obama 1 5 Mr. mr. PROPN6 2013-Obama 1 6 Chief chief PROPN7 2013-Obama 1 7 Justice justice PROPN8 2013-Obama 1 8 , , PUNCT9 2013-Obama 1 9 Members member NOUN

10 2013-Obama 1 10 of of ADP11 2013-Obama 1 11 the the DET ORG_B12 2013-Obama 1 12 United united PROPN ORG_I13 2013-Obama 1 13 States states PROPN ORG_I

14 2013-Obama 1 14 Congress congress PROPN ORG_I15 2013-Obama 1 15 , , PUNCT16 2013-Obama 1 16 distinguished distinguished ADJ17 2013-Obama 1 17 guests guest NOUN18 2013-Obama 1 18 , , PUNCT19 2013-Obama 1 19 and and CCONJ20 2013-Obama 1 20 fellow fellow ADJ

Page 5: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Part-of-speech tagging (2)

• POS information can be included in tokens

Political Text Analysis 5

## [1] "Vice/PROPN" "President/PROPN" "Biden/PROPN" ## [4] ",/PUNCT" "Mr./PROPN" "Chief/PROPN" ## [7] "Justice/PROPN" ",/PUNCT" "member/NOUN" ## [10] "of/ADP" "the/DET" "United/PROPN" ## [13] "States/PROPN" "Congress/PROPN" ",/PUNCT" ## [16] "distinguished/ADJ" "guest/NOUN" ",/PUNCT" ## [19] "and/CCONJ" "fellow/ADJ" "citizen/NOUN" ## [22] ":/PUNCT" "\n\n/SPACE" "each/DET" ## [25] "time/NOUN" "-PRON-/PRON" "gather/VERB" ## [28] "to/PART" "inaugurate/VERB" "a/DET"

Page 6: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Part-of-speech tagging (3)

• European languages (POS tagger)• spaCy

• spacyr (on CRAN)• Universal Dependencies

• Udpipe (on CRAN)

• Asian languages (morphological analysis tools)• Mecab

• RMecab, RcppMecab (on CRAN)• Jieba

• jiebaR (on CRAN)• HanNanum

• koNLP (on CRAN)

Political Text Analysis 6

Page 7: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Word embeddings (1)

• Word embedding is a technique to represent meanings of words in a vector space

• Each word has a numeric “word vector” that represent its meaning• Similarity between word vectors is their semantic similarity

• Some of the models have “linear substructure”• e.g. ‘man’ – ‘woman’ ∝ ‘king’ – ‘queen’

• Word embedding models • LSA, Word2vec, GloVe etc.

• Use sentences (LSA) or word windows (Word2vec, GloVe)

Political Text Analysis 7

Page 8: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Word embeddings (2)

• Words with the same context have the same meaning according to the distribution hypothesis

• Syntagmatic associates• Words that neighbor to each other (first order collocations)

• e.g. I will go to see a lawyer to seek legal advice• “lawyer” and “legal” are words in the same topic

• e.g. I met a bad doctor yesterday• “bad” is modifier of “doctor”

• Paradigmatic parallels (synonyms)• Words that have similar neighbors (second order collocations)

• e.g. I will go to see a lawyer/barrister to seek legal advice• e.g. I met a bad/terrible doctor yesterday

Political Text Analysis 8

Page 9: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Latent Semantic Analysis

• Latent Sematic Analysis/Indexing helps us to identify synonyms• Proposed for information retrieval systems by Deerwester et. al

(1990)• Applied to automatically extract synonyms from corpus by Turney and

Littman (2003)

• Essentially, LSA is factor analysis of DFMs• Simplest form of word embeddings where documents are sentences• Use SVD for reducing document dimensions to 200-300 factors

Political Text Analysis 9

Page 10: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Singular Value Decomposition (SVD)

Political Text Analysis 10

X ≈ �X = DST′

D�X

T′S

term

document

m × n m × k

k × k k × n

singular values

Page 11: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Smoothing of DFM

• Using right-singular vectors TS ′, we reduce documents to 𝑘𝑘factors

Political Text Analysis 11

Page 12: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Semi-supervised models

Political Text Analysis 12

Page 13: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Semi-supervised learning

• Semi-supervised models for text analysis uses ‘seed words’ as weak supervision

• Semi-supervised learning is aimed at balancing the cost and control• Cost: the amount of users’ manual input to perform analysis• Control: the degree of users’ control over results of analysis

• Different from semi-supervised models developed by computer scientists

• Typically, they expand training set by (unsupervised) clustering

Political Text Analysis 13

Page 14: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Newsmap

• Dictionary-based semi-supervised document classifier• Construct training set using seed dictionary

• e.g. geographical dictionary, topic dictionary• When

• P 𝑓𝑓 𝐶𝐶 is the probability of feature 𝑓𝑓 in documents with label 𝐶𝐶• P 𝑓𝑓 �̂�𝐶 is the probability of feature 𝑓𝑓 in documents with labels not 𝐶𝐶

• Association 𝑠𝑠 of word 𝑓𝑓 for category 𝐶𝐶 is a log-likelihood ratio

𝑠𝑠 = logP 𝑓𝑓 𝐶𝐶P 𝑓𝑓 �̂�𝐶

• Available as an R package• newsmap (on CRAN) with multi-lingual geographical dictionary

Political Text Analysis 14

Page 15: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Latent Semantic Scaling (LSS)

• Semi-supervised document scaling based on LSA1. Smooth a DFM by SVD with 𝑘𝑘 = 300

• Documents should be sentences2. Weight features by proximity to seed words

• Compute cosine similarity of features to seed words• For example, sentiment seed words are

• Positive {good, nice, excellent, positive, fortunate, correct, superior}• Negative {bad, nasty, poor, negative, unfortunate, wrong, inferior}

3. Scale documents by the feature weights in the same way as Wordscores

• Available as an R package• LSS (https://github.com/koheiw/LSS)

Political Text Analysis 15

Page 16: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Watanabe 2017Measuring news bias: Russia’s official news agency ITAR-TASS’ coverage of the Ukraine crisis

Political Text Analysis 16

Page 17: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Research question

• How much TASS’s news coverage of the Ukraine crisis was biased?

• News bias is one of the most important concept in media studies• However, it is very difficult to measure news bias

• Ukraine crisis was a significant geo-political event in recent years• Yanukovych's sudden pro-Russian policy change triggered anti-government

protests in Kiev in November 2013 • Russia swiftly annexed Crimea in March 2014 through ‘referendum’• Russia continue to support pro-Russian separatists in east Ukraine

Political Text Analysis 17

Page 18: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Data

• English news articles about Ukraine published in 2013-2014 by news agencies

• TASS: Russia’s state-owned news agency (n=87,725)• Interfax: Russia commercial news agency (n=103,236)• Reuters: Western international news agency (n=21,718)

Political Text Analysis 18

Page 19: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Analysis

• LSS• Train two models for democracy and sovereignly with sentiment seed

words• Use collocations of “democra*” or “sovereign*” as terms of the models• Seed words were generic positive-negative words

• Predict sentiment of all news articles by the LSS models• Use Interfax as an independent benchmark

• Newsmap• Train and apply geographical classifier to select news about Ukraine

• Regression• Fit regression models to estimated the impact of desirable or

undesirable events for the Russian government

Political Text Analysis 19

Page 20: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Key events in the Ukraine crisis

Political Text Analysis 20

Page 21: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Framing of democracy

Political Text Analysis 21

Page 22: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Framing of sovereignty

Political Text Analysis 22

Page 23: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Political Text Analysis 23

Page 24: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Estimated state-ownership bias

• State-ownership bias is estimated by interaction between TASS and time dummies

Political Text Analysis 24

Page 25: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Sentiment in stories with quotes

• There is non-linear association between quotes and sentiment in sovereignty

• It suggest that TASS news coverage is negative because of

• Description of events• Low quotes proportion

• Comments by Russian officials • High quotes proportion

Political Text Analysis 25

Page 26: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Validation

• Compute correlation between human and machine coding

Political Text Analysis 26

Page 27: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Conclusions

• LSS is a useful tool for domain-specific sentiment analysis• Passed face validation (trend) and criterion validation (correlation)

• Russian government utilized TASS for its international propaganda

• TASS’s news coverage changed in directions predicted based on Russia’s interest

• Positive shift after desirable events• Negative shift after undesirable events

• Typology of news bias in western media does not always apply

Political Text Analysis 27

Page 28: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Watanabe 2018Newsmap: A semi-supervised approach to geographical news classification

Political Text Analysis 28

Page 29: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Problems

• Simple dictionary matching does not work well• Dictionary cannot perform single membership classification well• Place names are too ambiguous

• Geo/non-geo ambiguity• “Nice” can be either Nice in France or an English adjective

• Geo/geo ambiguity• “London” refer to either the United Kingdom’s capital or a city in Canada’s Ontario

• Place names too many place indicator• People (e.g. David Cameron) or organizations (e.g. the Pentagon),

Political Text Analysis 29

Page 30: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Existing approaches

• Disambiguation techniques• Knowledge-based approach

• Prioritize places with larger population or physical areas • Map-based approach

• Choose places in closer proximity when place names are ambiguous• Data-driven approach

• Extract association between place names from the corpus

Political Text Analysis 30

Page 31: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Solution

• Develop a geographical news classifier to identify most strongly associated countries

• Extract both place and non-place features from the corpus• Construct small seed dictionary only with names of country and

capital cities, for example:• Ukraine: {Ukraine, Ukrainian*, Kiev}• Iraq: {Iraq, Iraqi*, Baghdad}

• Only include capitalized words to reduce geo/non-geo ambiguity • Words with geo/geo ambiguity should have small weight in the model

Political Text Analysis 31

Page 32: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Data

• News summaries collected in 2014• Training set

• Yahoo News US edition (n=156,980)• Test set

• From newspapers from different part of the world• The Times (United Kingdom)• The New York Times• The Australian• The Nation (Kenya)• The Times of India

• 5,000 summaries manually classified by at least three coders • Used crowdsourcing Prolific Academic (Oxford-based online recruiting platform)

Political Text Analysis 32

Page 33: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Experiment

• Compare Newsmap’s classification accuracy with three tools• Open Calais and Geoparser.io are commercial tools• Gazetteer is a larger dictionary of place names

• 27,678 place names in 255 countries collected by the US government agencies

Political Text Analysis 33

Page 34: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Political Text Analysis 34

Page 35: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Evaluation: precision

Political Text Analysis 35

• Precision is low in GB, ZA, SS and JP due to ambiguity in place names in dictionary

Page 36: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Evaluation: recall

Political Text Analysis 36

• Recall is high in Newsmap because it exploits more non-place features

Page 37: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Ambiguous place names

Political Text Analysis 37

Place names Country Misclassified cases Ambiguity type

“South” region Cameroon South Africa, South Sudan, South Dakota etc.

Geo/geo

“Unity” state South Sudan Software products Geo/non-geo

“Nigel” town South Africa Nigel Farage (British politician) Geo/non-geo

“Obama” city Japan Barack Obama (American president) Geo/non-geo

“Gay” city Russia LGBT rights Geo/non-geo

“Morsi” city India Mohamed Morsi (Egyptian president) Geo/non-geo

Page 38: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Issues

• Entity recognition is more difficult in non-English languages• Proper adjectives (e.g. American) are not capitalized in many

European languages• Tokenization of multi-word expression (e.g. New York) is necessary for

high classification accuracy

• Newsmap fails classify sports news• Athletes and sport events move quickly between countries

Political Text Analysis 38

Page 39: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Recapitulation Summary of the course

Political Text Analysis 39

Page 40: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

What is quantitative text analysis? (1)

• Methodology to understand social process through textual data• Pre-processing is about low-level linguistic phenomena

• Morphological and syntactical• POS tagging, syntactical parsing, lemmatization (stemming)

• Lexical• Dictionary lookup, collocation analysis

• Main analysis is about high-level linguistic phenomena• Discourse

• Frequency analysis, scaling or classification• Pragmatics

• Ideology of authors, preference of actors, bias in documents, etc.

Political Text Analysis 40

Page 41: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

What is quantitative text analysis? (2)

• There are symbolic and statistical approaches• Symbolic approach

• Easy to make analysis consistent with theory• Requires much manual input but little computation (e.g. dictionary)

• Statistical approach• Difficult to make analysis consistent with theory • Requires little manual input but much computation (e.g. topic models)

• However, we must connect concepts (symbols) and numbers(statistics) to make sense

• We connect concepts and numbers • Symbolic approach: pre-definition• Statistical approach: post-hoc interpretation

Political Text Analysis 41

Page 42: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Data collection (1)

• Political document corpora are distributed publicly• UN General Assembly Debate Corpus

https://github.com/sjankin/UnitedNations• Manifesto Corpus

https://manifesto-project.wzb.eu• Various political corpora

https://github.com/quanteda/quanteda.corpora

Political Text Analysis 42

Page 43: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Data collection (2)

• News articles corpora usually not available• Commercial databases

• Nexis, Factiva and proQuest• News websites

• RSS feeds• Scraping

• rvest and RSelenium (on CRAN)

• Web API• New York Times API

• See https://koheiw.net/?p=643

Political Text Analysis 43

Page 44: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Data management

• Save documents in UTF-8 to avoid data corruption• HTML or XML are better than plain text or CSV

• Import data into R using existing tools• Text, Word, PDF files

• readtext (on CRAN)• Excel spreadsheet

• rio or readxl (on CRAN) • HTML from newspaper databases

• newspapers (https://github.com/koheiw/newspapers)

Political Text Analysis 44

Page 45: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Statistical analysis (1)

• Types of statistical analysis• Positional (string-of-words)

• Collocation (contiguous/non-contiguous) • Word embeddings

• Non-positional (bag-of-words)• Frequency (absolute/relative)• Document/feature similarity• Naive Bayes• Wordscores• Wordfish• Correspondence analysis• Topic models

Political Text Analysis 45

Page 46: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Statistical analysis (2)

• Types of machine learning• Supervised models

• Naive Bayes• Wordscores

• Unsupervised models• Wordfish• Correspondence analysis• Topic models• Word embeddings

• Semi-supervised models• Newsmap• LSS• Seeded LDA

Political Text Analysis 46

Page 47: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Examples (1)

• Dictionary• Measure historical geo-political risks (Caldara & Iacoviello 2018)• Dynamics between ruling and opposition parties European legislatures

(Proksch et al., forthcoming)

• Document similarity• Adoption of bills from other US states (Jansa et al. 2018)

• Collocation• Representation of Muslims in newspapers (Baker et al. 2012)

• Naive Bayes classifier• Identify jihadist clerics (Nielsen 2017)

Political Text Analysis 47

Page 48: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Examples (2)

• Wordscores• Policy positions of Irish and British parties (Benoit & Laver 2003)• Foreign policies of UN member countries (Baturo et al. 2017)

• Wordfish• Policy positions of German parties (Slapin & Proksch 2008)

• Correspondence analysis• Themes in US presidential election (Schonhardt-Bailey 2005)

• LDA• Climate change denial by think tanks (Boussalis & Coan 2016)• Predict occurrence of political violence (Mueller & Rauh 2018)

Political Text Analysis 48

Page 49: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Types of validity and validation

• Content validity• Measurement captures all the relevant aspects

• Cannot check empirically

• Criterion validity• Results of measurement is correlated with known measurement

• Compared the results with existing data or manually created dataset (“gold standard” or “ground truth”)

• Face validity• Results of the measurement is consistent with our knowledge

• Show the results your supervisors or colleagues (“eyeball test”)

Political Text Analysis 49

Page 50: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Interpretation

• Focus to discourse and pragmatics• Discourse

• What are the tones of the documents?• What are the topics of the document?• What are the frames of the documents?

• Pragmatics• Who are the readers of the documents?• Who are the authors of the documents?• Why do the authors wrote the documents?• Why the documents were published at that particular time?

Political Text Analysis 50

Page 51: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

Further materials

• Quanteda• Reference and example

https://quanteda.io• Stackoverflow

https://stackoverflow.com/tags/quanteda• Bug report

https://github.com/quanteda/quanteda/issues

• Text analysis in Asian languages• Obstruction to Asian-language text analysis

https://koheiw.net/?p=766• Quantitative Text Analysis in Japanese

https://koheiw.net/?p=831

Political Text Analysis 51

Page 52: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

International conference at Waseda

• POLTEXT 2019• See https://www.poltextconference.org• An international conference on quantitative text analysis

• Next will be is Innsbruck Austria in 2020• Will be held in this building

• 13 September• Pre-conference event (tutorial sessions)

• 14-15 September• Main conference (presentation)

Political Text Analysis 52

Page 53: Political Text Analysis - Kohei Watanabe · • Base form of words (e.g. ‘eat’ is the lemma of ‘ate’, ‘eats’ and ‘eating’) • Named-entity • Words referring to

ReferencesDeerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing

by latent semantic analysis. JASIS, 41(6), 391–407.

Turney, P. D., & Littman, M. L. (2003). Measuring Praise and Criticism: Inference of Semantic Orientation from Association. ACM Transactions on Information Systems, 21(4), 315–346. https://doi.org/10.1145/944012.944013

Turney, P. D., & Pantel, P. (2010). From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37(1), 141–188.

Watanabe, K. (2017). Measuring news bias: Russia’s official news agency ITAR-TASS’ coverage of the Ukraine crisis. European Journal of Communication, 32(3), 224–241. https://doi.org/10.1177/0267323117695735

Watanabe, K. (2018). Newsmap: A semi-supervised approach to geographical news classification. Digital Journalism, 6(3), 294–309. https://doi.org/10.1080/21670811.2017.1293487

Political Text Analysis 53