day 10: text classification and scaling · locating things on latent traits (scaling) i but the two...

95
Day 10: Text classification and scaling ME314: Introduction to Data Science and Big Data Analytics LSE Methods Summer Programme 2019 13 August 2019

Upload: others

Post on 04-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Day 10: Text classification and scaling

ME314: Introduction to Data Science and Big Data Analytics

LSE Methods Summer Programme 2019

13 August 2019

Page 2: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Day 10 Outline

Overview of Supervised Learning vs.Unsupervised LearningDictionary MethodsScaling

Text ClassificationNaive BayesRegularized RegressionSupport Vector Machines

ScalingWordscoresWordfishCorrespondence Analysis

Page 3: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Overview of Machine Learning Methods for Text Analysis

Page 4: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Supervised machine learning

Goal: classify documents into pre existing categories.

e.g. authors of documents, sentiment of tweets, ideological position of parties

based on manifestos, tone of movie reviews...

What we need:I Hand-coded dataset (labeled), to be split into:

I Training set: used to train the classifierI Validation/Test set: used to validate the classifier

I Method to extrapolate from hand coding to unlabeleddocuments (classifier):

I Naive Bayes, regularized regression, SVM, K-nearest neighbors,BART, ensemble methods...

I Approach to validate classifier: cross-validation

I Performance metric to choose best classifier and avoidoverfitting: confusion matrix, accuracy, precision, recall...

Page 5: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Creating a labeled set

How do we obtain a labeled set?

I External sources of annotation

I Disputed authorship of Federalist papers estimated based onknown authors of other documents

I Party labels for election manifestosI Legislative proposals by think tanks (text reuse)

I Expert annotation

I “Canonical” dataset in Comparative Manifesto ProjectI In most projects, undergraduate students (expertise comes

from training)

I Crowd-sourced coding

I Wisdom of crowds: aggregated judgments of non-expertsconverge to judgments of experts at much lower cost (Benoitet al, 2016)

I Easy to implement with CrowdFlower or MTurk

Page 6: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Crowd-sourced text analysis (Benoit et al, 2016 APSR)

Page 7: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Crowd-sourced text analysis (Benoit et al, 2016 APSR)

Page 8: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Evaluating the quality of a labeled set

Any labeled set should be tested and reported for its inter-ratereliability, at three different standards:

Type Test Design Causes of Disagreements Strength

Stability test-retest intraobserver inconsistencies weakest

Reproduc-ibility

test-test intraobserver inconsistencies +interobserver disagreements

medium

Accuracy test-standard intraobserver inconsistencies +interobserver disagreements +deviations from a standard

strongest

Page 9: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Measures of agreement

I Percent agreement Very simple:(number of agreeing ratings) / (total ratings) * 100%

I CorrelationI (usually) Pearson’s r , aka product-moment correlation

I Formula: rAB = 1n−1

∑ni=1

(Ai−AsA

)(Bi−BsB

)I May also be ordinal, such as Spearman’s rho or Kendall’s tau-bI Range is [0,1]

I Agreement measuresI Take into account not only observed agreement, but also

agreement that would have occured by chanceI Cohen’s κ is most commonI Krippendorf’s α is a generalization of Cohen’s κI Both range from [0,1]

Page 10: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Reliability data matrixes

Example here used binary data (from Krippendorff)

Article: 1 2 3 4 5 6 7 8 9 10

Coder A 1 1 0 0 0 0 0 0 0 0Coder B 0 1 1 0 0 1 0 1 0 0

I A and B agree on 60% of the articles: 60% agreement

I Correlation is (approximately) 0.10

I Observed disagreement: 4

I Expected disagreement (by chance): 4.4211

I Krippendorff’s α = 1− DoDe

= 1− 44.4211 = 0.095

I Cohen’s κ (nearly) identical

Page 11: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Basic principles of supervised learning

I Generalization: A classifier or a regression algorithm learns tocorrectly predict output from given inputs not only inpreviously seen samples but also in previously unseen samples

I Overfitting: A classifier or a regression algorithm learns tocorrectly predict output from given inputs in previously seensamples but fails to do so in previously unseen samples. Thiscauses poor prediction/generalization.

I Goal is to maximize the frontier of precise identification oftrue condition with accurate recall

Page 12: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Measuring performance

I Classifier is trained to maximize in-sample performance

I But generally we want to apply method to new data

I Danger: overfitting

I Model is too complex,describes noise rather thansignal (Bias-Variance trade-off)

I Focus on features that performwell in labeled data but maynot generalize (e.g. “inflation”in 1980s)

I In-sample performance betterthan out-of-sampleperformance

I Solutions?I Randomly split dataset into training and test setI Cross-validation

Page 13: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Supervised v. unsupervised methods compared

I The goal (in text analysis) is to differentiate documents fromone another, treating them as “bags of words”

I Different approaches:I Supervised methods require a training set that exemplify

contrasting classes, identified by the researcherI Unsupervised methods scale documents based on patterns of

similarity from the term-document matrix, without requiring atraining step

I Relative advantage of supervised methods:You already know the dimension being scaled, because you set it in the

training stage

I Relative disadvantage of supervised methods:You must already know the dimension being scaled, because you have to

feed it good sample documents in the training stage

Page 14: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Supervised v. unsupervised methods: Examples

I General examples:I Supervised: Naive Bayes, regularized regression, Support

Vector Machines (SVM)I Unsupervised: topic models, IRT models, correspondence

analysis, factor analytic approaches

I Political science applicationsI Supervised: Wordscores (LBG 2003); SVMs (Yu, Kaufman and

Diermeier 2008); Naive Bayes (Evans et al 2007)I Unsupervised: Structural topic model (Roberts et al 2014);

“Wordfish” (Slapin and Proksch 2008); two-dimensional IRT(Monroe and Maeda 2004)

Page 15: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Supervised learning v. dictionary methods

I Dictionary methods:I Advantage: not corpus-specific, cost to apply to a new corpus

is trivialI Disadvantage: not corpus-specific, so performance on a new

corpus is unknown (domain shift)

I Supervised learning can be conceptualized as a generalizationof dictionary methods, where features associated with eachcategories (and their relative weight) are learned from the data

I By construction, they will outperform dictionary methods inclassification tasks, as long as training sample is large enough

Page 16: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Dictionaries vs supervised learning

Source: Gonzalez-Bailon and Paltoglou (2015)

Page 17: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Dictionaries vs supervised learning

Application: sentiment analysis of NYTimes articles

74.0

74.4

57.2

57.6

35.6

71.0

70.5

56.4

56.6

39.9

Ground Truth (Undergraduates) Ground Truth (Crowd)

9−Word Method (Hopkins, 2010)

Dictionary (LexiCoder)

Dictionary (SentiStrength)

ML (with directional terms)

ML (crowd−sourcing)

0% 20% 40% 60% 80% 0% 20% 40% 60% 80%Accuracy (Proportion of Articles Correctly Predicted)

Source: Barbera et al (2017)

Page 18: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Dictionaries vs supervised learning

Application: sentiment analysis of NYTimes articles

●9−Word Method (Hopkins, 2010)

Dictionary (LexiCoder)

Dictionary (SentiStrength)

ML (with directional terms)

ML (crowd−sourcing)

−0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6Correlation of Tone and Economic Time Series

Unemployment Rate(negative)

Change in UnemploymentRate (negative)

Leading Indicator(Conference Board)

Index of ConsumerConfidence (Michigan)

Source: Barbera et al (2017)

Page 19: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Classification v. scaling methods compared

I Machine learning focuses on identifying classes(classification), while social science is typically interested inlocating things on latent traits (scaling)

I But the two methods overlap and can be adapted – willdemonstrate later using the Naive Bayes classifier

I Applying lessons from machine learning to supervised scaling,we can

I Apply classification methods to scalingI Improve it using lessons from machine learning

Page 20: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Text Classification

Page 21: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Types of classifiers

General thoughts:

I Trade-off between accuracy and interpretability

I Parameters need to be cross-validated

Frequently used classifiers:

I Naive Bayes

I Regularized regression

I SVM

I Others: k-nearest neighbors, tree-based methods, etc.

I Ensemble methods

Page 22: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Multinomial Bayes model of Class given a Word

Consider J word types distributed across N documents, eachassigned one of K classes.

At the word level, Bayes Theorem tells us that:

P(ck |wj) =P(wj |ck)P(ck)

P(wj)

For two classes, this can be expressed as

=P(wj |ck)P(ck)

P(wj |ck)P(ck) + P(wj |c¬k)P(c¬k)(1)

Page 23: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Multinomial Bayes model of Class given a WordClass-conditional word likelihoods

P(ck |wj) =P(wj |ck)P(ck)

P(wj |ck)P(ck) + P(wj |c¬k)P(c¬k)

I The word likelihood within class

I The maximum likelihood estimate is simply the proportion oftimes that word j occurs in class k, but it is more common touse Laplace smoothing by adding 1 to each oberved countwithin class

Page 24: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Multinomial Bayes model of Class given a WordWord probabilities

P(ck |wj) =P(wj |ck)P(ck)

P(wj)

I This represents the word probability from the training corpus

I Usually uninteresting, since it is constant for the trainingdata, but needed to compute posteriors on a probability scale

Page 25: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Multinomial Bayes model of Class given a WordClass prior probabilities

P(ck |wj) =P(wj |ck)P(ck)

P(wj |ck)P(ck) + P(wj |c¬k)P(c¬k)

I This represents the class prior probability

I Machine learning typically takes this as the documentfrequency in the training set

Page 26: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Multinomial Bayes model of Class given a WordClass posterior probabilities

P(ck |wj) =P(wj |ck)P(ck)

P(wj |ck)P(ck) + P(wj |c¬k)P(c¬k)

I This represents the posterior probability of membership inclass k for word j

I Key for the classifier: in new documents, we only observeword distributions and want to predict class

Page 27: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Moving to the document level

I The “Naive” Bayes model of a joint document-level classposterior assumes conditional independence, to multiply theword likelihoods from a “test” document, to produce:

P(c |d) = P(c)∏j

P(wj |c)

P(wj)

P(c |d) ∝ P(c)∏j

P(wj |c)

I This is why we call it “naive”: because it (wrongly) assumes:I conditional independence of word countsI positional independence of word counts

Page 28: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Naive Bayes Classification Example

(From Manning, Raghavan and Schutze, Introduction toInformation Retrieval)

Online edition (c)�2009 Cambridge UP

13.2 Naive Bayes text classification 261

! Table 13.1 Data for parameter estimation examples.docID words in document in c = China?

training set 1 Chinese Beijing Chinese yes2 Chinese Chinese Shanghai yes3 Chinese Macao yes4 Tokyo Japan Chinese no

test set 5 Chinese Chinese Chinese Tokyo Japan ?

! Table 13.2 Training and test times for NB.mode time complexitytraining !(|D|Lave + |C||V|)testing !(La + |C|Ma) = !(|C|Ma)

We have now introduced all the elements we need for training and apply-ing an NB classifier. The complete algorithm is described in Figure 13.2.

✎ Example 13.1: For the example in Table 13.1, the multinomial parameters weneed to classify the test document are the priors P(c) = 3/4 and P(c) = 1/4 and thefollowing conditional probabilities:

P(Chinese|c) = (5 + 1)/(8 + 6) = 6/14 = 3/7

P(Tokyo|c) = P(Japan|c) = (0 + 1)/(8 + 6) = 1/14

P(Chinese|c) = (1 + 1)/(3 + 6) = 2/9

P(Tokyo|c) = P(Japan|c) = (1 + 1)/(3 + 6) = 2/9

The denominators are (8 + 6) and (3 + 6) because the lengths of textc and textc are 8and 3, respectively, and because the constant B in Equation (13.7) is 6 as the vocabu-lary consists of six terms.

We then get:

P(c|d5) " 3/4 · (3/7)3 · 1/14 · 1/14 ! 0.0003.

P(c|d5) " 1/4 · (2/9)3 · 2/9 · 2/9 ! 0.0001.

Thus, the classifier assigns the test document to c = China. The reason for this clas-sification decision is that the three occurrences of the positive indicator Chinese in d5outweigh the occurrences of the two negative indicators Japan and Tokyo.

What is the time complexity of NB? The complexity of computing the pa-rameters is !(|C||V|) because the set of parameters consists of |C||V| con-ditional probabilities and |C| priors. The preprocessing necessary for com-puting the parameters (extracting the vocabulary, counting terms, etc.) canbe done in one pass through the training data. The time complexity of this

Page 29: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Naive Bayes Classification Example

Online edition (c)�2009 Cambridge UP

13.2 Naive Bayes text classification 261

! Table 13.1 Data for parameter estimation examples.docID words in document in c = China?

training set 1 Chinese Beijing Chinese yes2 Chinese Chinese Shanghai yes3 Chinese Macao yes4 Tokyo Japan Chinese no

test set 5 Chinese Chinese Chinese Tokyo Japan ?

! Table 13.2 Training and test times for NB.mode time complexitytraining !(|D|Lave + |C||V|)testing !(La + |C|Ma) = !(|C|Ma)

We have now introduced all the elements we need for training and apply-ing an NB classifier. The complete algorithm is described in Figure 13.2.

✎ Example 13.1: For the example in Table 13.1, the multinomial parameters weneed to classify the test document are the priors P(c) = 3/4 and P(c) = 1/4 and thefollowing conditional probabilities:

P(Chinese|c) = (5 + 1)/(8 + 6) = 6/14 = 3/7

P(Tokyo|c) = P(Japan|c) = (0 + 1)/(8 + 6) = 1/14

P(Chinese|c) = (1 + 1)/(3 + 6) = 2/9

P(Tokyo|c) = P(Japan|c) = (1 + 1)/(3 + 6) = 2/9

The denominators are (8 + 6) and (3 + 6) because the lengths of textc and textc are 8and 3, respectively, and because the constant B in Equation (13.7) is 6 as the vocabu-lary consists of six terms.

We then get:

P(c|d5) " 3/4 · (3/7)3 · 1/14 · 1/14 ! 0.0003.

P(c|d5) " 1/4 · (2/9)3 · 2/9 · 2/9 ! 0.0001.

Thus, the classifier assigns the test document to c = China. The reason for this clas-sification decision is that the three occurrences of the positive indicator Chinese in d5outweigh the occurrences of the two negative indicators Japan and Tokyo.

What is the time complexity of NB? The complexity of computing the pa-rameters is !(|C||V|) because the set of parameters consists of |C||V| con-ditional probabilities and |C| priors. The preprocessing necessary for com-puting the parameters (extracting the vocabulary, counting terms, etc.) canbe done in one pass through the training data. The time complexity of this

Page 30: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Regularized regression

Assume we have:

I i = 1, 2, . . . ,N documents

I Each document i is in class yi = 0 or yi = 1

I j = 1, 2, . . . , J unique features

I And xij as the count of feature j in document i

We could build a linear regression model as a classifier, using thevalues of β0, β1, . . ., βJ that minimize:

RSS =N∑i=1

yi − β0 −J∑

j=1

βjxij

2

But can we?

I If J > N, OLS does not have a unique solution

I Even with N > J, OLS has low bias/high variance (overfitting)

Page 31: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Regularized regression

What can we do? Add a penalty for model complexity, such thatwe now minimize:

N∑i=1

yi − β0 −J∑

j=1

βjxij

2

+ λ

J∑j=1

β2j → ridge regression

or

N∑i=1

yi − β0 −J∑

j=1

βjxij

2

+ λ

J∑j=1

|βj | → lasso regression

where λ is the penalty parameter (to be estimated)

Page 32: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Regularized regression

Why the penalty (shrinkage)?

I Reduces the variance

I Identifies the model if J > N

I Some coefficients become zero (feature selection)

The penalty can take different forms:

I Ridge regression: λ∑J

j=1 β2j with λ > 0; and when λ = 0

becomes OLS

I Lasso λ∑J

j=1 |βj | where some coefficients become zero.

I Elastic Net: λ1∑J

j=1 β2j + λ2

∑Jj=1 |βj | (best of both worlds?)

How to find best value of λ? Cross-validation.Evaluation: regularized regression is easy to interpret, but oftenoutperformed by more complex methods.

Page 33: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

SVM

Intuition: finding classification boundary that best separatesobservations of different classes.

Harder to visualize in more than two dimensions (hyperplanes)

Page 34: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Support Vector Machines

With no perfect separation, goal is to minimize distances tomarginal points, conditioning on a tuning parameter C thatindicates tolerance to errors (controls bias-variance trade-off)

Page 35: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

SVM

In previous examples, vectors were linear; but we can try differentkernels (polynomial, radial):

And of course we can have multiple vectors within same classifier.

Page 36: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Ensemble methods

Intuition:

I Fit multiple classifiers, different types

I Test how well they perform in test set

I For new observations, produce prediction aggregatingpredictions of individual classifiers

I How to aggregate predictions?I Pick best classifierI Average of predicted probabilitiesI Weighted average (weights proportional to classification error)

I Implement in SuperLearner package in R

Page 37: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Scaling

Page 38: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Supervised scaling methods

Wordscores method (Laver, Benoit & Garry, 2003):I Two sets of texts

I Reference texts: texts about which we know something (ascalar dimensional score)

I Virgin texts: texts about which we know nothing (but whosedimensional score we’d like to know)

I These are analogous to a “training set” and a “test set” inclassification

I Basic procedure:

1. Analyze reference texts to obtain word scores2. Use word scores to score virgin texts

Page 39: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Wordscores Procedure

5

Labour 1992 5.35

Liberals 1992 8.21

Cons. 1992 17.21

Labour 1997 9.17 (.33)

Liberals 1997 5.00 (.36)

Cons. 1997 17.18 (.32)

Reference Texts

Scored word list

Scored virgin texts

1

2 3 4

The Wordscore Procedure (Using the UK 1997-2001 Example)

drugs 15.66 corporation 15.66 inheritance 15.48 successfully 15.26 markets 15.12 motorway 14.96 nation 12.44 single 12.36 pensionable 11.59 management 11.56 monetary 10.84 secure 10.44 minorities 9.95 women 8.65 cooperation 8.64 transform 7.44 representation 7.42 poverty 6.87 waste 6.83 unemployment 6.76 contributions 6.68

Step 1: Obtain reference texts with a priori known positions (setref) Step 2: Generate word scores from reference texts (wordscore) Step 3: Score each virgin text using word scores (textscore) Step 4: (optional) Transform virgin text scores to original metric

Page 40: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Wordscores Procedure

6

Labour 1992 5.35

Liberals 1992 8.21

Cons. 1992 17.21

Labour 1997 9.17 (.33)

Liberals 1997 5.00 (.36)

Cons. 1997 17.18 (.32)

Reference Texts

Scored word list

Scored virgin texts

1

2 3 4

The Wordscore Procedure (Using the UK 1997-2001 Example)

drugs 15.66 corporation 15.66 successfully 15.26 markets 15.12 motorway 14.96 nation 12.44 pensionable 11.59 management 11.56 monetary 10.84 secure 10.44 minorities 9.95 women 8.65 cooperation 8.64 representation 7.42 poverty 6.87 waste 6.83 unemployment 6.76 contributions 6.68

Step 1: Obtain reference texts with a priori known positions (setref) Step 2: Generate word scores from reference texts (wordscore) Step 3: Score each virgin text using word scores (textscore) Step 4: (optional) Transform virgin text scores to original metric

Page 41: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Wordscores Procedure

7

Labour 1992 5.35

Liberals 1992 8.21

Cons. 1992 17.21

Labour 1997 9.17 (.33)

Liberals 1997 5.00 (.36)

Cons. 1997 17.18 (.32)

Reference Texts

Scored word list

Scored virgin texts

1

2 3 4

The Wordscore Procedure (Using the UK 1997-2001 Example)

drugs 15.66 corporation 15.66 inheritance 15.48 successfully 15.26 markets 15.12 motorway 14.96 nation 12.44 single 12.36 pensionable 11.59 management 11.56 monetary 10.84 secure 10.44 minorities 9.95 women 8.65 cooperation 8.64 transform 7.44 representation 7.42 poverty 6.87 waste 6.83 unemployment 6.76 contributions 6.68

Step 1: Obtain reference texts with a priori known positions (setref) Step 2: Generate word scores from reference texts (wordscore) Step 3: Score each virgin text using word scores (textscore) Step 4: (optional) Transform virgin text scores to original metric

Page 42: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Wordscores Procedure

8

Labour 1992 5.35

Liberals 1992 8.21

Cons. 1992 17.21

Labour 1997 9.17 (.33)

Liberals 1997 5.00 (.36)

Cons. 1997 17.18 (.32)

Reference Texts

Scored word list

Scored virgin texts

1

2 3 4

The Wordscore Procedure (Using the UK 1997-2001 Example)

drugs 15.66 corporation 15.66 inheritance 15.48 successfully 15.26 markets 15.12 motorway 14.96 nation 12.44 single 12.36 pensionable 11.59 management 11.56 monetary 10.84 secure 10.44 minorities 9.95 women 8.65 cooperation 8.64 transform 7.44 representation 7.42 poverty 6.87 waste 6.83 unemployment 6.76 contributions 6.68

Step 1: Obtain reference texts with a priori known positions (setref) Step 2: Generate word scores from reference texts (wordscore) Step 3: Score each virgin text using word scores (textscore) Step 4: (optional) Transform virgin text scores to original metric

Page 43: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Wordscores mathematically: Reference texts

I Start with a set of I reference texts, represented by an I × Jdocument-feature matrix Cij , where i indexes the documentand j indexes the J total word types

I Each text will have an associated “score” ai , which is a singlenumber locating this text on a single dimension of difference

I This can be on a scale metric, such as 1–20I Can use arbitrary endpoints, such as -1, 1

I We normalize the document-feature matrix within eachdocument by converting Cij into a relative document-featurematrix (within document), by dividing Cij by its word totalmarginals:

Fij =Cij

Ci ·(2)

where Ci · =∑J

j=1 Cij

Page 44: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Wordscores mathematically: Word scores

I Compute an I × J matrix of relative document probabilitiesPij for each word in each reference text, as

Pij =Fij∑Ii=1 Fij

(3)

I This tells us the probability that given the observation of aspecific word j , that we are reading a text of a certainreference document i

Page 45: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Wordscores mathematically: Word scores (example)

I Assume we have two reference texts, A and B

I The word “choice” is used 10 times per 1,000 words in Text Aand 30 times per 1,000 words in Text B

I So Fi ”choice” = {.010, .030}I If we know only that we are reading the word choice in one of

the two reference texts, then probability is 0.25 that we arereading Text A, and 0.75 that we are reading Text B

PA ”choice” =.010

(.010 + .030)= 0.25 (4)

PB ”choice” =.030

(.010 + .030)= 0.75 (5)

Page 46: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Wordscores mathematically: Word scores

I Compute a J-length “score” vector S for each word j as theaverage of each document i ’s scores ai , weighted by eachword’s Pij :

Sj =I∑

i=1

aiPij (6)

I In matrix algebra, S1×J

= a1×I· PI×J

I This procedure will yield a single “score” for every word thatreflects the balance of the scores of the reference documents,weighted by the relative document frequency of its normalizedterm frequency

Page 47: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Wordscores mathematically: Word scores

I Continuing with our example:I We “know” (from independent sources) that Reference Text A

has a position of −1.0, and Reference Text B has a position of+1.0

I The score of the word “choice” is then0.25(−1.0) + 0.75(1.0) = −0.25 + 0.75 = +0.50

Page 48: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Wordscores mathematically: Scoring “virgin” texts

I Here the objective is to obtain a single score for any new text,relative to the reference texts

I We do this by taking the mean of the scores of its words,weighted by their term frequency

I So the score vk of a virgin document k consisting of the jword types is:

vk =∑j

(Fkj · sj) (7)

where Fkj =Ckj

Ck·as in the reference document relative word

frequencies

I Note that new words outside of the set J may appear in the Kvirgin documents — these are simply ignored (because wehave no information on their scores)

I Note also that nothing prohibits reference documents fromalso being scored as virgin documents

Page 49: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Wordscores mathematically: Rescaling raw text scores

I Because of overlapping or non-discriminating words, the rawtext scores will be dragged to the interior of the referencescores (we will see this shortly in the results)

I Some procedures can be applied to rescale them, either to aunit normal metric or to a more “natural” metric

I Martin and Vanberg (2008) have proposed alternatives to theLBG (2003) rescaling

Page 50: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Computing confidence intervals

I The score vk of any text represents a weighted mean

I LBG (2003) used this logic to develop a standard error of thismean using a weighted variance of the scores in the virgin text

I Given some assumptions about the scores being fixed (and thewords being conditionally independent), this yieldsapproximately normally distributed errors for each vk

I An alternative would be to bootstrap the textual data prior toconstructing Cij and Ckj — see Lowe and Benoit (2012)

Page 51: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Pros and Cons of the Wordscores approach

I Estimates unknown positions on a priori scales – hence noinductive scaling with a posteriori interpretation of unknownpolicy space

I Very dependent on correct identification of:I appropriate reference textsI appropriate reference scores

Page 52: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Suggestions for choosing reference texts

I Texts need to contain information representing a clearlydimensional position

I Dimension must be known a priori. Sources might include:I Survey scores or manifesto scoresI Arbitrarily defined scales (e.g. -1.0 and 1.0)

I Should be as discriminating as possible: extreme texts on thedimension of interest, to provide reference anchors

I Need to be from the same lexical universe as virgin texts

I Should contain lots of words

Page 53: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Suggestions for choosing reference values

I Must be “known” through some trusted external source

I For any pair of reference values, all scores are simply linearrescalings, so might as well use (-1, 1)

I The “middle point” will not be the midpoint, however, sincethis will depend on the relative word frequency of thereference documents

I Reference texts if scored as virgin texts will have documentscores more extreme than other virgin texts

I With three or more reference values, the mid-point is mappedonto a multi-dimensional simplex. The values now matter butonly in relative terms (we are still investigating this fully)

Page 54: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Multinomial Bayes model of Class given a WordClass posterior probabilities

P(ck |wj) =P(wj |ck)P(ck)

P(wj |ck)P(ck) + P(wj |c¬k)P(c¬k)

I This represents the posterior probability of membership inclass k for word j

I Under certain conditions, this is identical to what LBG (2003)called Pwr

I Under those conditions, the LBG “wordscore” is the lineardifference between P(ck |wj) and P(c¬k |wj)

Page 55: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

“Certain conditions”

I The LBG approach required the identification not only oftexts for each training class, but also “reference” scoresattached to each training class

I Consider two “reference” scores s1 and s2 attached to twoclasses k = 1 and k = 2. Taking P1 as the posteriorP(k = 1|w = j) and P2 as P(k = 2|w = j), A generalisedscore s∗j for the word j is then

s∗j = s1P1 + s2P2

= s1P1 + s2(1− P1)

= s1P1 + s2 − s2P1

= P1(s1 − s2) + s2

Page 56: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

“Certain conditions”: More than two reference classes

I For more than two reference classes, if the reference scores areordered such that s1 < s2 < · · · < sK , then

s∗j = s1P1 + s2P2 + · · ·+ sKPK

= s1P1 + s2P2 + · · ·+ sK (1−K−1∑k=1

Pk)

=K−1∑k=1

Pi (sk − sK ) + sI

Page 57: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

A simpler formulation:Use reference scores such that s1 = −1.0, sK = 1.0

I From above equations, it should be clear that any set ofreference scores can be linearly rescaled to endpoints of−1.0, 1.0

I This simplifies the “simple word score”

s∗j = (1− 2P1) +K−1∑k=2

Pk(sk − 1)

I which simplifies with just two reference classes to:

s∗j = 1− 2P1

Page 58: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Implications

I LBG’s “word scores” come from a linear combination of classposterior probabilities from a Bayesian model of classconditional on words

I We might as well always anchor reference scores at −1.0, 1.0

I There is a special role for reference classes in between−1.0, 1.0, as they balance between “pure” classes — more ina moment

I There are alternative scaling models, such that used inBeauchamp’s (2012) “Bayesscore”, which is simply thedifference in logged class posteriors at the word level. Fors1 = −1.0, s2 = 1.0,

sBj = −log P1 + log P2

= log1− P1

P1

Page 59: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Moving to the document level

I The “Naive” Bayes model of a joint document-level classposterior assumes conditional independence, to multiply theword likelihoods from a “test” document, to produce:

P(c |d) = P(c)

∏j P(wj |c)

P(wj)

I So we could consider a document-level relative score, e.g.1− 2P(c1|d) (for a two-class problem)

I But this turns out to be useless, since the predictions of classare highly separated

Page 60: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Moving to the document level

I A better solution is to score a test document as the arithmeticmean of the scores of its words

I This is exactly the solution proposed by LBG (2003)

I Beauchamp (2012) proposes a “Bayesscore” which is thearithmetic mean of the log difference word scores in adocument – which yields extremely similar results

And now for some demonstrations with data...

Page 61: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Application 1: Dail speeches from LBG (2003)

Lab

FG

DL

Green

Ind

FF

PD

FFMin

−0.14 −0.12 −0.10 −0.08 −0.06 −0.04 −0.02 0.00

(a) NB Speech scores by party, smooth=0, imbalanced priors

Document text score from NB

●●

●●

−0.40 −0.35 −0.30 −0.25 −0.20 −0.15

−0.4

0−0

.35

−0.3

0−0

.25

−0.2

0−0

.15

(b) Document scores from NB v. Classic Wordscores

NB word score with smooth=0 and imbalanced priors

Cla

ssic

wor

dsco

res

met

hod

DLFFFFMinFGGreenIndLabPD

R = 1

Lab

FG

DL

Green

Ind

FF

PD

FFMin

−0.15 −0.10 −0.05 0.00 0.05

(c) NB Speech scores by party, smooth=1, uniform class priors

Document text score from NB

●●

●●

−0.15 −0.10 −0.05 0.00 0.05

−0.4

0−0

.35

−0.3

0−0

.25

−0.2

0(d) Document scores from NB v. Classic Wordscores

NB word score with smooth=1 and uniform priors

Cla

ssic

wor

dsco

res

met

hod

DLFFFFMinFGGreenIndLabPD

R = 0.995

I three reference classes (Opposition, Opposition, Government) at{-1, -1, 1}

I no smoothing

Page 62: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Application 1: Dail speeches from LBG (2003)

Lab

FG

DL

Green

Ind

FF

PD

FFMin

−0.14 −0.12 −0.10 −0.08 −0.06 −0.04 −0.02 0.00

(a) NB Speech scores by party, smooth=0, imbalanced priors

Document text score from NB

●●

●●

−0.40 −0.35 −0.30 −0.25 −0.20 −0.15

−0.4

0−0

.35

−0.3

0−0

.25

−0.2

0−0

.15

(b) Document scores from NB v. Classic Wordscores

NB word score with smooth=0 and imbalanced priors

Cla

ssic

wor

dsco

res

met

hod

DLFFFFMinFGGreenIndLabPD

R = 1

Lab

FG

DL

Green

Ind

FF

PD

FFMin

−0.15 −0.10 −0.05 0.00 0.05

(c) NB Speech scores by party, smooth=1, uniform class priors

Document text score from NB

●●

●●

−0.15 −0.10 −0.05 0.00 0.05

−0.4

0−0

.35

−0.3

0−0

.25

−0.2

0

(d) Document scores from NB v. Classic Wordscores

NB word score with smooth=1 and uniform priors

Cla

ssic

wor

dsco

res

met

hod

DLFFFFMinFGGreenIndLabPD

R = 0.995

I two reference classes (Opposition+Opposition, Government) at {-1,1}

I Laplace smoothing

Page 63: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Application 2: Classifying legal briefs (Evans et al 2007)Wordscores v. Bayesscore

●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●

●●●●

●●●●●●●●●

●●●●●●●

●●

●●●●●●●

●●●●●●

●●

●●

●●●●●●●●

●●●●●●●●●

●●

●●

●●●

●●

●●●●●

●●●●●●

●●●●

●●

●●

●●●●●●●●●●●●●●●●●●

●●

●●●●●

●●●●●●●●●

●●●●

●●●

●●●●●

●●

●●●●●

●●

●●●●●●●●●●

●●

●●●

●●●●●●●

●●

●●●●●

●●

●●●●●

●●●●●●●

●●●●●●

●●●

●●

●●●●●●●●●●●●●●●

●●

●●●●●●

●●●

●●

●●●●

●●

●●●●●●●

●●●●●●●●●●●●

●●●

●●●●●●

●●●●●●●●●●●

●●

●●

●●

●●●

●●

●●●●●●

●●

●●

●●●●●●●●

●●

●●●●●●●●●●●●●●●

●●

●●●●●●●

●●

●●●●●

●●

●●●

●●●●●●

●●

●●●●●●●●●

●●

●●●●●●●

●●●●●●●●●●

●●●●●●

●●

●●●●●●●●

●●●●●●●

●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●

●●

●●●●●

●●

●●●●●●●●●

●●●

●●

●●●●

●●●●●

●●

●●

●●●●●●●●●

●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●●

●●●●●

●●

●●●●●●●●●

●●●

●●●●●●●●●●

●●

●●●●●●

●●●

●●●

●●●●

●●

●●●●●●●●●

●●●●●●

●●

●●

●●

●●●

●●

●●●●

●●●●●●●●●●

●●●●●

●●●●

●●●

●●●●●●●●●●●●●

●●

●●●

●●●●●

●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●

●●●●●

●●●

●●●●●●●●●

●●●●●●●●

●●

●●

●●

●●●●●●●●●●●●●●

●●●●●

●●

●●●●

●●

●●

●●●●●●

●●●

●●

●●●●●

●●●●●●●●●●●●

●●

●●●

●●●●●

●●●●●●

●●●

●●●●●●●●

●●

●●●●

●●

●●

●●●●●●●

●●●●

●●●●

●●●●

●●●●●

●●●

●●●●●●●

●●

●●●●●●●●

●●

●●●●●●●

●●●●●●

●●

●●●●●●●●●●●

●●

●●●

●●●●

●●

●●●●

●●●●

●●●●

●●

●●●●

●●●●

●●

●●

●●●●

●●

●●

●●●●●

●●●●●●●

●●●●●●

●●●

●●●●●

●●●●

●●●●

●●●●●●

●●●

●●●●

●●●

●●●●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●

●●●●●●●●●

●●●●●

●●●

●●

●●●●●●●●

●●

●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●●●

●●

●●●●●●●●

●●●●●●●●●●●●●●●

●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●

●●

●●●●●●●●●●

●●

●●●●●●●●●

●●●

●●

●●

●●

●●●●●●●●●●●●

●●●●●●

●●

●●●

●●●●●

●●●●●

●●●●

●●●

●●●●●●●

●●●

●●●●●

●●●●●●●

●●

●●●

●●●

●●●●●●●●●

●●●●●●

●●●

●●●●●

●●●●●●●●●

●●●●●●●

●●●

●●

●●

●●●●

●●●●●●●

●●

●●

●●●

●●●●●●●●●●●●●●●

●●

●●●

●●●●●●●●●●●●

●●●

●●●●●●●●●●●

●●●●

●●●●●●●●●●

●●●●●●●●●

●●●

●●

●●●●●●

●●●

●●●●●

●●●●●●

●●●●●

●●

●●●●●

●●●●●●●●

●●●

●●●●●●●●●

●●●●●●●●

●●●

●●●

●●●

●●●●●●

●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●

●●

●●●

●●●

●●

●●●●●

●●●●●●

●●●●

●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●

●●●●●●●●●●

●●●●●●

●●●

●●

●●

●●

●●●●

●●●●

●●●

●●●●●●●●●●●

●●●●

●●

●●●

●●●●●●

●●●●●●●●●

●●●●

●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●

●●●

●●●●●●

●●●●

●●●●

●●

●●●●

●●●●●●●

●●●●●●●

●●●●●●●●

●●

●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●

●●●●●

●●

●●●

●●

●●●●●

●●●●●●●●

●●●

●●●

●●

●●●●

●●

●●●●

●●

●●

●●●

●●●●

●●

●●●●●●●

●●

●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●

●●

●●●●●●●●

●●●●●●●●●

●●●●●●

●●●●●●

●●●

●●

●●

●●

●●

●●●●

●●

●●●●●

●●●●●●

●●●●●

●●●●●

●●

●●●

●●●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●●●●●

●●●

●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●

●●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●

●●●

●●●●

●●●●●●

●●●●

●●●●●

●●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●

●●●●●●●●

●●●●

●●

●●

●●

●●

●●

●●●●●

●●●●

●●●●●●●

●●●●●●●

●●

●●●●●●●●●●●

●●

●●●●●

●●

●●●●●●●

●●

●●●●●●●●

●●●

●●

●●●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●●

●●●●●●●●●

●●

●●

●●●

●●

●●●●●

●●

●●●●●●●●

●●●●●●●

●●

●●

●●

●●

●●●●●●

●●

●●●

●●●●●

●●●●●●●

●●●

●●●●●●●●●

●●●●

●●

●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●●●●●●●●

●●●●●●

●●●●

●●

●●●

●●

●●●●●●

●●

●●

●●●●●●

●●

●●●

●●●

●●

●●

●●●

●●●●●●●●●●●

●●●

●●●

●●

●●

●●

●●●

●●

●●●●●

●●●●●●●●●●●●●

●●●●●

●●●●

●●

●●●●

●●●

●●●

●●●

●●●

●●●●●●●●●●●

●●●

●●●●●

●●●●●●●●

●●●●●●●●●●

●●

●●

●●●●●●

●●

●●●●

●●●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●

●●●●●●●

●●

●●

●●●●●

●●

●●●●●

●●●●

●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●

●●●●

●●●●●●●

●●●●●●●●

●●●●

●●●●●●

●●

●●●●●●●●●●●

●●●●

●●●●

●●●●●●●●●

●●●●●

●●●●●●●

●●

●●

●●

●●●●●●

●●●●●●●●●

●●●●

●●

●●●●●●●●

●●

●●●

●●

●●●●●●●

●●●

●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●

●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●

●●●●

●●

●●

●●

●●

●●

●●

●●●●●●●●

●●●●

●●●●●●●●●

●●●●●●

●●●●●

●●●●●●

●●

●●

●●●●●●●

●●

●●●●●

●●●●

●●●●●●●●●

●●

●●

●●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●●

●●●

●●●●●

●●●●

●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●

●●●

●●●●

●●

●●●●

●●●●

●●

● ●

●●●

●●

●●

●●●●●●●●●●

●●●●●●

●●●●●●●

●●

●●

●●●

●●

●●

●●●

●●●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●●●●●●

●●●●●●

●●●●●

●●●●●●

●●

●●

●●

●●●●

●●●●●●●●●

●●

●●●

●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●

●●

●●●

●●

●●

●●●●●

●●●●●●●●●●

●●●●●●●●●●

●●

●●●●●●●●

●●

●●

●●●●●●●●●●●●

●●

●●●

●●●●●●●●●●●●●●

●●

●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●●●●●●

●●●●●

●●

●●●●●●

●●●●●

●●●●●●●●

●●●●

●●●

●●

●●●●●

●●●●●●

●●●

●●

●●

●●●

●●●●

●●●●●●●●●●

●●●●●

●●

●●●

●●●●●●●●

●●●●●●

●●●

●●●●●●●●●●●●●●●

●●●

●●

●●●●●

●●●●●●●●●●●●●

●●

●●●

●●●●

●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●

●●●●●●●

●●

●●

●●

●●●

●●

●●●●●●●●●●●●●

●●●●●●●●●

●●

●●●●●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●●●●●

●●

●●

●●●●●●

●●●

●●

●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●

●●●

●●●

●●●●●●●●●●

●●

●●●●●

●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●

●●

●●●●●●●●●●●●●●

●●

●●●●

●●

●●●

●●●●●●●●●●●●●

●●●●●●●

●●

●●●

●●●●●●

●●

●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●

●●●

●●●●●●●

●●

●●

●●●●●●●

●●●●●●

●●

●●●

●●

●●

●●●●

●●

●●●●●●

●●

●●●●

●●●●

●●

●●

●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●

●●●●●

●●●●

●●

●●●●●●●●●●

●●

●●

●●●●

●●

●●●●

●●

●●

●●●●●

●●●

●●●

●●●●●●●●

●●

●●

●●●

●●●●●●●●●●●●●●

●●

●●●●●●

●●●

●●

●●●●●

●●

●●●●●●●●●

●●●●●●

●●

●●●●●●

●●●

●●

●●●●

●●●●●

●●●

●●●

●●

●●

●●

●●●●●●

●●

●●●

●●●●●●●

●●●●●

●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●

●●●●●●

●●

●●●

●●●●●●

●●

●●●●

●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●

●●●

●●●

●●

●●

●●●●●●

●●

●●

●●●

●●●●

●●

●●●●

●●●●

●●●●

●●●●

●●●●●

●●●●●

●●●●●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●●●●●●●●●

●●●●

●●●●●●●●●●●●●

●●●●●●

●●

●●

●●●●●●●●●●●●●

●●●●●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●●●●●●

●●●●

●●

●●●●●●●●●

●●●

●●●●●●●

●●●●●●

●●●●

●●●●

●●

●●●●

●●

●●●●●●

●●●

●●

●●●

●●●●●●●

●●●●●●

●●●●●●●●●●

●●

●●●●●

●●●

●●●

●●

●●●●

●●●●●●

●●●●

●●●●

●●●

●●●

●●

●●●●●●

●●

●●●

●●

●●●●●●●●●●●●●

●●

●●●●●●●

●●

●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●●●●●●●●

●●●●●●●

●●●

●●●●●●●

●●●

●●

●●●●

●●●●

●●●

●●●●●●●●●●●●●

●●

●●

●●

●●

●●●●●●●●●●●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●●●●●

●●●

●●●

●●

●●●

●●●●●●

●●●●

●●

●●

●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●

●●●●●

●●●

●●●

●●

●●●●●●●●●●●

●●●

●●●●

●●●●●●●

●●●●●●●●●●

●●●●

●●●

●●

●●

●●

●●●●●●●●

●●●●●●●●●●

●●

●●●

●●

●●●●●●●

●●●●●

●●●

●●●●●●●

●●

●●●●●●●●●

●●●●●●

●●

●●●●

●●●●●

●●●●●●●

●●●●●●●

●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●

●●●

●●●●●●

●●

●●●●●

●●●

●●●●

●●●●●●

●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●

●●●●

●●●●●●

●●●

●●●

●●●●

●●

●●●●●●

●●

●●

●●●●

●●●●

●●●●

●●●

●●●

●●

●●●

●●●●●●●●●

●●●

●●●●

●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●

●●●●●

●●●●●●●

●●

●●

●●●●●

●●●●●●●●

●●●●●

●●

●●●●●●

●●●●

●●●●●

●●

●●●●●●

●●●

●●●●●●●

●●●●

●●●

●●●●●●●●●●

●●●●

●●

●●

●●●●

●●

●●

●●●●●

●●●●

●●●●●●●●●●

●●

●●●●●●●●●●●●

●●●

●●

●●●●

●●●●●

●●●●●●●●●●●●●

●●

●●

●●●●

●●

●●●

●●

●●●●●

●●●●

●●●●●●

●●

●●●●●

●●●●

●●●●●●●●●

●●●●

●●

●●●

●●●●

●●

●●

●●●

●●●●●●●●●

●●

●●●●

●●●●●●●

●●●●●●

●●

●●●●

●●

●●

●●●●●●

●●●●●●●

●●●●●

●●●●●

●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●

●●●●●●●●

●●

●●

●●●●

●●●●●●●●●●

●●

●●

●●●●●

●●●●●●●●●

●●●●●

●●●●●

●●

●●

●●

●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●

●●

●●●●●

●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●

●●

●●●●●●

●●●●●

●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●

●●●●●●

●●

●●

●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●

●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●

●●●

●●●

●●●●●

●●●●●

●●●

●●

●●●●

●●●

●●●

●●●●●

●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●

●●

●●●●●●●●●

●●●●●●●●●●

●●

●●●

●●

●●●●●●●●

●●●

●●●●●●

●●●●

●●

●●

●●●

●●

●●●

●●●●

●●●

●●●●●●●●

●●●●

●●●●●●●

●●

●●●●●

●●●●

●●

●●●

●●●●●●●●●●●

●●

●●●●●●

●●

●●●●●●

●●●●●●●●●●●

●●●●●●

●●●●●

●●●

●●

●●●●●●●●

●●●

●●

●●

●●

●●●

●●●●●

●●

●●●●

●●

●●●●

●●

●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●

●●●

●●●●●●●●●●●●●●●●

●●●

●●

●●●●●●●●●●●●●●●

●●●●●

●●●

●●●●

●●

●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●

●●●

●●●

●●●●●

●●●●●●●●

●●●●●●●●●●●●●●

●●●●●

●●

●●●●

●●

●●●

●●●●

●●●

●●●●●●●●●●●

●●

●●●●●●●●●●●●

●●●

●●

●●●●●

●●●●●●

●●

●●

●●●

●●●●

●●●●●●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●

●●●●●●●●●

●●●●●

●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●●●●●●●●

●●●●●●●●

●●●●●●

●●●

●●

●●

●●●●

●●●●●●●●●●

●●●●●●●●●●●●

●●●●

●●●

●●●●●●

●●

●●●●

●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●

●●

●●●●

●●●

●●●

●●●●●●●●●●●●●●●

●●●●●

●●●

●●●●

●●●●●●●

●●●●●●●

●●●●●●

●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●

●●●●●

●●●●●

●●

●●●●●●●●●●●●

●●●●●●●●●●●●

●●

●●●●●●

●●

●●●●●●●●

●●●●●●●●●●

●●

●●●●●

●●●

●●●●●●●●●●●

●●

●●●

●●

●●●●

●●●●●●●

●●●

●●●●

●●●●●●●●

●●●●●

●●

●●●

●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●

●●●●

●●●●●●●●

●●●●●●●●

●●●

●●●

●●●●●●●

●●●●●

●●●

●●●●●

●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●

●●

●●●●

●●

●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●

●●●

●●●●●

●●

●●●●●

●●●●

●●●●●●●●●

●●

●●●●●●●●●

●●●●●

●●●●●●●●

●●●●●

●●

●●

●●

●●●●●●●●●●●

●●

●●

●●

●●●

●●●

●●●●

●●●●●●

●●

●●

●●

●●●●●●●●

●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●

●●●

●●●●

●●●

●●●●●●●

●●●●●●●●●

●●●

●●●●

●●●●●

●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●

●●

●●

●●

●●

●●●●

●●●●●●●●●●●●●●●

●●●●●

●●

●●

●●

●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●

●●●

●●

●●

●●●

●●

●●●●●

●●●●

●●

●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●

●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●

●●

●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●

●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●

●●●●

●●●●

●●●●●●●●●

●●

●●●●●

●●●

●●●

●●●●●●●●●

●●●●

●●●

●●●●●●

●●

●●●●●●●●

●●●●●

●●●●

●●●●●●●●

●●●●●●●●●

●●

●●●●

●●

●●●●●●●●●

●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●

●●●

●●

●●●●●●●

●●●●

●●●●

●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●

●●●●●●

●●●

●●●●

●●

●●●●●●

●●●●●

●●●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●

●●●

●●●●●●

●●●●●●●●●●●●●●

●●

●●●●

●●●●●●●●●

●●●

●●●

●●●

●●

●●●●●

●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●

●●●

●●●

●●●

●●●●

●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●

●●●●●●●●●

●●

●●●●

●●●●

●●

●●●●●●

●●●

●●●●●

●●●

●●

●●●●

●●●●●●●●

●●

●●

●●

●●

●●●●●

●●

●●●●●●●

●●●●●●●●●●●●●

●●●●

●●

●●

●●●

●●●●●●●●●●

●●

●●

●●

●●●●●

●●●●●●●●●●●●●

●●●

●●●

●●●●●●

●●

●●

●●

●●●●●●

●●●

●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●

●●●●●●●●●●●●●●●

●●●●

●●

●●●●

●●●

●●●

●●

●●

●●●●●●●●●●●●●

●●●●●●●●●

●●

●●●●●●●●●●

●●●●●●●●●

●●

●●

●●

●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●

●●●●●

●●●

●●●

●●

●●●●●

●●●●

●●●●

●●

●●●●●●●●●

●●●

●●●●●●●●●

●●●●

●●●●●●

●●●●

●●●●

●●●●●●●●●●

●●●

●●●

●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●

●●

●●●

●●

●●●

●●●●

●●

●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●

●●●●

●●

●●●

●●

●●●

●●●●●●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●

●●

●●●●

●●●●

●●●●●

●●●●●●●●

●●●

●●●●●

●●●●●●●●●

●●

●●●●●●

●●●●●

●●●●●

●●●●●●●

●●●

●●

●●●●

●●●●●●●

●●●●●●

●●●

●●

●●●●

●●

●●●●●

●●●●●●●●●

●●●●●●

●●

●●

●●

●●●●●●●●●●●●

●●

●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●

●●●●●●●

●●

●●

●●●●

●●

●●

●●●●●●

●●●●●

●●

●●●

●●●●

●●●

●●●●●●●

●●

●●●●●

●●

●●●●●●●●●●●●

●●●●●●●●●

●●

●●

●●

●●●●●●●●●●●●

●●●●●●●●

●●●

●●

●●

●●●●●●

●●

●●

●●●●●●

●●●●●

●●●●●●●●●●●●●●

●●

●●●

●●

●●●●●●●●

●●●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●

●●

●●●●●

●●●●●●

●●●●

●●●

●●●●

●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●

●●●●

●●

●●

●●●

●●

●●●●●●●●

●●●●●●

●●●

●●●

●●●●●●●●●●●●

●●●●●

●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●

●●

●●●●●●

●●●●●

●●●●●●

●●●●●

●●●

●●

●●

●●●

●●●●●●●●●●

●●●●●●●●●●●●

●●

●●●

●●

●●●●●

●●●●

●●●●●

●●●

●●●

●●●●●●●●●●

●●

●●●

●●

●●

●●●●●●

●●●

●●●

●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●

●●

●●●●

●●●●

●●

●●●

●●●●●●●●●●●●●

●●

●●●●●

●●●●

●●●

●●●

●●●

●●●●

●●

●●●●●●●

●●

●●●●●●

●●●

●●●

●●●

●●●●●●●●●

●●●●●

●●

●●●●●●

●●●

●●●●●●●●

●●

●●●●

●●●●●●

●●●●●●

●●

●●

●●●

●●

●●

●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●

●●●

●●●

●●●●●●●●●●●●●

●●●

●●●

●●●

●●

●●

●●●●●●●●●

●●

●●

●●●●●

●●●●●●●●●

●●●●●●●●

●●

●●●●

●●●●

●●

●●●●●●●●●●

●●●●●

●●

●●●●●●

●●

●●●●

●●●●

●●●

●●●●●●●

●●●●

●●

●●

●●●●●●●

●●●

●●

●●

●●

●●

●●●●●●

●●●

●●●

●●

●●●●●●●●●

●●●●●●●●●●●●

●●●●●

●●●

●●●●●●●●●

●●●●

●●●●●●

●●●●●

●●●●●●●●

●●

●●●●●●●

●●

●●

●●●●

●●

●●●●

●●●●●●

●●●●

●●

●●●

●●

●●●

●●●●●●

●●

●●●●

●●

●●

●●●

●●

●●

●●●●●●●

●●●

●●

●●●●

●●●

●●

●●●●●●●●●●●●

●●●●●●●

●●●

●●

●●

●●

●●●

●●●●●●●●●●●

●●●●●●●●●●

●●

●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●

●●●●●●●●●

●●●●

●●

●●●

●●

●●●●●

●●

●●

●●●●

●●

●●●●

●●●●

●●

●●●●●●●

●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●

●●

●●●●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●

●●●●●●●●

●●●

●●

●●●●●●●●●●

●●

●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●

●●●●●●

●●●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●●●●●●●●

●●●●●●●●

●●●●

●●

●●

●●●●●

●●●●●

●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●

●●

●●

●●●

●●●●●●

●●●●●●●

●●●●●●●●

●●

●●●

●●●●●●●●●●●●

●●

●●●●●●●●●●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●●●●●●●●

●●●●

●●

●●●●●●●

●●●●●

●●

●●●●●

●●●●

●●●●

●●●●●

●●●

●●●●

●●●

●●●

●●●

●●

●●●●

●●●●●

●●●

●●●

●●

●●●●●●●●

●●

●●●

●●●●

●●●●

●●●●

●●

●●

●●

●●●●●●

●●●●●

●●

●●●●

●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●

●●

●●

●●●●

●●●

●●●

●●

●●●●●

●●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●●

●●

●●

●●●●●●

●●●●●

●●

●●●●●

●●

●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●

●●

●●

●●●●●●●●●●●●●●

●●

●●●●

●●●●

●●●●●

●●●

●●

●●●●●●●●

●●

●●

●●●●

●●●

●●●

●●

●●●●●

●●●●

●●

●●

●●●●●●●●

●●●●

●●●●

●●●●●●●●●●●●

●●

●●●

●●

●●●●

●●

●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●●

●●

●●

●●●●●●●●

●●

●●

●●●●●●●●●●●●●

●●

●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●

●●●●

●●●

●●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●

●●●●●●●●

●●

●●●●●●●●●●

●●●●●●●●●●

●●

●●●●

●●●

●●

●●●

●●●●

●●

●●

●●●●●●●

●●●●●

●●●●

●●●

●●●●●●

●●●●●●●●●●

●●●●●

●●●●●●●●

●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●●●

●●

●●

●●

●●●●

●●●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●

●●

●●●

●●●●●●●●●●●●●●

●●

●●

●●

●●●●

●●●

●●

●●●●

●●●●●

●●

●●●

●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●

●●●●●●●●

●●●●●

●●●●●●●●●●●●

●●●

●●●●●

●●●●●

●●●●

●●●●

●●

●●●●●

●●●

●●●●

●●

●●●●

●●

●●●●

●●●

●●●

●●●●

●●●●●●●●

●●●

●●

●●●

●●

●●●

●●●●●●

●●

●●●●

●●●●●●●●●●●●●●●●●●

●●●

●●●●

●●

●●●

●●

●●●●●●●

●●●●●●●●●●●●●●

●●●●

●●

●●●

●●

●●●

●●●●●●●●

●●●●

●●

●●

●●●●●●●●●●●●●●●●

●●

●●●●

●●●●●

●●

●●●●●

●●●●

●●●●

●●●

●●●

●●●●

●●●●●

●●●●

●●

●●●

●●●●

●●●●●●●●●●●●●●●●

●●

●●

●●●

●●●●●●●●

●●●

●●●●

●●

●●●

●●●●

●●

●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●

●●

●●●●

●●●●●●●●●●●

●●●●●

●●●●

●●

●●●●●●●●●

●●●●●●●●●●●

●●●

●●

●●

●●●●●●●●●●●●●

●●●●●

●●●●

●●●●●

●●●

●●●

●●●

●●●●

●●

●●●●●●●

●●●●●

●●●

●●●●●●●●

●●

●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●

●●

●●●●●●●

●●●●

●●●●●●●

●●●●●●●●●●●

●●●●

●●●●●

●●●●●●

●●

●●●●●●●

●●●●●

●●●

●●●

●●

●●●●

●●●

●●●

●●

●●●●

●●●

●●●●

●●

●●

●●●●●

●●●●

●●●●●●●

●●

●●

●●

●●●●●●●

●●

●●●●

●●●●●●●●

●●●●●●●

●●

●●●●●●●●

●●●●●●

●●●

●●●

●●●●●●●

●●●

●●

●●●●●●

●●

●●

●●

●●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●

●●●

●●

●●

●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●

●●●●

●●

●●●●

●●●●●●●●●●●●●

●●●

●●

●●●●

●●

●●●●●●●

●●●●

●●

●●

●●

●●

●●●●●●

●●

●●

●●●

●●●

●●●●●●●●

●●

●●●●

●●●●●

●●●●●●●●●●●●●

●●●

●●

●●●●●●

●●●●●●

●●●●●●●●●●●

●●●

●●●●●●●●●●

●●●●●

●●●●

●●●●●●

●●●

●●

●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●

●●

●●●

●●●●●

●●

●●●●●●●●●●●

●●●●

●●●●

●●●

●●●●●●

●●●●●●●●●●●●●

●●●●

●●●●●

●●●●●●●

●●

●●●●

●●●●●●●

●●

●●

●●●●

●●●●

●●●●●●●●●

●●●●●●●●

●●

●●●●

●●●

●●

●●

●●●●●●●●●●●●●●

●●

●●●●●

●●●●●

●●●●

●●●●

●●●●●●●●●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●●●

●●●●●●●

●●●●●●

●●●●●●●

●●●●●●●

●●

●●●●●●●●●●

●●●●

●●●

●●

●●●●●●

●●●●●●

●●●●●●●●●

●●●●●

●●

●●

●●

●●

●●●●●●●

●●●●●●●●

●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●

●●●●●●●

●●●

●●●

●●●

●●●●

●●

●●

●●

●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●

●●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●

●●●●

●●●●●●●●●

●●

●●●●

●●●●●●●●●●●●●

●●

●●

●●

●●●●●●

●●

●●●●●●●●●

●●●

●●

●●

●●●

●●●●

●●

●●●●●●

●●●

●●

●●●●●●●●

●●●●●●●

●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●●

●●●●

●●●●●●●●●●●●●

●●●

●●●●●●●●

●●

●●●

●●

●●●

●●●●●

●●●●●●

●●●●

●●●●●

●●●

●●●

●●

●●

●●

●●●●●●●●

●●●●●●

●●●

●●●

●●●●●●●●

●●●

●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●

●●●●●

●●●●

●●●●●

●●

●●●●

●●

●●●●●●●●●

●●●●●●●

●●

●●

●●

●●●●●●●●●●●●

●●●●●●●●

●●●●

●●

●●●●●

●●●●●●

●●

●●●●●●

●●●●●●●●●●

●●

●●●●●

●●

●●

●●●●●●●●●

●●

●●●●●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●

●●●

●●●●●●●●●●●●

●●

●●

●●●●●●

●●●●●●●●●●

●●

●●

●●

●●●

●●●

●●●●

●●●●●

●●●●●●●●●

●●●●

●●●●

●●●

●●●

●●●●●●●●●●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●●●

●●●●●●●●

●●

●●●

●●●●

●●●

●●●●●●●●●●●●

●●●

●●●

●●●●

●●●●●●●●●●

●●●

●●●●

●●

●●

●●●

●●

●●●

●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●

●●●●●●

●●

●●

●●●●●●

●●

●●●●●●●●●●

●●

●●●●●

●●●●●●

●●

●●●●●●●●

●●●●●●●●

●●

●●

●●●

●●●●

●●●●●

●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●

●●●●●●●●●

●●

●●

●●

●●

●●●●●●●●

●●●

●●●●●●●●

●●

●●●●●●

●●

●●●

●●

●●●●

●●

●●

●●●●●●

●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●

●●

●●

●●●●●

●●●

●●●

●●●

●●●●●●●●●

●●

●●●●

●●

●●●●

●●●●●●

●●●●

●●●●●●●●

●●

●●

●●●

●●●●●●●●●●●●

●●●●

●●●

●●

●●●●●●●●●●●●●

●●

●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●

●●●●●●●●●

●●●●●

●●●●●

●●

●●●●●

●●●●●●

●●●●

●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●

●●●

●●●●

●●

●●●●●●●●

●●

●●●

●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

−1.0 −0.5 0.0 0.5

−3−2

−10

12

3

(a) Word level

Wordscores − linear difference

Wor

dsco

res −

log

diffe

renc

e

●●

●●

● ●

●●

●●

●●

●●●

●●

−0.05 0.00 0.05 0.10

−0.2

−0.1

0.0

0.1

0.2

0.3

(b) Document level

Text scores − linear difference

Text

sco

res −

log

diffe

renc

e

●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●

●●●●

●●●●●●●●●

●●●●●●●

●●

●●●●●●●

●●●●●●

●●

●●

●●●●●●●●

●●●●●●●●●

●●

●●

●●●

●●

●●●●●

●●●●●●

●●●●

●●

●●

●●●●●●●●●●●●●●●●●●

●●

●●●●●

●●●●●●●●●

●●●●

●●●

●●●●●

●●

●●●●●

●●

●●●●●●●●●●

●●

●●●

●●●●●●●

●●

●●●●●

●●

●●●●●

●●●●●●●

●●●●●●

●●●

●●

●●●●●●●●●●●●●●●

●●

●●●●●●

●●●

●●

●●●●

●●

●●●●●●●

●●●●●●●●●●●●

●●●

●●●●●●

●●●●●●●●●●●

●●

●●

●●

●●●

●●

●●●●●●

●●

●●

●●●●●●●●

●●

●●●●●●●●●●●●●●●

●●

●●●●●●●

●●

●●●●●

●●

●●●

●●●●●●

●●

●●●●●●●●●

●●

●●●●●●●

●●●●●●●●●●

●●●●●●

●●

●●●●●●●●

●●●●●●●

●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●

●●

●●●●●

●●

●●●●●●●●●

●●●

●●

●●●●

●●●●●

●●

●●

●●●●●●●●●

●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●●

●●●●●

●●

●●●●●●●●●

●●●

●●●●●●●●●●

●●

●●●●●●

●●●

●●●

●●●●

●●

●●●●●●●●●

●●●●●●

●●

●●

●●

●●●

●●

●●●●

●●●●●●●●●●

●●●●●

●●●●

●●●

●●●●●●●●●●●●●

●●

●●●

●●●●●

●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●

●●●●●

●●●

●●●●●●●●●

●●●●●●●●

●●

●●

●●

●●●●●●●●●●●●●●

●●●●●

●●

●●●●

●●

●●

●●●●●●

●●●

●●

●●●●●

●●●●●●●●●●●●

●●

●●●

●●●●●

●●●●●●

●●●

●●●●●●●●

●●

●●●●

●●

●●

●●●●●●●

●●●●

●●●●

●●●●

●●●●●

●●●

●●●●●●●

●●

●●●●●●●●

●●

●●●●●●●

●●●●●●

●●

●●●●●●●●●●●

●●

●●●

●●●●

●●

●●●●

●●●●

●●●●

●●

●●●●

●●●●

●●

●●

●●●●

●●

●●

●●●●●

●●●●●●●

●●●●●●

●●●

●●●●●

●●●●

●●●●

●●●●●●

●●●

●●●●

●●●

●●●●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●

●●●●●●●●●

●●●●●

●●●

●●

●●●●●●●●

●●

●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●●●

●●

●●●●●●●●

●●●●●●●●●●●●●●●

●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●

●●

●●●●●●●●●●

●●

●●●●●●●●●

●●●

●●

●●

●●

●●●●●●●●●●●●

●●●●●●

●●

●●●

●●●●●

●●●●●

●●●●

●●●

●●●●●●●

●●●

●●●●●

●●●●●●●

●●

●●●

●●●

●●●●●●●●●

●●●●●●

●●●

●●●●●

●●●●●●●●●

●●●●●●●

●●●

●●

●●

●●●●

●●●●●●●

●●

●●

●●●

●●●●●●●●●●●●●●●

●●

●●●

●●●●●●●●●●●●

●●●

●●●●●●●●●●●

●●●●

●●●●●●●●●●

●●●●●●●●●

●●●

●●

●●●●●●

●●●

●●●●●

●●●●●●

●●●●●

●●

●●●●●

●●●●●●●●

●●●

●●●●●●●●●

●●●●●●●●

●●●

●●●

●●●

●●●●●●

●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●

●●

●●●

●●●

●●

●●●●●

●●●●●●

●●●●

●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●

●●●●●●●●●●

●●●●●●

●●●

●●

●●

●●

●●●●

●●●●

●●●

●●●●●●●●●●●

●●●●

●●

●●●

●●●●●●

●●●●●●●●●

●●●●

●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●

●●●

●●●●●●

●●●●

●●●●

●●

●●●●

●●●●●●●

●●●●●●●

●●●●●●●●

●●

●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●

●●●●●

●●

●●●

●●

●●●●●

●●●●●●●●

●●●

●●●

●●

●●●●

●●

●●●●

●●

●●

●●●

●●●●

●●

●●●●●●●

●●

●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●

●●

●●●●●●●●

●●●●●●●●●

●●●●●●

●●●●●●

●●●

●●

●●

●●

●●

●●●●

●●

●●●●●

●●●●●●

●●●●●

●●●●●

●●

●●●

●●●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●●●●●

●●●

●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●

●●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●

●●●

●●●●

●●●●●●

●●●●

●●●●●

●●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●

●●●●●●●●

●●●●

●●

●●

●●

●●

●●

●●●●●

●●●●

●●●●●●●

●●●●●●●

●●

●●●●●●●●●●●

●●

●●●●●

●●

●●●●●●●

●●

●●●●●●●●

●●●

●●

●●●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●●

●●●●●●●●●

●●

●●

●●●

●●

●●●●●

●●

●●●●●●●●

●●●●●●●

●●

●●

●●

●●

●●●●●●

●●

●●●

●●●●●

●●●●●●●

●●●

●●●●●●●●●

●●●●

●●

●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●●●●●●●●

●●●●●●

●●●●

●●

●●●

●●

●●●●●●

●●

●●

●●●●●●

●●

●●●

●●●

●●

●●

●●●

●●●●●●●●●●●

●●●

●●●

●●

●●

●●

●●●

●●

●●●●●

●●●●●●●●●●●●●

●●●●●

●●●●

●●

●●●●

●●●

●●●

●●●

●●●

●●●●●●●●●●●

●●●

●●●●●

●●●●●●●●

●●●●●●●●●●

●●

●●

●●●●●●

●●

●●●●

●●●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●

●●●●●●●

●●

●●

●●●●●

●●

●●●●●

●●●●

●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●

●●●●

●●●●●●●

●●●●●●●●

●●●●

●●●●●●

●●

●●●●●●●●●●●

●●●●

●●●●

●●●●●●●●●

●●●●●

●●●●●●●

●●

●●

●●

●●●●●●

●●●●●●●●●

●●●●

●●

●●●●●●●●

●●

●●●

●●

●●●●●●●

●●●

●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●

●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●

●●●●

●●

●●

●●

●●

●●

●●

●●●●●●●●

●●●●

●●●●●●●●●

●●●●●●

●●●●●

●●●●●●

●●

●●

●●●●●●●

●●

●●●●●

●●●●

●●●●●●●●●

●●

●●

●●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●●

●●●

●●●●●

●●●●

●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●

●●●

●●●●

●●

●●●●

●●●●

●●

● ●

●●●

●●

●●

●●●●●●●●●●

●●●●●●

●●●●●●●

●●

●●

●●●

●●

●●

●●●

●●●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●●●●●●

●●●●●●

●●●●●

●●●●●●

●●

●●

●●

●●●●

●●●●●●●●●

●●

●●●

●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●

●●

●●●

●●

●●

●●●●●

●●●●●●●●●●

●●●●●●●●●●

●●

●●●●●●●●

●●

●●

●●●●●●●●●●●●

●●

●●●

●●●●●●●●●●●●●●

●●

●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●●●●●●

●●●●●

●●

●●●●●●

●●●●●

●●●●●●●●

●●●●

●●●

●●

●●●●●

●●●●●●

●●●

●●

●●

●●●

●●●●

●●●●●●●●●●

●●●●●

●●

●●●

●●●●●●●●

●●●●●●

●●●

●●●●●●●●●●●●●●●

●●●

●●

●●●●●

●●●●●●●●●●●●●

●●

●●●

●●●●

●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●

●●●●●●●

●●

●●

●●

●●●

●●

●●●●●●●●●●●●●

●●●●●●●●●

●●

●●●●●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●●●●●

●●

●●

●●●●●●

●●●

●●

●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●

●●●

●●●

●●●●●●●●●●

●●

●●●●●

●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●

●●

●●●●●●●●●●●●●●

●●

●●●●

●●

●●●

●●●●●●●●●●●●●

●●●●●●●

●●

●●●

●●●●●●

●●

●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●

●●●

●●●●●●●

●●

●●

●●●●●●●

●●●●●●

●●

●●●

●●

●●

●●●●

●●

●●●●●●

●●

●●●●

●●●●

●●

●●

●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●

●●●●●

●●●●

●●

●●●●●●●●●●

●●

●●

●●●●

●●

●●●●

●●

●●

●●●●●

●●●

●●●

●●●●●●●●

●●

●●

●●●

●●●●●●●●●●●●●●

●●

●●●●●●

●●●

●●

●●●●●

●●

●●●●●●●●●

●●●●●●

●●

●●●●●●

●●●

●●

●●●●

●●●●●

●●●

●●●

●●

●●

●●

●●●●●●

●●

●●●

●●●●●●●

●●●●●

●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●

●●●●●●

●●

●●●

●●●●●●

●●

●●●●

●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●

●●●

●●●

●●

●●

●●●●●●

●●

●●

●●●

●●●●

●●

●●●●

●●●●

●●●●

●●●●

●●●●●

●●●●●

●●●●●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●●●●●●●●●

●●●●

●●●●●●●●●●●●●

●●●●●●

●●

●●

●●●●●●●●●●●●●

●●●●●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●●●●●●

●●●●

●●

●●●●●●●●●

●●●

●●●●●●●

●●●●●●

●●●●

●●●●

●●

●●●●

●●

●●●●●●

●●●

●●

●●●

●●●●●●●

●●●●●●

●●●●●●●●●●

●●

●●●●●

●●●

●●●

●●

●●●●

●●●●●●

●●●●

●●●●

●●●

●●●

●●

●●●●●●

●●

●●●

●●

●●●●●●●●●●●●●

●●

●●●●●●●

●●

●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●●●●●●●●

●●●●●●●

●●●

●●●●●●●

●●●

●●

●●●●

●●●●

●●●

●●●●●●●●●●●●●

●●

●●

●●

●●

●●●●●●●●●●●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●●●●●

●●●

●●●

●●

●●●

●●●●●●

●●●●

●●

●●

●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●

●●●●●

●●●

●●●

●●

●●●●●●●●●●●

●●●

●●●●

●●●●●●●

●●●●●●●●●●

●●●●

●●●

●●

●●

●●

●●●●●●●●

●●●●●●●●●●

●●

●●●

●●

●●●●●●●

●●●●●

●●●

●●●●●●●

●●

●●●●●●●●●

●●●●●●

●●

●●●●

●●●●●

●●●●●●●

●●●●●●●

●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●

●●●

●●●●●●

●●

●●●●●

●●●

●●●●

●●●●●●

●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●

●●●●

●●●●●●

●●●

●●●

●●●●

●●

●●●●●●

●●

●●

●●●●

●●●●

●●●●

●●●

●●●

●●

●●●

●●●●●●●●●

●●●

●●●●

●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●

●●●●●

●●●●●●●

●●

●●

●●●●●

●●●●●●●●

●●●●●

●●

●●●●●●

●●●●

●●●●●

●●

●●●●●●

●●●

●●●●●●●

●●●●

●●●

●●●●●●●●●●

●●●●

●●

●●

●●●●

●●

●●

●●●●●

●●●●

●●●●●●●●●●

●●

●●●●●●●●●●●●

●●●

●●

●●●●

●●●●●

●●●●●●●●●●●●●

●●

●●

●●●●

●●

●●●

●●

●●●●●

●●●●

●●●●●●

●●

●●●●●

●●●●

●●●●●●●●●

●●●●

●●

●●●

●●●●

●●

●●

●●●

●●●●●●●●●

●●

●●●●

●●●●●●●

●●●●●●

●●

●●●●

●●

●●

●●●●●●

●●●●●●●

●●●●●

●●●●●

●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●

●●●●●●●●

●●

●●

●●●●

●●●●●●●●●●

●●

●●

●●●●●

●●●●●●●●●

●●●●●

●●●●●

●●

●●

●●

●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●

●●

●●●●●

●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●

●●

●●●●●●

●●●●●

●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●

●●●●●●

●●

●●

●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●

●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●

●●●

●●●

●●●●●

●●●●●

●●●

●●

●●●●

●●●

●●●

●●●●●

●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●

●●

●●●●●●●●●

●●●●●●●●●●

●●

●●●

●●

●●●●●●●●

●●●

●●●●●●

●●●●

●●

●●

●●●

●●

●●●

●●●●

●●●

●●●●●●●●

●●●●

●●●●●●●

●●

●●●●●

●●●●

●●

●●●

●●●●●●●●●●●

●●

●●●●●●

●●

●●●●●●

●●●●●●●●●●●

●●●●●●

●●●●●

●●●

●●

●●●●●●●●

●●●

●●

●●

●●

●●●

●●●●●

●●

●●●●

●●

●●●●

●●

●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●

●●●

●●●●●●●●●●●●●●●●

●●●

●●

●●●●●●●●●●●●●●●

●●●●●

●●●

●●●●

●●

●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●

●●●

●●●

●●●●●

●●●●●●●●

●●●●●●●●●●●●●●

●●●●●

●●

●●●●

●●

●●●

●●●●

●●●

●●●●●●●●●●●

●●

●●●●●●●●●●●●

●●●

●●

●●●●●

●●●●●●

●●

●●

●●●

●●●●

●●●●●●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●

●●●●●●●●●

●●●●●

●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●●●●●●●●

●●●●●●●●

●●●●●●

●●●

●●

●●

●●●●

●●●●●●●●●●

●●●●●●●●●●●●

●●●●

●●●

●●●●●●

●●

●●●●

●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●

●●

●●●●

●●●

●●●

●●●●●●●●●●●●●●●

●●●●●

●●●

●●●●

●●●●●●●

●●●●●●●

●●●●●●

●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●

●●●●●

●●●●●

●●

●●●●●●●●●●●●

●●●●●●●●●●●●

●●

●●●●●●

●●

●●●●●●●●

●●●●●●●●●●

●●

●●●●●

●●●

●●●●●●●●●●●

●●

●●●

●●

●●●●

●●●●●●●

●●●

●●●●

●●●●●●●●

●●●●●

●●

●●●

●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●

●●●●

●●●●●●●●

●●●●●●●●

●●●

●●●

●●●●●●●

●●●●●

●●●

●●●●●

●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●

●●

●●●●

●●

●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●

●●●

●●●●●

●●

●●●●●

●●●●

●●●●●●●●●

●●

●●●●●●●●●

●●●●●

●●●●●●●●

●●●●●

●●

●●

●●

●●●●●●●●●●●

●●

●●

●●

●●●

●●●

●●●●

●●●●●●

●●

●●

●●

●●●●●●●●

●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●

●●●

●●●●

●●●

●●●●●●●

●●●●●●●●●

●●●

●●●●

●●●●●

●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●

●●

●●

●●

●●

●●●●

●●●●●●●●●●●●●●●

●●●●●

●●

●●

●●

●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●

●●●

●●

●●

●●●

●●

●●●●●

●●●●

●●

●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●

●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●

●●

●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●

●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●

●●●●

●●●●

●●●●●●●●●

●●

●●●●●

●●●

●●●

●●●●●●●●●

●●●●

●●●

●●●●●●

●●

●●●●●●●●

●●●●●

●●●●

●●●●●●●●

●●●●●●●●●

●●

●●●●

●●

●●●●●●●●●

●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●

●●●

●●

●●●●●●●

●●●●

●●●●

●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●

●●●●●●

●●●

●●●●

●●

●●●●●●

●●●●●

●●●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●

●●●

●●●●●●

●●●●●●●●●●●●●●

●●

●●●●

●●●●●●●●●

●●●

●●●

●●●

●●

●●●●●

●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●

●●●

●●●

●●●

●●●●

●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●

●●●●●●●●●

●●

●●●●

●●●●

●●

●●●●●●

●●●

●●●●●

●●●

●●

●●●●

●●●●●●●●

●●

●●

●●

●●

●●●●●

●●

●●●●●●●

●●●●●●●●●●●●●

●●●●

●●

●●

●●●

●●●●●●●●●●

●●

●●

●●

●●●●●

●●●●●●●●●●●●●

●●●

●●●

●●●●●●

●●

●●

●●

●●●●●●

●●●

●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●

●●●●●●●●●●●●●●●

●●●●

●●

●●●●

●●●

●●●

●●

●●

●●●●●●●●●●●●●

●●●●●●●●●

●●

●●●●●●●●●●

●●●●●●●●●

●●

●●

●●

●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●

●●●●●

●●●

●●●

●●

●●●●●

●●●●

●●●●

●●

●●●●●●●●●

●●●

●●●●●●●●●

●●●●

●●●●●●

●●●●

●●●●

●●●●●●●●●●

●●●

●●●

●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●

●●

●●●

●●

●●●

●●●●

●●

●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●

●●●●

●●

●●●

●●

●●●

●●●●●●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●

●●

●●●●

●●●●

●●●●●

●●●●●●●●

●●●

●●●●●

●●●●●●●●●

●●

●●●●●●

●●●●●

●●●●●

●●●●●●●

●●●

●●

●●●●

●●●●●●●

●●●●●●

●●●

●●

●●●●

●●

●●●●●

●●●●●●●●●

●●●●●●

●●

●●

●●

●●●●●●●●●●●●

●●

●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●

●●●●●●●

●●

●●

●●●●

●●

●●

●●●●●●

●●●●●

●●

●●●

●●●●

●●●

●●●●●●●

●●

●●●●●

●●

●●●●●●●●●●●●

●●●●●●●●●

●●

●●

●●

●●●●●●●●●●●●

●●●●●●●●

●●●

●●

●●

●●●●●●

●●

●●

●●●●●●

●●●●●

●●●●●●●●●●●●●●

●●

●●●

●●

●●●●●●●●

●●●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●

●●

●●●●●

●●●●●●

●●●●

●●●

●●●●

●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●

●●●●

●●

●●

●●●

●●

●●●●●●●●

●●●●●●

●●●

●●●

●●●●●●●●●●●●

●●●●●

●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●

●●

●●●●●●

●●●●●

●●●●●●

●●●●●

●●●

●●

●●

●●●

●●●●●●●●●●

●●●●●●●●●●●●

●●

●●●

●●

●●●●●

●●●●

●●●●●

●●●

●●●

●●●●●●●●●●

●●

●●●

●●

●●

●●●●●●

●●●

●●●

●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●

●●

●●●●

●●●●

●●

●●●

●●●●●●●●●●●●●

●●

●●●●●

●●●●

●●●

●●●

●●●

●●●●

●●

●●●●●●●

●●

●●●●●●

●●●

●●●

●●●

●●●●●●●●●

●●●●●

●●

●●●●●●

●●●

●●●●●●●●

●●

●●●●

●●●●●●

●●●●●●

●●

●●

●●●

●●

●●

●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●

●●●

●●●

●●●●●●●●●●●●●

●●●

●●●

●●●

●●

●●

●●●●●●●●●

●●

●●

●●●●●

●●●●●●●●●

●●●●●●●●

●●

●●●●

●●●●

●●

●●●●●●●●●●

●●●●●

●●

●●●●●●

●●

●●●●

●●●●

●●●

●●●●●●●

●●●●

●●

●●

●●●●●●●

●●●

●●

●●

●●

●●

●●●●●●

●●●

●●●

●●

●●●●●●●●●

●●●●●●●●●●●●

●●●●●

●●●

●●●●●●●●●

●●●●

●●●●●●

●●●●●

●●●●●●●●

●●

●●●●●●●

●●

●●

●●●●

●●

●●●●

●●●●●●

●●●●

●●

●●●

●●

●●●

●●●●●●

●●

●●●●

●●

●●

●●●

●●

●●

●●●●●●●

●●●

●●

●●●●

●●●

●●

●●●●●●●●●●●●

●●●●●●●

●●●

●●

●●

●●

●●●

●●●●●●●●●●●

●●●●●●●●●●

●●

●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●

●●●●●●●●●

●●●●

●●

●●●

●●

●●●●●

●●

●●

●●●●

●●

●●●●

●●●●

●●

●●●●●●●

●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●

●●

●●●●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●

●●●●●●●●

●●●

●●

●●●●●●●●●●

●●

●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●

●●●●●●

●●●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●●●●●●●●

●●●●●●●●

●●●●

●●

●●

●●●●●

●●●●●

●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●

●●

●●

●●●

●●●●●●

●●●●●●●

●●●●●●●●

●●

●●●

●●●●●●●●●●●●

●●

●●●●●●●●●●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●●●●●●●●

●●●●

●●

●●●●●●●

●●●●●

●●

●●●●●

●●●●

●●●●

●●●●●

●●●

●●●●

●●●

●●●

●●●

●●

●●●●

●●●●●

●●●

●●●

●●

●●●●●●●●

●●

●●●

●●●●

●●●●

●●●●

●●

●●

●●

●●●●●●

●●●●●

●●

●●●●

●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●

●●

●●

●●●●

●●●

●●●

●●

●●●●●

●●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●●

●●

●●

●●●●●●

●●●●●

●●

●●●●●

●●

●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●

●●

●●

●●●●●●●●●●●●●●

●●

●●●●

●●●●

●●●●●

●●●

●●

●●●●●●●●

●●

●●

●●●●

●●●

●●●

●●

●●●●●

●●●●

●●

●●

●●●●●●●●

●●●●

●●●●

●●●●●●●●●●●●

●●

●●●

●●

●●●●

●●

●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●●

●●

●●

●●●●●●●●

●●

●●

●●●●●●●●●●●●●

●●

●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●

●●●●

●●●

●●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●

●●●●●●●●

●●

●●●●●●●●●●

●●●●●●●●●●

●●

●●●●

●●●

●●

●●●

●●●●

●●

●●

●●●●●●●

●●●●●

●●●●

●●●

●●●●●●

●●●●●●●●●●

●●●●●

●●●●●●●●

●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●●●

●●

●●

●●

●●●●

●●●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●

●●

●●●

●●●●●●●●●●●●●●

●●

●●

●●

●●●●

●●●

●●

●●●●

●●●●●

●●

●●●

●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●

●●●●●●●●

●●●●●

●●●●●●●●●●●●

●●●

●●●●●

●●●●●

●●●●

●●●●

●●

●●●●●

●●●

●●●●

●●

●●●●

●●

●●●●

●●●

●●●

●●●●

●●●●●●●●

●●●

●●

●●●

●●

●●●

●●●●●●

●●

●●●●

●●●●●●●●●●●●●●●●●●

●●●

●●●●

●●

●●●

●●

●●●●●●●

●●●●●●●●●●●●●●

●●●●

●●

●●●

●●

●●●

●●●●●●●●

●●●●

●●

●●

●●●●●●●●●●●●●●●●

●●

●●●●

●●●●●

●●

●●●●●

●●●●

●●●●

●●●

●●●

●●●●

●●●●●

●●●●

●●

●●●

●●●●

●●●●●●●●●●●●●●●●

●●

●●

●●●

●●●●●●●●

●●●

●●●●

●●

●●●

●●●●

●●

●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●

●●

●●●●

●●●●●●●●●●●

●●●●●

●●●●

●●

●●●●●●●●●

●●●●●●●●●●●

●●●

●●

●●

●●●●●●●●●●●●●

●●●●●

●●●●

●●●●●

●●●

●●●

●●●

●●●●

●●

●●●●●●●

●●●●●

●●●

●●●●●●●●

●●

●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●

●●

●●●●●●●

●●●●

●●●●●●●

●●●●●●●●●●●

●●●●

●●●●●

●●●●●●

●●

●●●●●●●

●●●●●

●●●

●●●

●●

●●●●

●●●

●●●

●●

●●●●

●●●

●●●●

●●

●●

●●●●●

●●●●

●●●●●●●

●●

●●

●●

●●●●●●●

●●

●●●●

●●●●●●●●

●●●●●●●

●●

●●●●●●●●

●●●●●●

●●●

●●●

●●●●●●●

●●●

●●

●●●●●●

●●

●●

●●

●●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●

●●●

●●

●●

●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●

●●●●

●●

●●●●

●●●●●●●●●●●●●

●●●

●●

●●●●

●●

●●●●●●●

●●●●

●●

●●

●●

●●

●●●●●●

●●

●●

●●●

●●●

●●●●●●●●

●●

●●●●

●●●●●

●●●●●●●●●●●●●

●●●

●●

●●●●●●

●●●●●●

●●●●●●●●●●●

●●●

●●●●●●●●●●

●●●●●

●●●●

●●●●●●

●●●

●●

●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●

●●

●●●

●●●●●

●●

●●●●●●●●●●●

●●●●

●●●●

●●●

●●●●●●

●●●●●●●●●●●●●

●●●●

●●●●●

●●●●●●●

●●

●●●●

●●●●●●●

●●

●●

●●●●

●●●●

●●●●●●●●●

●●●●●●●●

●●

●●●●

●●●

●●

●●

●●●●●●●●●●●●●●

●●

●●●●●

●●●●●

●●●●

●●●●

●●●●●●●●●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●●●

●●●●●●●

●●●●●●

●●●●●●●

●●●●●●●

●●

●●●●●●●●●●

●●●●

●●●

●●

●●●●●●

●●●●●●

●●●●●●●●●

●●●●●

●●

●●

●●

●●

●●●●●●●

●●●●●●●●

●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●

●●●●●●●

●●●

●●●

●●●

●●●●

●●

●●

●●

●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●

●●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●

●●●●

●●●●●●●●●

●●

●●●●

●●●●●●●●●●●●●

●●

●●

●●

●●●●●●

●●

●●●●●●●●●

●●●

●●

●●

●●●

●●●●

●●

●●●●●●

●●●

●●

●●●●●●●●

●●●●●●●

●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●●

●●●●

●●●●●●●●●●●●●

●●●

●●●●●●●●

●●

●●●

●●

●●●

●●●●●

●●●●●●

●●●●

●●●●●

●●●

●●●

●●

●●

●●

●●●●●●●●

●●●●●●

●●●

●●●

●●●●●●●●

●●●

●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●

●●●●●

●●●●

●●●●●

●●

●●●●

●●

●●●●●●●●●

●●●●●●●

●●

●●

●●

●●●●●●●●●●●●

●●●●●●●●

●●●●

●●

●●●●●

●●●●●●

●●

●●●●●●

●●●●●●●●●●

●●

●●●●●

●●

●●

●●●●●●●●●

●●

●●●●●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●

●●●

●●●●●●●●●●●●

●●

●●

●●●●●●

●●●●●●●●●●

●●

●●

●●

●●●

●●●

●●●●

●●●●●

●●●●●●●●●

●●●●

●●●●

●●●

●●●

●●●●●●●●●●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●●●

●●●●●●●●

●●

●●●

●●●●

●●●

●●●●●●●●●●●●

●●●

●●●

●●●●

●●●●●●●●●●

●●●

●●●●

●●

●●

●●●

●●

●●●

●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●

●●●●●●

●●

●●

●●●●●●

●●

●●●●●●●●●●

●●

●●●●●

●●●●●●

●●

●●●●●●●●

●●●●●●●●

●●

●●

●●●

●●●●

●●●●●

●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●

●●●●●●●●●

●●

●●

●●

●●

●●●●●●●●

●●●

●●●●●●●●

●●

●●●●●●

●●

●●●

●●

●●●●

●●

●●

●●●●●●

●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●

●●

●●

●●●●●

●●●

●●●

●●●

●●●●●●●●●

●●

●●●●

●●

●●●●

●●●●●●

●●●●

●●●●●●●●

●●

●●

●●●

●●●●●●●●●●●●

●●●●

●●●

●●

●●●●●●●●●●●●●

●●

●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●

●●●●●●●●●

●●●●●

●●●●●

●●

●●●●●

●●●●●●

●●●●

●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●

●●●

●●●●

●●

●●●●●●●●

●●

●●●

●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

−1.0 −0.5 0.0 0.5

−3−2

−10

12

3

(a) Word level

Wordscores − linear difference

Wor

dsco

res −

log

diffe

renc

e

●●

●●

● ●

●●

●●

●●

●●●

●●

−0.05 0.00 0.05 0.10

−0.2

−0.1

0.0

0.1

0.2

0.3

(b) Document level

Text scores − linear differenceTe

xt s

core

s −

log

diffe

renc

e

I Training set: Petitioner and Respondent litigant briefs fromGrutter/Gratz v. Bollinger (a U.S. Supreme Court case)

I Test set: 98 amicus curiae briefs (whose P or R class is known)

Page 64: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Application 2: Classifying legal briefs (Evans et al 2007)Posterior class prediction from NB versus log wordscores

-0.3 -0.2 -0.1 0.0 0.1 0.2

0.0

0.2

0.4

0.6

0.8

1.0

Log wordscores mean for document

Pos

terio

r P(c

lass

=Pet

ition

er|d

ocum

ent)

Predicted PetitionerPredicted Respondent

Page 65: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Application 3: Scaling environmental interest groups(Kluver 2009)

I Dataset: text of online consultation on EU environmental regulations

I Reference texts: most extreme pro- and anti-regulation groups

Page 66: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Application 4: LBG’s British manifestosMore than two reference classes

6 8 10 12 14 16

68

1012

1416

Word scores on Taxes−Spending Dimension

Wor

d sc

ores

on

Dec

entr

aliz

atio

n D

imen

sion

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●●●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●● ●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●●●●●

●●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●●●

● ●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●●●

Con

Lab

LD

Lab−LD shared w

ords

LD−C

on s

hare

d wor

ds

Lab−Con shared words

I x-axis: Reference scores of {5.35, 8.21, 17.21} for Lab, LD, ConservativesI y -axis: Reference scores of {10.21, 5.26, 15.61}

Page 67: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Unsupervised methods scale distance

I Text gets converted into a quantitative matrix of featuresI words, typicallyI could be dictionary entries, or parts of speech

I Documents are scaled based on similarity or distance infeature use

I Fundamental problem: distance on which scale?I Ideally, something we care about, e.g. policy positions,

ideology, preferences, sentimentI But often other dimensions (language, rhetoric style,

authorship) are more predictive

I First dimension in unsupervised scaling will capture mainsource of variation, whatever that is

I Unlike supervised models, validation comes after estimatingthe model

Page 68: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Unsupervised scaling methods

Two main approachesI Parametric methods model feature occurrence according to

some stochastic distribution, typically in the form of ameasurement model

I for instance, model words as a multi-level Bernoullidistribution, or a Poisson distribution

I word effects and “positional” effects are unobservedparameters to be estimated

I e.g. Wordfish (Slapin and Proksch 2008) and Wordshoal(Lauderdale and Herzog 2016)

I Non-parametric methods typically based on the Singular ValueDecomposition of a matrix

I correspondence analysisI factor analysisI other (multi)dimensional scaling methods

Page 69: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Wordfish (Slapin and Proksch 2008)

I Goal: unsupervised scaling of ideological positions

I The frequency with which politician i uses word k is drawnfrom a Poisson distribution:

wik ∼ Poisson(λik)

λik = exp(αi + ψk + βk × θi )

I with latent parameters:

αi is “loquaciousness” of politician iψk is frequency of word kβk is discrimination parameter of word kθi is the politician’s ideological position

I Key intuition: controlling for document length and wordfrequency, words with negative βk will tend to be used moreoften by politicians with negative θi (and vice versa)

Page 70: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Wordfish (Slapin and Proksch 2008)

Why Poisson?

I Poisson-distributed variables are bounded between (0,∞) andtake on only discrete values 0, 1, 2, . . . ,∞

I Exponential transformation: word counts are function of logdocument length and word frequency

wik ∼ Poisson(λik)

λik = exp(αi + ψk + βk × θi )log(λik) = αi + ψk + βk × θi

Page 71: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

How to estimate this model

Conditional maximum likelihood estimation:

I If we knew ψ and β (the word parameters) then we have aPoisson regression model

I If we knew α and θ (the party / politician / documentparameters) then we have a Poisson regression model too!

I So we alternate them and hope to converge to reasonableestimates for both

I Implemented in the quanteda package astextmodel wordfish

An alternative is MCMC with a Bayesian formulation or variationalinference using an Expectation-Maximization algorithm (Imai et al2016)

Page 72: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Conditional maximum likelihood for wordfish

Start by guessing the parameters (some guesses are better thanothers, e.g. SVD)

Algorithm:

1. Assume the current legislator parameters are correct and fit asa Poisson regression model

2. Assume the current word parameters are correct and fit as aPoisson regression model

3. Normalize θs to mean 0 and variance 1

Iterate until convergence (change in values is below a certainthreshold)

Page 73: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Identification

The scale and direction of θ is undetermined — like most modelswith latent variablesTo identify the model in Wordfish

I Fix one α to zero to specify the left-right direction (Wordfishoption 1)

I Fix the θs to mean 0 and variance 1 to specify the scale(Wordfish option 2)

I Fix two θs to specify the direction and scale (Wordfish option3 and Wordscores)

Note: Fixing two reference scores does not specify the policydomain, it just identifies the model

Page 74: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

“Features” of the parametric scaling approach

I Standard (statistical) inference about parameters

I Uncertainty accounting for parametersI Distributional assumptions are made explicit (as part of the

data generating process motivating the choice of stochasticdistribution)

I conditional independenceI stochastic process (e.g. E(Yij) = Var(Yij) = λij)

I Permits hierarchical reparameterization (to add covariates)

I Generative model: given the estimated parameters, we couldgenerate a document for any specified length

Page 75: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Some reasons why this model is wrong

I Violations of conditional independence:I Words occur in sequence (serial correlation)I Words occur in combinations (e.g. as collocations)

“carbon tax” / “income tax” / “inheritance tax” / “capitalgains tax” /”bank tax”

I Legislative speech uses rhetoric that contains frequentsynonyms and repetition for emphasis (e.g. “Yes we can!”)

I Heteroskedastic errors (variance not constant and equal tomean):

I overdispersion when “informative” words tend to clustertogether

I underdispersion could (possibly) occur when words of highfrequency are uninformative and have relatively lowbetween-text variation (once length is considered)

Page 76: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Overdispersion in German manifesto data(data taken from Slapin and Proksch 2008)

Log Word Frequency

Ave

rage

Abs

olut

e V

alue

of R

esid

uals

0.5

1.0

1.5

2.0

2.5

2 3 4 5 6

Page 77: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

One solution to model overdispersion

Negative binomial model (Lo, Proksch, and Slapin 2014):

wik ∼ NB

(r ,

λikλik + ri

)λik = exp(αi + ψk + βk × θi )

where ri is a variance inflation parameter that varies acrossdocuments.

It can have a substantive interpretation (ideological ambiguity),e.g. when a party emphasizes an issue but fails to mention keywords associated with it that a party with similar ideologymentions.

Page 78: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Example from Slapin and Proksch 2008

Page 79: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Example from Slapin and Proksch 2008

Page 80: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Example from Slapin and Proksch 2008

Page 81: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Example from Slapin and Proksch 2008

Page 82: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Wordshoal (Lauderdale and Herzog 2016)

Two key limitations of wordfish applied to legislative text:

I Word discrimination parameters assumed to be constantacross debates (unrealistic, think e.g. “debt”)

I May not capture left-right ideology but topic variation

Slapin and Proksch partially avoid these issues by scaling differenttypes of debates separately.

But resulting estimates are confined to set of speakers who spokeon each topic.

Wordshoal solution: aggregate debate-specific ideal points into areduced number of scales.

Page 83: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Wordshoal (Lauderdale and Herzog 2016)

I The frequency with which politician i uses word k in debate jis drawn from a Poisson distribution:

wijk ∼ Poisson(λijk)

λijk = exp(αij + ψjk + βjk × θij)θij ∼ N (νj + κjµi , τi )

I with latent parameters:

αij is “loquaciousness” of politician i in debate jψjk is frequency of word k in debate jβkj is discrimination parameter of word k in debate jθij is the politician’s ideological position in debate jνj is baseline ideological position of debate jκj is correlation of debate j with common dimensionµi is overall ideological position of politician i

I Intuition: debate-specific estimates are aggregated into asingle position using dimensionality reduction

Page 84: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Wordshoal (Lauderdale and Herzog 2016)

New quantities of interest to estimate:

I Politicians’ overall position vs debate-specific positions

I Strength of association between debate scales and generalideological scale

I Association of words with general scales, and stability of worddiscrimination parameters across debates

Page 85: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Example from Lauderdale and Herzog 2016

Page 86: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Example from Lauderdale and Herzog 2016

Page 87: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Example from Lauderdale and Herzog 2016

Page 88: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Example from Lauderdale and Herzog 2016

Page 89: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Example from Lauderdale and Herzog 2016

Page 90: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Non-parametric methods

I Non-parametric methods are algorithmic, involving no“parameters” in the procedure that are estimated

I Hence there is no uncertainty accounting given distributionaltheory

I Advantage: don’t have to make assumptionsI Disadvantages:

I cannot leverage probability conclusions given distributionalassumptions and statistical theory

I results highly fit to the dataI not really assumption-free, if we are honest

Page 91: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Correspondence Analysis

I CA is like factor analysis for categorical data

I Following normalization of the marginals, it uses SingularValue Decomposition to reduce the dimensionality of thedocument-feature matrix

I This allows projection of the positioning of the words as wellas the texts into multi-dimensional space

I The number of dimensions – as in factor analysis – can bedecided based on the eigenvalues from the SVD

Page 92: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Singular Value Decomposition

I A matrix Xn×k

can be represented in a dimensionality equal to

its rank d as:

Xn×k

= Un×d

Σd×d

V′d×k

(1)

I The U, Σ, and V matrixes “relocate” the elements of X ontonew coordinate vectors in d-dimensional Euclidean space

I Row variables of X become points on the U columncoordinates, and the column variables of X become points onthe V column coordinates

I The coordinate vectors are perpendicular (orthogonal) to eachother and are normalized to unit length

Page 93: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Correspondence analysis

1. Compute matrix of standardized residuals, S:

S = D1/2r (P− rcT )D

1/2c

where P = Y/∑

ij yijr, c are row/column masses: e.g. ri =

∑j pij

Dr = diag(r), Dc = diag(c)

2. Calculate SVD of S

3. Project rows and columns onto low-dimensional space:

θ = D1/2r U for rows (documents)

φ = D1/2c V for columns (words)

Mathematically close to log-linear poisson regression model(Lowe, 2008)

Page 94: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Example: Schonhardt-Bailey (2008) - speakers

402 S C H O N H A R D T-B A I L E Y

appears to be unique to that bill – i.e., the specific procedural measures, the constitutionalityof the absent health exception, and the gruesome medical details of the procedure are allunique to the PBA ban as defined in the 2003 bill. Hence, to ignore the content of thedebates by focusing solely on the final roll-call vote is to miss much of what concernedsenators about this particular bill. To see this more clearly, we turn to Figure 3, in whichthe results from ALCESTE’s classification are represented in correspondence space.

Fig. 3. Correspondence analysis of classes and tags from Senate debates on Partial-Birth Abortion Ban Act

Page 95: Day 10: Text classification and scaling · locating things on latent traits (scaling) I But the two methods overlap and can be adapted { will demonstrate later using the Naive Bayes

Example: Schonhardt-Bailey (2008) - words410 S C H O N H A R D T-B A I L E Y

Fig. 4. Senate debates on Partial-Birth Abortion Ban Act – word distribution in correspondence space