bridging the native language and language variety ...language variety identification using...

Bridging the Native Language and Language Variety

Identification TasksMarc Franco-Salvador, Greg Kondrak, and Paolo Rosso

KES 2017 – IS13:

Supervised versus Unsupervised Methods for Intelligent Text Processing

September 7th, Marseille, France

Outline

• Introduction• Tasks

• Motivation

• Methods• String kernels

• Word embeddings

• Combination

• Evaluation

• Conclusions

Native Language Identification (NLI)

The task of NLI (Koppel et al., 2005) is to determine the native languageof the author of a text that he or she wrote in another language.

Example:Deciding whether an English essay was written by a Spanish or German student.

Spanish student: “the people here is very friendly”

German student: “I like it to go swimming”

Language Variety Identification (LVI)

LVI (Bezooijen & Gooskens, 1999) aims at classifying texts of different varieties of a single language.

Example:Distinguishing between American and British English.

EN-UK: “touch wood”

EN-US: “knock on wood”

Bridging the NLI and LVI tasks - Motivation

Bridging the NLI and LVI tasks - Motivation

Input:

NLI & LVI: a text written in a known language L.

Objective:

LVI: determine the language variety Li, Li ϵ L

NLI: determine the author’s native language X, X ≠ L

Our insight:

The way that native speakers of X write texts in Lconstitutes a particular variety Lx of L, where Lx ϵ L.In this way, we reduce both tasks to theidentification of the language variety of L.

Testing the hypothesis

We test our hypothesis by designing an approach intended to work onboth tasks without any task-specific adaptation.

The approach combines two distinct methods:• String kernels

• Word embeddings

Outline


• Motivation


• Word embeddings

• Combination

• Evaluation

• Conclusions

String kernels (SK)String Kernels are functions that measure the similarity of string pairsat lexical level. General from of a p-grams kernel:

k𝑝(𝑠, 𝑡) =

𝑣∈𝐿𝑝

f(num𝑣 𝑠 , num𝑣 (𝑡))

Three variants of the kernel differ in the definition of the functionf(x,y):

1. f 𝑥, 𝑦 = 𝑥. 𝑦 in the p-spectrum kernel;2. f 𝑥, 𝑦 = sgn(𝑥). sgn(𝑦) in the p-grams presence bits kernel;3. f 𝑥, 𝑦 = min(𝑥, 𝑦) in the p-grams intersection bits kernel.

* Following Ionescu et al. (2014) our kernels combine sizes of p = [5,8] and are classified with kerneldiscriminant analysis (Friedman et al., 2001).

Word embeddings (WE)

We explore two alternatives based on the continuous Skip-gram model(Mikolov et al., 2013) to get the vector Ԧ𝑒 of a text 𝑑:

1. Average the vectors of the words 𝑤𝑖 ∈ 𝑑:

Ԧ𝑒 =1

𝑛

𝑤𝑖∈𝑑

𝑛

𝑤𝑖

2. Use the Skip-gram Sentence Vectors (SenVec) variant (Le andMikolov, 2014):

* Following Franco-Salvador et al. (2015a,2015b) we test the logistic regression and SVM classifiers.

Classifier combination

Linear interpolation: 𝛼 ⋅ 𝑥 + 1 − 𝛼 ⋅ 𝑦

Logarithmic interpolation: 𝑥𝛼⋅ 𝑦1−𝛼

Ranking interpolation: 𝛼 ⋅ rank(𝑥) + 1 − 𝛼 ⋅ rank(𝑦)

Meta-learning: 10-fold cross-validation over the training set to derive aSVM model that combines all the class probabilities.

* 𝑥 and 𝑦 stand for the conditional probabilities obtained from string kernels andword embeddings, respectively.

Outline


• Motivation


• Word embeddings

• Combination

• Evaluation

• Conclusions

Evaluation – Dataset statistics

Task NLI LVI

Dataset TOEFL11 ICLE DSLCC 2.0 DSLCC 3.0 HispaBlogs

# languages 11 7 14 12 5

Train texts 9900 770 252k 216k 2250

Dev texts 1100 - 28k 24k -

Test texts 1100 - 14k 12k 1000

Avg. text length 243 689 37 43 3168

Evaluation – Development experiments

Task NLI LVI


SK (p-spectrum) 83.5 85.7 94.5 86.8 74.3

SK (intersection) 85.4 89.1 94.6 86.9 75.9

SK (presence) 86.4 89.2 94.6 87.2 76.1

WE-avg (logistic) 65.6 59.4 93.5 84.1 73.8

WE-avg (SVM-linear) 60.9 59.0 93.1 83.9 73.6

SenVec (logistic) 64.1 58.6 91.2 82.7 71.5

SenVec (logistic) 55.3 58.6 91.1 82.4 70.9

Linear inter. 87.4 90.6 94.8 87.4 78.3

Log inter. 87.5 90.1 94.7 87.3 77.9

Rank inter. 86.4 89.2 94.6 87.2 73.8

Meta-learning 87.3 90.3 94.6 87.2 78.0

Evaluation – Test partition results

Task NLI LVI


Baseline 60.1 78.2 90.3 81.9 52.7

String kernels 82.8 89.2 94.4 88.2 74.9

Word embed. 66.1 59.4 92.1 85.3 72.2

Combination 83.8 90.6 94.7 88.3 76.7

Previous work 83.6 90.1 95.5 89.4 71.1

• The baseline is a BOW with the 10k most frequent words represented as binary features. We used SVM with a linear kernel.

• We compare to the best performing systems on the respective shared tasks: Jarvis et al. (2013) for TOEFL11; Malmasi andDras (2015) for DSLCC 2.0; and Çöltekin and Rama (2016) for DSLCC 3.0.

• The ICLE results are from Tetreault et al. (2012). For HispaBlogs, we report the results of Rangel et al. (2016).

Evaluation – Discussion: SK vs WE

• String kernels are effective to capturing the lexical peculiarities of agiven language variety.• E.g. the Spanish word coger “to take” is used frequently in European news,

but not in Latin America, where it has acquired a taboo meaning.

• Word embeddings are less effective at leveraging individual wordtokens like coger. The contribution of a single word vector may beinsufficient to get a correct classification. However, the method hasthe potential to take into account the frequencies of words.• E.g. a high frequency of the English pronoun he in TOEFL essays is more

indicative of the Turkish than the Arabic native speakers.

Evaluation – Discussion: text characteristics impact

• The length of the text affect the classifier performance.Word embeddings excelled on datasets with larger texts, i.e. HispaBlogs.

• Small training datasets do not allow to train representativeembeddings (e.g. ICLE).

Evaluation – Discussion: text characteristics impact

• Word embeddings work better with a high number of named entities.Named entities: 5-7% of tokens in the NLI datasets vs. 11-12% in the LVI ones.

• DSLCC training and test instances may overlap with respect to the textauthors, resulting in over-fitting to match a particular author’s writingstyle.

Short n-gram sizes have been used in the past for authorship attribution (Koppel& Schler, 2003). Our string kernels used n-grams of size [5,8], which may be toolarge.

Outline


• Motivation


• Word embeddings

• Combination

• Evaluation

• Conclusions

Conclusions

• String kernels work very well on the LVI task.

• Experiments on five datasets show that the combination-basedapproach achieves results that are close to the state of the art onboth the NLI and LVI tasks.

• We interpret this as empirical evidence for our hypothesis concerningthe similarity of the two identification tasks.

• We hypothesize that our approach may be similarly effective on otherauthor profiling tasks that can be framed as the identification of aparticular language variety, e.g., gender identification.

Bridging the Native Language and Language Variety

Identification TasksMarc Franco-Salvador, Greg Kondrak, and Paolo Rosso

KES 2017 – IS13:

Supervised versus Unsupervised Methods for Intelligent Text Processing

September 7th, Marseille, France

Thank you for your time

Questions?

[email protected]

mailto:[email protected]

References I

Çöltekin, Ç., & Rama, T. (2016). Discriminating Similar Languages with Linear SVMs and NeuralNetworks. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties andDialects (VarDial3) (pp. 15-24).

Franco-Salvador, M., Rangel, F., Rosso, P., Taulé, M., & Martít, M. A. (2015a). Language VarietyIdentification Using Distributed Representations of Words and Documents. In Experimental IRMeets Multilinguality, Multimodality, and Interaction (pp. 28-40). Springer International Publishing.

Franco-Salvador, M. Rosso, P., & Rangel, F. (2015b). Distributed Representations of Words andDocuments for Discriminating Similar Languages. In: Proc. of the Joint Workshop on LanguageTechnology for Closely Related Languages, Varieties and Dialects (LT4VarDial), Hissar, Bulgaria.

Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1, pp. 241-249). New York: Springer series in statistics.

Ionescu, R. T., Popescu, M., & Cahill, A. (2014). Can characters reveal your native language? Alanguage-independent approach to native language identification. In EMNLP (pp. 1363-1373).

Jarvis, S., Bestgen, Y., & Pepper, S. (2013, June). Maximizing Classification Accuracy in NativeLanguage Identification. In BEA@ NAACL-HLT (pp. 111-118).

References II

Koppel, M., & Schler, J. (2003). Exploiting stylistic idiosyncrasies for authorship attribution. In Proceedings of IJCAI'03Workshop on Computational Approaches to Style Analysis and Synthesis (Vol. 69, p. 72).

Koppel, M., Schler, J., & Zigdon, K. (2005). Automatically determining an anonymous author’s native language.Intelligence and Security Informatics, 41-76.

Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. arXiv preprintarXiv:1405.4053.

Malmasi, S., & Dras, M. (2015, September). Language identification using classifier ensembles. In Proceedings of theJoint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial) (pp. 35-43).

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrasesand their compositionality. In Advances in Neural Information Processing Systems (pp. 3111-3119).

Rangel, F., Franco-Salvador, M., & Rosso, P. (2016). A low dimensionality representation for language varietyidentification. arXiv preprint arXiv:1705.10754.

Tetreault, J., Blanchard, D., Cahill, A., & Chodorow, M. (2012). Native tongues, lost and found: Resources andempirical evaluations in native language identification. Proceedings of COLING 2012, 2585-2602.

Van Bezooijen, R., & Gooskens, C. (1999). Identification of language varieties: The contribution of different linguisticlevels. Journal of language and social psychology, 18(1), 31-48.