bridging the native language and language variety ...language variety identification using...

24
Bridging the Native Language and Language Variety Identification Tasks Marc Franco-Salvador, Greg Kondrak, and Paolo Rosso KES 2017 – IS13: Supervised versus Unsupervised Methods for Intelligent Text Processing September 7 th , Marseille, France

Upload: others

Post on 28-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Bridging the Native Language and Language Variety

Identification TasksMarc Franco-Salvador, Greg Kondrak, and Paolo Rosso

KES 2017 – IS13:

Supervised versus Unsupervised Methods for Intelligent Text Processing

September 7th, Marseille, France

Page 2: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Outline

• Introduction• Tasks

• Motivation

• Methods• String kernels

• Word embeddings

• Combination

• Evaluation

• Conclusions

Page 3: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Outline

• Introduction• Tasks

• Motivation

• Methods• String kernels

• Word embeddings

• Combination

• Evaluation

• Conclusions

Page 4: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Native Language Identification (NLI)

The task of NLI (Koppel et al., 2005) is to determine the native languageof the author of a text that he or she wrote in another language.

Example:Deciding whether an English essay was written by a Spanish or German student.

Spanish student: “the people here is very friendly”

German student: “I like it to go swimming”

Page 5: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Language Variety Identification (LVI)

LVI (Bezooijen & Gooskens, 1999) aims at classifying texts of different varieties of a single language.

Example:Distinguishing between American and British English.

EN-UK: “touch wood”

EN-US: “knock on wood”

Page 6: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Bridging the NLI and LVI tasks - Motivation

Page 7: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Bridging the NLI and LVI tasks - Motivation

Input:

NLI & LVI: a text written in a known language L.

Objective:

LVI: determine the language variety Li, Li ϵ L

NLI: determine the author’s native language X, X ≠ L

Our insight:

The way that native speakers of X write texts in Lconstitutes a particular variety Lx of L, where Lx ϵ L.In this way, we reduce both tasks to theidentification of the language variety of L.

Page 8: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Testing the hypothesis

We test our hypothesis by designing an approach intended to work onboth tasks without any task-specific adaptation.

The approach combines two distinct methods:• String kernels

• Word embeddings

Page 9: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Outline

• Introduction• Tasks

• Motivation

• Methods• String kernels

• Word embeddings

• Combination

• Evaluation

• Conclusions

Page 10: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

String kernels (SK)String Kernels are functions that measure the similarity of string pairsat lexical level. General from of a p-grams kernel:

k𝑝(𝑠, 𝑡) =

𝑣∈𝐿𝑝

f(num𝑣 𝑠 , num𝑣 (𝑡))

Three variants of the kernel differ in the definition of the functionf(x,y):

1. f 𝑥, 𝑦 = 𝑥. 𝑦 in the p-spectrum kernel;2. f 𝑥, 𝑦 = sgn(𝑥). sgn(𝑦) in the p-grams presence bits kernel;3. f 𝑥, 𝑦 = min(𝑥, 𝑦) in the p-grams intersection bits kernel.

* Following Ionescu et al. (2014) our kernels combine sizes of p = [5,8] and are classified with kerneldiscriminant analysis (Friedman et al., 2001).

Page 11: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Word embeddings (WE)

We explore two alternatives based on the continuous Skip-gram model(Mikolov et al., 2013) to get the vector Ԧ𝑒 of a text 𝑑:

1. Average the vectors of the words 𝑤𝑖 ∈ 𝑑:

Ԧ𝑒 =1

𝑛

𝑤𝑖∈𝑑

𝑛

𝑤𝑖

2. Use the Skip-gram Sentence Vectors (SenVec) variant (Le andMikolov, 2014):

* Following Franco-Salvador et al. (2015a,2015b) we test the logistic regression and SVM classifiers.

Page 12: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Classifier combination

Linear interpolation: 𝛼 ⋅ 𝑥 + 1 − 𝛼 ⋅ 𝑦

Logarithmic interpolation: 𝑥𝛼⋅ 𝑦1−𝛼

Ranking interpolation: 𝛼 ⋅ rank(𝑥) + 1 − 𝛼 ⋅ rank(𝑦)

Meta-learning: 10-fold cross-validation over the training set to derive aSVM model that combines all the class probabilities.

* 𝑥 and 𝑦 stand for the conditional probabilities obtained from string kernels andword embeddings, respectively.

Page 13: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Outline

• Introduction• Tasks

• Motivation

• Methods• String kernels

• Word embeddings

• Combination

• Evaluation

• Conclusions

Page 14: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Evaluation – Dataset statistics

Task NLI LVI

Dataset TOEFL11 ICLE DSLCC 2.0 DSLCC 3.0 HispaBlogs

# languages 11 7 14 12 5

Train texts 9900 770 252k 216k 2250

Dev texts 1100 - 28k 24k -

Test texts 1100 - 14k 12k 1000

Avg. text length 243 689 37 43 3168

Page 15: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Evaluation – Development experiments

Task NLI LVI

Dataset TOEFL11 ICLE DSLCC 2.0 DSLCC 3.0 HispaBlogs

SK (p-spectrum) 83.5 85.7 94.5 86.8 74.3

SK (intersection) 85.4 89.1 94.6 86.9 75.9

SK (presence) 86.4 89.2 94.6 87.2 76.1

WE-avg (logistic) 65.6 59.4 93.5 84.1 73.8

WE-avg (SVM-linear) 60.9 59.0 93.1 83.9 73.6

SenVec (logistic) 64.1 58.6 91.2 82.7 71.5

SenVec (logistic) 55.3 58.6 91.1 82.4 70.9

Linear inter. 87.4 90.6 94.8 87.4 78.3

Log inter. 87.5 90.1 94.7 87.3 77.9

Rank inter. 86.4 89.2 94.6 87.2 73.8

Meta-learning 87.3 90.3 94.6 87.2 78.0

Page 16: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Evaluation – Test partition results

Task NLI LVI

Dataset TOEFL11 ICLE DSLCC 2.0 DSLCC 3.0 HispaBlogs

Baseline 60.1 78.2 90.3 81.9 52.7

String kernels 82.8 89.2 94.4 88.2 74.9

Word embed. 66.1 59.4 92.1 85.3 72.2

Combination 83.8 90.6 94.7 88.3 76.7

Previous work 83.6 90.1 95.5 89.4 71.1

• The baseline is a BOW with the 10k most frequent words represented as binary features. We used SVM with a linear kernel.

• We compare to the best performing systems on the respective shared tasks: Jarvis et al. (2013) for TOEFL11; Malmasi andDras (2015) for DSLCC 2.0; and Çöltekin and Rama (2016) for DSLCC 3.0.

• The ICLE results are from Tetreault et al. (2012). For HispaBlogs, we report the results of Rangel et al. (2016).

Page 17: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Evaluation – Discussion: SK vs WE

• String kernels are effective to capturing the lexical peculiarities of agiven language variety.• E.g. the Spanish word coger “to take” is used frequently in European news,

but not in Latin America, where it has acquired a taboo meaning.

• Word embeddings are less effective at leveraging individual wordtokens like coger. The contribution of a single word vector may beinsufficient to get a correct classification. However, the method hasthe potential to take into account the frequencies of words.• E.g. a high frequency of the English pronoun he in TOEFL essays is more

indicative of the Turkish than the Arabic native speakers.

Page 18: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Evaluation – Discussion: text characteristics impact

• The length of the text affect the classifier performance.Word embeddings excelled on datasets with larger texts, i.e. HispaBlogs.

• Small training datasets do not allow to train representativeembeddings (e.g. ICLE).

Page 19: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Evaluation – Discussion: text characteristics impact

• Word embeddings work better with a high number of named entities.Named entities: 5-7% of tokens in the NLI datasets vs. 11-12% in the LVI ones.

• DSLCC training and test instances may overlap with respect to the textauthors, resulting in over-fitting to match a particular author’s writingstyle.

Short n-gram sizes have been used in the past for authorship attribution (Koppel& Schler, 2003). Our string kernels used n-grams of size [5,8], which may be toolarge.

Page 20: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Outline

• Introduction• Tasks

• Motivation

• Methods• String kernels

• Word embeddings

• Combination

• Evaluation

• Conclusions

Page 21: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Conclusions

• String kernels work very well on the LVI task.

• Experiments on five datasets show that the combination-basedapproach achieves results that are close to the state of the art onboth the NLI and LVI tasks.

• We interpret this as empirical evidence for our hypothesis concerningthe similarity of the two identification tasks.

• We hypothesize that our approach may be similarly effective on otherauthor profiling tasks that can be framed as the identification of aparticular language variety, e.g., gender identification.

Page 22: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

Bridging the Native Language and Language Variety

Identification TasksMarc Franco-Salvador, Greg Kondrak, and Paolo Rosso

KES 2017 – IS13:

Supervised versus Unsupervised Methods for Intelligent Text Processing

September 7th, Marseille, France

Thank you for your time

Questions?

[email protected]

Page 23: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

References I

Çöltekin, Ç., & Rama, T. (2016). Discriminating Similar Languages with Linear SVMs and NeuralNetworks. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties andDialects (VarDial3) (pp. 15-24).

Franco-Salvador, M., Rangel, F., Rosso, P., Taulé, M., & Martít, M. A. (2015a). Language VarietyIdentification Using Distributed Representations of Words and Documents. In Experimental IRMeets Multilinguality, Multimodality, and Interaction (pp. 28-40). Springer International Publishing.

Franco-Salvador, M. Rosso, P., & Rangel, F. (2015b). Distributed Representations of Words andDocuments for Discriminating Similar Languages. In: Proc. of the Joint Workshop on LanguageTechnology for Closely Related Languages, Varieties and Dialects (LT4VarDial), Hissar, Bulgaria.

Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1, pp. 241-249). New York: Springer series in statistics.

Ionescu, R. T., Popescu, M., & Cahill, A. (2014). Can characters reveal your native language? Alanguage-independent approach to native language identification. In EMNLP (pp. 1363-1373).

Jarvis, S., Bestgen, Y., & Pepper, S. (2013, June). Maximizing Classification Accuracy in NativeLanguage Identification. In BEA@ NAACL-HLT (pp. 111-118).

Page 24: Bridging the Native Language and Language Variety ...Language Variety Identification Using Distributed Representations of Words and Documents. ... A low dimensionality representation

References II

Koppel, M., & Schler, J. (2003). Exploiting stylistic idiosyncrasies for authorship attribution. In Proceedings of IJCAI'03Workshop on Computational Approaches to Style Analysis and Synthesis (Vol. 69, p. 72).

Koppel, M., Schler, J., & Zigdon, K. (2005). Automatically determining an anonymous author’s native language.Intelligence and Security Informatics, 41-76.

Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. arXiv preprintarXiv:1405.4053.

Malmasi, S., & Dras, M. (2015, September). Language identification using classifier ensembles. In Proceedings of theJoint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial) (pp. 35-43).

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrasesand their compositionality. In Advances in Neural Information Processing Systems (pp. 3111-3119).

Rangel, F., Franco-Salvador, M., & Rosso, P. (2016). A low dimensionality representation for language varietyidentification. arXiv preprint arXiv:1705.10754.

Tetreault, J., Blanchard, D., Cahill, A., & Chodorow, M. (2012). Native tongues, lost and found: Resources andempirical evaluations in native language identification. Proceedings of COLING 2012, 2585-2602.

Van Bezooijen, R., & Gooskens, C. (1999). Identification of language varieties: The contribution of different linguisticlevels. Journal of language and social psychology, 18(1), 31-48.