altw 2010 shared task: multilingual language …saffsd.net/pdf/alta2010-sharedtask.pdflanguage...

28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References ALTW 2010 Shared Task: Multilingual Language Identification Marco Lui & Tim Baldwin NICTA VRL Department of Computer Science and Software Engineering University of Melbourne, VIC 3010, Australia [email protected], [email protected] University of Melbourne 10 December 2010 1 / 28

Upload: lethuan

Post on 24-Mar-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

ALTW 2010 Shared Task:

Multilingual Language Identification

Marco Lui & Tim BaldwinNICTA VRL

Department of Computer Science and Software EngineeringUniversity of Melbourne, VIC 3010, Australia

[email protected], [email protected]

University of Melbourne

10 December 2010

1 / 28

Page 2: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

What is Language Identification?

Source(s): Wikipedia2 / 28

Page 3: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Can you LangID?

Source(s): Wikipedia3 / 28

Page 4: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Basic Assumptions

Monolingual

Homogeneous

Closed World

Narrow Scope

4 / 28

Page 5: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Cavnar & Trenkle - Dataset

• 3478 samples from the soc.culture newsgroup hierarchy

• 8 languages:

English 1208Spanish 697German 481Italian 316French 273Dutch 235Portuguese 151Polish 117

Reference(s): Cavnar and Trenkle, 19945 / 28

Page 6: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Cavnar & Trenkle - TechniquesData Representation

• keep only letters, apostrophes, whitespace

• union over byte-level N-grams (N = 1. . .5)

Examples

language identification

1-gram l, a, n, g, u . . .

2-gram la, an, gu, ua, ag . . .

3-gram lan, ang, gua, uag, age . . .

4-gram lang, angu, guag, uage, age . . .

5-gram langu, angua, guage, uage , age i . . .

Reference(s): Cavnar and Trenkle, 19946 / 28

Page 7: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Cavnar & Trenkle - TechniquesFeature Selection

• N-Gram Frequency Profile

• Top X (X = 100 . . . 400)

Examples

X = 3

from a:20 b:15 c:10 ab:12 ac:8 . . .

select a, b, ab

Reference(s): Cavnar and Trenkle, 19947 / 28

Page 8: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Cavnar & Trenkle - TechniquesClassification Algorithm

• nearest prototype

• 1 prototype per language

• sum of term frequencies across all instances

• out-of-place distance metric

Examples

doc1 a:10 b:15 c:2

doc2 a:2 b:3 c:1

doc3 a:25 b:20 c:15

prototype a:37 b:38 c:18

Reference(s): Cavnar and Trenkle, 19948 / 28

Page 9: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Cavnar & Trenkle - TechniquesOut-of-Place distance metric

Reference(s): Cavnar and Trenkle, 19949 / 28

Page 10: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Cavnar & Trenkle - Results

• 98.6% accuracy for articles ≤300 bytes

• 99.8% accuracy for articles > 300 bytes

• A solved problem?

Reference(s): Cavnar and Trenkle, 199410 / 28

Page 11: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Baldwin & Lui - Task Description

Corpus Docs Langs Encs Document Length (bytes)

EuroGOV 1500 10 1 17460.5±39353.4

TCL 3174 60 12 2623.2±3751.9

Wikipedia 4963 67 1 1480.8±4063.9

Reference(s): Baldwin and Lui, 201011 / 28

Page 12: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Baldwin & Lui - Method

• 10-fold cross-validation on each dataset

• 42 distinct classifiers

model (×7): nearest-neighbour (Cos1NN, Skew1NN, OOP1NN)nearest-prototype (CosAM, SkewAM)Naive BayesSVM

tokenisation (×2): byte, codepoint

n-gram (×3): 1-gram, 2-gram, 3-gram

Reference(s): Baldwin and Lui, 201012 / 28

Page 13: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Baldwin & Lui - TechniquesSkew Divergence

D(x || y) =∑

i

xi(log2 xi − log2 yi )

skewα(x , y) = D(x || αy + (1− α)x)

• variant of Kullback-Leibler divergence

• linear interpolation between x and y with smoothing factor α

• α typically 0.99

Reference(s): Lee, 200113 / 28

Page 14: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Baldwin & Lui - ResultsTokenization: Choice of n-gram order (Wikipedia)

Reference(s): Baldwin and Lui, 201014 / 28

Page 15: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Baldwin & Lui - ResultsTokenization: Bytes vs Codepoints (2-gram)

Reference(s): Baldwin and Lui, 201015 / 28

Page 16: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Baldwin & Lui - ResultsPerformance vs Time Taken

Reference(s): Baldwin and Lui, 201016 / 28

Page 17: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Baldwin & Lui - ResultsThe Long Tail

• Wikipedia

• byte bigram

• Skew Divergence

• Nearest Prototype

Language N P R FTamil 6 1.000 1.000 1.000Japanese 219 0.990 0.992 0.955English 1629 0.972 0.899 0.934

. . .Italian 202 0.735 0.906 0.812Danish 37 0.710 0.595 0.647Icelandic 10 0.188 0.300 0.231

Reference(s): Baldwin and Lui, 201017 / 28

Page 18: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Baldwin & Lui - ResultsConfusion Pairs

• Wikipedia

• byte bigram

• Skew Divergence

• Nearest Prototype

From To ProportionIndonesian Malay 0.405Malay Indonesian 0.214Danish Norwegian 0.270Norwegian Danish 0.043Russian Ukrainian 0.090Ukrainian Russian 0.043

Reference(s): Baldwin and Lui, 201018 / 28

Page 19: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Open Issues

Supporting Minority Languages

Open Class Language Identification

Sparse or Impoverished Training Data

Multilingual Documents

Standard Evaluation Corpora

Performance Evaluation Criteria

Reference(s): Hughes et al., 200619 / 28

Page 20: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

ALTW 2010 Shared Task

• multiclass text categorization task

• select 2 languages from a closed set of 74

• addresses a number of open issues:• Sparse or Impoverished Training Data• Multilingual Documents• Standard Evaluation Corpora• Performance Evaluation Criteria

20 / 28

Page 21: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

ALTW 2010 Shared Task Dataset

• 10000 synthetic bilingual documents in 74 languages

• randomly partitioned into• 8000 training documents• 1000 developement documents• 1000 test documents

• compiled from static dumps of language-specific Wikipedias

• downloaded between 9 June and 1 August 2008

• selected languages with > 1000 articles

21 / 28

Page 22: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Generating a synthetic bilingual document

• semantic linkage

• language-links: [[<language-prefix>:<page title>]]

1. select primary document

2. select secondary document via language-link

3. normalize: remove redirects, language-links and templates

4. chunk: split on two consecutive paragraphs

5. retain top 50% of paragaphs from primary, bottom 50% fromsecondary

22 / 28

Page 23: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Evaluation MetricsMulti-class Text Categorization

• IR-style performance metrics:precision= TP

TP+FP

recall= TPTP+FN

f-score= 2×precision×recallprecision+recall

• macroaveraging vs microaveraging

• competition metric: micro-averaged f-scoreReference(s): Sebastiani, 2002

23 / 28

Page 24: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Majority-class baseline

• most common classes• en(3330) de(747) fr(747) ja(442)

• most common pairs• en-de(1283) en-fr(1053) en-ja(606) en-it(479)

Baseline PM RM FM Pµ Rµ Fµ

en .011 .015 .012 .701 .350 .467en+de .014 .030 .018 .458 .458 .458

24 / 28

Page 25: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Nearest-prototype benchmarkSkew Divergence with Arithmetic Mean Language Prototypes

N-Gram Multiclass PM RM FM Pµ Rµ Fµ

1 .440 .274 .295 .264 .132 .1762 single .540 .376 .413 .583 .291 .3893 .564 .412 .453 .814 .407 .543

1 .412 .458 .414 .629 .622 .6252 stratified .460 .448 .435 .775 .768 .7713 .497 .467 .464 .833 .826 .829

1 .115 .786 .155 .057 .878 .1072 binarised .171 .705 .221 .114 .885 .2023 .227 .686 .292 .259 .903 .402

25 / 28

Page 26: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Source(s): Google Translate26 / 28

Page 27: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Questions?

27 / 28

Page 28: ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion

Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References

Reference

Timothy Baldwin and Marco Lui. Language identification: The long and theshort of the matter. In Proceedings of Human Language Technologies: The11th Annual Conference of the North American Chapter of the Associationfor Computational Linguistics (NAACL HLT 2010), pages 229–237, LosAngeles, USA, 2010.

William B. Cavnar and John M. Trenkle. N-gram-based text categorization. InProceedings of the Third Symposium on Document Analysis andInformation Retrieval, Las Vegas, USA, 1994.

Baden Hughes, Timothy Baldwin, Steven Bird, Jeremy Nicholson, and AndrewMacKinlay. Reconsidering language identification for written languageresources. In Proceedings of the 5th International Conference on LanguageResources and Evaluation (LREC 2006), pages 485–488, Genoa, Italy, 2006.

Lillian Lee. On the effectiveness of the skew divergence for statistical languageanalysis. In Proceedings of Artificial Intelligence and Statistics 2001(AISTATS 2001), pages 65–72, Key West, USA, 2001.

Fabrizio Sebastiani. Machine learning in automated text categorization. ACMcomputing surveys (CSUR), 34(1):1–47, 2002.

28 / 28