multilingual text classification using ontologies
DESCRIPTION
In this paper, we investigate strategies for automatically classifying documents in different languages thematically, geographically or according to other criteria. A novel linguistically motivated text representation scheme is presented that can be used with machine learning algorithms in order to learn classifications from pre-classified examples and then automatically classify documents that might be provided in entirely different languages. Our approach makes use of ontologies and lexical resources but goes beyond a simple mapping from terms to concepts by fully exploiting the external knowledge manifested in such resources and mapping to entire regions of concepts. For this, a graph traversal algorithm is used to explore related concepts that might be relevant. Extensive testing has shown that our methods lead to significant improvements compared to existing approaches.TRANSCRIPT
![Page 1: Multilingual Text Classification using Ontologies](https://reader036.vdocuments.us/reader036/viewer/2022081203/5583f4f6d8b42afa438b4dd4/html5/thumbnails/1.jpg)
IntroductionTechniques
Overview and Summary
Multilingual Text Classification
Gerard de Melo, Stefan Siersdorfer
Max Planck Institute for Computer ScienceSaarbrucken, Germany
2007-04-04
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
![Page 2: Multilingual Text Classification using Ontologies](https://reader036.vdocuments.us/reader036/viewer/2022081203/5583f4f6d8b42afa438b4dd4/html5/thumbnails/2.jpg)
IntroductionTechniques
Overview and SummaryText Classification
Text Classification
task: automatically assign textdocuments to classes (e.g.thematically, geographically)
machine learning algorithms, e.g.SVM, can learn from pre-classifiedtraining documents
multilingual case: documents inmultiple languages
applications: news wire filtering,library management, e-mail, etc.
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
![Page 3: Multilingual Text Classification using Ontologies](https://reader036.vdocuments.us/reader036/viewer/2022081203/5583f4f6d8b42afa438b4dd4/html5/thumbnails/3.jpg)
IntroductionTechniques
Overview and SummaryText Classification
Text Classification
task: automatically assign textdocuments to classes (e.g.thematically, geographically)
machine learning algorithms, e.g.SVM, can learn from pre-classifiedtraining documents
multilingual case: documents inmultiple languages
applications: news wire filtering,library management, e-mail, etc.
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
![Page 4: Multilingual Text Classification using Ontologies](https://reader036.vdocuments.us/reader036/viewer/2022081203/5583f4f6d8b42afa438b4dd4/html5/thumbnails/4.jpg)
IntroductionTechniques
Overview and SummaryText Classification
Text Classification
task: automatically assign textdocuments to classes (e.g.thematically, geographically)
machine learning algorithms, e.g.SVM, can learn from pre-classifiedtraining documents
multilingual case: documents inmultiple languages
applications: news wire filtering,library management, e-mail, etc.
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
![Page 5: Multilingual Text Classification using Ontologies](https://reader036.vdocuments.us/reader036/viewer/2022081203/5583f4f6d8b42afa438b4dd4/html5/thumbnails/5.jpg)
IntroductionTechniques
Overview and SummaryText Classification
Text Classification
task: automatically assign textdocuments to classes (e.g.thematically, geographically)
machine learning algorithms, e.g.SVM, can learn from pre-classifiedtraining documents
multilingual case: documents inmultiple languages
applications: news wire filtering,library management, e-mail, etc.
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
![Page 6: Multilingual Text Classification using Ontologies](https://reader036.vdocuments.us/reader036/viewer/2022081203/5583f4f6d8b42afa438b4dd4/html5/thumbnails/6.jpg)
IntroductionTechniques
Overview and Summary
Machine TranslationMapping to Semantic ConceptWeight Propagation
Machine Translation for Multilingual TC
idea: simply translate all documents into a single language LI
(prior work by Jalam 2002, Rigutini et al. 2005)
shortcomings of this approach
lexical variety in LI (English: huge vocabulary, many synonyms)variety of expression in source languageslexical ambiguity in LI (unnecessary introduction of additionalambiguity)
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
![Page 7: Multilingual Text Classification using Ontologies](https://reader036.vdocuments.us/reader036/viewer/2022081203/5583f4f6d8b42afa438b4dd4/html5/thumbnails/7.jpg)
IntroductionTechniques
Overview and Summary
Machine TranslationMapping to Semantic ConceptWeight Propagation
Machine Translation for Multilingual TC
idea: simply translate all documents into a single language LI
(prior work by Jalam 2002, Rigutini et al. 2005)
shortcomings of this approach
lexical variety in LI (English: huge vocabulary, many synonyms)variety of expression in source languageslexical ambiguity in LI (unnecessary introduction of additionalambiguity)
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
![Page 8: Multilingual Text Classification using Ontologies](https://reader036.vdocuments.us/reader036/viewer/2022081203/5583f4f6d8b42afa438b4dd4/html5/thumbnails/8.jpg)
IntroductionTechniques
Overview and Summary
Machine TranslationMapping to Semantic ConceptWeight Propagation
Machine Translation for Multilingual TC
idea: simply translate all documents into a single language LI
(prior work by Jalam 2002, Rigutini et al. 2005)
shortcomings of this approach
lexical variety in LI (English: huge vocabulary, many synonyms)variety of expression in source languageslexical ambiguity in LI (unnecessary introduction of additionalambiguity)
Spanish coche 7−→ car
French voiture 7−→ automobile
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
![Page 9: Multilingual Text Classification using Ontologies](https://reader036.vdocuments.us/reader036/viewer/2022081203/5583f4f6d8b42afa438b4dd4/html5/thumbnails/9.jpg)
IntroductionTechniques
Overview and Summary
Machine TranslationMapping to Semantic ConceptWeight Propagation
Semantic Concepts
Idea
map all words to semantic concepts (e.g. WordNet synsets),thus distinguishing different senses of a word while identifyingsynonyms
disambiguate using context information
construct feature vectors by counting occurrences of conceptsrather than terms
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
![Page 10: Multilingual Text Classification using Ontologies](https://reader036.vdocuments.us/reader036/viewer/2022081203/5583f4f6d8b42afa438b4dd4/html5/thumbnails/10.jpg)
IntroductionTechniques
Overview and Summary
Machine TranslationMapping to Semantic ConceptWeight Propagation
Semantic Concepts
Idea
map all words to semantic concepts (e.g. WordNet synsets),thus distinguishing different senses of a word while identifyingsynonyms
disambiguate using context information
construct feature vectors by counting occurrences of conceptsrather than terms
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
![Page 11: Multilingual Text Classification using Ontologies](https://reader036.vdocuments.us/reader036/viewer/2022081203/5583f4f6d8b42afa438b4dd4/html5/thumbnails/11.jpg)
IntroductionTechniques
Overview and Summary
Machine TranslationMapping to Semantic ConceptWeight Propagation
Semantic Concepts
Idea
map all words to semantic concepts (e.g. WordNet synsets),thus distinguishing different senses of a word while identifyingsynonyms
disambiguate using context information
construct feature vectors by counting occurrences of conceptsrather than terms
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
![Page 12: Multilingual Text Classification using Ontologies](https://reader036.vdocuments.us/reader036/viewer/2022081203/5583f4f6d8b42afa438b4dd4/html5/thumbnails/12.jpg)
IntroductionTechniques
Overview and Summary
Machine TranslationMapping to Semantic ConceptWeight Propagation
Semantic Concepts
Problems
understemming
polysemy: highly related senses are treated as distinct
incongruent concepts between languages
variety of expression
lexical lacunae
English I have a headache I have a headache
Spanish Me duele la cabeza *It hurts the head to me
French J’ai mal a la tete *I have pain at the head
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
![Page 13: Multilingual Text Classification using Ontologies](https://reader036.vdocuments.us/reader036/viewer/2022081203/5583f4f6d8b42afa438b4dd4/html5/thumbnails/13.jpg)
IntroductionTechniques
Overview and Summary
Machine TranslationMapping to Semantic ConceptWeight Propagation
Weight Propagation
propagate weight from originalconcepts to related concepts
choose path to c ′ maximizingits weight
Dijkstra-like algorithm in orderto assign maximal possibleweight to a concept
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
![Page 14: Multilingual Text Classification using Ontologies](https://reader036.vdocuments.us/reader036/viewer/2022081203/5583f4f6d8b42afa438b4dd4/html5/thumbnails/14.jpg)
IntroductionTechniques
Overview and Summary
Machine TranslationMapping to Semantic ConceptWeight Propagation
Weight Propagation
propagate weight from originalconcepts to related concepts
choose path to c ′ maximizingits weight
Dijkstra-like algorithm in orderto assign maximal possibleweight to a concept
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
![Page 15: Multilingual Text Classification using Ontologies](https://reader036.vdocuments.us/reader036/viewer/2022081203/5583f4f6d8b42afa438b4dd4/html5/thumbnails/15.jpg)
IntroductionTechniques
Overview and Summary
Machine TranslationMapping to Semantic ConceptWeight Propagation
Weight Propagation
propagate weight from originalconcepts to related concepts
choose path to c ′ maximizingits weight
Dijkstra-like algorithm in orderto assign maximal possibleweight to a concept
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
![Page 16: Multilingual Text Classification using Ontologies](https://reader036.vdocuments.us/reader036/viewer/2022081203/5583f4f6d8b42afa438b4dd4/html5/thumbnails/16.jpg)
IntroductionTechniques
Overview and SummaryOverview and Summary
Overview and Summary
Ontology Region Mapping
1 optionally translate the documents – or use a multilinguallexical resource (aligned wordnets)
2 map terms to concepts
3 search for highly related concepts
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
![Page 17: Multilingual Text Classification using Ontologies](https://reader036.vdocuments.us/reader036/viewer/2022081203/5583f4f6d8b42afa438b4dd4/html5/thumbnails/17.jpg)
IntroductionTechniques
Overview and SummaryOverview and Summary
Overview and Summary
Ontology Region Mapping
1 optionally translate the documents – or use a multilinguallexical resource (aligned wordnets)
2 map terms to concepts
3 search for highly related concepts
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification
![Page 18: Multilingual Text Classification using Ontologies](https://reader036.vdocuments.us/reader036/viewer/2022081203/5583f4f6d8b42afa438b4dd4/html5/thumbnails/18.jpg)
IntroductionTechniques
Overview and SummaryOverview and Summary
Overview and Summary
Ontology Region Mapping
1 optionally translate the documents – or use a multilinguallexical resource (aligned wordnets)
2 map terms to concepts
3 search for highly related concepts
entire regions of concepts arerelevant, so propagate a partof the concept’s weight torelated concepts
G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification