multilingual text classification using ontologies

18
Introduction Techniques Overview and Summary Multilingual Text Classification Gerard de Melo, Stefan Siersdorfer Max Planck Institute for Computer Science Saarbr¨ ucken, Germany 2007-04-04 G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification

Upload: gerard-de-melo

Post on 19-Jun-2015

206 views

Category:

Data & Analytics


4 download

DESCRIPTION

In this paper, we investigate strategies for automatically classifying documents in different languages thematically, geographically or according to other criteria. A novel linguistically motivated text representation scheme is presented that can be used with machine learning algorithms in order to learn classifications from pre-classified examples and then automatically classify documents that might be provided in entirely different languages. Our approach makes use of ontologies and lexical resources but goes beyond a simple mapping from terms to concepts by fully exploiting the external knowledge manifested in such resources and mapping to entire regions of concepts. For this, a graph traversal algorithm is used to explore related concepts that might be relevant. Extensive testing has shown that our methods lead to significant improvements compared to existing approaches.

TRANSCRIPT

Page 1: Multilingual Text Classification using Ontologies

IntroductionTechniques

Overview and Summary

Multilingual Text Classification

Gerard de Melo, Stefan Siersdorfer

Max Planck Institute for Computer ScienceSaarbrucken, Germany

2007-04-04

G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification

Page 2: Multilingual Text Classification using Ontologies

IntroductionTechniques

Overview and SummaryText Classification

Text Classification

task: automatically assign textdocuments to classes (e.g.thematically, geographically)

machine learning algorithms, e.g.SVM, can learn from pre-classifiedtraining documents

multilingual case: documents inmultiple languages

applications: news wire filtering,library management, e-mail, etc.

G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification

Page 3: Multilingual Text Classification using Ontologies

IntroductionTechniques

Overview and SummaryText Classification

Text Classification

task: automatically assign textdocuments to classes (e.g.thematically, geographically)

machine learning algorithms, e.g.SVM, can learn from pre-classifiedtraining documents

multilingual case: documents inmultiple languages

applications: news wire filtering,library management, e-mail, etc.

G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification

Page 4: Multilingual Text Classification using Ontologies

IntroductionTechniques

Overview and SummaryText Classification

Text Classification

task: automatically assign textdocuments to classes (e.g.thematically, geographically)

machine learning algorithms, e.g.SVM, can learn from pre-classifiedtraining documents

multilingual case: documents inmultiple languages

applications: news wire filtering,library management, e-mail, etc.

G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification

Page 5: Multilingual Text Classification using Ontologies

IntroductionTechniques

Overview and SummaryText Classification

Text Classification

task: automatically assign textdocuments to classes (e.g.thematically, geographically)

machine learning algorithms, e.g.SVM, can learn from pre-classifiedtraining documents

multilingual case: documents inmultiple languages

applications: news wire filtering,library management, e-mail, etc.

G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification

Page 6: Multilingual Text Classification using Ontologies

IntroductionTechniques

Overview and Summary

Machine TranslationMapping to Semantic ConceptWeight Propagation

Machine Translation for Multilingual TC

idea: simply translate all documents into a single language LI

(prior work by Jalam 2002, Rigutini et al. 2005)

shortcomings of this approach

lexical variety in LI (English: huge vocabulary, many synonyms)variety of expression in source languageslexical ambiguity in LI (unnecessary introduction of additionalambiguity)

G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification

Page 7: Multilingual Text Classification using Ontologies

IntroductionTechniques

Overview and Summary

Machine TranslationMapping to Semantic ConceptWeight Propagation

Machine Translation for Multilingual TC

idea: simply translate all documents into a single language LI

(prior work by Jalam 2002, Rigutini et al. 2005)

shortcomings of this approach

lexical variety in LI (English: huge vocabulary, many synonyms)variety of expression in source languageslexical ambiguity in LI (unnecessary introduction of additionalambiguity)

G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification

Page 8: Multilingual Text Classification using Ontologies

IntroductionTechniques

Overview and Summary

Machine TranslationMapping to Semantic ConceptWeight Propagation

Machine Translation for Multilingual TC

idea: simply translate all documents into a single language LI

(prior work by Jalam 2002, Rigutini et al. 2005)

shortcomings of this approach

lexical variety in LI (English: huge vocabulary, many synonyms)variety of expression in source languageslexical ambiguity in LI (unnecessary introduction of additionalambiguity)

Spanish coche 7−→ car

French voiture 7−→ automobile

G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification

Page 9: Multilingual Text Classification using Ontologies

IntroductionTechniques

Overview and Summary

Machine TranslationMapping to Semantic ConceptWeight Propagation

Semantic Concepts

Idea

map all words to semantic concepts (e.g. WordNet synsets),thus distinguishing different senses of a word while identifyingsynonyms

disambiguate using context information

construct feature vectors by counting occurrences of conceptsrather than terms

G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification

Page 10: Multilingual Text Classification using Ontologies

IntroductionTechniques

Overview and Summary

Machine TranslationMapping to Semantic ConceptWeight Propagation

Semantic Concepts

Idea

map all words to semantic concepts (e.g. WordNet synsets),thus distinguishing different senses of a word while identifyingsynonyms

disambiguate using context information

construct feature vectors by counting occurrences of conceptsrather than terms

G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification

Page 11: Multilingual Text Classification using Ontologies

IntroductionTechniques

Overview and Summary

Machine TranslationMapping to Semantic ConceptWeight Propagation

Semantic Concepts

Idea

map all words to semantic concepts (e.g. WordNet synsets),thus distinguishing different senses of a word while identifyingsynonyms

disambiguate using context information

construct feature vectors by counting occurrences of conceptsrather than terms

G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification

Page 12: Multilingual Text Classification using Ontologies

IntroductionTechniques

Overview and Summary

Machine TranslationMapping to Semantic ConceptWeight Propagation

Semantic Concepts

Problems

understemming

polysemy: highly related senses are treated as distinct

incongruent concepts between languages

variety of expression

lexical lacunae

English I have a headache I have a headache

Spanish Me duele la cabeza *It hurts the head to me

French J’ai mal a la tete *I have pain at the head

G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification

Page 13: Multilingual Text Classification using Ontologies

IntroductionTechniques

Overview and Summary

Machine TranslationMapping to Semantic ConceptWeight Propagation

Weight Propagation

propagate weight from originalconcepts to related concepts

choose path to c ′ maximizingits weight

Dijkstra-like algorithm in orderto assign maximal possibleweight to a concept

G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification

Page 14: Multilingual Text Classification using Ontologies

IntroductionTechniques

Overview and Summary

Machine TranslationMapping to Semantic ConceptWeight Propagation

Weight Propagation

propagate weight from originalconcepts to related concepts

choose path to c ′ maximizingits weight

Dijkstra-like algorithm in orderto assign maximal possibleweight to a concept

G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification

Page 15: Multilingual Text Classification using Ontologies

IntroductionTechniques

Overview and Summary

Machine TranslationMapping to Semantic ConceptWeight Propagation

Weight Propagation

propagate weight from originalconcepts to related concepts

choose path to c ′ maximizingits weight

Dijkstra-like algorithm in orderto assign maximal possibleweight to a concept

G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification

Page 16: Multilingual Text Classification using Ontologies

IntroductionTechniques

Overview and SummaryOverview and Summary

Overview and Summary

Ontology Region Mapping

1 optionally translate the documents – or use a multilinguallexical resource (aligned wordnets)

2 map terms to concepts

3 search for highly related concepts

G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification

Page 17: Multilingual Text Classification using Ontologies

IntroductionTechniques

Overview and SummaryOverview and Summary

Overview and Summary

Ontology Region Mapping

1 optionally translate the documents – or use a multilinguallexical resource (aligned wordnets)

2 map terms to concepts

3 search for highly related concepts

G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification

Page 18: Multilingual Text Classification using Ontologies

IntroductionTechniques

Overview and SummaryOverview and Summary

Overview and Summary

Ontology Region Mapping

1 optionally translate the documents – or use a multilinguallexical resource (aligned wordnets)

2 map terms to concepts

3 search for highly related concepts

entire regions of concepts arerelevant, so propagate a partof the concept’s weight torelated concepts

G. de Melo, S. Siersdorfer, Max-Planck-Institut Informatik Multilingual Text Classification