multilingual support to a proposed semantic web architecture andrea ferrato top-uic ms thesis,...

Multilingual supportto a proposed Semantic Web

architecture

Andrea FerratoTOP-UIC MS Thesis, 2003/’04

Advisor: Laura Farinetti

A. Ferrato, TOP-UIC 2003-'04 2

Purpose of this work

Design and (partially) implement multilingual support on a pre-existing Semantic Web platform Provide an approach as generical as

possible Exploit features of the pre-existing

architecture Cope with the average chaotic

structure of resources currently available


Outline

Semantic WebMultilingualityThe DOSE platformProposed solutionGiven implementationExperimental resultsConclusions


Semantic Web

The next evolutionary stage for WWWGoal: make network data usable by

intelligent agentsDeployable only on top of existing

infrastructureTwo pressing tasks

Transform existing contents to include semantics

Setup ad hoc user agents to work on them


Transform existing contents

Basic data units: resources Every single information entity that

can be semantically isolatedFeatures to be given

Identification: URI Structure: XML Meaning: RDF Knowledge: ontologies


Set up ad hoc user agents

Major players in Semantic Web deployment

Invoked by users, can proceed autonomously

Key facilities to be supported Logic Proof Trust


Dig

ital

sig

natu

res

Semantic Web: layer cake view(Berners-Lee)

Unicode URI

XML + NS + XMLschema

RDF + RDFschema

Ontology vocabulary

Logic

Proof

Trust

Self desc

. doc.

Data D

ata R

ule s


Multilinguality

The extension to multiple languages of tasks already performed in a monolingual context

Typical issues from cross-language mapping Lexical gaps Role of the context Lack of pre-acquired knowledge


Multilinguality and Semantic Web

A problem of Text Retrieval in multiple languages (NLP) Start from popular approaches

(Controlled Vocabulary, Free text, etc.)

Two main requirements Recognize language ID of resources Map contents independently from

language


Language ID retrieval

Two possible scenarios Retrieve a given ID via resource

parsing Recreate the ID via resource analysis

When recollecting a given language attribute, conform to existing language specification standards


Language ID specification

Content-language

CSS-leveldeclarations

“lang”attribute

Languageinheritance

+


Language-independent contents mapping

Investigate the form/meaning relationship Ontology design is crucial Three main requirements

1. Consistency (based on linguistic evidence)

2. Flexibility (meaningful for all languages)

3. Extendibility (easy addition of new languages)


Ontology models

Conceptual founded upon general knowledge

Language-based Built on a particular language

Interlingua A combination of the above two

None is definitely superior for multilinguality


The DOSE platform

Distributed Open Semantic Elaboration platform

Key features Modularity Scalability Semantic integration

Main functionalities offered Annotation Search


DOSE: layered view

Indexer SearchEngine

SemanticMapper

FragmentRetriever

Substructure

Extractor AnnotationRepository

Onto-logy Syn-

set

Servicelayer

Back-endlayer

Front-endlayer


DOSE: distributed view

Onto-logy

Syn-set

Fragment Retriever

Substructure Extractor

SemanticMapper

AnnotationRepository

Indexer SearchEngine

XML-RPC infrastructure


13

4

5

8

7

6

9 1011

DOSE: annotation

SemanticMapper

Substructure

Extractor


The Web

2

Indexer FragmentRetriever


1

23

4

5

6

7

8

DOSE: search

SearchEngine


The WebFragmentRetriever

SemanticMapper


DOSE and multilinguality

Traditionally: a new ontology for each different language

DOSE: the ontology language is totally independent of the synset language Use synsets to store lexical

representations only Let the ontology focus on knowledge

modelization


Practical requirements for multilinguality

Indexing Recognize language of resources to

consequently setup the system Store language IDs with annotations

Search Interpret user queries coming in

natural languages Allow for cross-language search tasks


Extension to language

Proposed approach: one ontology, many synsets A concept is expressed by a different

synset for each supported language Each synset contains multiple lexical

representations of a related concept in a single language

Separate semantic and textual layers


lavorostipendiodatore di lavoro…

salaryjobemployment…

travailchomeur…

Extension to language (cont’d)

job

(one concept,three synsets)


Advantages

Reduced implementation requirements Ontology design Resource occupation

Simplicity (in ontology management)Flexibility

A new language just brings a new bag of synsets

Expansion of indexing word set


Language recognition

Proposed approach Retrieve language IDs whenever present Otherwise, recognize language(s)

Design constraints To be activated in the annotation phase Refined at the document substructure level Has to deal with the average low authoring

quality of Web documents


Language recognition (cont’d)

1. Validate explicit request

2. Retrieve “lang” value

3. Guess via heuristics

4. Retrieve from ancestor

5. Accept default

<P lang=“ru”>

Russian

There was an Old Man of Coblenz,The length of whose legs was immense…

English

default = “it”

Italian

<H1 lang=“fr”>Le Bilboquet</H1><P>C’était un vieux passe-temps…

<P> is French

Hindi

Hindisynset

?


Current implementation

A new English synset to couple with a disability ontology (~500 concepts)

A set of 20 bilingual documents (Italian, English) on disability

A basic Language Detector XML-RPC module implemented in Java

Testing scenarios Parallel annotation Language recognition


Implementation work

Language Detector module (Java, ~1000 lines of code)

Additions to pre-existing modules (Java, ~1000 lines of code)

English synset (RDF, ~3500 lines of code)~ 24 Mb of annotations producedSimulation results analysis (A 600x40 .XLS

for <BODY>, a 925x250 .XLS for <Hx>)


Multilingual DOSE in action


Parallel annotation

Two parallel documents have The same structure elements with the

same contents Two different languages of expression

Goal: demonstrate that two sets of parallel documents are (almost) simmetrically mapped to the same concepts (“parallel annotation”)

Both sets indexed separately, with language explicitly specified


Parallel annotation (cont’d)

Test methodology: “Vector Space Model”Document fragments described as vectors

Dimensions are ontology concepts Components are weighted (tf/idf)

occurrencies of such conceptsThe correlation between two fragments is

quantified as the cosine of the angle between their vectors


Parallel annotation (cont’d)

IT/html/body/p[3]X:Part-time job (2.5)Y:Retirement (0)

EN/html/body/p[3]X:Part-time job (1.5)Y:Retirement (1.5)

Y

XX

Y

X

Y

CorrelationItalian

English


Parallel annotation results at <BODY> level

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Correlation factor

Norm

aliz

ed fre

que

ncy

Parallel fragments Others


Correlation results at <BODY> level

1 4 7 10 13 16 19

S1

S3

S5

S7

S9

S11

S13

S15

S17

S19

Correlation factor

Italian pages

English pages

0-0,2 0,2-0,4 0,4-0,6 0,6-0,8


Correlation results at <BODY> level (alt)

1

7

13

19 S1

S6

S11

S160

0,2

0,4

0,6

0,8

Correlation factor

Italianpages

Englishpages

0,6-0,8

0,4-0,6

0,2-0,4

0-0,2


Parallel annotation results at <Hx> level

0

0,1

0,2

0,3

0,4

0,5

0,6

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Correlation factor

Norm

aliz

ed fre

quen

cy

Parallel fragments Others


Parallel annotation: notes

Parallel and nonparallel pairs can be grouped as two different distributions i.e. Gaussian distributions

Average values of the two distributions are clearly separated, both for <BODY> and <Hx> levels This proves that the indexing system is

able to annotate relevant document fragments independently from language


Language recognition

Separate testing on the same document setItalian and English documents are

alternated in batch processing Avoid reuse of default settings for

contiguous documents of the same language

Two ways to retrieve ancestor language Via Annotation Repository (acceptable) Via a “Language Stack” (still inefficient)


Annotation Repository vs. Language Stack

<BODY lang="en">

<H1 lang="it">Passatempi</H1>

<H2 lang="en">Board Games</H2>

<P>Gomuku</P><P>Dama</P>…

All cyan, underlined words are to annotate (included in the synsets)Language Stack: Dama is ignored (language “en” inherited by <H2>)Annotation Repository: Dama is annotated (language “it” inherited by <H1>, annotated)


Language recognition results(via Annotation Repository)

0

20

40

60

80

100

120

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Analyzed pages

Rec

ogn

itio

n pe

rcen

tage

Hit percentage (%) Hit average (%)


Conclusions

Typical issues discussedOverall validity of the approach shownFurther work and improvements

Synset composition Annotation testing with more

languages Optimize proposed language

recognition techniques, add new ones


Thank you…

Questions?


Language recognition (2)

0

20

40

60

80

100

120

1 4 7 10 13 16 19 22 25 28 31 34 37 40

Analyzed pages

Rec

ogn

itio

n pe

rcen

tage

Percentage Anno

Percentage Stack

Average Anno

Average Stack

multilingual support to a proposed semantic web architecture andrea ferrato top-uic ms thesis,...

Documents

language slide

multilinguality slide

available slide

language id retrieval

given language attribute

acquired knowledge slide

semantic web deployment

general knowledge language