a pivot xml-based architecture for … 032-391.pdf · avancée, formations doctorales, colloques et...

* [email protected], CLIPS, GETA, IMAG-CAMPUS, BP 53, 38041 Grenoble Cedex 9, France§ [email protected], CLIPS, GETA, IMAG-CAMPUS, BP 53, 38041 Grenoble Cedex 9, France

©Convergences ’03 1 December 2 – 6, 2003, Alexandria, EGYPT

International Conference on the Convergence of Knowledge, Culture, Language and Information Technologies

A "PIVOT" XML-BASED ARCHITECTURE FOR MULTILINGUAL,MULTIVERSION DOCUMENTS : PARALLEL MONOLINGUAL DOCUMENTSALIGNED THROUGH A CENTRAL CORRESPONDENCE DESCRIPTOR AND

POSSIBLE USE OF UNL

Najeh Hajlaoui* and Christian Boitet§

Abstract -- We propose a structure for multilingual,multiversion documents, built on the model of the web-oriented, cooperative lexical multilingual data basePAPILLON: a document is represented by a collection ofmonolingual XML "volumes" interlinked by a central volumeof "interlingual links". Here, the links relate subdocuments(XML trees) corresponding to each other in monolingual"volumes". We are developing a Java application to enabledirect editing of a multilingual document through the web, atthe level of monolingual volumes as well as through bilingualor trilingual interfaces inspired by those of commercial"translation workbenches". Another goal is easy integrationwith machine translation and multilingual generation tools.For this, we add a special UNL volume.

Index Terms -- Multilingual Parallel documents (MPD),multilinguization, internationalisation, fine-graineddocument alignment, pivot architecture, MLD.dtd, UNL,XML, UNL-xml, UNL-html.

INTRODUCTIONDue to Internet, the number of available documents growsdramatically. There is a strategic need for companies tocontrol information written in more than 30 languages (HP,IBM, MS, Caterpillar). This requires the installation ofpowerful and effective management tools of multilingual"synchronized" documents.

There are techniques of large-grained linking (on the level ofHTML pages). However, there are no techniques forstructuring multilingual documents so as to allow fine-grained synchronization (at paragraph or sentence level) andeven less permitting editability through the Web.

The interest to synchronize at least on the level of thesentences is double:

• for the translation and human revision with the assistanceof techniques of HTHM (Human Translation Helped byMachine) and in particular of translation memory.

• for the increase in the number of languages of amultilingual document, it would be useful to synchronizethe versions of multilingual documents with arepresentation such as the multilingual UNL documentformat, allowing to increase the number of languages ofthe document in an economic way by calling distantdeconverters.

This paper is organized as follows.

In the first part, we put our research in perspective with theUNL project (Universal Networking Language). We showthe advantage and the limits of the UNL format, and discusssome aspects related to the management of the informationsystems.

FIGURE 1: EXAMPLE OF ALIGNMENT

In the second part, we present a possible solution to managethe correspondences between the linguistic versions of amultilingual document: it consists in splitting a document inUNL format in several monolingual documents.

L’institut IMAG IMAG instituteest une fédération de 7 unités de recherche du CNRS (FR 0071), del’Institut National Polytechnique de Grenoble (INPG) et de l’UniversitéJoseph Fourier (UJF). L’IMAG représente une communauté de 650personnes (dont la moitié de doctorants) qui se consacre à la formation et àla recherche en informatique et mathématiques appliquées.

The Computer Science and Applied Mathematics Institute of Grenoble(IMAG) accounts for most of the academic research in these domains inGrenoble. IMAG is a federation of seven laboratories, comprising about650 people, jointly established in Grenoble by CNRS, INPG and UJF.

Depuis 1988, l’Institut IMAG est l’interlocuteur des tutelles, descollectivités territoriales et des industriels ou institutions avec lesquels ilmène des partenariats pluriannuels ; coordonne et anime la vie scientifiqueinter et supra-laboratoires : mise en évidence de projets de recherchesoulignant les axes scientifiques de l’Institut, projets d’expérimentationavancée, formations doctorales, colloques et écoles!; gère les ressourcescommunes aux différents laboratoires : réseau et moyens informatiques,médiathèque, services électronique et infographie, cellules communicationet multimédia, affaires internationales.

These laboratories have a long standing tradition of cooperation withindustry and of active participation in European programs. They may becredited with an indisputable ability to apply their results and transfer theirknow-how from research to industry.

Un enseignement supérieur de pointe Top level university trainingLes scientifiques de l’Institut IMAG participent à la formation de plus de 1000 étudiants de second et troisième cycle de l’ENSIMAG (école del’INPG) et de l’UFR d’Informatique et Mathématiques Appliquées (UJF).

IMAG university training higher education is given each year to 1500students by members of IMAG (professors and researchers) in oneEngineering School of INPG (ENSIMAG), one University Department ofUJF (UFRIMA), and in six other joint graduate schools..

© Convergences ’03 2 December 2 – 6, 2003, Alexandria, EGYPT


The third part is devoted to the reconstitution of linksbroken between the documents, and to a mockup andprototypes of interfaces.

In the conclusion, we show the flexibility of such a structureof multilingual, multiversion documents, and itsapplicability in several domains.

PROBLEMS

Situation of the problem

There are many multilingual documents, which aremodified separately (leaflets, booklet...). After a certaintime, we wish to make them coherent [1]. That meansfinding the correspondences (alignments) and reconstitutinga complete and coherent (monolingual) "source" document.For this, modifications in target languages have to betranslated into the source language. A. Assimi, in the workof his PhD, treated the case of the non-centralizedmanagement of the evolution of multilingual paralleldocuments.In the industry, it is frequent that documents are managedon the same platform without being linked at a fine-grainedlevel like that of sentences or paragraphs. For example,technical documents are usually aligned at the level ofHTML pages. Generally, free modification by readers (finalusers) is not authorized (whereas it is usually permitted forleaflets in Word).

Several problems appear in real life:

1. As shown in Figure 1, alignment (based on sentencesconsidered to be exact mutual translations of eachother) may be quite sparse, even with only 2 languages,after only one batch of modifications in one language.

2. There is no explicit link between the monolingual (real)documents constituting the (virtual) multilingualdocument.

3. In some contexts like the European Heritage web site, aUNL document is also built in parallel, as a simple listof UNL-graphs, with no document structure. Theproblem can then be abstracted as in Table 1 below.

Language 1(FR)

Language2 (EN)

……

UNL

jFR1 jEN

1 g1

jFR2 f g2

f jEN3 g3

…

jFRn jEN

n gm

…

jFRN-FR jEN

N- EN gM

TABLE 1: CORRESPONDENCES BETWEEN SENTENCES

jli = sentence with identifier i in language l

gm= UNL graph representing the meaning of one(occurrence of a) sentence

A very simple idea is to seek an identifier for a set ofequivalent sentences, with

j1 @ j2 if and only if UNL (j1) = UNL (j2)

• jli @ jl’

i’ if and only if s (jli) = s (jl’

i’)

s is the equivalence of the intuitive means buttestable by a human translation

• jli @ jl’

i’ if and only if r(jli) = r (jl’

i’)

r is defined in a restrictive and operational way.Here, r = UNL.

A first problem is to calculate links from the UNL graphs tothe sentences in each monolingual document. They may bemodeled as a function

P : 1..M ¥ L Æ N or as a relation in 1..M ¥ L ¥ N.

If we choose the first possibility, a UNL graph (in theparallel UNL document) cannot be linked by P to morethan 1 sentence in any language, which implies that 2identical UNL graphs can appear in the central list. The ideais that, after some reordering and duplication, the list ofUNL graphs can be linked to the list of sentences (theterminal nodes) of the xml structure of each monolingualdocument, with "no crossing". In other words, P is thenmonotonically increasing in its first component.

We might also choose the second possibility, where P is arelation, so that all occurrences of sentences with the samemeaning could be linked to the same UNL graph. Then, theparallel UNL file should represent a set of UNL graphs,with no possible repetition.

However, both these possibilities lead to problems. Let usshow it on the first only (P is a function). Then,

P (m, l) = n if and only if

1. d (gm, l) ª jln where d stands for "deconversion"

(from UNL)

2. l (jln) = gm where l stands for "enconversion"

(into UNL)

P (m, l) = nil otherwise (gm does not correspond to anysentence).

To establish the links between the UNL graphs and thesentences implies then to call all deconverters on all graphs,and to compare the results with the actual sentences. Butdeconverters are constantly updated, may be unavailable atsome time, and sentences may also be modified by hand.Hence, with all probability, only very few links will beestablished. What would be needed is a process to comparethe meaning of a sentence present in a document with thatof a sentence produced by deconversion "on-the-fly". Butthat is a hard and perhaps harder problem!

We can also attack the problem from the other side, that is,we can try to establish links from the sentences to the UNLgraphs. This linking is the inverse y of P. Again, y can bea function or a relation. In the UNL format, it is a function,which implies that, if a sentence is truly ambiguous andcorresponds to several different UNL graphs, one of themhas to be chosen in the representation. Let us adopt thisrestriction.



We have then y : N ¥ L Æ N, and

y (n, l) = m if and only if P (m, l) = n.

We encounter a similar problem: to compute y, we have to"enconvert" each sentence, and compare the result with theUNL graphs in the list. But (1) enconversion is harder thandeconversion, and (2) the UNL language allows for morethan one way of representing a given interpretation of asentence.

We should then develop techniques to test the synonymy of2 UNL graphs… but it is quite certain that any proposedsolution will be incomplete, because the problem ofdeciding whether 2 formal expressions have the samemeaning is undecidable as soon as the considered formulaspertain to a rich enough formal system. For example, it isundecidable whether 2 java programs compute the samefunction.

This shows that the solution consisting in putting someUNL-related or UNL-like representation as a centralstructure leads to problems. It also imposes the addeddifficulty to build a correct and complete UNL-xmldocument.

Hence, our solution will be to design a specific centralstructure linked to all sentences of all monolingualdocuments, and to the UNL graphs. A separate problemwill be to determine whether some intersection or union ofthe monolingual document structures should be reflected inthe central structure or not.

ß Evolution of the versions of a multilingual document

We introduce the term "polyphrase" to denote a set ofsentences in several languages and UNL graphs, formedfrom an initial set of such elements, the "kernel" of thepolyphrase, deemed to be semantically equivalent.

In most cases, the kernel is simply one sentence in a givenlanguage, all other sentences are obtained by translation orcorrections, and the UNL graphs by enconversion and thendirect edition or coedition.

While the kernel corresponds to exactly one intendedmeaning, the evolution of the polyphrase may introducenew meanings. To trace them, we need to add a notion ofversion to the elements of a polyphrase, and by extension toall parts of a multilingual document.

The passage to a new version can happen in many cases:

• correction of errors.

• human revision.

• addition of another language.

• change of order of linguistic objects.

• addition of new polyphrases.

The preceding points are important factors, which influencethe increase in the number of versions of a multilingualdocument, and the unalignment rate of these versions.

ß Coherence of the versions

The coherence of the versions is directly related to theconcept of alignment. 2 versions in 2 languages will said to

be "coherent" if their aligned documents are mutualtranslations of each other. Alignments should go at least tothe level of sentences. In our first mockup (see below), westop there, but finer units such as segments and words maybe quite useful to help human translators or posteditors.

The coherence of the versions of the database is distinctfrom that of an environment of translation; the graph ofdependence is fixed and the ascending translation processrespecting alignment generates a coherent version.

A new version then traverses a development cycle until itbecomes frozen and/or validated, before entering in a stateof "public" availability. It can then be used in a translationmemory.

Advantages of the UNL language and limits of the UNLformat

We choose UNL [2] as our interlingua for various reasons:

1. it is specifically designed for linguistic and semanticmachine processing,

2. it derives with many improvements from H.!Uchida'spivot used in ATLAS-II (Fujitsu), still judged as the bestquality MT system for English-Japanese, with a largecoverage (586,000 lexical entries in each language in2001),

3. participants of the UNL project1 have built"deconverters" from UNL into about 12 languages, andat least the Arabic, Indonesian, Italian, French, Russian,Spanish, and Thai deconverters were accessible forexperimentation through a web interface in spring 2003,

4. although formal, UNL graphs (see below) are quite easyto understand with little training and may be presentedin a "localized" way to naive users by translating UNLsymbols (semantic relations, attributes) and lexemes(UWs) into symbols and lexemes of their language,

5. the UNL project has defined a format embedded in htmlfor files containing a complete multilingual documentaligned at the level of utterances, and produced a"visualizer" transforming a UNL file into as many htmlfiles as languages, and sending them to any webbrowser.

The UNL representation of a text is a list of "semanticgraphs", each expressing the meaning of a natural languageutterance. Nodes contain lexical units and attributes; arcsbear semantic relations. Connex subgraphs may be definedas "scopes", so that a UNL graph may be a hypergraph.

The lexical units, called Universal Words (UW2), represent(sets of) word meanings, something less ambitious thanconcepts. Their denotations are built to be intuitivelyunderstood by developers knowing English, that is, by alldevelopers in NLP. A UW is an English term or specialsymbol (number…) possibly completed by semanticrestrictions: the UW "process" represents all word meaningsof that lemma, seen as citation form (verb or noun here),

1 http://unl.ias.unu.edu2 Universal Word, or Unit of Virtual Vocabulary



and "process(icl>do, agt>person)" covers only the meaningsof processing, working on, etc.

The attributes are the (semantic) number, genre, time,aspect, modality, etc., and the 40 or so semantic relationsare traditional "deep cases" such as agent, (deep) object,location, goal, time, etc.

One way of looking at a UNL graph corresponding to anutterance in language L is to say that it represents theabstract structure of an equivalent English utterance "seenfrom L", that is, where semantic attributes not necessarilyexpressed in L may be absent (e.g., aspect coming fromFrench, determination or number from Japanese, etc.).

The UNL format, whether UNL-html or UNL-xml, givesfor the moment a simple solution: a multilingual documentis only one large file where the alignment of the variousversions (languages and revisions) is done at the level ofeach sentence. But, in general, two parallel documents intwo different languages cannot be aligned at this level.Indeed, a sentence in L1 can correspond to two or threesentences in L2 and conversely (m-n possibility).

Moreover, the order of a list of sentences or paragraphs canvary from one language to another (for example because ofa lexicographical sorting). Thus, the idea from where weleft in the introduction is good, but must be refined.

Aspects related to the management of informationsystems

The problem of management of correspondences andcoherence of MPDs (Multilingual Parallel Documents) stillremains open: there is no adequate concrete solution, indeedthere is a lack of tools, methods, practices and models todescribe, maintain and refine the correspondences betweenversions of the same document in several languages.

An important point is that the suggested techniques must beusable in practice and as practical as possible in the knowninformation systems. Let us see how the problem is posedon this level.

ß Centralized management

In the case of centralized management of documents, theproblem is easier to solve as soon as (1) a unique XMLformat is used for exchanging and storing data, and (2) thereis a central place to describe and control thecorrespondences between linguistic versions. Thedisadvantage, however, is that the freedom to modifyindividual versions is limited.

FIGURE 2: CORRESPONDENCE BETWEEN CENTRALIZEDDOCUMENTS (XML FORMATS)

Indeed, the life cycle of a multilingual document organizedin this way has to be controlled from the start using certainmechanisms of observation and protection of thecorrespondences.

ß Non-centralized management

There are many cases where the various versions of adocument are not centralized, for instance because theyhave to be processed with different tools on differentplatforms. To realign them after a series of modificationshave been done is quite difficult, and to rebuild a coherentcomplete original is even more difficult.

On the formatting side, there are m-n possibility ofcorrespondences for several distributed documents, ndifferent formats and 2n filters.

A. Assimi [1] analyzed and solved a part of these and otherproblems posed by the management of the non-centralizedevolution of multilingual parallel documents. He used astructuring of the multilingual texts by a multicolumn table,which is not practicable for documents of big size (technicaldocumentation, catalogues...). In his thesis, he reports thatthis simple solution worked for certain needs of customers,but was limited to the management of small documents suchas the brochure of the IMAG Institute (Informatics andApplied Mathematics at Grenoble), which containsapproximately 2000 words, that is 8 standard pages oftranslation, or 4 pages of Word.

ß Principe of solution

Starting from the study made in the two preceding cases, wesee the need for designing tools and methods allowingpractical management of large multilingual documents. Inparticular, it is necessary to describe and to maintainlinguistic correspondences at a very fine level between nversions in m languages, while allowing new versions toappear in any language independently of others.

For that, the idea is to represent the correspondencesbetween the structural trees of n parallel monolingualdocuments by a separate structure, of a different type,connecting fragments of trees with as few constraints aspossible, as is done for the macrostructure of themultilingual lexical data base PAPILLON.

THE VERSIONING PROBLEM AND AFIRST SOLUTION

We simply adopt the solution implemented in PAPILLON(storage of the modifications on standby in the form ofXSLT transformations in the private space of eachcontributor) and draw from our preliminary experiment inmanagement of versions for XML documents representingvirtual electronic components.

In order to manage the successive versions of a multilingualdocument, we introduce the concept of status of a version.

Status of documents and versions

The status of any part of a document can be:

• modifiable : when its contents can still undergomodifications.

XMLdocument

2

Correspondence

XMLdocument

3

XMLdocument

1

XMLdocument

4

Word

Interleaf

Anotherformat

• frozen: when its contents cannot be modified but arenot yet validated.

• validated: when its contents have been validated. Avalidated part may be put on some sharable referencespace.

We define the order: modifiable < frozen < validated.

Suppose a multilingual document has content in n languages(including UNL if present).

The “last version” of any part of this document is the n-upleconsisting of the maximum version number of all itspolyphrases.

A “version” of a document is any n-uple of version numbersless or equal to the last version (component by component).

The status of a version of a document is the minimum of thestatuses of the subdocument corresponding to that version.

From a multilingual document to several monolingualdocuments

The basic idea is to separate the monolingual documentsand to represent their correspondences in an autonomous"pivot" structure. It was also the idea of A. Assimi, but weuse it here in a context where the formats to besynchronized are standard XML formats. We find it too inPAPILLON, where each dictionary of lexies (wordmeanings or monolingual acceptions) is represented by anXML file, as well as the "pivot" or “hub” formed by theaxies (links between lexies).

In addition, more and more annotations are introduced intodocuments for various applications (IR, summary,categorization...). They can be annotations related to thelanguage (like GDA of K. Hashida) or annotations onlyrelated to the contents (graphs UNL, semantic categories...).

At this point, we consider two ways of separating themonolingual documents: partial separation and totalseparation.

ß Partial Separation

Let us suppose for the moment that we have a multilingualdocument in UNL-xml format aligned on the level of thesentence. Suppose we want to switch to the non-centralizedmanagement situation, for example to let 15 persons edit thesame document in 15 languages. The idea of partialseparation is then to split the UNL-xml representation into15 monolingual documents enriched by the original content(source language) and its UNL representation as shown inthe following example.

FIGURE 3: PARTIAL SEPARATION OF A MULTILINGUALDOCUMENT

This makes it possible to make local modifications in eachlanguage and thus to introduce different versions. Here, forexample, the sentence "He eats fruits" becomes "He iseating fruits" with the corresponding modification of UNL-xml format, and a second version of the English documentappears (figure 4).

FIGURE 4: EVOLUTION OF MONOLINGUAL DOCUMENT

ß Total Separation

Here, we split the UNL-xml representation in severalmonolingual documents by considering the fact that theoriginal is also a monolingual document as well as its UNLrepresentation.

<?xml version="1.0" ?><unl:D unl:dn="d1" unl:on="Najeh"unl:mid="[email protected]" ><unl:P unl:number="1"><unl:S unl:number="1"><unl:org unl:lang="fr">Il mange les fruits.</unl:org><unl:unl>agt(eat.@entry, he)obj(eat.@entry, fruit.@pl)</unl:unl><unl:GS unl:lang="es">come los frutos.</unl:GS><en>He eats fruits.</en></unl:S></unl:D>

<?xml version="1.0" ?><unl:D unl:dn="d1" unl:on="Najeh"unl:mid="[email protected]" ><unl:P unl:number="1"><unl:S unl:number="1"><unl:org unl:lang="fr">Il mange les fruits.</unl:org><unl:unl>agt(eat.@entry, he)obj(eat.@entry, fruit.@pl)</unl:unl><en>He eats fruits.</en></unl:S></unl:D>

Monolingualdocument 2(EN)

Monolingualdocument 1

(Es)

<?xml version="1.0" ?><unl:D unl:dn="d1" unl:on="Najeh"unl:mid="[email protected]" ><unl:P unl:number="1"><unl:S unl:number="1"><unl:org unl:lang="fr">Il mange les fruits.</unl:org><unl:unl>agt(eat.@entry, he)obj(eat.@entry, fruit.@pl)</unl:unl><unl:GS unl:lang="es">come los frutos.</unl:GS></unl:S></unl:D>

Multilingual document

<?xml version="1.0" ?><unl:D unl:dn="d1" unl:on="Najeh"unl:mid="[email protected]" ><unl:P unl:number="1"><unl:S unl:number="1"><unl:org unl:lang="fr">Il mange les fruits.</unl:org><unl:unl>agt(eat.@entry, he)obj(eat.@entry, fruit@pl)</unl:unl><en v=1 >He eats fruits.</en></unl:S></unl:D>

Monolingual document 2 (EN)Version 1

<?xml version="1.0" ?><unl:D unl:dn="d1" unl:on="Najeh"unl:mid="[email protected]" ><unl:P unl:number="1"><unl:S unl:number="1"><unl:org unl:lang="fr">Il mange les fruits.</unl:org><unl:unl>agt(eat.@entry@progress, he)obj(eat.@entry@progress, fruit@pl)</unl:unl><en v=2 >He is eating fruits.</en></unl:S></unl:D>

Monolingual document 2 (EN)Version 2

FIGURE 5: TOTAL SEPARATION OF A MULTILINGUALDOCUMENT

This separation of UNL-xml representation can beimproved by gathering technical information common to themonolingual documents in the same document ofdescription. That is possible using XML facilities forcreating and managing metadata.

Discussion

In the first technique of separation

• the autonomous evolution of each linguistic version ispossible; that constitutes an important advantage forhuman revision.

• the source language, the target language and the UNLrepresentation are in the same file, which allows thesimple reuse of tools and interfaces of "traditional"MAHT (Machine-Aided Human Translation) systems,there must be a source text and a target text.

• There exist "local" UNL tools which begin to be reallyused in practice.

In the second technique and since we have only one UNL-xml file, controlled and centralized at the level of sentences,this last file cannot remain strictly parallel with eachlinguistic version; it has to some extent to reflectmodifications. For example, if we replace in the French filea sentence by two sentences, it will be necessary to leavethe UNL graph for the old large sentence in the UNL fileand to add 2 new UNL graphs.

Consequently, the two preceding techniques are notsatisfactory and there remains the problem of themaintenance of the correspondences.

If modifications are done in all the versions, we cannot usethe UNL file as "center" also serving to memorize thesemodifications.

The principle of our solution is inherited from the area oftechnical document management and from the PAPILLONproject. This solution is based on two important points:

• Monotony: never erase anything in any "volume" (anXML file) but add new evolutionary versions.

• Modularity: represent the correspondences in a separateway.

We propose the following diagram:

FIGURE 6: CORRESPONDENCE BETWEEN SEVERALDOCUMENTS

SECOND SOLUTION: A CENTRALREPRESENTATION OF ALL

CORRESPONDENCES BETWEENMONOLINGUAL AND UNL CONTENT

Logical view

The idea is to represent the correspondences between thevarious linguistic versions in the form of links in a centralstructure. These links can be numbers of sentences in thecase of a simple local structure such as a large XML file,which includes all the data, the URLs of XML and DTDfiles representing the versions of each language. It is tosome extent a question of following the life cycle of eachversion, of conserving a complete history of themodifications and applying thereafter the list of themodifications made to the parallel versions to keepalignment.

When a new revision is created, it is necessary to keep atrace identifying the reason for this modification. Moreover,information to be annotated on the object to be replaced inthe document is predefined: author, date of operation,optional comment describing the cause of operation.

In what follows, we propose a representation of thecorrespondence between the linguistic versions whichhighlights the dependence of the data.

In the next figure, indicates a correspondence linkXi

<?xml version="1.0" ?><unl:D unl:dn="d1" unl:on="Najeh"unl:mid="[email protected]" ><unl:P unl:number="1"><unl:S unl:number="1"><unl:org unl:lang="fr">Il mange les fruits.</unl:org> </unl:S></unl:D>

<?xml version="1.0" ?><unl:D unl:dn="d1" unl:on="Najeh"unl:mid="[email protected]" ><unl:P unl:number="1"><unl:S unl:number="1"> <unl:GS unl:lang="es">come los frutos.</unl:GS></unl:S></unl:D>

<?xml version="1.0" ?><unl:D unl:dn="d1" unl:on="Najeh"unl:mid="[email protected]" ><unl:P unl:number="1"><unl:S unl:number="1"><en>He eats fruits.</en></unl:S></unl:D>

<?xml version="1.0" ?><unl:D unl:dn="d1"unl:on="Najeh"unl:mid="[email protected]" ><unl:P unl:number="1"><unl:S unl:number="1"><unl:org unl:lang="fr">Il mange les fruits.</unl:org><unl:unl>agt(eat.@entry, he)obj(eat.@entry, fruit@pl)</unl:unl><unl:GS unl:lang="es">come los frutos.</unl:GS><en>He eats fruits.</en></unl:S></unl:D>

<?xml version="1.0" ?><unl:D unl:dn="d1" unl:on="Najeh"unl:mid="[email protected]" ><unl:P unl:number="1"><unl:S unl:number="1"> <unl:unl>agt(eat.@entry, he)obj(eat.@entry, fruit@pl)</unl:unl></unl:S></unl:D>

UNL format

Original document

Multilingual document

Monolingualdocument 1 (Es)

Monolingualdocument 2 (EN)

XMLdocument 2

Correspondence

XMLdocument 3

XMLdocument 1

XMLdocument 4

Word

Interleaf

Anotherformat

UNL

Link of nature possiblydifferent from the linkswhich exists between aDocument and thecorrespondence

FIGURE 7: TREE XML AND REPRESENTATION OFCORRESPONDENCES

The XML tree conforms to the MLD.dtd (MultilingualLanguage Document). Xn represents a link between thelinguistic versions. For example, X000001 is a link betweenthe French version « Il mange les fruits » and the Englishversion "He eats fruits" constituting the first polyphrase. Westore the set of these links in the XML file, as well as thehistory of the modifications made to each version.

Physical view

Data will be stored on a central server in two ways:

• A “Postgres” data base

• File descriptors written in XML, and conforming to acertain DTD, by default our MLD.dtd (MultilingualLanguage Documents).

The data stored in the database comprises all that relates tothe effective management of XML files and access rights onthe server.

Data tables gather the following information:

• Correspondence between the linguistic versions, theirfiles descriptors in XML, and their DTD.

• URLs of various files (XML, DTD) in order to allowsearching and handling them on the server.

• Total information on the level of a version formanaging it, and for checking access rights: name,version, author, creation date, planning date in the caseof versions under development, translation systemused, and some comments.

• Access and modification rights (import and export).

A version can have two states:

• Private Version : the version is stored on the userworkstation, this version can be reloaded and modified.

• Published Version : the version is stored on the server.It results from the decision to publish a Private Version.

A first mockup (TraCorpEx project)

After having studied possible architectures and datastructures, we have started practical experiments in theframework of the TraCorpEx project. Two parallel corporain Japanese-English are available [3]. The first comprises

162000 sentences from the CSTAR project and the second214000 sentences from the PAPILLON project.

To easily manage these corpora using XML, we defined aDTD, MLD.dtd, corresponding to the general structure ofmultilingual documents. MLD (Multilingual LanguageDocuments) is evolutionary and allows to add otherlanguages to these corpora.

ß MLD (MultiLingual Documents)

FIGURE 8 : MLD (MULTILINGUAL DOCUMENTS)

This DTD respects the tree structure of the corpora, as wellas the dependencies which arise from the translationprocess, as we go down the tree towards the contents. Itdescribes a format for multilingual, multiversion documentswith m languages and n versions, n>m, and represents at thesame time the correspondences between the parallel parts.

A multilingual document is a set of organizationalinformation (name of the document, creation date, lastmodification date, numbers of languages, numbers ofpolyphrase) plus a set of polyphrases.

A polyphrase is the set of linguistic versions of the samesegment, which have one attribute in common, a uniquenumber. They are also identifiable by other attributes: thelanguage, and for each language the version of the content.In these corpora, the level of alignment is the sentence, butit can go down to a finer level of segments and words. Inother corpora, we might go up to the level of paragraph, ifsentences are not perfectly aligned.

ß Interfaces

At this point, the storage format adopted in TraCorpEx is anXML file, which respects MLD.dtd. Upper levels concernthe division into corpora, then into sections (import files),then into sentences. Further levels give a hierarchicalstructure to a polyphrase: language, original and versions,distances, administrative information for tracing etc. Ateach level, some information is encoded as XML attributes.

To add French to these corpora, we have begun to use thecommercial MT system Systran-Pro/EF and to revise theresults. We plan to run other MT systems and to chooseautomatically the "best" translation using the distancesbetween the retrotranslations and the orignal English. In

<!ELEMENT document (information, polyphrase*) ><!ELEMENT information (#PCDATA) ><!ATTLIST information document-name CDATA#REQUIRED><!ATTLIST information creation-date CDATA#IMPLIED><!ATTLIST information modification-date CDATA#IMPLIED><!ATTLIST information number-language CDATA#IMPLIED><!ATTLIST information number-polyphrase CDATA#IMPLIED>

<!ELEMENT polyphrase (linguistic-version*) ><!ATTLIST polyphrase number CDATA#REQUIRED>

<!ELEMENT linguistic-version(segment) ><!ATTLIST linguistic-version language CDATA#REQUIRED><!ELEMENT segment(#PCDATA) >

Document

Information

Document-name = «!corpus »Creation-date =! «!!»

Modification-dateNumber-language =!«!2!»Number-package =! «!211997!»

Polyphrase

Number = «!000001!»Linguistic-version

Language = «!FR!»segment =!«!Il mange les fruits!»

Linguistic-version

Language =! «!EN!»

segment = «!He eats fruits »

Polyphrase

Polyphrase

Number =! «!000002!»!Linguistic-version

X000001X000002

….Xi......Xn

case of conflict, we will also use distances between thetranslations, to group them, and between translations andoriginal, to detect those with more unknown words, leftuntranslated.

Last but not least, further elaboration, again using stringdistances, will provide various feedbacks to the developersof the MT systems thus used.

A java program has been developed to calculate the distancebetween two character strings and to post the result in theform of a matrix and an XML file directly presentable inWord “Track changes” format. Prototypes of two interfaceshave also been produced. The “preparation” interface allowsto submit the English sentences to two or three EF MTsystems and to compute the “best” translation of eachsentence. It also computes distances between Englishoriginal sentences, so that the document can be used as atranslation memory in the following step.

FIGURE 9 : INTERFACE 1 “PREPARATION”

The second interface is for human revision of the bestsuggestion using an English zone: we can correct words orexpressions and use the translation memory which is in thiscase the multilingual document itself.

FIGURE 10 : INTERFACE 2 “REVISION”

A third interface will be built for the preparation offeedbacks to the developers of the MT systems used. It will

allow to calculate and validate the words unknown or badlytranslated by each system, and to provide translationsuggestions from “reference” translations obtained afterhuman revision. It will also provide comparisons betweenthe various systems used, always thanks to the computationof distances at the level of the characters or words.

CONCLUSIONThe proposed structure of multilingual multiversiondocuments is technically flexible and modifiable on theinitiative of the administrator. It is declared in a hierarchicalway in the form of an XML DTD and can be tailored toeach corpus of multilingual structured documents aligned atthe level of sentences. The hope is that it can contribute tothe standardization of multilingual documents, needed tofacilitate their management and evolution.

The approach presented here is quite flexible and allowsany description of file and directory by XML tags, formultiple applications, among which multilingualinformation retrieval, multilingual summary, multilingualcategorization and of course all types of translation.

REFERENCES[1] Al-Assimi A.-B. (2000) Gestion de l'évolution non centralisée dedocuments parallèles multilingues. Nouvelle thèse, UJF, Grenoble,31/10/00, 200 p.[2] Blanc É. & Sérasset G. (2001) From Graph to Tree: Processing UNLGraphs using an Existing MT System. Proc. The first UNL openConference, Suzhou, China, 22-26 November 2001, UNDL.[3] Boitet C. (2003) Approaches to enlarge bilingual corpora of examplesentences to more languages. Proc. Papillon-03 seminar, Hookaidouniversity, Sapporo, 3-5 July 2003, 13 p.[4] <url> Emmanuel Planas Ph.D. Thesis.http://bibliotheque.imag.fr/theses/1998/Planas.Emmanuel/these.dir/[5] <url> Multilingual corpora (ITC, Italy).http://tcc.itc.it/people/forner/multilingualcorpora.html[6] <url> UNL project (Universal Networking Langage).http://www.undl.org/[7] <url> W3-TR/REC. http://www.w3.org/TR/REC-xml[8] Al-Assimi A.-B. & Boitet C. (2001) Management of Non-CentralizedEvolution of Parallel Multilingual Documents. Proc. InternationalizationTrack, 10th International World Wide Web Conference, Hong Kong, May1-5, 2001, 7 p.[9] Boguslavsky I., Frid N., Iomdin L., Kreidlin L., Sagalova I. & SizovV. (2000) Creating a Universal Networking Language Module within anAdvanced NLP System. Proc. COLING-2000, Saarbrücken,31/7—3/8/2000, ACL & Morgan Kaufmann, vol. 1/2, pp. 83-89.[10] Boitet C. & Tsai W.-J. (2002) Coedition to share text revision acrosslanguages. Proc. COLING-02 WS on MT, Taipeh, 1/9/2002, 8 p.[11] Hajlaoui N. (2002) Gestion des versions des composantséléctroniques virtuels. Rapport de DEA, CSI, INPG, juin 2002, 80 p.[12] Sérasset G. & Boitet C. (1999) UNL-French deconversion astransfer & generation from an interlingua with possible qualityenhancement through offline human interaction. Proc. MT Summit VII,Singapore, 13-17 September 1999, Asia Pacific Ass. for MT, J.-I. Tsujiied., pp. 220—228.[13] Sérasset G. & Boitet C. (2000) On UNL as the future "html of thelinguistic content" & the reuse of existing NLP components in UNL-relatedapplications with the example of a UNL-French deconverter. Proc.COLING-2000, Saarbrücken, 31/7—3/8/2000, ACL, 7 p.[14] Tomokiyo M., Al-Assimi A.-B. & Boitet C. (2001) Multilingualdocuments management by using Universal Networking Language UNL onAlignment Gestion Tool OGA. Proc. PACLING'01, Fukuoka, 7 p.[15] Tsai W.-J. (2001) SWIIVRE- a web site for the Initiation,Information, Validation, Research and Experimentation on UNL. Proc.First UNL Open Conference - Building Global Knowledge with UNL,Suzhou, China, 18-20 Nov. 2001, 8 p.