metadata for multilingual content management a practical experience with the sare-bi system

27
DELi DELi (Universidad de Deusto) (Universidad de Deusto) [1] [1] , , CodeSyntax CodeSyntax [2] [2] www.deli.deusto.es www.deli.deusto.es www.codesyntax.com www.codesyntax.com Translating and the Computer 25 Translating and the Computer 25 Metadata for Metadata for multilingual content multilingual content management management A practical experience with A practical experience with the SARE-Bi system the SARE-Bi system Díaz, Abaitua, Jacob, Quintana Díaz, Abaitua, Jacob, Quintana [1] [1] y y Araolaza Araolaza [2] [2]

Upload: sileas

Post on 21-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

Translating and the Computer 25. Metadata for multilingual content management A practical experience with the SARE-Bi system. Díaz, Abaitua, Jacob, Quintana [1] y Araolaza [2]. DELi (Universidad de Deusto) [1] , CodeSyntax [2] www.deli.deusto.es www.codesyntax.com. Problem description. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Metadata for multilingual content management A practical experience with the SARE-Bi system

DELiDELi (Universidad de Deusto) (Universidad de Deusto)[1][1], , CodeSyntaxCodeSyntax[2][2]

www.deli.deusto.eswww.deli.deusto.es www.codesyntax.comwww.codesyntax.com

Translating and the Computer 25Translating and the Computer 25

Metadata for multilingual Metadata for multilingual content managementcontent managementA practical experience with the A practical experience with the SARE-Bi systemSARE-Bi system

Díaz, Abaitua, Jacob, QuintanaDíaz, Abaitua, Jacob, Quintana [1][1] y Araolaza y Araolaza[2][2]

Page 2: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 22DELi (UD)DELi (UD)

Problem descriptionProblem description

• Goal: rapid multilingual delivery of Goal: rapid multilingual delivery of publishable documentspublishable documents

• still a challenge, becausestill a challenge, because• automatically translated text usually needs post-automatically translated text usually needs post-

translation processingtranslation processing

• Multilingual document publicationMultilingual document publication• is not only translationis not only translation

– requires more functions than those that MT offersrequires more functions than those that MT offers

• text quality is a must in some environmentstext quality is a must in some environments

Page 3: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 33DELi (UD)DELi (UD)

Case studyCase study

• University of Deusto (Bilbao, Spain)University of Deusto (Bilbao, Spain)• generates high number of administrative docsgenerates high number of administrative docs• most of them in Spanish and Basque (most of them in Spanish and Basque (euskaraeuskara), ),

official languages of Basque Countryofficial languages of Basque Country• some also in English, French, Italian...some also in English, French, Italian...

• Administrative documentsAdministrative documents• big (statutes, regulations, reports...)big (statutes, regulations, reports...)• small (calls, announces, minutes, letters...)small (calls, announces, minutes, letters...)• one sentence (“Please, do not smoke here”)one sentence (“Please, do not smoke here”)

Page 4: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 44DELi (UD)DELi (UD)

Case studyCase study

• Who read the documents?Who read the documents?• a Department (e.g. 20 people)a Department (e.g. 20 people)• the employees (a thousand people)the employees (a thousand people)• the students (20,000 people)the students (20,000 people)

• Document quality is a concernDocument quality is a concern• independent of the number of people going to readindependent of the number of people going to read• independent of the importance/size of the doc.independent of the importance/size of the doc.• ““politically incorrect” to publish a bad document, politically incorrect” to publish a bad document,

either in Spanish or in Basqueeither in Spanish or in Basque

Page 5: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 55DELi (UD)DELi (UD)

Case study: fieldworkCase study: fieldwork

• Procedure (almost fixed)Procedure (almost fixed)• a “writer” writes original document (in one language)a “writer” writes original document (in one language)• he send it to a “translator”he send it to a “translator”• ““translator” writes the other language versiontranslator” writes the other language version• she send it back to the “writer”she send it back to the “writer”• he publishes the multilingual documenthe publishes the multilingual document

• Almost 100% of original writing in SpanishAlmost 100% of original writing in Spanish• Basque: a minority languageBasque: a minority language• many can read/understand, only a few can writemany can read/understand, only a few can write

Page 6: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 66DELi (UD)DELi (UD)

Case study: fieldworkCase study: fieldwork

• Cost of translationCost of translation• mainly an economic concern (institution can only mainly an economic concern (institution can only

afford to translate “important” documents)afford to translate “important” documents)• but also a problem of time (urgent documents)but also a problem of time (urgent documents)

• Key: many documents follow a “template”Key: many documents follow a “template”• short letters, calls, invitations...short letters, calls, invitations...• repeated weekly, monthly, yearly...repeated weekly, monthly, yearly...• small changes (date, place, name...)small changes (date, place, name...)

– ““writers” take advantage of this, REUSINGwriters” take advantage of this, REUSING– but “translators” CAN NOT REUSEbut “translators” CAN NOT REUSE

Page 7: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 77DELi (UD)DELi (UD)

How can MT help?How can MT help?

• Goal: Goal: to increase the number of multilingual to increase the number of multilingual documents generated in our Universitydocuments generated in our University

• No Spanish to Basque MT tool yetNo Spanish to Basque MT tool yet• although a big research effort is being madealthough a big research effort is being made• anyway, ¿quality?anyway, ¿quality?• translation is an important step, but not the only onetranslation is an important step, but not the only one

• Translators use some MAT toolsTranslators use some MAT tools• term baseterm base• translation memories evaluated, still not in operationtranslation memories evaluated, still not in operation

Page 8: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 88DELi (UD)DELi (UD)

Solution (1):Solution (1):a document management systema document management system

• Organising the documentsOrganising the documents• cumulative document repositorycumulative document repository• classified under several criteriaclassified under several criteria

• Multilingual functionalityMultilingual functionality• showing explicitly the textual correspondence showing explicitly the textual correspondence

between parts (segments) of documentsbetween parts (segments) of documents

• Collaborative systemCollaborative system• writers and translators share the documentswriters and translators share the documents• allows to implement other stages of the publication allows to implement other stages of the publication

procedureprocedure

Page 9: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 99DELi (UD)DELi (UD)

Solution (2):Solution (2):translation memoriestranslation memories

• Experience of DELiExperience of DELi• automatic extraction of translation memories from automatic extraction of translation memories from

bilingual (es-eu) docs (XTRA-Bi project, 2000-2001)bilingual (es-eu) docs (XTRA-Bi project, 2000-2001)• several Gigabytes of TMX filesseveral Gigabytes of TMX files• unorganised chunks of texts segmentsunorganised chunks of texts segments

• Multilingual Multilingual segmentedsegmented document system document system• not only the document as a wholenot only the document as a whole• if we show the corresp. of multilingual segmentsif we show the corresp. of multilingual segments• then the system is then the system is alsoalso a translation memories a translation memories

(TMX) repository(TMX) repository

Page 10: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 1010DELi (UD)DELi (UD)

Solution (3): metadataSolution (3): metadata

• Chaotic accumulation of contentsChaotic accumulation of contents• difficult management, search, retrieval...difficult management, search, retrieval...

• MetadataMetadata• document = content + metacontentdocument = content + metacontent• semantic web, ontologies, content syndication...semantic web, ontologies, content syndication...• XML technology as architectureXML technology as architecture

• TEI (Text Encoding Initiative)TEI (Text Encoding Initiative)• not so much for the purpose of linguistic mark-upnot so much for the purpose of linguistic mark-up• for structural and metadata aspects (TEI header)for structural and metadata aspects (TEI header)

Page 11: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 1111DELi (UD)DELi (UD)

SARE-Bi: a first tourSARE-Bi: a first tour

• SARE-BiSARE-Bi– multilingual document management systemmultilingual document management system– allows incremental compilation of documentsallows incremental compilation of documents– allows users to work collaborativelyallows users to work collaboratively– uses metadata as a conceptual mechanismuses metadata as a conceptual mechanism– can also be seen as a memory-based machine can also be seen as a memory-based machine

translation systemtranslation system

• DemoDemo

Page 12: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 1212DELi (UD)DELi (UD)

SARE-Bi:SARE-Bi:functionsfunctions

• Retrieving docs.Retrieving docs.– filteringfiltering

• based on based on metadatametadata

– searchingsearching• free textfree text• any languageany language

Page 13: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 1313DELi (UD)DELi (UD)

SARE-Bi: filtering resultsSARE-Bi: filtering results

• A document each rowA document each row– visualisation link modification linkvisualisation link modification link

Page 14: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 1414DELi (UD)DELi (UD)

SARE-Bi:SARE-Bi:visualisationvisualisation

• Export toolExport tool– TEI & TMXTEI & TMX

• Complete doc.Complete doc.• useful for useful for

copyingcopying

• Segmented doc.Segmented doc.• useful to see useful to see

language language correspondencecorrespondence

Page 15: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 1515DELi (UD)DELi (UD)

SARE-Bi:SARE-Bi:search resultssearch results

• Found segmentsFound segments– in all document in all document

languageslanguages– equivalent to equivalent to

translation translation memories memories browsingbrowsing

• visualization linkvisualization link

Page 16: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 1616DELi (UD)DELi (UD)

SARE-Bi: adding a document SARE-Bi: adding a document (first step)(first step)

• User supplies:User supplies:– non-automatic non-automatic

metadata metadata (almost all)(almost all)

– document document languageslanguages(may be only (may be only one)one)

Page 17: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 1717DELi (UD)DELi (UD)

• User input Segmentation and alignmentUser input Segmentation and alignment– user canuser can

verify thatverify thatthese tasksthese taskshave beenhave beencorrectcorrect

• Same pageSame pagefor docunmentfor docunmentmodificationmodification

SARE-Bi: adding a document SARE-Bi: adding a document (second step)(second step)

Page 18: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 1818DELi (UD)DELi (UD)

SARE-Bi: componentsSARE-Bi: components(general)(general)

• Corpus of multilingual documentsCorpus of multilingual documents• annotated (TEI-like), segmented, and alignedannotated (TEI-like), segmented, and aligned• segments are paragraphssegments are paragraphs• automatic processesautomatic processes

• Metadata associated to each documentMetadata associated to each document• guidelines of the TEI headerguidelines of the TEI header• usual data: title, dates, author, place, centre...usual data: title, dates, author, place, centre...

– Most important metadata:Most important metadata:• category, state, visibilitycategory, state, visibility

Page 19: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 1919DELi (UD)DELi (UD)

SARE-Bi: metadataSARE-Bi: metadata(categorisation of documents)(categorisation of documents)

• Hierarchical taxonomy Hierarchical taxonomy of several levelsof several levels– 3 functions, 25 genres, 3 functions, 25 genres,

and 256 topics (UD)and 256 topics (UD)– e.g. a e.g. a certificate of certificate of

attendance at a short attendance at a short coursecourse has: has:

• 1-function 1-function informativeinformative• 2-genre 2-genre certificate certificate• 3-topic 3-topic attendanceattendance

30000/inquirir31100/ ficha31101/ aceptación o renuncia de beca31102/ boletín de inscripción31103/ datos de viaje31104/ modelo de pago31105/ relación de coordinadores departamentales31106/ planificación actividad de profesores31107/ prácticas31108/ datos estadísticos31109/ boletín subscripción revista31200/ impreso31201/ de solicitud de beca31202/ de solicitud de expediente31203/ de solicitud de admisión31204/ de solicitud de alojamiento31205/ de programa Sócrates31206/ de matrícula31207/ factura31208/ recibí31209/ petición de fotocopias

Page 20: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 2020DELi (UD)DELi (UD)

SARE-Bi: metadataSARE-Bi: metadata(state and visibility)(state and visibility)

• Dynamic behaviourDynamic behaviour• users change state/visibility during the edition cycleusers change state/visibility during the edition cycle• to show the composition/multilingual situation of the to show the composition/multilingual situation of the

documentdocument• metadata other than these are static (fixed values)metadata other than these are static (fixed values)

• StateState• non-validatednon-validated, , validatedvalidated, , normativenormative

• VisibilityVisibility• rough draftrough draft, , confidentialconfidential, , sharedshared, , publicpublic

Page 21: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 2121DELi (UD)DELi (UD)

SARE-Bi: componentsSARE-Bi: components(users)(users)

• Mainly associated to Mainly associated to taskstasks in the system in the system– guestsguests, , writerswriters, , translatorstranslators, , administratorsadministrators

• But also related to But also related to permissionspermissions– document document ownerowner: user that added it: user that added it

• Complex set of permissionsComplex set of permissions– a rule for each task, that involves:a rule for each task, that involves:

• ownerowner

• metadatum statemetadatum state

• metadatum visibilitymetadatum visibility

Page 22: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 2222DELi (UD)DELi (UD)

SARE-Bi: typical edition cycleSARE-Bi: typical edition cycle

1 A writer adds a monolingual documentA writer adds a monolingual document• on creation: visibility on creation: visibility draftdraft, state , state non-validatednon-validated• on finish: visibility on finish: visibility sharedshared (for example) (for example)• he calls the translatorhe calls the translator

2 A translator does the translationA translator does the translation• assigns state as assigns state as validatedvalidated• she calls back the writershe calls back the writer

3 The writer retrieves the bilingual documentThe writer retrieves the bilingual document• and publishes itand publishes it

Page 23: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 2323DELi (UD)DELi (UD)

SARE-Bi: edition cycle variationsSARE-Bi: edition cycle variations

• Bilingual writerBilingual writer• could develop bilingual documentcould develop bilingual document• translator work is greatly simplified: she only has to translator work is greatly simplified: she only has to

revise translationrevise translation

• Normative documentNormative document• model or template in its categorymodel or template in its category• state state normativenormative assigned by the translator assigned by the translator• a bilingual writer could use it for a new document a bilingual writer could use it for a new document

without translator interventionwithout translator intervention• frequent in administrative environmentfrequent in administrative environment

Page 24: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 2424DELi (UD)DELi (UD)

SARE-Bi: implementationSARE-Bi: implementation

• Web application (based in Zope server)Web application (based in Zope server)• multilingual (es-eu-en localised) web interfacemultilingual (es-eu-en localised) web interface• optimal information/contents managementoptimal information/contents management• complex system of user managementcomplex system of user management

• Object-oriented databaseObject-oriented database• classes: documents, subdocuments, segmentsclasses: documents, subdocuments, segments• attributes: metadata (managed in disjoint sets)attributes: metadata (managed in disjoint sets)

• Full XML functionalityFull XML functionality• allowed export to the TEI and TMX formatsallowed export to the TEI and TMX formats

Page 25: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 2525DELi (UD)DELi (UD)

SARE-Bi: conclusionsSARE-Bi: conclusions

• In full experimental use since May 2003In full experimental use since May 2003• six writers / two translatorssix writers / two translators• no quantitative measures, butno quantitative measures, but• sustained increment in the number of documentssustained increment in the number of documents• mostly positive comments of the usersmostly positive comments of the users

• Improving the system (X-Flow project)Improving the system (X-Flow project)• automation of the workflow tasksautomation of the workflow tasks• document versioning (XLIFF)document versioning (XLIFF)• integration of linguistic engineering technologiesintegration of linguistic engineering technologies

Page 26: Metadata for multilingual content management A practical experience with the SARE-Bi system

T&tC 25 (2003)T&tC 25 (2003) 2626DELi (UD)DELi (UD)

SARE-Bi: conclusionsSARE-Bi: conclusions

• SARE-Bi has been funded by:SARE-Bi has been funded by:– Autonomous Basque GovernmentAutonomous Basque Government

• Dept. of Industry (project X-Flow, 2002-2003)Dept. of Industry (project X-Flow, 2002-2003)• Dept. of Education, Universities, and Research Dept. of Education, Universities, and Research

(project XML-Bi, PI1999-72, 2000-2001)(project XML-Bi, PI1999-72, 2000-2001)

– CodeSyntax (Eibar, Spain)CodeSyntax (Eibar, Spain)

• AcknowledgementsAcknowledgements– Josu Gómez, Arantza Domínguez (DELi, UD)Josu Gómez, Arantza Domínguez (DELi, UD)– Luistxo Fernández (CodeSyntax)Luistxo Fernández (CodeSyntax)

Page 27: Metadata for multilingual content management A practical experience with the SARE-Bi system

DELiDELi (Universidad de Deusto) (Universidad de Deusto)[1][1], , CodeSyntaxCodeSyntax[2][2]

www.deli.deusto.eswww.deli.deusto.es www.codesyntax.comwww.codesyntax.com

Translating and the Computer 25Translating and the Computer 25

Metadata for multilingual Metadata for multilingual content managementcontent managementA practical experience with the A practical experience with the SARE-Bi systemSARE-Bi system

Díaz, Abaitua, Jacob, QuintanaDíaz, Abaitua, Jacob, Quintana [1][1] y Araolaza y Araolaza[2][2]