dealing with lexicon acquired from comparable corpora: post-edition and exchange
DESCRIPTION
Material presented at the TKE (Terminology and Knowledge Engineering) Conference 2010, Dublin, Ireland. Download paper at http://hal.archives-ouvertes.fr/hal-00544403 Insitutions: Laboratoire d'Informatique de Nantes Atlantique (LINA), Lingua et Machina.TRANSCRIPT
Dealing with Lexicon Acquired from Comparable Corpora
Post-edition and Exchange
Estelle Delpech, Lingua et MachinaBéatrice Daille, U. de Nantes - LINA
1/23
Working w/ lexicon acquired from comparable corpora
I. Terminology acquisition from comparable corpora : quick overview
II. A tool for terminology post-edition
III. Data exchange : a TBX variant for automatically acquired lexicons
IV. Future work
2/23
Part I
Terminology Acquisition from Comparable Corpora
3/23
Terminology acquisition from comparable corpora Comparable corpora:
“Two corpora, respectively in two languages l1 and l2 are said ”comparable” if there exists a substantial part of the vocabulary of the corpus in language l1 whose translation can be found in the corpus in language l2.”
(my translation of [Déjan and Gaussier, 2002] )
Advantages : Availabily Real usages
4/23
Terminology acquisition from comparable corpora
Terminology extraction : a contextual analysis Compare contexts of source and target terms If contexts are similar, there's a good chance
source and target terms are translations of each other, ex :
mastectomy : reconstruction, prophylactic, treat, undergo, removal
mastectomie : reconstruction, prophylactique, traiter, subir, ablation
5/23
Terminology acquisition from comparable corpora
Results Not as good as acquisition from parallel corpora ! Fung (1997) : 30 % accuracy on the Top20
candidates Morin et al. (2004) : translation is usually the 34th for
complex terms
0,92 ablation
0,48 opération
mastectomy 0,89 mastectomie
6/23
Outputs one-to-many alignments– Evaluation : precision on the TopNBest alignments
Part II
A Tool for Post-edition
7/23
A tool for post-edition
Existing Tools : iView (Merkel and Foo, 2007) ArayaTermExtractor (Waldhör 2006) Xerox Terminology Suite ®
Our needs : Deal with one-to-many alignments Non-aligned contexts Allow non binary annotation Display useful information to help finding the right
candidate in the corpus8/23
“Useful” information
→ Knownledge that helps catching the in vivo behavior terms
→Text-driven, term-oriented approach Useful information :
Variants Collocations Distributional neighbors Contexts
→ To be harvested during the term extraction / alignment process
9/23
Useful information : example
Mastectomy Mastectomie
risk reducting ~simple ~
~ préventive~ simple
TumorectomyLumpectomyOophorectomy
TumorectomieAblationOpération
...patient may choose to have risk-reducing bilateral mastectomy if they have a strong family history of breast cancer...
...la mastectomie préventive pourrait supprimer la grande majorité du risque de développer un cancer...
10/23
Post-edition interface http://80.82.238.151/Metricc/InterfaceValidation, user “test”, no password
11/23
Part III
Data Exchange : a TBX variant for
automatically acquired lexicon
12/23
Quick introduction to TBX (1)
TBX : Term Base eXchange Open, XML-based standard for exchanging
structured terminological data approved as an international standard by LISA
and ISO (norm 30042) Maps to TMF data model Subset of MARTIF Designed for various use cases Customizable
13/23
Quick introduction to TBX (2)
2 components : Structure : core structure based on TMF
metamodel Content : formalism to express data-categories
and their constraints
Adapted from ISO norm 30042:2008, Fig. 4, p.30
Default XCS XCS1 XCSn
Default TBX TBX variant 1
Core DTD/Schema
Form Content
TBX variant n 14/23
Quick introduction to TBX (3)
Taken from ISO norm 30042:2008, Fig. 1, p.9
responsability
respPerson
termType
usageNote
corpusTrace
reliabilityCode
partOfSpeech
Form defined in DTD Content defined in XCS
15/23
TBX variant for lexicon acquired from comparable corpora
Default TBX data-categories termType : entryTerm, variant externalCrossReference, usageNote partOfSpeech, frequency, reliabilityCode... transactionType, responsability
+ Customized data-categories : occurrences, occurrenceCount relatedTerm termDefinition, definitionRelevance ntigReference 16/23
TBX variant : A term entry
17/23
TBX variant : 1-to-n alignments
18/23
TBX variant : approved alignment
19/23
Feed-back on TBX TBX is made for stable terminologies with little
uncertainy on the status of translations not machine-generated lexicons of “candidate translations” : difficult to separate of term + properties from its
alignments
no data category specific to automatically estimated reliability
Difficult to make text-driven, term-oriented knowledge fit in a concept oriented format no definition category that would apply to a single term
and not the whole concept
Conclusion
Future work
21/23
Future work
Integration of prototype in Libellex TBX import / export edition of linguistic properties
User testing (ergonomics) Evaluation of added-value for translation Explore new ways of :
aligning terms selecting contexts
22/23
References Post-edition prototype on line : http://80.82.238.151/Metricc/InterfaceValidation/ user “test”,
no password
Metricc project : http://www.metricc.com/
Lingua et Machina : http://www.lingua-et-machina.com/
Comparable corpora : Déjean, H., Gaussier, É. (2002) : “Une nouvelle approche à l'extraction de lexiques bilingues à partir de corpus comparables”, In Lexicometrica, Alignement Lexical dans les corpus multilingues, pp.1-22.
ArayaTermExtractor : http://www.heartsome.de
Xerox Terminology Suite : http://www.temis.com/
Iview : Nyström, M., Merkel, M., Ahrenberg, L., Zweignebaum, P., Petersson, H. and Åhlfeldt H. (2006) : “Creating a medical English-Swedish dictionary using interactive word alignment”', In BMC Medical Informatics and Decision Making, 2006, pp. 6-35
TMF : ISO 16642 - Terminological markup framework
TBX : ISO 30042 - Systems to manage terminology, knowledge and content -- TermBase eXchange (TBX)
Data categories : ISO 12620 - Terminology and other language and content resources -- Specification of data categories and management of a Data Category Registry for language resources