terminology management - [email protected] 18 statistical extraction •monolingual or bilingual...
TRANSCRIPT
Terminology Management
Angelika Zerfass
Introduction
•Degree in translation (Chinese/Japanese into German), Computational Linguistics
•Worked for Trados in Japan, Germany, USA
•Since 2000, independent trainer and consultant for translation tools (TM tools and terminology management tools)
•2 employees for technical support and terminology management
Agenda
•What kind of terminology work are you doing?
•What is terminology?
•Collecting terminology
•Terminology lists and term bases
•Terminology checking
WHAT KIND OF TERMINOLOGY WORK ARE YOU DOING?
WHEN TERMINOLOGY WORK GETS STARTED…
Terminology Work
• Terminology projects are often initiated during the translation process.
• During the creation of a bilingual terminology list, the need for source language standardization arises
– Allowed and forbidden terms, additional information like definitions…
• Standardized terminology can now be used for source language checking
• The source language term lists are then complemented by the target languages (if there is a relay language, this needs to be completed first)
– Relay Language = source language Japanese, relay language English, further translation starts from English, not from Japanese
• Bilingual term lists can be used for translation and terminology checks (in the Translation Memory tools) in the target language
• The content of a term base can be published on the intranet/internet or access can be granted for other users than translators
6
What is terminology?
• In our industry, terminology usually means
– product and company names
– product-specific terms
– company-specific terms
– subject matter-specific terms
– special abbreviations and acronyms
– Terminology that is prescribed by norms and standards
– our terminology versus the terminology of our competitors
7
Multilingual Workflow Management
8
Where do you find terminology?
• Product development, text creation, specifications
• Marketing
• Sales
• Contracts, agreements
• Product manuals / documentation
• …
How is terminology represented?
• Dictionary
– Lists all the meanings of one term
– Bank = financial institution, elevation, bench…
• Terminology Database for translation
– One entry per meaning of a term
• bank (= financial institution)
• bank (= bench)
• bank (= elevation like a sand bank)
9
10
Term base
Bank (financial institution)
Bank (elevation)
Bank (bench)
Bank (cloud bank)
Bank (to bank)
…
…
…
…
Dictionary Thefreedictionary.com
Creation of term lists Term extraction /setting up a term list
• Ideally the term lists with new terms and company/product specific terms are collected where they are created
• Product development / engineering
• (Technical) authoring
• Marketing / Sales
– Ideally these groups use rules on how to create terms – air flow sensor vs. air-flow sensor vs. Air Flow sensor vs. sensor airflow…
– Körperwäremesensor, Sensor für Körperwärme, Sensor Körperwärme
11
Creation of term lists Term extraction /setting up a term list
• During translation – Asking translators to either create new entries in the term base or fill in the
target language equivalents of the source language terms
• Extraction of terms from existing (monolingual) documents or bilingual translation resources (TMs) after a translation project or independent of any translation project. – manually
– tools-assisted
– Up to 20,000 words manual and tool-assisted extraction take about the same time for reading/checking the segments.
– About 3% of all words or 20-30% of all terms extracted by a tool can be considered real term candidates.
12
Terminology Extraction
• Tools-assisted
– Manual extraction assisted by translation tools
– Concordance tools – create a list of all terms in the document
– monolingual
– Statistical extraction tools – create a list of term candidates according to frequencies
– monolingual and bilingual
– Linguistic extraction tools – create a list of term candidates according to rules (ex: noun
phrases up to 3 elements)
– monolingual and bilingual 13
Manual extraction
• Some Translation Memory tools offer a way to collect term pairs out of the alignment of two documents and send it directly to a term base (memoQ)
14
Extraction of source language terms
• Some Translation Memory tools will create a list of the source language terms of the documents that are imported into a TM system (ex. across).
• Workflows allow for term translation as a first step
– Without context, this might not be feasible…
• Terminology goes into the term base directly (across, memoQ, SDL Trados via a separate extraction tool)
16
Monolingual Extraction
• Extraction of terms from documents in one language.
– Creation of term lists… • important terms
– Who defines what is important?
– How can a tool “know”, what is important?
• frequent terms – What is frequent? 3 times / 10 times…
– Are frequent terms also important?
• new terms – According to whose level of subject matter knowledge?
– Compared to which term list / term database?
Concordance lists
• List of all terms in a document
• Frequency of each term
• Extraction can be controlled through stop word lists
• Terms in context
• Simple Concordance Program (SCP) freeware (European languages and
Arabic only)
17
Term candidates – statistical extraction
Statistical extraction • Monolingual or bilingual
• Suitable for every language / language combination (for example from a translation memory)
• The larger the collection of extraction material, the better the extracted lists
• Stop word lists
• Context sentences
• In theory for all languages, but in praxis Asian languages need more selection work than others
Term candidates – statistical extraction
Statistical extraction • Settings are focused
on words surrounded by spaces as delimiters
• Result: "stupid" term candidates
20
This is, how a text looks to a statistical extraction tool…
Vot gnig harengoga fuor tok gnig nor shewerginhatz. Mirhon bortup tip trewshu gnig batbo loqtet. Bortup ter, bortup nofdas, semsel nih furpo ayano bliktreptat. Mirhon granbevtrov driktopret grig go wasbrekit mut mirkep taptro gnig suf. Aktrep zitpek nitnit bortup mil. Setrimb ak troptan bur metlatkento.
21
Bilingual Extraction
• Term extraction from bilingual sources like translation memory files or bilingual translation files
– Creation of parallel lists of terms and their translation(s)
• All forms of the term and all its translations
• Only basic form
• Most frequent translation of source term
Statistical Extraction with SDL MultiTerm Extract
List of source terms List of possible translations
Context sentences from TM Notes field, definition field; add context sentences to entry in term base
Linguistic Extraction Tool
• Tool knows about the structure of the language
• Extracted terms can be reduced to their basic from with the help of dictionaries and rules
• User can define the rules used for extraction – like noun phrases up to 3 words…
• Monolingual or bilingual extraction
• Extraction limited to supported languages (mostly Western European)
Results of a linguistic Extraction with Context
(TerminologyWizard)
Linguistic Extraction with SDL PhraseFinder
List of source terms List of possible translations
Grammar, comment fields
Context sentences from TM
Test Results…
• Statistical extraction "works" for all language pairs, but the tools are better with European languages than Asian languages
• For bilingual extraction Asian languages work better as target languages
• Concordance tools are limited in file formats that they can extract from
• Manual extraction might not be the fastest, but the best if you have a certain goal (company specific terms…)
• Cleaning up these extracted lists takes TIME!
Online Extraction Japanese/Chinese
• Gensen web (uses a POS tagger)
• Test with Japanese Yahoo website, homepage
• Test with China Daily news article
29
Term extraction issues • Terminology extraction is a highly individual
process – Goal of extraction, subject matter expertise, available
time
• Tools use different methods for terminology extraction
– Concordance, statistics, linguistics
• Tools support different file formats for extraction and export
– Monolingual, bilingual, export formats
• Tools sometimes don’t show the context from which the term was taken
Example – Bilingual extraction from a TM, 60,000 units
– Several statistical extractions with varying settings to produce lists with more or less "noise" (ca. 10 min / extraction)
– General stop word list
– 800 extracted terms (medical) English-German took us 5-6 hours
– Deleting terms that were too general, product names that stayed the same, cleaning up rubbish strings…
– Looking up context to determine which of the translations should be marked as term candidates
– Export to Excel (+ sorting from short to long, number of words in a term… 0.5 hours)
– And now the list needs to be checked by experts!
ADDITIONAL INFORMATION
What else belongs to a term?
• A term and its equivalent in other languages is often not enough. More information is needed…
– Source (where did you find the term?)
– What product, business area… does the term belong to?
– Definition (how do you write a good definition? )
– Images
– What is allowed and what is forbidden
– Additional notes, comments, examples… 32
33
A set of questions to create good definitions
(as suggested by Kurt Hilgenberg)
• Object-oriented questions • What is XXX?
• Where does XXX appear?
• Functional questions • How does XXX work
• What are the characteristics of XXX?
• How does XXX differ from YYY?
• Questions on reasons and conditions • Under what conditions does XXX appear?
• Instrumental questions • What are the objectives of XXX?
• How is XXX used?
Multilingual Workflow Management 34
Example for the term DIALOG
Object
What is a dialog? A part of the user interface
Function
What are the characteristics of a dialog? A dialog has checkboxes, radio buttons, input
fields, dropdown menus or other ways of selecting
or inputting values
How does a dialog differ from a window? A dialog is used for activating / deactivating
settings
A window is used for showing the result of the
settings
Condition / Reason
Under what conditions does a dialog appear? The dialog appears when the user selects a menu
item ending with “…“
Instrumental / Usage
What is the dialog used for? The dialog is used to activate / deactivate the
options and input values
Multilingual Workflow Management 35
Example for the term SWITCH Object
What is a switch? A piece of hardware
A selection item on a dialog in a software user
interface
Function
What are the characteristics of a switch?
The switch can be turned to different positions
The switch is a square box that can contain a mark
or be empty
Condition / Reason
What is the reason for the switch? The switch allows the changing of the settings
Instrumental / Usage
How is the switch used?
The switch is turned clockwise or counter-
clockwise
The switch box is clicked and then shows a mark,
it is clicked again to take the mark off again
36
Tips on terminology work
• As much useful information as possible – Definition, Context example, Source information,
graphics, status, customer
• ...but only as much as you need to decide if the term is to be used in a certain situation or not
• Enter the term as it will appear in the text – Incorrect: Screen (Monitor)
– Correct: Screen Monitor
• Enter base / singular form of the term
• For Asian languages, which are more contextual than others, add longer expressions
EXCHANGING TERMINOLOGY
38
Exchanging Terminology – Working with a client
• Ask for term lists
• Create a template for the customer to add terms and answer questions
• Define clear rules and color marking on who is allowed to do what (nobody to add rows or columns, only fill the column that is indicated for you to work in…)
38
Exchanging Terminology
• Working with Translators • Usually the first step is an Excel table
• Translators will either import it into their tool of choice or will consult the list manually during translation
• It is hard to check, if the terminology was used consistently, if there is no term base with term checking features in the translation process
• If you know what software your translators are using, you can send them a pre-defined term base to attach to their translation system
39
Developments in Terminology Work
• If possible
– All users of terminology work on an online system
• Web-based version of the terminology component of the translation tool (WebTerm, MultiTerm Online, qTerm, crossTerm Web…)
• Self-developed web-based interface for terminology work by a translation vendor (Lookup by DOG, TermXplorer…)
• Web-based terminology system like TermWeb (Interverbum)
• Connection to a TM system is a plus 40
Terminology Work
• Terminology takes a lot of time
– Limit the exchanges on terminology questions to a certain amount of communication back and forth
– Someone needs to have the power to decide what to use (even if the client is not responsive)
– Terminology work needs good documentation
– Terminology is a work in progress!
41
What about the TBX standard exchange format?
• Not all tools are able to read this format yet and the files may be different from different tools
– TBX is quite complex and needs a lot of attention to detail
– Some examples for TBX support: – MultiTerm 2009/2011 (export; import only with conversion)
– memoQ qTerm (yes), internal term base (no)
– MultiTrans (TBX import and export)
– across (TBX import and export) (TBX between across and MultiTerm works well in our experience)
42
TERMINOLOGY LISTS/ TERM BASES
45
Content for a term list / term base
• Subject matter or company/product-specific terms
• ProductName, Company - Name
• “compatibility across platforms”
• view settings – Dialog with settings for viewing something or a button to view the
settings of something?
• Collect terms, synonyms, abbreviations
• screen, scr., monitor
• Base form of the term
• Decide on term status field
• forbidden, deprecated, pending, confirmed by…, used by us, used by competitor… 45
Terminology Retrieval during Translation
46
Known terms marked in the text in blue
Additional Information from the term base (examples, explanations…)
Blue = allowed term Black = forbidden term
47
Metadata in term lists / term bases
• Depending on the intended usage
– Definitions, status, source, pictures for translators
– Gender, grammar, examples for non-translators
– More grammatical information for use in machine translation systems
– Explanations with examples and pictures for educational purposes
47
48
Levels on the Entry
• A terminology entry usually has 2 to 3 levels where you can add information
– Entry level: Information on the entry as a whole, like a picture, a general note, a product name…
– Language level: Information that applies to all term in one language (definition…) – this level is not often used
– Term level: Information that applies to a certain term, like status forbidden, source of the term, context example, links to other, similar terms, gender information…
48
49
Term Base Entry (MultiTerm)
Term level information (free text)
Synonyms
Term level information (categories)
Entry level information
Term list -> Term Base
• Getting the data into a term base
– Import format for a term database depends on the tool that is going to be used to maintain the terminology
• Row layout – 1 row = 1 terminology entry
• ID layout – several rows with the same ID = 1 terminology entry
50
Row-based term list (one row = one terminology entry)
51
ID-based term list (all rows with one ID belong to the same entry)
Term Bases of TM systems
• Some examples • SDL MultiTerm is a freely configurable term base
system. Can be used online as well.
• memoQ can import TMX files or delimited text files (saved from Excel) directly, but the user interface is fixed and the number of fields is limited
• memQ qTerm (web-based term base)is similar to MultiTerm in that it is freely configurable and can accommodate as many fields as you like
• TermStar/WebTerm offer a pre-defined interface where custom fields can be added
52
Import routines
• Preparation steps to import a list (most often Excel) into a term base:
– memoQ: save Excel as Unicode TXT
– MultiTerm: convert Excel to MultiTerm import format (XML) through the MultiTerm Convert tool.
53
Structure of a memoQ entry List of entries in the term base The 2 selected languages for display in the list
Metadata for term in language 1
Metadata for term in language 2
Metadata for the entire entry (all languages)
Excel for memoQ
Entry level data Term level data In memoQ, the string NonTerm can be used to mark a term as forbidden. The corresponding checkbox will be set during import.
Import settings (mapping Excel to memoQ structure)
Columns are mapped to the corresponding field in memoQ
The status column, containing the NonTerm information goes into the Term information field.
Excel for MultiTerm
Create as many fields as you need and name them as you like
Hyperlinks will be available as links in MultiTerm as well
Import settings (mapping Excel to MultiTerm structure)
Columns are mapped to the field types in MultiTerm
Metadata columns are connected to the correct level (entry level or term level)
Imported term base in MultiTerm
59
Maintaining terminology within the term base
• Terminology is an ongoing process – Adding new terms
– Adding new languages
– Adding additional information (pictures…)
– Changing translations / incorporating feedback from users
– Changing the status of a term (forbidden, allowed, deprecated, pending…)
– Separating or combining terminology resources
– Converting terminology resources for use within other tools
60
TERMINOLOGY CHECKING
Checking terminology during translation and the authoring process
• Tools that integrate with the authoring environment to check the source documents for – Forbidden terms, use of synonyms, correct spelling of the words,
grammatical or structural errors
• Tools inside translation memory systems that check the source language-target language sentence pairs against a term base for – Forbidden terms
– Terms that are in the term base where the translation of the term has not been used
– Missing terms in the term base
• Stand-alone QA Tools that offer a terminology checking component for bilingual translation documents
63
64
SDL Author Assistant
Acrolinx
65
Term Check inside a TM tool environment (SDL Trados 2009/11)
- Forbidden term (phrase)
- Wrong term (set instead of sentence)
17-20 Nov. Buenos Aires, Argentina
66
Term Check Results
Wrong translation
(selector key = Auswahltaste not Auswahlschalter)
Wrong translation
(monitor = Bildschirm, not Monitor)
17-20 Nov. Buenos Aires, Argentina
67
Term Check Results
Wrong translation
Several translations Missing translation in term database
- Forbidden term (Monitor) - Wrong term (Auswahlschalter) - No target (menu)
70
Summary
• Each checking routine only checks some possibilities, none checks the whole range
• Term from the term base was not used
• Missing translation in term list / term base
• Term with several translations
• Missing source term (reverse check)
• Setup possibilities of the term base define the range of checks
71
Term base setup with forbidden terms (SDL MultiTerm)
72
memoQ term base with wildcards ( * or | ) for terminology matching
WHAT ELSE…
Using terminology outside of the translation process
• What else is terminology good for
– Company dictionary
• Help new employees / sub-contractors to understand the products
– Training machine translation systems (additional grammatical information needed)
– Training search engines (to search for synonyms for the term the user entered)
– Setting a standard by publishing the terminology (see Microsoft glossaries)
74
75
Terminology Processes
Term list
Authoring
Term list
Terminology extraction
Terminology approval
Term list
Import
Terminology Database
Translation and Terminology Check
Term list
• Term translations • New terms • Change requests
Term list
Terminology approval
Import of translations
Term check during authoring
Online publication of term database (intranet/internet)
Term list
• New terms • Change requests
Term list
Terminology approval
76
From terms back to sentences
• If terminology is used for source language checking then sentence checking can be added as well
• A list of standard sentences is compared against the document to see where the author deviated from the standard sentence structures
• Standard sentences can be extracted (manually) from translation memory systems
• Authoring memory systems then use these sentences for source language checking
17-20 Nov. Buenos Aires, Argentina
77
Sentence Check • Authoring Memory Systems to check consistent
use of standard sentences
Select the second menu point – open recent project
Select the first menu point “open current project“
Select the second option – show open projects
Select the view menu and click on project
95%
92%
90%
The text you write…
78
Online terminology
• Terminology management was either done in a standalone terminology management tool or for use in a translation memory system.
• Not everybody who needs to work on the term owns the corresponding term management system or wants to work with one.
• Online systems allow to work on terminology together with clients and translators through a browser interface.
• Users have certain rights on the entries (making comments, adding entries, deleting terms…)
MultiTerm online
79
79
memoQ qTerm
80
80
Interverbum
81
81
QUESTIONS?
85
Some Terminology Extraction Tools
– Concordance tools (freeware)
• Simple Concordance Program (SCP), http://www.textworld.com/scp/
• ExtPhr32, http://publish.uwo.ca/~craven/freeware.htm
– Term extraction tools / components of translation memory tools
– Online term extraction for Japanese, Chinese
• http://gensen.dl.itc.u-tokyo.ac.jp/gensenweb_eng.html
– Statistical Extraction
• MultiTerm Extract, Déjà Vu Lexicon, Heartsome Dictionary Editor, across
• TermiDOG (www.dog-gmbh.de), Chamblon Terminology Extractor (http://www.chamblon.com/terminologyextractor.htm), Terminotix Synchroterm (http://www.terminotix.com )…
– Linguistic Extraction
• Synthema Terminology Wizard
(http://www.synthema.it/english/servizi/traduzioni.html) – Does not seem to be available commercially anymore
• SDL PhraseFinder…
85