bibusiness sis trategies for b ildib uilding …...title microsoft powerpoint - asia online - lrc-...
TRANSCRIPT
B i S i f B ildiBusiness Strategies for Building Strategic Advantage and Revenue
from Machine Translation
Dion WigginsChief Executive OfficerChief xecutive [email protected]
Copyright © 2011, Asia Online Pte LtdCopyright © 2011, Asia Online Pte Ltd
Cloud is one of many delivery models, not a quality model
Copyright © 2011, Asia Online Pte LtdCopyright © 2011, Asia Online Pte Ltd
Copyright © 2011, Asia Online Pte Ltd
Copyright © 2011, Asia Online Pte Ltd
Near HumanNew techniques mature
Hybrid MT Platforms
Businesses start to consider MTLSPs start to adopt MT
New skills develop in editing MTGood Enough Threshold
Google drivesProcessors became powerful enoughLarge volumes of digital data available
Gist
New skills develop in editing MT
Early SMTVendors
Good Enough Threshold
MT acceptance
Qua
lity
g g
BabelfishGoogle switches
from Systran to SMT
911 ‐> Research fundingQuality plateau as RBMT reached its limits in many languages – only marginal improvement.y g g y g p
Experimental
Copyright © 2011, Asia Online Pte Ltd
Early RBMT improved rapidly as new techniques were discovered.
CorporateCorporate Brochures
Product Brochures
2,000
10 000
Example WordsHuman
User Interface
Products Product Brochures
Software Products
M l / O li H l
10,000
50,000
200 000
Enterprise Information
User Documentation Manuals / Online Help
HR / Training / Reports
200,000
500,000
Existing Markets $31.4B
New Markets
Support / Knowledge Base
Communications 10,000,000
20,000,000+
Email / IM
Call Center / Help Desk
User Generated Content 50,000,000+Blogs / ReviewsMachine
Problem: Only 0.5% of what needs to be translated today is being translated due to cost and time constraints.
Solution: Humans cost too much and take too long. The average human translates at 2,000‐3,000 words/day.Near‐human quality machine translation offers an alternative to human that is faster and lower cost and can deliver huge volumes of content quickly – millions of words/day. M hi T l i (MT) li i d h f Wh bi d
Copyright © 2011, Asia Online Pte Ltd
Machine Translation (MT) quality is now good enough for many new purposes. When combined with human post editing MT can deliver gold standard quality in a fraction of the time.Many business cases previously not viable due to time and cost constraints now make sense.
Wait for a Project That Requires MTWait for a Project That Requires MT
Create a Product For Resale
Create a Product For Resale
• Opportunistic approach• Many LSPs are interested in MT, but not
• Proactive approach• Leverages existing translation assets• Can be sold to many clients
That Requires MTThat Requires MT For ResaleFor Resale
willing to take the plunge without a paying client.
• Limited to one clientH d t ll l l l
• Can be sold to many clients• Easier to sell ‐ test and show• Can sell multiple language pairs at the same
time• Harder to sell – longer sales cycle• Often build one language pair to try, before
committing to others
• Generally a higher Return On Investment (ROI)
Revenue RevenueRecurring revenues from words translatedOne time revenues from resale of customization
Recurring revenues from words translatedPreparing source dataOne time revenues from resale of customization
Post editingPreparing source dataTerminology definition
Preparing source dataPost editingRuntime glossary preparationNon‐translatable terms definition
Copyright © 2011, Asia Online Pte Ltd
Non‐Translatable termsUnknown and high‐frequency phrase resolution
Reduce Project Costs Reduce Project Costs Faster Translation Delivery
Faster Translation DeliveryDelivery Delivery
• Helps to manage margin squeeze:– compete with competitors using cheaper (perhaps
lower quality) resources or competitors using MT
• New projects that could not have been delivered on due to time and resource constraintsq y) p g
• Helps to cost justify business cases that may not be viable using a human only approach
• Can be used behind the scenes (like a
constraints• Helps clients that want to simultaneously
ship product in multiple languages• New clients in research, analysis, data
translation memory) or disclosed to client• More client work in other areas as a result of
left over translation budget.
ymining and discovery markets
• New clients that need real‐time or near real‐time translation
Revenue RevenueDepending on project or product model, revenues will vary See previous slide
Preparing source dataRuntime glossary preparationrevenues will vary. See previous slide.
Additional revenues from client getting more ROI and willing to invest in new languages.
Runtime glossary preparationNon‐translatable terms definitionPost editingRecurring revenues from words translated
Copyright © 2011, Asia Online Pte Ltd
Expand ExistingRelationships
Expand ExistingRelationships
Added Functionality
Added FunctionalityRelationshipsRelationships FunctionalityFunctionality
• Opportunities to translate additional material for markets that may not have
• Expand service offerings with new features such as multilingual customer y
been cost viable with a human only approach
• Reuse custom MT for multiple purposes
gsupport
• Integrate machine translation into existing client technologies, products p p p
• Enable clients to better compete in markets that were only partially addressed due to cost and time
and services
Revenue Revenue
addressed due to cost and time
Preparing source dataTerminology definition
Same as for Expanding Existing RelationsTerminology definitionNon‐Translatable termsUnknown and high‐frequency phrase resolutionPost editing
Additionally able to charge various service fees relating to the new services offered. For example, translating common Q&A for
Copyright © 2011, Asia Online Pte Ltd
Recurring revenues from words translatedOne time revenues from resale of customization
customer support and a commission on integrated multilingual support products.
• Chat• Discussions• Translation ofTranslation of other content
Copyright © 2011, Asia Online Pte Ltd
Copyright © 2011, Asia Online Pte LtdCopyright © 2011, Asia Online Pte Ltd
Whether human or machine, where there is translation, there will be errorsWhether human or machine ‐
, ,where there is translation, there will be errors
Machine translation k l herrors make me laugh.
BUTHuman translation
Copyright © 2011, Asia Online Pte Ltd
errors make me cry!
Engine OutputSystran La TA no es no majican en tekt de la salsa del baady jGoogle MT no hay majican en salsa de tekt baadAsia Online TA ain't no tekt baad majican en salsa
Quality Input Delivers Quality OutputInput: MT cannot perform magic on bad source textInput: MT cannot perform magic on bad source text.Output: TA no puede hacer magia en el texto de fuente mala.
GIGOGIGOGarbage In
Copyright © 2011, Asia Online Pte Ltd
Garbage Out
Bad source text will result in bad translations.GIGO G b I G b O
E amples of Common Errors
GIGO – Garbage In, Garbage Out
• Bad end of sentence formatting (merged or missing periods)
Examples of Common Errors
...jumped over the lazy dog.Mary had a little lamb...
...jumped over the lazy dog Mary had a little lamb...
• Bad spelling and grammar (often from OCR or typing without spell check)
The quiak brown faux jumped over the lazy fog.
• Sentences broken by line breaks
Twas the the night christmas, when all threw the house
The quick brown
Copyright © 2011, Asia Online Pte Ltd
y e qu c b ofox jumped over thelazy dog.
If you wanted to evaluate a sports car, which of the following vehicles would you test drive?
Ag y
B C
Traditional Consumer Mainstream High Performance
Systran, ProMT Microsoft, Google Asia Online
Gi tGi t PublicationPublicationGistDesigned to give a general understanding of a document. Some
basic customization is available by adding dictionaries and l b t th i l t l f t l
GistDesigned to give a general understanding of a document. Some
basic customization is available by adding dictionaries and l b t th i l t l f t l
PublicationDesigned to deliver translated output that requires the least
amounts of edits in order to publish
PublicationDesigned to deliver translated output that requires the least
amounts of edits in order to publishsamples, but there is no real control of style or grammar.samples, but there is no real control of style or grammar. amounts of edits in order to publish.amounts of edits in order to publish.
Machine translation systems cannot magically understand who your customer is, what vocabulary you prefer or what writing style you desire
Copyright © 2011, Asia Online Pte Ltd
, y y p f g y y– all are required in order to publish.To achieve this goal, a MT system must be customized to match.
ESTranslated text can be stylized based on the style of the Monolingual data.
News paper ENpaper article
Bilingual Data Monolingual Data
Text written in the style of
business Millions of Sentence Pairs Business News
The EconomistNew York TimesForbes
news
ENChildren’s BooksHarry PotterRupert the BearFamous Five
Text written in the style of children’s Possible Vocabulary Famous Five
Spanish Original Before Translation:
Se necesitó una gran maniobra política muy prudente a fin de facilitar una cita de los dos enemigos históricos
booksy
Writing Style & Grammar
Before Translation: cita de los dos enemigos históricos.
Business News After Translation:
Significant amounts of cautious political maneuvering were required in order to facilitate a rendezvous between the two bitter historical opponents.
Copyright © 2011, Asia Online Pte Ltd
Children’s Books After Translation:
A lot of care was taken to not upset others when organizing the meeting between the two long time enemies.
Pre‐Processing RulesPre‐Processing Rules
Language Studio™ Hybrid Rules and SMT Engine ModelStatistical Machine TranslationStatistical Machine Translation Post‐Processing RulesPost‐Processing RulesPre Processing RulesPre Processing Rules
• Open Process• Fully CustomizableC l t C t l
• Open Process• Fully CustomizableC l t C t l
• Open Process• Fully CustomizableC l t C t l
Statistical Machine TranslationStatistical Machine Translation Post Processing RulesPost Processing Rules
• Complete Control• No Limits
• Complete Control• No Limits
• Complete Control• No Limits
Rules prepare the source text for better translation
Rules fine tune the SMT output and offer
Competitors “Hybrid” Rules and Corrective Statistical Engine Model
and guide the SMT processp
additional flexibility
Competitors Hybrid Rules and Corrective Statistical Engine ModelGet it right the first time!!• Black Box • Black Box
Statistical Correction of Rules ErrorsStatistical Correction of Rules ErrorsTranslation RulesTranslation Rulesf
Why treat the symptoms when
• Black Box• Where “Magic” Happens• Limited Control
• Black Box• Where “Magic” Happens• Limited Control
• Phrases Only• Dictionaries Only
Copyright © 2011, Asia Online Pte Ltd
y pyou can treat the cause?Band‐aid to fix a broken solution
Phrases Only• 40,000 Phrase Limit
Dictionaries Only• 20,000 Term Limit
Data CleaningData Preparation
Translate
Data Collections
Training
CollectionsDiagnostics and Fine Tuning
QualityAssurance
Language Pair Foundation Data
Copyright © 2011, Asia Online Pte Ltd
Original Translation Sources
g g
Domain Foundation Data
If you wanted to write an article on global monetary policy for Forbes, which specialist would your hire?
A B C
John Smith, Financial Analyst
Bruno Mann, Ex‐Pro Footballer
Kerry Adams, Mechanical Engineer
You would choose the specialist with skills in the domainIf you wanted to translate the same article into Chinese,
Mother tongue
which translator would you hire?A B C
Domain experience
John Smith, American, Financial Analyst, Learning Chinese
Brenda Jones, British, Studying Chinese History
Faye Wong, Chinese,Graduated Business & Finance at Harvard
Copyright © 2011, Asia Online Pte Ltd
High Quality Translation – Human and Machine Both Require Specialization
Financial Analyst, Learning ChineseStudying Chinese History Graduated Business & Finance at Harvard
Clean and Consistent DataA statistical engine learns from data in the training corpus. Language Studio Pro™ contains many tools to help ensure that the data is scrubbed clean prior to training.
Controlled DataFewer translation options for the same source segment, and “clean” translations lead to better foundation patternsclean translations lead to better foundation patterns.
Common DataHi h d t l i th bj t i f t ti ti lHigher data volume, in the same subject area, reinforces statistical relationships. Slight variations of the same information add robustness to the systems.
Current DataEnsure that the most current TM is used in the training data.
Copyright © 2011, Asia Online Pte Ltd
Outdated high frequency TM can have an undue negative impact on the translation output and should be normalized to current style
Copyright © 2011, Asia Online Pte Ltd
• Data– Gathered from as many sources as
possible
Dirty Data SMT ModelDirty Data SMT Modelpossible.
– Domain of knowledge does not matter.– Data quality is not important. – Data quantity is important.q y p
• Theory – Good data will be more statistically
relevant.relevant.
• DataGathered from a small number of
Clean Data SMT ModelClean Data SMT Model – Gathered from a small number of trusted quality sources.
– Domain of knowledge must match target
– Data quality is very important.– Data quantity is less important.
• Theory
Copyright © 2011, Asia Online Pte Ltd
y– Bad or undesirable patterns cannot be
learned if they don’t exist in the data.
• Bad translations• Out of domain text
U b l d / Bi d• Unbalanced / Biased– Too much text from other domains
• Mixed / Wrong language• Junk and noise• Junk and noise• Broken HTML• Mixed Encoding• Missing diacritics• Missing diacritics
– café vs. cafe• OCR Text• Machine translated textMachine translated text• Anything that is not high
quality and in domain
Put Simply: If a bad pattern does not exist in your training data you cannot generate such a
Copyright © 2011, Asia Online Pte Ltd
your training data, you cannot generate such a bad pattern as translation output.
English Source Human Translation Google Translation Google ContextI went to the bank Fui al banco Fui al banco Bank as in financeI went to the bank to Fui al banco para depositar Fui al banco a depositar Bank as in financeI went to the bank to deposit money
Fui al banco para depositardinero
Fui al banco a depositarel dinero
Bank as in finance
I went to the bank of h i
Fui en coche a la i li ió d l l
Fui a la orilla de la vueltai h
Bank as in river bankthe turn in my car inclinación de la vuelta en mi coche
I put my car into the bank of the turn
Puse mi coche en la inclinación de la vuelta.
Pongo mi coche en el banco de la vuelta
Bank as in finance
I swam to the bank of the river
Nadé en la orilla del río Nadé hasta la orilla del río
Bank as in river bank
I bankedmy money Depositémi dinero Yo depositadomi dinero Banked as in financey y p pI bankedmy car into the turn
Inclinémi coche en la vuelta
Yo depositadomi cocheen la vuelta
Banked as in finance
I bankedmy plane into Inclinémi avión en para Yo depositado en mi Banked as in financeI bankedmy plane into a steep dive
Inclinémi avión en parauna zambullida.
Yo depositado en mi avión en picada
Banked as in finance
The above examples show that Google is biased towards the banking and finance domainIssue:
Copyright © 2011, Asia Online Pte Ltd
There is much more multilingual banking and finance data available to learn from than there is aeronautical or water sports data available.Cause:
Mixing a focused set of Mixing a wide variety ofNot RecommendedRecommended
Mixing a focused set of bilingual domain data together
• Different sources are ok
Mixing a wide variety of bilingual domain data together
Do not mistake somewhat Providing large enough monolingual data to support grammar structures
related content as content in the same domain
• E g Anti virus is more in thegrammar structures • E.g. Anti‐virus is more in the security domain than the IT domain
Example of translated output influenced by anti‐virus
( i d i ) i d i IT d i Providing insufficient monolingual data to support grammar structures
Protect your documents
text (security domain) mixed into IT domain
grammar structures• The more variety in bilingual data, the more monolingual data ill b i d
Copyright © 2011, Asia Online Pte Ltd
will be required. Chroni komputer przed dokumentów(Protects your computer from documents)
Copyright © 2011, Asia Online Pte Ltd
• Typically about 10‐20 examples for each clean word of phrase.
• Each correction has statistical relevance and
• Large volumes of dirty data prohibits manual correction.
• Individual corrections are not statistically• Each correction has statistical relevance and impact can be clearly seen.
• Corrections usually involve adding data to fill gaps.
• Individual corrections are not statistically relevant.
• Manual corrections must compete against 1,000’s of bad examples. Impractical to create g p
• Far less correction of actual errors.• Clean data means cause of errors can be
understood and corrected.
, p penough examples manually.
• Understanding the cause of errors is difficult.• Slows training and overall processing time.
• Concordance used to create unbiased examples/phrases and ensure scope covered.
Requires more resources to process excess data.
• Only solution is to acquire more dirty data d h bl i fi d B
Copyright © 2011, Asia Online Pte Ltd
and hope problem is fixed. But may get worse or cause new errors.
Competitors Statistical MT
Language Studio Allows:• Automated identification of areas of weakness• Post Editing Feedback focusing directly on areas of
weakness• Get more dirty data• Human translate more data• Automated error pattern analysis and correction
• Analysis and Resolution of Unknown Words• Determination and resolution of high frequency
phrases
• Human translate more data
Competitors Rule Base MTp
• Terminology Extraction• Balancing Bilingual Phrases against Monolingual Data• Runtime glossary• Runtime spelling dictionaryRuntime spelling dictionary• Pattern handling and adjustments• Incremental Improvement Training• Automated Quality Measurement• Human Quality Measurement
• Add dictionary entries (limit 20K words)
Copyright © 2011, Asia Online Pte Ltd
• Human Quality Measurement • Quality Confidence Scores for each segment • Train a language model to fix broken
rules output (limit 40K phrases)
Example of Training DataMore initial data provided for training results in greater vocabulary and grammatical coverage above the
Sufficient Data Threshold
Volume
Sufficient Data Threshold and less post editing feedback required.
Data Shortfall
Post Edited Feedback andGenerated Data to Fill Gaps
Data V
T i i d f h i d f d i h
Gaps in Topic Coverage
• Training data can often have gaps in coverage and an excess of data in other areas.• Gaps in coverage reduce translation quality. • Gaps can quickly be filled via post editing the machine translated output and submitting
the data back to the system for further learningthe data back to the system for further learning.• Many gaps can be filled with monolingual data only.• Further gaps can be identified and resolved by analyzing the text that is to be translated
for high frequency terms and unknown words
Copyright © 2011, Asia Online Pte Ltd
g q y• In some cases incorrect data may be statistically more relevant. Post editing will raise the
relevance of the correct grammar.
Before Machine TranslationBefore Machine TranslationSource text is processed and modifiedPre‐Translation Corrections (PTC)‐ A list of terms that adjust the source text fixing common issues and
Source text is processed and modified.
text fixing common issues and making it more suitable for translation.Runtime Glossary (RTG)A list of bilingual terms that are used to
After Machine TranslationAfter Machine Translation‐ A list of bilingual terms that are used to ensure terminology is translated a specific way.Non Translatable Terms (NTT)
Target text is processed and modified.
Post Translation Adjustment (PTA)A list of terms in the target language thatNon‐Translatable Terms (NTT)
‐ A list of monolingual terms that are used to ensure key terms are not translated
‐ A list of terms in the target language that modify the translated output. This is very useful for normalization of target terms.
translated.
Each of the above runtime customizations can be applied in 2 forms:D f l A li d ll j b
Copyright © 2011, Asia Online Pte Ltd
Default: Applied to all jobs.Job Specific: A different set of customizations can be applied for different clients.
Initial System put into production p
Changes are collected and added to initial corpus to drive continuous retraining
Trained Internal Experts begin initial clean up and correction process
All users allowed to suggest changes which goes through vetting process
Expert Users also allowed to make changesg p g
Publication Quality Target
Post Editing EffortQua
lity
Raw MT Quality
PostPost‐‐editing effort and cost can be managed by editing effort and cost can be managed by improving the quality and performance of the improving the quality and performance of the MT engine via corrective linguistic feedback MT engine via corrective linguistic feedback
Copyright © 2011, Asia Online Pte LtdEngine Learning Iteration
1 2 543 6
CorrectMi l i
Human Feedback
Targeted Corrections
Key
MistranslationSyntax/GrammarTerminology
S lli d
Targeted Corrections of Bad Learning
CorrectgySpellingPunctuation
Spelling and Terminology
Correct
Initial SystemCorrect
Correct
Correct
Copyright © 2011, Asia Online Pte Ltd
Human Feedback can raise the raw output to previously unseen quality levels
Human OnlyHuman Only MT + HumanMT + HumanMT OnlyMT Only
• Terminology Definition• Non‐Translatable Terms• Historical Translations
l id
• Terminology Definition• Non‐Translatable Terms• Historical Translations
S l G id **
• Terminology Definition• Non‐Translatable Terms• Historical Translations
St l G id **• Style Guide• Quality Requirements• Translate
Edit
• Style Guide**• Quality Requirements• Customize MT• Translate
• Style Guide**• Quality Requirements• Customize MT• Translate• Edit
• Proof• Project Management
• Translate• Edit• Proof• Project Management
• Translate• Edit• Proof• Project Management • Project Management
When preparing for a high quality human translation project, many core steps are performed in order to ensure that the writing style
• Project Management
Almost identical information and data is required in order to
core steps are performed in order to ensure that the writing style and vocabulary are designed for a customers target audience.
Copyright © 2011, Asia Online Pte Ltd
Almost identical information and data is required in order to customize a high quality machine translation system
Sajan's experience with MT customizationIñ ki H á d LIñaki Hernández‐Lasa
Copyright © 2011, Asia Online Pte LtdCopyright © 2011, Asia Online Pte Ltd
B i S i f B ildiBusiness Strategies for Building Strategic Advantage and Revenue
from Machine Translation
Dion WigginsChief Executive OfficerChief xecutive [email protected]
Copyright © 2011, Asia Online Pte LtdCopyright © 2011, Asia Online Pte Ltd