taus mt showcase, strategies for building competitive advantage and revenue from machine...
DESCRIPTION
This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit. MosesCore is supported by the European Commission Grant Number 288487 under the 7th Framework Programme. For the latest updates, follow us on Twitter - #MosesCoreTRANSCRIPT
TAUS MACHINE TRANSLATION SHOWCASE
Strategies for Building Competitive Advantage and Revenue from Machine Translation 14:40 – 15:00 Wednesday, 10 April 2013 Dion Wiggins Asia Online
Copyright © 2013, Asia Online Pte Ltd Copyright © 2013, Asia Online Pte Ltd
Dion Wiggins Chief Execu<ve Officer [email protected]
Business Strategies for Building Strategic Advantage and Revenue from
Machine Transla<on
Copyright © 2013, Asia Online Pte Ltd
Copyright © 2013, Asia Online Pte Ltd
• Human Resources – Linguis@c
• Language / Transla@on • Natural Language Programming (NLP)
– Technical • Opera@ng System • SoGware installa@on and support
– Programming • Tailoring to needs of the business
• Integra@on with other tools and plaLorms
• Infrastructure – Hardware
• Hosted, purchased – SoGware
• Licensed, Hosted, Open Source
• Data Requirements – Third party
• Free, Commercial – Internal data – Data manufacturing – Clean vs. Dirty Data SMT – Rules vs. SMT vs. Hybrid
• Skill Development – Hosted -‐ basic skills – Onsite Moses – comprehensive
• TMS / Workflow Integra@on – Pre-‐built, custom development
• Document Format Support – Wide, limited
Copyright © 2013, Asia Online Pte Ltd
• Transla@on Costs – Monthly fee, per word, human resources
• Customiza@on Costs – Up front, embedded on transla@on costs, human resources
• Management Costs – Oversight, improvement
• Control – Extensive, limited
• Data Security – Contract, internal
• Project Type – Language Pair – Domain
• Risk – Managed by expert – Managed by your term – Likelihood of failure
• Time to Quality – Trained by professionals, learned skills
• Cost of Post Edi@ng – Higher quality MT should result in lower cost of edi@ng
Copyright © 2013, Asia Online Pte Ltd
A An infinite demand – a well defined and growing problem that has always been looking for a solu@on – what was missing was …
Machine Transla<on M T
eMpTy Promises 50 Years of
Q Why does an industry that has spent 50 years failing to deliver on its promises s@ll exist?
Copyright © 2013, Asia Online Pte Ltd
Quality
Control
Focus
Copyright © 2013, Asia Online Pte Ltd
4. Manage Manage transla@on projects while genera@ng correc@ve data for quality improvement.
2. Measure Measure the quality of the engine for ra@ng and future improvement comparisons
3. Improve Provide correc@ve feedback removing poten@al for transla@on errors.
1. Customize Create a new custom engine using founda@on data and your own language assets
Copyright © 2013, Asia Online Pte Ltd
Quality requires an understanding of
the data
There is no exception to this rule
Copyright © 2013, Asia Online Pte Ltd
1. Click Training Data tab. 2. Click on Upload and select TMX files. 3. Click Training Data tab. 4. Click Build
Some even brag that it is this simple.
“Seriously, that’s it!”
Perhaps it should have been
“Seriously, that’s it????”
Copyright © 2013, Asia Online Pte Ltd
Flaws in the One BuWon Instant MT Approach
• Simply upload your data and magic happens to create a custom MT engine in hours/minutes.
• Seriously, that’s it!
• MT cannot not read your mind. • It cannot determine which wri<ng
style, target audience, formats, vocabulary or capitaliza<on you want.
• It cannot determine what is missing and whether your data is suitable for your goal.
• You don’t know which is the right data
Copyright © 2013, Asia Online Pte Ltd
Just Add Water Upload Data
If it was really this easy, don’t you think custom MT success stories would be everywhere?
Copyright © 2013, Asia Online Pte Ltd
• Moses Case study that describes the effort in detail: hhp://slidesha.re/KwkdUH • Summary:
– Needs expert programmer, expert project manager – Requires very powerful hardware – Large amounts of soGware development – TAUS Data Associa@on membership EUR 15,000 for data – 360 man hours to set up first pilot – Mul@-‐year effort with considerable funding required – Transla@on quality close to that of Bing
“The ready availability of the Moses MT engine under an open source license enables everybody to create staCsCcal MT engines from parallel data with a
moderate amount of effort.”
“With self-‐serve MT, clients without the necessary MT and compuCng experCse to install Moses themselves, have for the first Cme the ability to build an MT system
based on their own user requirements preLy much instantly.“
Copyright © 2013, Asia Online Pte Ltd
• Do it yourself Moses and Self Service Moses primarily target and solve the engineering complexity of deploying a basic Moses system
• There are many other technical and data requirements necessary
• Many addi@onal technology components are needed. Some have not yet been developed such as TMS integra@on, XML tag handling etc.
For a good blog entry and discussion on this topic see hLp://bit.ly/rWAxG7
Copyright © 2013, Asia Online Pte Ltd
1. What is the right data to upload for my MT system? 2. How should I prepare my data? 3. What cleaning can I do that the magic 1 click buhon
does not do? 4. What impact will my data have on the MT system? 5. Will the data I upload improve or decrease quality? 6. What will mixing data from mul@ple domains do to my MT system? 7. Should I add some or all of the TAUS data to my system? 8. Once I have a system, how can I make it beher? 9. When I see an error in my MT output, how can I know the cause of the error? 10. When I see an error in my MT output, how can I fix the error? 11. … .. 1. … 1. …
Copyright © 2013, Asia Online Pte Ltd
• Defini@on – Domain – Target Audience – Preferred Wri@ng Style – Glossaries, Non-‐Translatable Terms, Preferred Capitaliza@on – Special Formapng Requirements – Quality Requirements
• Data Gathering – Source data in domain – Bilingual data to support domain – Monolingual data to support domain
• Data Analysis – Gap analysis – High frequency terms – Term extrac@on
• Data Genera@on – Suppor@ng grammar structures – Source Data Analysis
• Cleaning of Data • Tuning and Test Set Prepara@on • Diagnos@c Engine
– Fine tuning
Provided by client and gathered from third par@es.
Copyright © 2013, Asia Online Pte Ltd
• Near human quality automated transla<on designed for the professional transla<on industry – Many customers have achieved quality levels where more than 50%
of raw machine transla@on requires no edi@ng at all – Case studies of customers that have achieved 3 x margin with 1/3 the
human resources – Regularly replacing compe@tors pre-‐exis@ng installa@ons
• Machine + Human approach delivers higher quality than a human only approach – More consistent wri@ng style and more accurate terminology
• Rapid ongoing transla<on quality improvement – Post edited machine transla@on is fed back to the engine which learns
from its previous errors by analyzing the correc@ons – Live feedback as new content is published
• Enable clients to control preferred terminology, vocabulary and wri<ng style
Spanish Original Before Transla<on:
Se necesitó una gran maniobra polí@ca muy prudente a fin de facilitar una cita de los dos enemigos históricos.
Business News Aaer Transla<on:
Significant amounts of cau@ous poli@cal maneuvering were required in order to facilitate a rendezvous between the two biher historical opponents.
Children’s Books Aaer Transla<on:
A lot of care was taken to not upset others when organizing the mee@ng between the two long @me enemies.
We found that 52% of the raw original output from Asia Online had no errors at all – which is great for an
ini<al engine. . – Kevin Nelson,
Managing Director, Omnilingua Worldwide
“ ”
Complete Stylis<c Control
Two different output styles for the same input sentence
Copyright © 2013, Asia Online Pte Ltd
LP Top-‐Level Domain
Engines/Sub-‐Domains
EN-‐ES Automo<ve
Honda Cars
Motorbikes
Toyota Marke@ng
Service Reports
User Manuals Engineering Service Manuals
User Manuals Engineering Service Manuals
Copyright © 2013, Asia Online Pte Ltd
Copyright © 2013, Asia Online Pte Ltd
• Data – Gathered from as many sources as
possible. – Domain of knowledge does not maher. – Data quality is not important. – Data quan<ty is important.
• Theory – Good data will be more sta<s<cally
relevant.
• Data – Gathered from a small number of
trusted quality sources. – Domain of knowledge must match
target – Data quality is very important. – Data quan@ty is less important.
• Theory – Bad or undesirable paAerns cannot be
learned if they don’t exist in the data.
Dirty Data SMT Model
Clean Data SMT Model
Copyright © 2013, Asia Online Pte Ltd
Copyright © 2013, Asia Online Pte Ltd
• There is no magic in MT, human effort is required. • The quality of the output and suitability for purpose is directly in propor@on to the amount of human effort.
• Without human direc@on, MT will cost more in the long term and is more likely to fail.
Copyright © 2013, Asia Online Pte Ltd
• Bad transla@ons • Out of domain text • Unbalanced / Biased
– Too much text from other domains • Mixed / Wrong language • Junk and noise • Broken HTML • Mixed Encoding • Missing diacri@cs
– café vs. cafe • OCR Text • Machine translated text • Anything that is not high
quality and in domain
Put Simply: If a bad paWern does not exist in your training data, you cannot generate such a
bad paWern as transla<on output.
Copyright © 2013, Asia Online Pte Ltd
English Source Human Transla<on Google Transla<on Google Context I went to the bank Fui al banco Fui al banco Bank as in finance I went to the bank to deposit money
Fui al banco para depositar dinero
Fui al banco a depositar el dinero
Bank as in finance
I went to the bank of the turn in my car
Fui en coche a la inclinación de la vuelta
Fui a la orilla de la vuelta en mi coche
Bank as in river bank
I put my car into the bank of the turn
Puse mi coche en la inclinación de la vuelta.
Pongo mi coche en el banco de la vuelta
Bank as in finance
I swam to the bank of the river
Nadé en la orilla del río Nadé hasta la orilla del río
Bank as in river bank
I banked my money Deposité mi dinero Yo depositado mi dinero Banked as in finance I banked my car into the turn
Incliné mi coche en la vuelta
Yo depositado mi coche en la vuelta
Banked as in finance
I banked my plane into a steep dive
Incliné mi avión en para una zambullida.
Yo depositado en mi avión en picada
Banked as in finance
The above examples show that Google is biased towards the banking and finance domain Issue: There is much more mul<lingual banking and finance data available to learn from than there is aeronau<cal or water sports data available. Cause:
Copyright © 2013, Asia Online Pte Ltd
• Compe<tors require 20% or more addi<onal data than the ini<al training data to show notable improvements. – This could take years for most LSPs – This is the dirty lihle secret of the
Dirty Data SMT approach that is frequently acknowledged.
• Asia Online has reference customers that have had notable improvements with just 1 days work of post edi<ng. – Only possible with Clean Data SMT
< 0.1% Improvements daily based on
edits
Typical Dirty Data SMT engines will have between 2 million and 20 million sentences in the iniCal
training data.
Copyright © 2013, Asia Online Pte Ltd
Language Studio Allows: • Automated iden<fica<on of areas of weakness • Post Edi<ng Feedback focusing directly on areas of
weakness • Automated error paWern analysis and correc<on • Analysis and Resolu<on of Unknown Words • Determina<on and resolu<on of high frequency
phrases • Terminology Extrac<on • Balancing Bilingual Phrases against Monolingual Data • Run<me glossary • Run<me spelling dic<onary • PaWern handling and adjustments • Incremental Improvement Training • Automated Quality Measurement • Human Quality Measurement • Quality Confidence Scores for each segment
• Get more dirty data • Human translate more data
Compe<tors Sta<s<cal MT
Compe<tors Rule Base MT
• Add dic<onary entries (limit 20K words) • Train a language model to fix broken
rules output (limit 40K phrases)
Copyright © 2013, Asia Online Pte Ltd
• Typically about 10-‐20 examples for each clean word of phrase.
• Each correc<on has sta<s<cal relevance and impact can be clearly seen.
• Correc<ons usually involve adding data to fill gaps.
• Far less correc<on of actual errors. • Clean data means cause of errors can be
understood and corrected. • Concordance used to create unbiased
examples/phrases and ensure scope covered.
• Large volumes of dirty data prohibits manual correc<on.
• Individual correc<ons are not sta<s<cally relevant.
• Manual correc<ons must compete against 1,000’s of bad examples. Imprac<cal to create enough examples manually.
• Understanding the cause of errors is difficult. • Slows training and overall processing <me.
Requires more resources to process excess data.
• Only solu<on is to acquire more dirty data and hope problem is fixed. But may get worse or cause new errors.
Copyright © 2013, Asia Online Pte Ltd
1960’s 1980’s
1990’s 2012
Copyright © 2013, Asia Online Pte Ltd
Before Machine TranslaCon
Pre-‐Transla<on JavaScript (JS) -‐ Complex pre-‐processing can be customized via JavaScript. Pre-‐Transla<on Correc<ons (PTC) -‐ A list of terms that adjust the source text fixing common issues and making it more suitable for transla@on. Non-‐Translatable Terms (NTT) -‐ A list of monolingual terms that are used to ensure key terms are not translated. Run<me Glossary (GLO) -‐ A list of bilingual terms that are used to ensure terminology is translated a specific way.
AUer Machine TranslaCon Target text is processed and modified. Post Transla<on Adjustment (PTA) -‐ A list of terms in the target language that modify the translated output. This is very useful for normaliza@on of target terms. Post Transla<on JavaScript (JS) -‐ Complex post-‐processing can be customized via JavaScript.
Run<me customiza<ons can be applied in 2 forms: Default: Applied to all jobs. Job Specific: A different set of customiza@ons can be applied for different clients.
Source text is processed and modified.
Copyright © 2013, Asia Online Pte Ltd
Typical MT + Post Edi<ng
Speed
28,000 0
3,000
6,000
9,000 12,000
25,000
21,000
18,000 15,000
Human Transla<on
*Fastest MT + Post Edi@ng Speed reported by clients.
*
Words Per Day Per Translator
Average person reads 200-‐250 words per minute. 96,000-‐120,000 in 8 hours. ~35 Cmes faster than human translaCon.
Copyright © 2013, Asia Online Pte Ltd
Metrics That Really Count • Produc<vity – Words per day per human resource • Margin – 2-‐3 <mes the profit margin is commonplace • Consistency – Wri<ng style and terminology
ü MT + Human delivers higher quality than a human only approach
• Deals ü New deals not accessible with a human only approach ü Deals where you could offer a more compe@@ve bid due
to MT than your compe@tors ü Deals that would have been lost to a compe@tor without
the advantages that MT offers
Raw MT oaen has a greater number of errors than first pass human transla<on.
However: Language Studio™ MT is stylised to a specific domain, customer and target audience, so quality is considerably higher than other MT systems.
This means that: 1. MT errors are easy to see and easy to fix
(i.e. simple grammar). 2. MT provides more accurate and consistent
terminology than human translators, especially when more than 1 human works on a project.
3. Human errors may be fewer, but harder to see and harder to fix.
Coun@ng the number of errors only, offers no value as a metric as the complexity of the error is not taken into account.
MT with more errors is oaen faster to edit and fix than first pass human transla<ons with fewer errors.
ProducCvity is the Best Quality Metric
Examples of other “Useful” Quality Indicators Automated Metrics (Good indicators, but not absolute) • BLEU (Bilingual Evalua@on Understudy) • NIST • F-‐Measure (F1 Score or F-‐Score) • METEOR (Metric for Evalua@on of Transla@on with Explicit ORdering) Manual Quality Metrics (Most not designed for MT, more for HT) • Edit Distance (Does not take into account complexity of edit) • SAE-‐J2450 (Industry specific)
Time Margin
Copyright © 2013, Asia Online Pte Ltd
• Opportunis@c approach • Many LSPs are interested in MT, but not
willing to take the plunge without a paying client.
• Limited to one client • Harder to sell – longer sales cycle • OGen build one language pair to try, before
commipng to others
• Proac@ve approach • Leverages exis@ng transla@on assets • Can be sold to many clients • Easier to sell -‐ test and show • Can sell mul@ple language pairs at the same
@me • Generally a higher Return On Investment
(ROI)
Wait for a Project That Requires MT
Create a Product For Resale
Revenue Revenue Recurring revenues from words translated One @me revenues from resale of customiza@on Post edi@ng Preparing source data Terminology defini@on Non-‐Translatable terms Unknown and high-‐frequency phrase resolu@on
Recurring revenues from words translated Preparing source data Post edi@ng Run@me glossary prepara@on Non-‐translatable terms defini@on
Copyright © 2013, Asia Online Pte Ltd
Reduce Project Costs Faster Transla@on Delivery
Revenue Revenue
• Helps to manage margin squeeze: – compete with compe@tors using cheaper (perhaps
lower quality) resources or compe@tors using MT
• Helps to cost jus@fy business cases that may not be viable using a human only approach
• Can be used behind the scenes (like a transla@on memory) or disclosed to client
• More client work in other areas as a result of leG over transla@on budget.
• New projects that could not have been delivered on due to @me and resource constraints
• Helps clients that want to simultaneously ship product in mul@ple languages
• New clients in research, analysis, data mining and discovery markets
• New clients that need real-‐@me or near real-‐@me transla@on
Depending on project or product model, revenues will vary. See previous slide. Addi@onal revenues from client gepng more ROI and willing to invest in new languages.
Preparing source data Run@me glossary prepara@on Non-‐translatable terms defini@on Post edi@ng Recurring revenues from words translated
Copyright © 2013, Asia Online Pte Ltd
Expand Exis@ng Rela@onships
Added Func@onality
Revenue Revenue
• Opportuni@es to translate addi@onal material for markets that may not have been cost viable with a human only approach
• Reuse custom MT for mul@ple purposes • Enable clients to beher compete in
markets that were only par@ally addressed due to cost and @me
• Expand service offerings with new features such as mul@lingual customer support
• Integrate machine transla@on into exis@ng client technologies, products and services
Preparing source data Terminology defini@on Non-‐Translatable terms Unknown and high-‐frequency phrase resolu@on Post edi@ng Recurring revenues from words translated One @me revenues from resale of customiza@on
Same as for Expanding Exis@ng Rela@ons Addi@onally able to charge various service fees rela@ng to the new services offered. For example, transla@ng common Q&A for customer support and a commission on integrated mul@lingual support products.
Copyright © 2013, Asia Online Pte Ltd Engine Learning Itera<on
1 2 5 4 3 6
Publica<on Quality Target
Post Edi<ng Effort
Qua
lity
Post Edi<ng Effort Reduces Over Time The post edi@ng and cleanup effort gets easier as the MT engine improves. Ini@al efforts should focus on error analysis and correc@on of a representa@ve sample data set. Each successive project should get easier and more efficient.
Raw MT Quality
Engine Learning Itera<on 1 2 5 4 3 6
6 5 4 3 2 1
Post Edi<ng (Human Transla<on)
MT Post Edi<ng
Cost Per W
ord
Post Edi<ng Cost MT learns from post edi@ng feedback and quality of
transla@on constantly improves. Cost of post edi@ng progressively reduces as MT quality increases aGer each engine learning itera@on.
Copyright © 2013, Asia Online Pte Ltd
How Omnilingua Measures Quality – Triangulate to find the data – Raw MT J2450 v. Historical Human Quality J2450 – Time Study Measurements – OmniMT EffortScore™
Everything must be measured by effort first – All other metrics support effort metrics – Produc@vity is key
∆ Effort > MT System Cost + Value Chain Sharing
Copyright © 2013, Asia Online Pte Ltd
• Built as a Human Assessment System: – Provides 7 defined and ac@onable error classifica@ons. – 2 severity levels to iden@fy severe and minor errors.
• Provides a Measurement Score Between 1 and 0: – A lower score indicates fewer errors. – Objec@ve is to achieve a score as close to 0 (no errors/issues) as possible.
• Provides Scores at Mul@ple Levels: – Composite scores across an en@re set of data. – Scores for logical units such as sentences and paragraphs.
Copyright © 2013, Asia Online Pte Ltd
Asia Online v. Compe<ng MT System Factor
Total Raw J2450 Errors 2x Fewer
Raw J2450 Score 2x Beher
Total PE J2450 Errors 5.3x Fewer
PE J2450 Score 4.8x Beher
PE Rate 32% Faster
We found that 52% of the raw original output from Asia Online had no errors at all – which is great for an ini<al engine.
“ ”
There were far fewer errors produced by the Language Studio™ custom MT engine than the compe<tor's legacy MT engine.
Notably there were fewer wrong meanings, structural errors and wrong terms in the Language Studio™ custom MT engine, that were "typical SMT problems" in the
compe@tor's legacy MT engine.
The final transla<on quality aaer post-‐edi<ng was beWer with the new Language Studio™ custom MT engine than the compe<tor's legacy MT engine and also beWer
than a human only transla<on approach. Terminology was more consistent with a combined Language Studio™ custom MT engine
plus human post edi@ng approach.
“ ”
” “ – Kevin Nelson,
Managing Director, Omnilingua Worldwide
Copyright © 2013, Asia Online Pte Ltd
• LSP: Sajan • End Client Profile:
– Large global mul@na@onal corpora@on in the IT domain. – Has developed its own proprietary MT system that has been developed over many years.
• Project Goals – Eliminate the need for full transla@on and limit it to MT + Post-‐edi@ng
• Language Pair: – English -‐> Simplified Chinese. – English -‐> European Spanish. – English -‐> European French.
• Domain: IT • 2nd Itera@on of Customized Engine
– Customized ini@al engine, followed by an incremental improvement based on client feedback.
• Data – Client provided ~3,000,000 phrase pairs. – 26% were rejected in cleaning process as unsuitable for SMT training.
• Measurements: – Cost – Timeframe – Quality
Copyright © 2013, Asia Online Pte Ltd
• Quality – Client performed their own metrics – Asia Online Language Studio™ was
considerably beher than the clients own MT solu@on.
– Significant quality improvement aGer providing feedback – 65 BLEU score.
– Chinese scored beher than first pass human transla@on as per client’s feedback and was faster and easier to edit.
• Result – Client extremely impressed with result
especially when compared to the output of their own MT engine.
– Client has commissioned Sajan to work with more languages
70% Time Saving
60% Cost Saving
LRC have uploaded Sajan’s slides and video PresentaCon from the recent LRC conference: Slides: hLp://bit.ly/r6BPkT Video: hLp://bit.ly/trsyhg
Copyright © 2013, Asia Online Pte Ltd
Travel & Leisure Ver@cal
English to Spanish Language Pair
Custom MT engines built and programma@cally consumed
A human post edit step was included in workflow and measurement
Scien@fic measures of produc@vity for all phases of process
Copyright © 2013, Asia Online Pte Ltd
Base training materials provided and catalogued
Asia Online trained the engine and released to a diagnos@c stage
First pass of new content through diagnos@c engine yielded posi@ve results
Asia Online provided advanced data genera@on technologies to the diagnos@c engine through monolingual data crawling, applica@on of run@me rules, and pre-‐transla@on adjustments
Even further progress achieved from extrac@ng and applying a industry specific high frequency term list from the source
Copyright © 2013, Asia Online Pte Ltd
58% of segments required no edits
Copyright © 2013, Asia Online Pte Ltd
Post Edit Produc<vity Analysis
Produc@vity Percentage 328% Increase
Produc@vity Rate 8,208 words a day
Copyright © 2013, Asia Online Pte Ltd Copyright © 2013, Asia Online Pte Ltd
Dion Wiggins Chief Execu<ve Officer [email protected]
Business Strategies for Building Strategic Advantage and Revenue from
Machine Transla<on