gala webminar september 2013
DESCRIPTION
Pangea Machine Translation platform from Pangeanic. A product presentation by Manuel Herranz, Elia Yuste, Andi Frank showcasing the best of automated cleaning cycles, automated engine retraining, machine translation engine creation.TRANSCRIPT
PangeaMT Manuel Herranz – Elia Yuste – Alex Helle – Andi Frank
User-EmpoweringData-Driven, In-Domain
Machine Translation
#pangeanic E: [email protected]
AGENDA• Industry reflections• Pangeanic PangeaMT• Customization as Key Initial Servicing Step of our MT
Offering• All about the PangeaMT Platform– Featuring Highlights and Demo– API : CAT Environment Integration (Demo)
• Q&A RoundGALA Marketplace Offer
´1
´2
1.This is an example text. Go ahead and replace it with your own text.
2.This is an example text. Go ahead and replace it with your own text.
19951995
20052005
20152015
3.This is an example text. Go ahead and replace it with your own text.
4.This is an example text. Go ahead and replace it with your own text.
COST OF TRANSLATION (price/w) vs DEMAND
10-YEAR STEPS
DEM
AND
• Price per word a valid model?
• Is there an explanation?
• What can we do about it? Is there a future for the Language Industry?
• Unique to this industry?
MASSIVE AMOUNTS OF DATA – IS LANGUAGE BUSINESS MANAGEABLE?
World’s data in Tb / Exa
Typi
cal T
rans
latio
n Vl
ume
1990 1995 2000
2005 2010 2015
Why Machine Translation?
As of May 2009: 487 Billion gigabytes or1,000,000,000 * 487,000,000,000 = 4,87 x 1020
Estimates Up 50% a year (Oracle) Doubles every 11 hours (IBM)
Humankind has stored more than 295 billion gigabytes (or 295 exabytes) of data since 1986 ComputerWorld - 2011
Researchers at the University of California, Berkeley, that found the amount of data generated from the dawn of time through 2002 was about 5 exabytes.
Why Machine Translation?The Data Deluge
As Content Volume Explodes, Machine Translation Becomes an Inevitable Part of Global Content Strategy http://ow.ly/jVuhZ
In 2011, it took about two days for the world to create the same 5 exabytes of data that it took human eons to generate.
In 2013, it took the world just 10 minutes to create 5 exabytes.
Eric Schmidt: Every 2 Days We Create As Much Information As We Did Up To 2003TechCrunch, 2010
The sixth power of 1,000 = 1018
1 EB = 1000000000000000000B = 1018bytes = 1000petabytes = 1 billion gigabytes.
Where is data stored?
What can I do with MT?Machine Translation application, NEW usage and success depend on
MT for assimilation: “gisting” or “understanding“Sports Politics
Social etc
Output format
• Practically unlimited demand; but free web-based services reduce incentive to improve technology
• Coverage + important. Instant quality MT for dissemination: “publication“
MT for direct communication
Output format
Sports Politics
Social etc
• Publishable quality that can only be achieved by humans. MT & tools a productivity booster
Output format
Output format
Sports Politics
Social etc• Current R&D, Military uses systems for
spoken MT, first applications for smartphones, online help, multilingual chat systems
Output format
Output format
9
Short history Pangeanic: LSP. Major clients in Asia, European
localization, increasing number of languages Need to produce translation faster, cheaper… Experimenting with some RB MT systems
TAUS & TDA founding members Partnering with Valencia's Computer Science
Institute & Prof. F. Casacuberta / E. Vidal Research Team Commercial implementations of PangeaMT systems
at client side: SONY EUROPE, SYBASE, LSPs….
10
Milestones EU Post-editing contract 2007 (RBMT output) Euromatrix mention AMTA 2010 AAMT 2011/12 (JP Hybridization and MT DIY) 1st commercial platform 2010 DIY 2011 (automated re-training cycles) SaaS Power, LocWorld Paris 2012
Improved automated cleaning cycles, Online automated training
Regional EU R&D Funds (“Feder” x 3: 2009-2011) & Marie Curie EXPERT Project
Customization by the PangeaMT Team
Key to achieve better qualitative results later• Top-notch human and automated service• Focused on the Client from day one!• Prior to 1st-time Engine Delivery prior to Platform
Deployment (production)
• Customization concentrates on data and best engine consultancy• Data cleaning and enhancement• The impact of glossaries (in-domain, client-/product-
specific…)• Reporting (your data was like this…..now let’s do this)• Training Pangeanic tests all the development features in-house at a
TRANSLATION DEPARTMENT BEFORE RELEASE.
Getting the data right:Automated cleaning and
preparationTMX data
Cleanup: Entities
Conversion
Cleanup: Characters
Two plain text files
Moses Cleanup: Segments
TokenizationLower-casing
Two aligned text files, no tags, lower-cased
MT engine training
Cleanup: Tags
Bilingual XML with inline tags/markup
XML entities like © etc.
Invalid characters
Remove: <ph> etc.
Empty linesSentence ratio wrong
Example: By default, èBy default ,
Example: HOUSE è house
Don’t forget data cleaning!!!
<tu srclang="en-GB"><tuv xml:lang="EN-GB"><seg>A system for recovering the methane that is emitted from the manure so that it does not leak into the atmosphere.</seg></tuv><tuv xml:lang="FR-FR"><seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel d'origine animale de sorte qu'il ne se dissipe pas dans l'atmosphère.</seg></tuv>
<tu creationdate="20090817T114430Z" creationid="APIACCESS" changedate="20110617T141159Z" changeid=“pat"><tuv xml:lang="EN-US"><seg>Overall heigtht –<bpt i="1">{\f43 </bpt> <ept i="1">}</ept>25"; width –<bpt i="2">{\f43 </bpt> <ept i="2">}</ept>20.1".</seg></tuv><tuv xml:lang="ES-EM"><seg><bpt i="1">{\f2 </bpt>Altura total - 25"; anchura <ept i="1">}</ept>–<bpt i="2">{\f43 </bpt> <ept i="2">}</ept><bpt i="3">{\f2 </bpt>20,1".<ept i="3">}</ept></seg></tuv></tu>
<tuv xml:lang=“EN-US"><seg>On 22nd May we decided not to join the group.</seg><tuv xml:lang=“DE-DE"><seg>Am 22. </seg>
More cleaning
Cleaning
Don’t forget data cleaning!!!
<tu srclang="en-GB"><tuv xml:lang="EN-GB"><seg>The President of the United States visited Costa Rica.</seg></tuv><tuv xml:lang=“ES-ES"><seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora Michelle, visitaron Costa Rica el pasado sábado.</seg></tuv>
<tuv xml:lang=“JP"><seg>同書は「通訳・翻訳キャリアガイド」の 2011-2012 年度版。英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。 </seg><tuv xml:lang=“EN-US"><seg>It is a journalistic point of view and strengths of the English-language newspaper Japan Times. It includes a description of the exciting and rewarding work of translation and interpretation, as well as the introduction of consciousness and how to acquire the required professional skills. The road to becoming a translator and interpreter also down to the actual work site, a comprehensive guide to interpreting the reality of today'stranslation industry. </seg>
More cleaning
Cleaning
More cleaning
Cleaning
Engine training with clean dataHaving approved, terminologically sound, clean data improves engine accuracy and performance with even small sets of data.
Data cleaning modules•Remove any “suspects”:•Sentences that are too long•Mismatches (of many kinds!)•Terminological inaccuracies•Non-useful segments, etc
Parallel text extraction / Translation input / Post-edited materialThis is often comes from CAT tools or document alignments, crawling
Data Cleaning (in-lines)Remove all non-translation data.
TMX Human approvalSome of this material may actually be OK for training. It is then input in the training set.
DATA CLEANING CYCLE (AUTOMATED)DATA CLEANING CYCLE (AUTOMATED)
A Success StorySony Professional Europe, Salomé Lopez-LavadoNeeds-Improve publication French, Italian, Spanish-8M words training set-time-to-market: from 3 days down to 1,5 days: html, InDesign, -Outsourcing cost: -20%-Volume: 1,5M words/year
Japanese Automotive manufacturer-Spanish-8M words/year-Time to market reduced by 2 week – 3 weeks from 8 to 6 or 5 weeks-Team of 17 freelancers down to 4-7 post-editors-Outsourcing cost: -30%
Spanish LSP working for banking sector-Spanish-1-2M words/year-Time to market: 1-week to 2 days!!!!-Docx, html, tmx-Down from 2-3 in-house staff and 2-3 freelancers to 2 in-house!!!
http://ow.ly/peuFD
Successfully applied (3d-party applications/beneficiaries)
Use Case -
✔Even with small data sets!!
• PangeaMT can be self-hosted when data security is critical (all processes internal to the organization) - commercially sensitive data,- financial, legal, institutional,- intelligence, knowledge-gathering,- product pre-release, etc
• Control Panel + full system statistics
• Re-trainings and updates by the client for data privacy / more accuracy
Potential Uses of Machine Translation
• Information discovery: patent, unknown documents,
• Automatic, on-demand creation of foreign language versions / web apps – keyword testing
• multilingual crawling, data discovery
• Pre-translation
Other Potential Uses of Machine Translation
20
Polling Questions to Audience
21
Platform overview• 24/7 control over your data and engines• secure, robust and scalable• user focused (permissions and empowering capabilities)• API linked, if need be• enabled us to offer an extraordinary flexible business model
- SaaS
- SaaS Power (online DIY, re-trainings included)
- Full Power (PLATFORM OWNERSHIP)
PangeaMT System – Domain Creation
PangeaMT System – Data Cleaning
PangeaMT System – Engine Creation
PangeaMT System – Engine Training
26
PangeaMT API – SDL Plugin Demo Time(Video file)
Myth: MT will never be as good as humans
“We cannot solve the problem using the same tools and the way of thinking that created it” A. Einstein
uhmmm, it is going to get really good...
2nd stagePE material and more data make engines even
more predictable. More specialist engines
3rd stageBeyond 2030... no predictions
1st stageWe are creating usable engines, first PE
experiences 2009-2015 or 2020
GALA Marketplace Offer
Free Consultancy and Custom Engine Piloting Period
October-November 2013