independent evaluation of commercial machine translation

57
July 2020 In partnership with Independent evaluation of commercial machine translation engines Domain: COVID Language pair: EN-FR

Upload: others

Post on 05-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Independent evaluation of commercial machine translation

July 2020

In partnership with

Independent evaluationof commercial machine translation enginesDomain: COVID Language pair: EN-FR

Page 2: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

The systems used in this report were trained and evaluated from May 8 to May 26, 2020. They may have been changed many times since then.— This report demonstrates the performance of those systems exclusively on the dataset used for this report (English-French, TAUS Corona Crisis Corpus).— We have run multiple evaluations in the same domain for other language pairs and observed different rankings of the MT systems.— There’s no “best” MT system. Performance depends on how your data is similar to what a provider used to train their baseline models and their algorithms.— Remember: Each case is different. There’s no one size fits all.

Disclaimer

Page 3: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

About

The purpose of this research is to help improve the quality of translated COVID-related content across the world. The COVID challenge is global and the correctness of disseminated information is crucial.

Using our expertise in machine translation evaluation, we have assessed several machine translation engines to identify the ones that work best for COVID-related content in different language pairs.

This study is independent and available to all.

The research has been performed on Corona Crisis Corpus made available by TAUS.

Intento is a vendor-agnostic platform that helps global companies evaluate, procure and utilize the best-fit machine translation.

Page 4: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Intento Enterprise MT HubOne place to evaluate and manage MT

MAY BE DEPLOYED

ON PRIVATE CLOUD

Universal API to all MT engines

Single MTdashboard

Connects to many CAT, TMS and CMS

Works with files of any size

Smart Routingwith retries

and failovers

Get your API key at

inten.to

Page 5: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

1. Benchmark description

2. Reference-based scores

3. Human LQA results

4. How much does it cost?

5. How safe is my Data

4Domain-Adaptive

NMT Engines

12 Stock NMT

Engines

1 Language Pair:

en-fr

1 Domain:

Healthcare

Overview

Page 6: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Machine Translation Engines Evaluateden → fr

Google Cloud AutoML Translation

IBM WatsonLanguage Translator v3

MicrosoftTranslator API v3

ModernMTEnterprise API

AlibabaE-commerce Edition

AmazonTranslate

BaiduTranslate API

DeepLAPI

Google CloudTranslation API

IBM CloudLanguage Translator v3

MicrosoftTranslator Text API v3

ModernMTEnterprise API

PROMTCloud API

SYSTRANPNMT

TencentTMT API

YandexTranslate API

custom NMT

stockNMT

Page 7: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Evaluation methodology

Clean the data, extract 4000 segments as a test set, use the rest for training customized engines.—Use different metrics to compute the similarity between reference translations and the MT output for stock and customized engines for 4000 segments. Identify the top-performing engines.—Select a set of typical segments translated by the top engines. Perform human expert LQA of these translations.—Perform human expert analysis of each engine’s weak spots — segments where an engine can fail.—Perform human expert LQA of machine translations of an entire document about COVID-19.—Analyze the results of LQA and choose the most suitable engines for translation.

Page 8: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

DatasetEnglish-French, Healthcare

The percentage positive for the other respiratory viruses in week 44 remained low, although RSV and parainfluenza increased slightly compared to week 43: RSV 2.4%; parainfluenza 3.3%; adenovirus 1.1%; hMPV 0.2%; and coronavirus 1.3%.

TAUS Corona Crisis Corpus• Original dataset volume: 891,926 segments• We clean the data and remove bad segments• Next we run Machine Translation Quality Estimation to remove segments

where the source text and the translation are likely to be mismatched

A lot of medical terminology:

Page 9: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

DatasetOver 200,000 segments were removed in cleaning

Over 130,000 segments were duplicated.—About 22,000 segments had different digits in source and target.—More than 9,000 segments were too short (4 tokens or less). —And 36 more filters.

Example:source: “1 to 2 tablets a day before or during the meal, or as recommended from your health professional.”

reference translation: “Édulcorant: Saccharinate de sodium.1 comprimé/jour, ou comme recommandé par votre professionnel de la santé.”

Page 10: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

DatasetOver 7,000 segments were removed after filtering with BERT/LASER embeddings

For example, segments where the source and the reference translations did not match:

source: “How much does it cost?”

reference translation: “Combien coûte l’utilisation des labos de tests ADC ?”

Page 11: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

DatasetOther issues in the data

Wrong domain:

source: “18 For Christ also suffered once for sins, the righteous for the unrighteous, that he might bring us to God, being put to death in the flesh but made alive in the spirit,”

reference: “18 En effet, Christ aussi est mort une seule fois pour les péchés, lui juste pour des injustes, afin de vous amener à Dieu. Mis à mort selon la chair, il a été rendu vivant selon l’Esprit.”

Wrong language in source or reference:

source: “Treatment of glycogen storage disease type II (Pompe’s disease)”

reference: “Treatment of glycogen storage disease type II ( Pompe’ s disease )”

Page 12: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

DatasetSplitting into training and testing sets

TAUS Corona Crisis Corpus

• After cleaning 685,606 segments remain• We extract 4000 segments as a test set• We train custom MT model on a training set of about 660,000 segments

Page 13: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

https://www.slideshare.net/AaronHanLiFeng/lepor-an-augmented-machine-translation-evaluation-metric-thesis-ppt

https://github.com/aaronlifenghan/aaron-project-lepor

Evaluation Metrics

We use three metrics for automatic evaluation of machine translations: hLEPOR, TER, and SacreBLEU. —The scores are mostly well-correlated, and we rely on hLEPOR to build an initial ranking of MT engines and select the top runners.

Page 14: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Different metrics are correlated

Page 15: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Average hLEPOR Scores

customized models

stock models

top engines

Page 16: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Average TER ScoresSame top runners

We use hLEPOR for the analysis below.

customized models

stock models

top engines

Page 17: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Customization analysisCorpus scores

We have taken a closer look at the customized models and how they compare to stock models.—The custom ModernMT model shows the most score improvement over stock.—The Microsoft custom model has a slightly lower average score than stock.

Page 18: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Customization analysis

In order to understand where a custom model differs from a stock model, we need to look at specific translations, not the whole test set.—We have developed a method of selecting segments that have significantly different scores in the stock and custom translations, so that we can compare the translations.—This allows us to see how the custom models has improved over stock and where it fails.

Page 19: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

ModernMT custom vs. stock

The custom model has higher scores for a third of all segments.—Where the stock model leaves words untranslated, the custom model provides normal translations.—There are some omissions in the custom translations, for example:—source: “There are plenty of other health problems, other than lung problems, that are under suspicion.Children who are exposed to second-hand smoke are at risk of developing a variety of health problems.”

custom MT: “Il y a une plénitude de problèmes de santé, autres que les problèmes pulmonaires, qui sont liés à  la fumée secondaire.”

Page 20: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Google custom vs. stock

Translations with improved scores are not always better — often they are just closer to the reference translation.—There are some omissions in the custom translations, for example:—source: “Standard doses of voriconazole and standard doses of efavirenz must not be coadministered Steady-state efavirenz ( 400 mg orally once daily ) decreased the steady state Cmax and AUCτ„ of voriconazole by an average of 61 % and 77 % , respectively , in healthy subjects .”

custom MT: “Les doses standards de voriconazole et les doses standards d’éfavirenz ne doivent pas être co-administrées.”

Page 21: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

IBM custom vs. stock

Several stock translations that contain many untranslated words have been translated normally by the custom model, for example:—source: “The Alberta survey report: a report from Nutrition Canada by the Bureau of Nutritional Sciences, Department of National Health and Welfare.”

stock MT: “The Alberta survey report: a report from Nutrition Canada by the Bureau of Nutritional Sciences, Department of National Health and Welfare.”

custom MT: “Le rapport d'enquête de l'Alberta: un rapport de Nutrition Canada réalisé par le Bureau des sciences de la nutrition, ministère de la Santé nationale et du Bien-être social.”

Page 22: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Microsoft custom* vs. stock

Many custom translations have lower scores because of punctuation issues, for example:—source: “Clinics and hospitals in Sonoma, California - Amarillascalifornia.com”

custom MT: “Cliniques et hôpitaux à  Sonoma (Californie)-Amarillascalifornia.com”

* We trained the model in the Medicine domain

Page 23: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Reference-Based Scores and CustomizationDiscussion

Ranking engines by different metrics provides similar results.—The ModernMT engine trained on our dataset has the highest scores. There are some omissions in the custom translations, however, more than a third of all segments have improved scores.—The customized Google and IBM engines show modest improvement over stock. The custom IBM model corrects many segments that the stock model has left untranslated.— Although Microsoft custom corpus score has fallen slightly compared to the stock engine, translation quality has not degraded significantly. — Yandex has the highest scores of all stock engines. Overall, most engine scores are close, only Baidu lags behind.

Page 24: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Human Linguistic Quality Analysis

For our evaluation, we picked the seven top-performing engines.—Engines provide similar translations for some segments, but disagree on others.—Corpus-level scores show quality for median segments.—To perform human review, we need to select median segments as well as those that will demonstrate the differences between MTs.

num

ber o

f seg

men

ts

segment difficultyhard easy

MT

engi

ne

agre

emen

t

low

high

Page 25: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Extracting groups of segments for review

We calculated average hLEPOR scores for all test segments across the top-performing engines.—Median segments are those that have the average hLEPOR and variance close to the median.—Weak spots are segments that most engines handled well, but one or more engines translated badly. These segments can spotlight a particular engine’s weaknesses.—These groups of segments were analysed by linguists with expertise in biomedical domain and both English and French languages.

median

weak spots

Page 26: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Human Linguistic Quality Analysis

We continue our evaluation with human review of three kinds of texts: typical segments from the test set, weak spots (non-typical translations), and a whole document on COVID-19.—We are enormously grateful to our LSP partners, who performed the LQA: e2f and ITC Translations

Page 27: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Segments ReviewBlind within-subjects review

LQA was performed by our two partners: e2f and ITC Translations. The experts received the source segments and all translations (including the human reference) without labels.—Experts rated the estimated effort to post-edit every segment. For all segments where post-editing make sense (i.e. neither perfect nor useless segments), reviewers were also asked to provide a suggested translation. —In our analysis, we explore the level of reviewer agreement and provide two models to rank the engines (based on the PE distance and on the estimated cost saving compared to human translation).

Page 28: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Segments ReviewEditing effort ratings, validated by reviewers

Body Level One

Body Level Two

Body Level Three

Body Level Four

Body Level Five

© Intento, Inc.

EditingEffort Description Estimated Effort

Saving*

PERFECTThere is absolutely nothing to improve. The translation sounds like it was produced by a professional human translator who understands the context in which the source segment appears. 90 %

GOODThe translation conveys the meaning of the source accurately and does not contain any grammatical errors but it does not sound quite natural. Style and tone need some improvement. 75 %

FAIRThe translation adequately conveys the meaning of the source sentence. There are some mistakes that are easy to fix, the effort is similar to reviewing human translation or fuzzy TM matches. 50 %

BADThe translation adequately conveys the meaning of the source sentence. There are mistakes of different severity. Fixing these mistakes requires careful examination of the source sentence and significant effort. However, the machine translation still provides speed-up. 20 %

USELESS The translation is completely irrelevant to the source, it's either useless or misleading, the meaning of the source sentence is lost, it should be translated from scratch. 0 %

Page 29: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Reviewer agreement One reviewer has more PERFECT segments

Body Level One

Body Level Two

Body Level Three

Body Level Four

Body Level Five

Page 30: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Reviewer agreementRatings are mostly well-correlated

According to one reviewer, slightly more effort is needed for some segments. —To deal with reviewer disagreement, we compute average ratings across reviewers.

Page 31: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Segments ReviewMT model average ranking

fair segments

good segments

bad segments

useless segments

perfect segments

Page 32: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Segments ReviewWhy is Human Translation the last?

Reviewers mention that reference translations are heavy rephrased and too localized, while MTs are literal.—Reviewers edit reference translations to make them more literal and closer to source.

Page 33: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Segments ReviewIssues in the reference translations

Reference is heavily re-phrased:source: “Similar effects can also result from exposure to chemicals that influence nervous system development, but have no known action on the endocrine system.”

reference: “L'exposition à  des substances qui influent sur le développement neurologique, mais n'ont pas d'action connue sur le système endocrinien, peut également provoquer des effets similaires.”

—Reference does not quite match source:source: “As a role model, your actions and reactions can influence how youth relate to each other. Source: Public Safety Canada”

reference: “En tant que modèle, vos gestes et vos réactions peuvent influer sur la façon dont les jeunes réagissent entre eux.”

Page 34: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Segments ReviewDeepL leads

DeepL is the leader of the averaged ranking.—Human Translation has the lowest rating.

Page 35: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Ranking by post-editing distanceThe same leader

DeepL is the leader again.—Google stock, Amazon, and ModernMT custom follow, other engines are further behind.—Human translation has by far the largest edit distance from suggested translations.

Page 36: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Segments ReviewDiscussion

DeepL is in the first place, with very high potential effort saving and the lowest editing distance to suggested translations.—Google custom and stock, as well as ModernMT custom are the closest followers. Amazon stock has a similar average rating but has some USELESS segments.—The Human translation was rated the worst by both reviewers. The reference translations are significantly re-phrased comparing to the source texts. Machine translations, as well as translations suggested by the reviewers, are usually literal.

Page 37: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Segments ReviewDiscussion

Custom translations, which are closer to reference translations and thus have higher scores, are not rated the best in LQA for the same reason that reference translations received a low rating.—Stock translations are already high-quality, so training models on more data does not result in explosive growth.—Moreover, the training data, though it comes from the same domain, is rather heterogenous. Custom models usually perform radically better than stock when they are trained on more standardized data.—Our initial ranking was based on comparisons to imperfect reference translations. This explains why human LQA produces a somewhat different ranking.

Page 38: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

MT Weak Spots Analysis

Weak spots are segments for which a particular engine received a low score, while the average score across all top-performing providers is high.—Weak spots showcase the contexts where an engine can fail.—Some weak spots are valid alternative translations — see next slide.

minor issues major issues

Page 39: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

MT Weak Spots AnalysisBased on data from two reviewers

omission mistranslation untranslated words style issues MT is better than reference

Amazon(stock) X

DeepL(stock) X X X X

Google(custom) X X X

Google(stock) X X X

ModernMT(custom) X X X X

ModernMT(stock) X X X

Yandex(stock) X

Page 40: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

MT Weak Spots AnalysisExamples

Amazon — untranslated words:source: “Vaccine response was defined as an hSBA titre of ≥1:8 in subjects initially seronegative (hSBA titre <1:4) and as a 4-fold increase in titre in subjects initially seropositive (hSBA titre >1:4).”

MT: “La réponse vaccinale a été définie comme un titre HsBA ≥ 1:8 chez les sujets initialement séronégatifs (HsBA titre < 1:4) and as a 4-fold increase in titre in subjects initially seropositive (hSBA titre > 1:4).”

DeepL — omission:source: “56 healthy men and women were recruited for a placebo-controlled study by Dr.Jodi Miller and colleagues of Solvay Pharmaceuticals, Marietta, Georgia (USA), to compare the pharmacokinetics of oral dronabinol against a nebulized formulation administered in a propylene glycol/ethanol/water solution.”

MT: “Jodi Miller et ses collègues de Solvay Pharmaceuticals, Marietta, Géorgie (USA), pour comparer la pharmacocinétique du dronabinol oral à une formulation nébulisée administrée dans une solution de propylène glycol/éthanol/eau.”

Page 41: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

MT Weak Spots AnalysisDiscussion

Results of human review of the weak spots show that many of them are actually better than reference translations. However, engines have some actual weak spots too.—DeepL has especially many segments, where MT is better than reference. However, there are also some omissions and untranslated words.—Amazon leaves some words untranslated in one segment, and has minor issues in several others.—ModernMT Stock leaves several segments fully or mostly untranslated, several translations are wrong or too literal.

Page 42: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

HOLISTIC REVIEW RESULTSBlind within-subjects review

We have translated a whole document on COVID-19 with the seven top-running engines.—Experts have reviewed the translations’ readability, adequacy, consistency, and commented on any issues they have found.—We have analyzed the results and computed the rating for each engine’s translation.

Page 43: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

HOLISTIC REVIEW RESULTSFragment of the test document

Covid-19 is caused by SARS-CoV-2, a member of the coronavirus family. Its closest relatives are the SARS-CoV virus, with which it shares roughly 79% genomic similarity, and MERS-CoV virus, with 50% similarity. They are enveloped viruses with a positive-sense single-stranded RNA genome and a nucleocapsid of helical symmetry. The genome size of coronaviruses ranges from approximately 27 to 34 kilobases (29.9 for SARS-CoV-2), the largest among known RNA viruses.

Compared to the seasonal flu virus, SARS-CoV-2 is characterised by both higher infectivity (basic reproductive number 2.0–2.5 vs 1.3 for flu) and higher disease severity, both in hospitalization rate (~20% vs ~2%) and case fatality rate (~3% vs ~0.1%). SARS-CoV-2 also starkly contrasts with its closest relatives in these regards, causing much less severe symptoms than both SARS-CoV and MERS-CoV (with fatality rates around 10% and 35% respectively).

Page 44: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

HOLISTIC REVIEWCriteria

criteria explanation

untranslated elements number of words left untranslated

translation adequacy the translation matches the source

translation consistency neighboring sentences are translated consistently

readability the translation is easy to read for a human

specific issues formatting, punctuation, lists, links, added junk

Page 45: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

# MT Engine untranslated segments

readability translation adequacy translation consistency

specific issues reviewer comments total points

1 Amazon Few / None Good / Perfect Good / Perfect Good / Perfect Few good punctuation, handling of terminology, and style 17

2 ModernMT stock None Good Good / Perfect Good Average easy-to-fix grammar and

punctuation issues 15.5

3 DeepL Few Good Good / Perfect Good Average easy-to-fix terminology and punctuation issues 14.5

4 Google custom Few / None Understandable /

Good Good Easy to fix / Good Average easy-to-fix grammar and punctuation issues 13.5

5 ModernMT custom Few / None Understandable /

Good Good Easy to fix Average incorrect punctuation and handling of acronyms 13

6 Google stock Few / None Understandable /

Good Good Easy to fix Averagegood style but incorrect

punctuation and handling of acronyms

13

7 Yandex Few / None Understandable / Good Good Easy to fix Average

below-average grammar, punctuation, and

terminology13

HOLISTIC REVIEW RESULTSAverage of two reviewers’ ratings

Page 46: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

HOLISTIC REVIEW RESULTSDiscussion

A new leader emerges: Amazon has the highest overall average rating from two reviewers. One of the reviewers praises the translation’s style and handling of terminology.—ModernMT stock and the leader of Segments Review, DeepL, show good results too: adequate, readable, and consistent translations with few or no untranslated elements and issues that are easy to fix.—For DeepL, one reviewer mentions issues with terminology, which is important in this domain. However, according to the reviewer, they are easy to fix.

Page 47: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

# MT Engine Holistic review points(whole document)

Effort Saving(typical segments)

PE Distance per word (typical segments)

Weak spots(non-typical segments)

Further improvement

1 DeepL 14.579 %

(Human Translation - 57%)

0.83(Human Translation -

2.48)

omissions, untranslated words

2 Amazon 17 69 % 1.1 untranslated words, minor issues

3 ModernMT Stock 15.5 66 % 1.36

omissions, untranslated words, one critical

mistranslation

4 Google stock 13 72 % 1.08 some mistranslations and

style issues

5 ModernMT custom 13 69 % 1.12 some omissions and

minor issuesfurther improvement on better data is possible

6 Google custom 13.5 69 % 1.32 some omission, a few

mistranslationsfurther improvement on

better data and a glossary is possible

7 Yandex 13 66 % 1.43 too literal translations

LINGUISTIC QUALITY ANALYSISSummary

Page 48: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

LINGUISTIC QUALITY ANALYSISConclusions

We see a discrepancy between reference-based scores and human ratings. The leaders of the reference-based ranking are the customized engines ModernMT and Google. They are not the best, according to human reviewers.—Since the dataset has some issues (Canadian French in reference, overly localized reference translations), customized engines do not always learn the right things, and they output translations that are closer to reference but not great from the human point of view.—DeepL shows the best results on segments from the test set. Its weak spots are some omissions.—Amazon outperforms all engines on the holistic review of an entire document, handling terminology and style well. However, its performance in the segments review was not perfect.

Page 49: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

4 HOW MUCH DOES IT COST?

3.1 Price Comparison - Training

3.2 Price Comparison - Maintenance

3.3 Price Comparison - Translation

3.4 Total Cost of Ownership

3.5 Discussion

Page 50: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Price Comparison

Page 51: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Price Comparison

Page 52: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

5 HOW SAFE IS MY DATA?

Data protected by ToS: Google (link).—Data protected by Data Protection and Privacy Policy: ModernMT (link).—Amazon may store and use input data to improve its technologies (link).—DeepL does not store input data and uses it only to provide the translation (link).—Yandex does not store input data and uses it only to provide the translation (link).

Page 53: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Intento Web Demo

End-to-End

Fast and Safe

Trusted

Get a portfolio of Machine Translation engines optimal for your language pairs, domains, and available training data. —4-5 weeks from assorted TMs and glossaries to winning MT engines with ROI estimation for Post-Editing and Real-Time Machine Translation.—We run 15-20 MT Procurement projects per month for global retail, travel, and technology companies under strict Security, Quality and Data Protection requirements. ISO 27001 certified.

REACH USat [email protected]

Page 54: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Intento Plugins and Connectors

XLIFF

Microsoft Office (Outlook, Word, Excel)—Google Chrome and Microsoft Edge (extension)—memoQ (included in 9.4, also private plugin)—SDL Trados (SDL AppStore)—XTM (XLIFF API Connector)—MateCat (private plugin)—Any Enterprise TMS via XLIFF connector.—Miss some connector? Reach us at [email protected]!

Page 55: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Any questions?Get in touch!

If you’d like to have a closer look at the data or to reproduce the results, fill free to contact us at [email protected]. The following data assets are available:

• this report in PDF• the training set• the test set• test set translations by MT engines• segments for human review, commented by the reviewers• MT engines’ weak spots, commented by the reviewers

Page 56: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Body Level One

Body Level Two

Body Level Three

Body Level Four

Body Level Five

������/,1*8,676����_�������/$1*8$*(�&20%,1$7,216

,7&�75$16/$7,216

<285�/$1*8$*(�6(59,&(�3$571(5

��&217,1(1766(9(5$/�7,0(�=21(6

$7�<285�6(59,&(

����<($56�,1�%86,1(66

:(%6,7(�75$16/$7,21�6(2

7UDQVODWLRQ� E\� QDWLYH� WUDQVODWRUVZKR� DUH� H[SHUWV� LQ� WKHLU� ILHOG�SURRIUHDGLQJ� DQG� HGLWLQJ�� PDFKLQHWUDQVODWLRQ� DQG� SRVW�HGLWLQJ�WUDQVFUHDWLRQ�� PXOWLOLQJXDO� '73� IRUUHDG\�WR�XVH�ILOHV�

:ULWLQJ�FRQWHQW�LQ�WKH�ODQJXDJHV�RI\RXU� FKRLFH�� 6(2�RSWLPL]HG� ZHEFRQWHQW�� EORJ� DUWLFOHV�� WHFKQLFDOPDQXDOV��ZKLWH�SDSHUV��HWF�

:HEVLWH� WUDQVODWLRQ� DQGORFDOL]DWLRQ�� 6(2� WUDQVODWLRQRSWLPL]HG�IRU�VHDUFK�HQJLQH�

/RFDOL]DWLRQ� RI� LQIRUPDWLRQWHFKQRORJ\� SURJUDPV�ORFDOL]DWLRQ� RI� RQOLQH� WUDLQLQJSURJUDPV�DQG�PDWHULDOV��GLVWDQFHOHDUQLQJ�

75$16/$7,21�'73

5(027(�$1'�21�6,7(,17(535(7,1*

&RQVHFXWLYH� RU� VLPXOWDQHRXVLQWHUSUHWLQJ�� UHPRWHO\� E\� YLGHR� RUWHOHSKRQH��RQ�VLWH�ZLWK�VRXQGSURRIERRWK� RU� ELGXOH� �PLFURWUDQVPLWWHUDQG�UHFHLYHUV��

/RFDOL]DWLRQ� DQG� DXGLRYLVXDOSURGXFWLRQ�� WUDQVFULSWLRQ�� VXEWLWOLQJ�GXEELQJ�� YRLFH�RYHU�� YRLFH� FDVWLQJ�V\QWKHWLF�YRLFH�

(�/($51,1*�62)7:$5(/2&$/,=$7,21

0HGLFDO�ǽ�+HDOWKFDUH�ǽ�/HJDO�ǽ�7HFKQLFDO�ǽ�/LIH�6FLHQFH�ǽ�0DUNHWLQJ�ǽ�,7�ǽ�6RIWZDUH&RPPXQLFDWLRQV�ǽ�7RXULVP�ǽ�)LQDQFH�ǽ�%XVLQHVV�ǽ�(QJLQHHULQJ�ǽ�7HFKQRORJ\�ǽ�*DPLQJ

(GXFDWLRQ�ǽ�/X[XU\�ǽ�,QVXUDQFH�ǽ�5HWDLO�ǽ�(QWHUWDLQPHQW�ǽ�$QG�PRUH�

08/7,/,1*8$/�352-(&7�0$1$*(0(17�&$3$%,/,7<

$� WHDP�RI�SDVVLRQDWH�H[SHUWV�DQG� OLQJXLVWVVSHFLDOL]HG�LQ�\RXU�LQGXVWU\�

$� GHGLFDWHG� FXVWRPHU� SRUWDO� DQG� WKH�PRVWDGYDQFHG�WRROV�LQ�WKH�LQGXVWU\�

$Q� DJLOH� VWUXFWXUH� RIIHULQJ� D� UDQJH� RIFXVWRPL]HG�VROXWLRQV�

$� GDWDEDVH� VSHFLILF� WR� \RXU� RUJDQL]DWLRQWKDW� LPSURYHV� FRQVLVWHQF\� DQG� RIIHUV� FRVWVDYLQJV�IURP�SURMHFW�WR�SURMHFW�

������������FRQWDFW#LWFJOREDOWUDQVODWLRQV�FRP

&217(17�:5,7,1*

$8',2��9,'(2

��7$/(17 ��7(&+12/2*<

��$*,/,7< ��7(50,12/2*<

<285�1(('6

����(�,1',$172:1�5'�����-83,7(5��)/��������86$ ZZZ�LWFJOREDOWUDQVODWLRQV�FRP

,7&�*5283

Page 57: Independent evaluation of commercial machine translation

July 2020© Intento, Inc.

Intento, [email protected] Shattuck AveBerkeley CA 94704

https://inten.to