welocalize eamt 2014 presentation assumptions, expectations and outliers in post-editing

21
Assumptions, Expectations and Outliers in Post- Editing Lena Marg, Laura Casanellas Language Tools Team @ EAMT Summit Dubrovnik, Croatia, June 2014

Upload: welocalize

Post on 19-Nov-2014

257 views

Category:

Business


1 download

DESCRIPTION

EAMT Summit Dubrovnik, Croatia, June 2014 presentation Assumptions, Expectations and Outliers in Post-Editing by Lena Marg, Laura Casanellas from the Welocalize Language Tools Team. European Association Machine Translation - Discussion about machine translation and quality processes. Results for analysis of output in Welocalize study by Language Tools Team research. Scoring, evaluations and productivity tests. Quality Evaluation. MT study Based on some of the data shown previously, and from the point of view of consistent results versus outliers. For Human Evaluations of raw MT output, the inter-annotator agreement was consistent in terms of scores (same test set and language) and metrics based on human effort (productivity delta) are less consistent and might include significant variations between individual (same test set and language)

TRANSCRIPT

Page 1: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

Assumptions, Expectations and Outliers in Post-EditingLena Marg, Laura CasanellasLanguage Tools Team

@ EAMT Summit Dubrovnik, Croatia, June 2014

Page 2: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

Background on MT Programs @

MT programs vary with regard to:

ScopeLocalesMaturity

System Setup & OwnershipMT Solution used

Key Objective of using MTFinal Quality Requirements

Source Content

Page 3: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

MT Quality Evaluation @

1. Automatic Scores Provided by the MT system (typically BLEU) Provided by our internal scoring tool, weScore (range of metrics)

2. Human Evaluation Adequacy, scores 1-5 Fluency, scores 1-5

3. Productivity Tests Post-Editing versus Human Translation in iOmegaT, validated

through final Quality Assessments

Page 4: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

The Database

Objective:Establish correlations between these 3 evaluation approaches to- draw conclusions on predicting productivity gains in advance- see how & when to use the different metrics best

Contents:- Content Type - Language Pair (English into XX)- MT engine provider & owner (i.e. who owns training & maintenance)- Metrics (BLEU & PE Distance, Adequacy & Fluency, Productivity deltas)- MT error analysis- Final QA scores- Level of experience of resource doing productivity test

Data from 2013

Page 5: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

thedatabaseData Used

27 locales in total, with varying amounts of

available data

5 different MT systems (SMT / Hybrid

Page 6: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

AssumptionsresultsGeneral assumptions around best performing languages and content types were confirmed

Page 7: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

Assumptionsresults, IIInteresting results around correlation between productivity gained when translating and post-editing :

Not all the resources improve equally (or at all) when changing activities from translation to post-editing.

Page 8: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

correlationresultsSummary

Pearson's r Variables Strength of Correlation Tests (N) Locales Statistical Significance (p value <)

0.82 Adequacy & Fluency Very strong positive relationship 182 22 0.0001

0.77 Adequacy & P Delta Very strong positive relationship 23 9 0.0001

0.71 Fluency & P Delta Very strong positive relationship 23 9 0.00015

0.55 Cognitive Effort Rank & PE Distance Strong positive relationship 16 10 0.027

0.41 Fluency & BLEU Strong positive relationship 146 22 0.0001

0.26 Adequacy & BLEU Weak positive relationship 146 22 0.0015

0.24 BLEU & P Delta Weak positive relationship 106 26 0.012

0.13 Numbers of Errors & PE Distance No or negligible relationship 16 10 ns

-0.30 Predominant Error & BLEU Moderate negative relationship 63 13 0.017

-0.32 Cognitive Effort Rank & PE Delta Moderate negative relationship 20 10 ns

-0.41 Numbers of Errors & BLEU Strong negative relationship 63 20 0.00085

-0.41 Adequacy & PE Distance Strong negative relationship 38 13 0.011

-0.42 PE Distance & P Delta Strong negative relationship 72 27 0.00024

-0.70 Fluency & PE Distance Very strong negative relationship 38 13 0.0001

-0.81 BLEU & PE Distance Very strong negative relationship 75 27 0.0001

Page 9: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

takeaways

The strongest correlations were found between:

Adequacy & Fluency BLEU and PE Distance Adequacy & Productivity Delta Fluency & Productivity Delta Fluency & PE Distance

The Human Evaluations come out as stronger indicators for potential post-editing productivity gains than Automatic metrics.

CORRELATIONS

Page 10: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

Looking at subsetsAdequacy and Fluency versus BLEU

da_DK de_DE es_ES es_LA fr_CA fr_FR it_IT ja_JP ko_KR pt_BR ru_RU zh_CN

-1.00

-0.80

-0.60

-0.40

-0.20

0.00

0.20

0.40

0.60

0.80

1.00

Adequacy, Fluency & BLEU Correlation – Select Locales

Adequacy & BLEU Fluency & BLEU

Pear

son'

s r

Adequacy, Fluency and BLEU correlation for locales with 4 or more test sets*

Although the tests sets here are too small to be statistically relevant, the correlations seem to vary significantly between locales.Would this be maintained with more data and what are the reasons for the differences?

Page 11: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

Looking at subsets, II

Adequacy and Fluency versus PE Distance

Fluency and PE distance across all locales have a cumulative Pearson’s r of -0.70, a very strong negative relationship

Adequacy and PE distance across all locales have a cumulative Pearson’s r of -0.41, a strong negative relationship

de_DE es_ES/LA fr_FR/CA it_IT pt_BR

-1.00

-0.80

-0.60

-0.40

-0.20

0.00

0.20

0.40

0.60

0.80

1.00

Adequacy, Fluency and PE Distance Correlation

Adequacy & PE Distance Fluency & PE Distance

Looking at a few select locales with the highest numbers of tests, it looks more varied again.

Page 12: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

Outliersresults

Based on some of the data shown previously, and from the point of view of consistent results versus outliers:

• For Human Evaluations of raw MT output, the inter-annotator agreement was consistent in terms of scores (same test set and language)

• Metrics based on human effort (productivity delta) are less consistent and might include significant variations between individual (same test set and language)

Page 13: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

furtherquestionsBased on the premise that there are significant variations between different post-editors…

… and with the aim of learning from individual behaviors and predicting future productivity gains, we ask ourselves two questions:

• What circumstances or variables most reliably facilitate good-quality, highly productive post editing?

• Do conditions and parameters outside the post-editor’s control facilitate or hamper his or her success?

Page 14: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

survey

Q1: What is your primary target language?Q2: What is your background?Q3: How many years experience ?Q4: How is your work environment?Q5: Which of the following CAT tools have you worked with?Q6: What is your level of proficiency on the CAT tool(s) you use?Q7: What is your translation methodology?Q8: How do you primarily enter text?Q9: What are your quality assurance and automation processes?Q10: What do you consider most important in your assignments?

5 languages (DE, FR, JP, PTBR, HU)38 linguists (belonging to 14 different teams)

Page 15: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

Probably less surprising… Except for 1 respondent, all respondents have more experience with

translation than with post-editing The overall correlation between translation experience and post-

editing experience is “strong”

However, looking at correlations by localeGerman: very strong French: weak Japanese: weak PTBR: strong Hungarian: weak

This suggests that for German and Brazilian Portuguese only, the overall experience as professional translator (whether junior or senior) gives us insights into how much post-editing experience to expect. For the other 3 locales, profiles are more varied.

Q3: How many years experience do you have?

Page 16: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

The choice of CAT tool is to some extend dependent on the client requirement, but what the data shows is that all locales & respondents are using a broad range of CAT tools for their work.

On average, respondents use / are familiar with 6-8 different CAT tools.

There is a slight trend that junior translators use / are familiar with more CAT tools than senior translators.

All respondents claim to be proficient and / or expert in their most frequently used CAT tool.

6 out of 8 Hungarian respondents call themselves “Experts”3 out of 8 Germans4 out of 9 French1 out of 7 PTBRNone of the Japanese respondents (despite on average most translation experience)

Q5: Which of the following CAT tools have you worked with? Please select all that apply

Page 17: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

Of the 5 locales, the French respondents stand out as a very homogenous group with

- Rarely making use of any pre-processing steps- Never using free MT tools- Never using internal MT tools

The Japanese, Brazilian and Hungarian respondents are more likely to perform pre-processing steps

Japanese translators appear to copy to Word more than any other locale

Hungarian translators were the only group with almost half of the respondents never doing draft translations first, but working segment by segment

Q7: Please evaluate the following statements on translation methodology

Page 18: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

Looking at respondents who Always / Frequently perform any of the 5 proposed actions,- There was no clear trend with regard to years of translation

experience- There was no clear trend with regard to background- There was no clear trend between resource working in an

office / at home etc.

With regard to text input methods,

French and German translators seem to make more use of CAT tool shortcuts.

Japanese requires the use of Input Method Editors.

… Q7: Please evaluate the following statements on translation methodology

Page 19: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

• Romance languages are the best performers on MT.

• User Assistance is the most suitable content (apart from UGC).

• Translators do not improve homogenously when moving to post-editing (some of them do not improve at all).

• It is more difficult to foresee post-editing effort than to asses the quality of raw MT. The human effort is still the most variable aspect.

• In some locales (Germany, Brazil) “senior translators” accept post-editing as much as junior translators might do.

• Our French linguists seem to use less automation in their processes.

Final Conclusions

Page 20: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

White Papers: Two white papers elaborating on the approach and results of the Analysis of the Database will be published in the near future.

www.welocalize.com

More research: We continue adding data to our Database; we have also included the survey on our hand-off material when doing productivity tests with the aim of gaining more insights into the post-editors background.

nextprojects

Page 21: Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

THANK [email protected]@welocalize.com