the tipping point

The Tipping PointAndrzej Zydroń CTO XTM Intl

Localization World 2014 Vancouver

The Tipping Point

OCR analogy:

• 1978 Kurzweil Computer Products launches OCR

• Initial quality varied average up to 90%

- Still quicker and cheaper to retype and proof

• Gradual improvements including extensive use of dictionaries

- 1990 quality up to 97%

• 1990’s

- Better algorithms, faster processors, cheaper RAM, extensive use of dictionaries, dynamic training, multiple script support

• 2000 – quality up to 99%

The Tipping Point

Language

Global Demand

12% pa growth

Average Price Paradox

Average Price Paradox

• Automation• More competition• More resources• Better technology• Machine translation

The Translation Puzzle


Project Manager requirements:

– Real-time projects

• Creation

• Tracking

• Communication

– Translation assets – TM, Terminology

– Financial management


Client / Requestor requirements:

– Project creation

– Cost confirmation

– Project tracking

– Quality review

– Translation pick up


Linguist requirements:

– Work effectively as a team

– Access to the most up to date assets

– Ensure translation quality

– WYSIWYG preview of target files

– Meet deadlines

Putting the Pieces Together

Swift collaboration of all the project contributors with real-time data

sharing and tracking.

Machine Translation

In a nutshell:

– 1950’s IBM/Washington University/Georgetown University

• Transfer systems

• ALPAC Report – 1966

– More expensive, slower, less accurate

– Ambiguity/Complexity of language

– Context

– 1970’s/1980’s

• Systran (USAF, Xerox, Caterpillar, European Commission), Canadian Meteo

– Statistical Machine Translation (SMT) 2000’s

• EU funded research: Moses

• Statistical/Example based translation (Och, Ney, Koehn, Marcu)

– Big Data: 1million+ aligned sentences

SMT

A great success:

– Google Translate

– Microsoft Translator

– Asia Online

– Safaba

– Tauyou

– DoMY

– Etc.

SMT

Cannot overemphasise the contribution:

– European Union

– Academic institutions:

• Edinburg University

• Carnegie Mellon

• Princeton University

• John Hopkins University

• University of Pennsylvania

• CNGL

– Dublin City University

– Trinity College

– University of Limerick

SMT

In a nutshell:

– Based on: Information Theory

• Bayesian theory:

• Translation model

– Probability that the source string is the translation of the target string

– Given enough data we can calculate the probability that word ‘A’ is translation for word ‘X’

SMT

Limitations:

– You need an awful lot of data

– Probabilities are at best a ‘guess’

– Word order issues,

• English and German

• English Japanese

– Morphology difficulties

• Impoverished to rich, e.g. English to Polish

– Terminology

– Workflow

– Real time retraining

SMT

Limitations:

– Currently these are an impediment to further SMT adoption

FALCON:

– EU FP7 funded project

– Federated Active Linguistic data CuratiON

– Members

• Dublin City University

• Trinity College Dublin

• Easyling

• Interverbum

• XTM International

– Currently half way into 2 year project

– Tight integration

• Easyling

• TermWeb

• XTM

– L3Data

• Linked Language and Localisation Data

• SPARQL linking and curation of language resources

– Advances in SMT

• Adding Babelnet – Lexical Big Data

• Dynamic retraining

• Optimal segment translation sequence

• Forcing terminology (forced decoding)

• Workflow integration

• L3Data curation and sharing

Lays a golden egg

Babelnet:http://www.babelnet.org

• Lexical Big Data

• Sapienza Università di Roma

– Roberto Navilgi

– ERC funded project

• Princeton WordNet

• Wikipedia

• Wiktionary

• DBPedia

• Google

• 9.5 million entries

• Equivalents in 50 languages

http://www.babelnet.org

Moses + Babelnet:

Moses: SMT Big Data

Babelnet: Lexical Big Data

Babelnet + Moses =

much improved SMT

Babelnet + Segment Alignment =

much improved alignment

Dynamic retraining:

– New feature

– Moses learns on the fly as translation/post editing happened

– Immediate benefits from translator output

Optimal translation sequence:Prioritize translation for dynamic retraining

Forced decoding:

– Terminology system integration

– Prompt the Moses decoder to use a specific term

– Immediate benefits for translator

das ist ein kleines <term

translation="dwelling”>Haus</term>

Workflow integration:

– Making SMT part of an integrated TMS workflow

• Terminology: forced decoding

• Babelnet input

• Translation Memory

• Browser based Translator Workbench

• Dynamic retraining

• Optimal sequence

• Always up to date SMT engines

Workflow integration:

L3Data curation and sharing:

Publish

Correct & refine

Lex-concept lifecycle

Correct & refine

Discover & use

Discover & use

Correct & refine

Bitext lifecycle

Discover data

(Re)train-MT

Revise and annotate

Publish

Content lifecycle

Publish

I18n & source QA

Trans QA

Post-edit

Automated translation

Consume Create

Limits of current technology

– We are making significant progress

• Big Data generated dictionaries

– 9.5 million+ entries

• Phrase based alignment/translation

• Syntax based translation

• Hierarchical phrase based translation

– Marker/Function words

Limits of current technology

– There are limits with current technology

• Syntax

• Morphology

• Grammar

• Statistical anomalies

• Data dilution

• Idioms

• Out of Vocabulary words

• Morphology

– Computers can never ‘understand’ the text

– Next generation systems need a completely approach

John Searle’s Chinese Room

Defining Intelligence

Human vs Computer• Human 200 OPS

• Computer 82,300,000,000 OPS

vs

How the brain works

30 billion cells, 100 trillion synapses

How the brain works

How the brain works

• Trajectory• Velocity• Angle• Wind speed • Direction

How the brain works

Human Intelligence

Jeff Hawkins: On Intelligence 2004 ISBN 0-8050-7456-2

• Understanding cannot be measured by external behavior

• Understanding is an internal metric of how the brain remembers things to

make predictions

• AI programs do not simulate brains and are not intelligent

• All intelligence is concentrated in the neocortex and the synapses that connect

different parts of the brain

• Intelligence is primarily based on hierarchical pattern matching starting with

an ‘invariant form’: house, animal, dog

• All animals exploit patterns in nature

Simulating Human Intelligence

Beyond Turing

Biological intelligence

Neocortical architecture

Numenta

Cortical theory

Sparse distributed architecture

Pattern matching

Hierarchical Temporal Memory

Simulating Human Intelligence

Hierarchical Temporal Sequence Memory:

Regions

• Learn sequences of common spacial patterns

• Pass stable representations up hierarchy

• Unfold sequences going down hierarchy

Hierarchy

• Reduces memory and training time

• Provides means of generalization

Question and Answer session

Better Translation Technology

Contact Details

XTM International

www.xtm-intl.com

Register for future Webinar sessions

www.xtm-intl.com/demos

Contact

[email protected]

+44 (0) 1753 480 479

http://www.xtm-intl.com

http://www.xtm-intl.com/demos

mailto:[email protected]

the tipping point

Documents

puzzle pieces

xtm version

project contributors

xtm quick session webinar

new version

webinar inbuilt chat

tipping pointocr analogy

cheaper ram