pangea v3 - when engine search meets machine translation, manuel herranz, pangeanic

Manuel Herranz

When search engine [techniques] meet

machine translation

Introducing Machine Translation

Unrelated tech changes thegame…

TM database built from TMX files Based on the state-of-the-art full-text search engine Extremely fast indexing, search and retrieval Supports advanced text retrieval techniques

(fuzzy match, regular expressions) Easily scalable Role-based security

ElasticTM

Considered Lucene-based search engines: Solr and ElasticSearch Mature open source projects Have similar capabilities & performance

ElasticSearch was picked mainly because of: Out-of-the-box scalability Powerful Query DSL (query language) Role-based security (via plugin)

ElasticTM - Search Engine

ElasticTM - Design

EN ES FR ... NL

Search Engine

EN <-> ESFR <-> ES

FR <-> NL...

Map DB

ElasticTM

ElasticTM - Design (cont’d) Monolingual indices

Memory-effective Implicit transitive language pairs

Bilingual mappings§ Fast bidirectional id <-> id mapping

Role-based security system Admin, project admin, user etc.

§ Mapping source language segments to a target language§ Bidirectional map (id to id)§ Supports quick bulk incremental updates

ElasticTM - Map

Alternatives?§ NoSQL key-value databases MongoDB CouchDB Redis ElasticSearch, many others … Lack of upsert support for bulk

updatesHandling duplicate entriesScalability

§ SQL databases MySQL PostgreSQL

✗✗

ElasticTM - Map -Benchmarks

* The lower, the better

Time, sMemory, MB

ElasticTM - Map -Benchmarks

ElasticSearch MongoDB CouchDB RedisAdd (47K) 83s 432s 67s 458s

Add (440K) 858s 6112s 644s 621s

Query (10K)

51s 187s 458s 72s

Query (440K)

1400s 6451 19647 1210s

Memory 252M 549M 771M 148M

ElasticTM - Scaling

ghg

ClusterEN ...1)

EN-ES1

EN-ES2

ghg

ClusterEN1 ES1

ES2 EN2 ...2)EN-ES1

EN-ES2

ES

Translation Memory (TM)

Pre-translations stored in a database and offered as suggestions Implemented matching algorithm to propose a relevant translations

exact match and fuzzy match segments similarities based on characters or tokens

NLP improves matching algorithm

Approach

• Statistical Machine Translation (SMT)• Computer-Aided Translation (CAT)

environment

Run maintenance• Search and

replace • Update TM entries• Imports & Export

entries

Translation Memory

TM processing ElasticTM

Full-text search engine+

NLP techniques

Basic examples of TM Matching & processing

perfect match by substitution

fuzzy match

{“source_TM” : “I have 3 cats”,“target_TM” : “Yo tengo 3 gatos”, “score” : “80%”}

{“source_TM” : “I have <number> cats”, “target_TM” : “Yo tengo <number> gatos”, “score” : “100%”}

Original TM{“input_source”: “I have 2 cats”,“output_target”: “ ”}

TM after preprocessing • URLs• Emails• Dates• Units

Basic examples of TM Matching & processing

fuzzy match

{“source_TM” : “I have a cat and I am very happy”,“target_TM” : “Yo tengo un gato y estoy muy feliz”, “score” : “44%”}

{“target_TM” : “Yo tengo un gato y estoy muy feliz”,“source_TM” : “I have a cat”,“target_TM” : “Yo tengo un gato”,“source_TM” : “I am very happy”,“target_TM” : “Estoy muy feliz”,“score” : “100%”}

Original TM{“input_source”: “I have a cat”,“output_target”: “ ”}

TM after preprocessing

perfect match by substitution

Improving TM Matching

Several language → Maximise the reuse of existing human translation Linguistic feature → improving fuzzy matching

string transformation segmentation rules pos tagger tokenizer

ENESPTJA...FR

ENESPTJA...FR

Improving TM MatchingLinguistic approach to improve match

• Segment the text by sentence○ Using delimiters like . ? ! , - :○ Limited the total of words

• Intra-sentence segmentation○ Using conjunctions, adverbs,

determiners, pronouns○ Others approaches

• Replace segments○ Numbers, dates, proper nouns and

identifiers, URLs, e-mail address, punctuation marks, acronyms, variables.

• POS pattern string• Named entity recognition

ElasticTM

TMX files

source text

(Puscasu, 2004; Eriksson and Myhrman, 2010; Orasan, 2000)

Challenges• Morphologically rich and non-Indo-European languages • Go beyond statistics (ongoing work, part of EXPERT project)

Hybrid approaches improve certain language pairs: Japanese (R&D with Japanese partners), morphologically rich languages, Semitic languages.• Continue building revenue streams on MT

MT allows Pangeanic to build other technologies (web, search, etc), enhance and improve its solutions to its client portfolio and offer new services.

¡Gracias! Questions?

[email protected]#pangeanic pangeanic

pangea v3 - when engine search meets machine translation, manuel herranz, pangeanic

Presentations & Public Speaking