pangea v3 - when engine search meets machine translation, manuel herranz, pangeanic
TRANSCRIPT
TM database built from TMX files Based on the state-of-the-art full-text search engine Extremely fast indexing, search and retrieval Supports advanced text retrieval techniques
(fuzzy match, regular expressions) Easily scalable Role-based security
ElasticTM
Considered Lucene-based search engines: Solr and ElasticSearch Mature open source projects Have similar capabilities & performance
ElasticSearch was picked mainly because of: Out-of-the-box scalability Powerful Query DSL (query language) Role-based security (via plugin)
ElasticTM - Search Engine
ElasticTM - Design (cont’d) Monolingual indices
Memory-effective Implicit transitive language pairs
Bilingual mappings§ Fast bidirectional id <-> id mapping
Role-based security system Admin, project admin, user etc.
§ Mapping source language segments to a target language§ Bidirectional map (id to id)§ Supports quick bulk incremental updates
ElasticTM - Map
Alternatives?§ NoSQL key-value databases MongoDB CouchDB Redis ElasticSearch, many others … Lack of upsert support for bulk
updatesHandling duplicate entriesScalability
§ SQL databases MySQL PostgreSQL
✗✗
ElasticTM - Map -Benchmarks
ElasticSearch MongoDB CouchDB RedisAdd (47K) 83s 432s 67s 458s
Add (440K) 858s 6112s 644s 621s
Query (10K)
51s 187s 458s 72s
Query (440K)
1400s 6451 19647 1210s
Memory 252M 549M 771M 148M
ElasticTM - Scaling
ghg
ClusterEN ...1)
EN-ES1
EN-ES2
ghg
ClusterEN1 ES1
ES2 EN2 ...2)EN-ES1
EN-ES2
ES
Translation Memory (TM)
Pre-translations stored in a database and offered as suggestions Implemented matching algorithm to propose a relevant translations
exact match and fuzzy match segments similarities based on characters or tokens
NLP improves matching algorithm
Approach
• Statistical Machine Translation (SMT)• Computer-Aided Translation (CAT)
environment
Run maintenance• Search and
replace • Update TM entries• Imports & Export
entries
Translation Memory
TM processing ElasticTM
Full-text search engine+
NLP techniques
Basic examples of TM Matching & processing
perfect match by substitution
fuzzy match
{“source_TM” : “I have 3 cats”,“target_TM” : “Yo tengo 3 gatos”, “score” : “80%”}
{“source_TM” : “I have <number> cats”, “target_TM” : “Yo tengo <number> gatos”, “score” : “100%”}
Original TM{“input_source”: “I have 2 cats”,“output_target”: “ ”}
TM after preprocessing • URLs• Emails• Dates• Units
Basic examples of TM Matching & processing
fuzzy match
{“source_TM” : “I have a cat and I am very happy”,“target_TM” : “Yo tengo un gato y estoy muy feliz”, “score” : “44%”}
{“target_TM” : “Yo tengo un gato y estoy muy feliz”,“source_TM” : “I have a cat”,“target_TM” : “Yo tengo un gato”,“source_TM” : “I am very happy”,“target_TM” : “Estoy muy feliz”,“score” : “100%”}
Original TM{“input_source”: “I have a cat”,“output_target”: “ ”}
TM after preprocessing
perfect match by substitution
Improving TM Matching
Several language → Maximise the reuse of existing human translation Linguistic feature → improving fuzzy matching
string transformation segmentation rules pos tagger tokenizer
ENESPTJA...FR
ENESPTJA...FR
Improving TM MatchingLinguistic approach to improve match
• Segment the text by sentence○ Using delimiters like . ? ! , - :○ Limited the total of words
• Intra-sentence segmentation○ Using conjunctions, adverbs,
determiners, pronouns○ Others approaches
• Replace segments○ Numbers, dates, proper nouns and
identifiers, URLs, e-mail address, punctuation marks, acronyms, variables.
• POS pattern string• Named entity recognition
ElasticTM
TMX files
source text
(Puscasu, 2004; Eriksson and Myhrman, 2010; Orasan, 2000)
Challenges• Morphologically rich and non-Indo-European languages • Go beyond statistics (ongoing work, part of EXPERT project)
Hybrid approaches improve certain language pairs: Japanese (R&D with Japanese partners), morphologically rich languages, Semitic languages.• Continue building revenue streams on MT
MT allows Pangeanic to build other technologies (web, search, etc), enhance and improve its solutions to its client portfolio and offer new services.