computer communication b
DESCRIPTION
Computer communication B. Automatic (Machine) translation. Bibliography. Arnold, D., Balkan, L., Lee Humphreys, R, Maijer, S. & Sadler, L. (1994) Machine translation: an introductury guide. Blackwell, Oxford Hutchins, W. John. (2000) Early years in machine translation. Benjamins - PowerPoint PPT PresentationTRANSCRIPT
Computer communication B
Automatic (Machine) translation
Bibliography Arnold, D., Balkan, L., Lee Humphreys, R,
Maijer, S. & Sadler, L. (1994) Machine translation: an introductury guide. Blackwell, Oxford
Hutchins, W. John. (2000) Early years in machine translation. Benjamins
More that you find to go deeper into the topic is fine
Automatic translation Automatic translation:
It`s a process to translate in an automatic way all or part of the process to translate from one human language to the other
AT has several aspects A social political aspect
Automation of translation can be a necessity for societies which do not want to impose a common language on their members.
An economic aspect Human translators are expensive and human trasmation can take
lots of time. Scientific aspect
AT it`s at the interface between AI, linguistics and computer science
Automatic translation: a brief history First patent application for machine translation in the 30 First discussions about the possibility to have automatic translation around
1946/47
1949 (Weaver Warren, letter to the Rockfeller foundation) “I have a text in front of me which is written in Russian but I am going to
pretend that is really written in English and that it has been coded in some strange symbols. All I need to do is strip off the code in order to retrieve the information contained in the text”
Around those years theoretical doubts (and linguistic) for ambiguous sentences The philosopher Bar-Hillel: Fully Automatic High Quality Machine Translation
was impossible both from a theoretical and a technical perspective)
Automatic translation: a brief history In the 50ties AT was introduced as an academic topic 1954
First public demonstration of a Automatic Translation system wich translated from English to Russian (Systran)
1955 First AT activities in the Soviet Union
These were the years were AT gained a lot of popularity and gained a lot of financing as well
1964 (published in 1966 within: ”Language and machines: computers in translation and linguistics”) ALPAC (Automatic Language Processing Advisory Committee) report:
“There is no immediate or predictable prospect of useful machine translation”
Automatic translation: the dark years (66-75)
In consequence to the ALPAC report, there was a huge drop in the financing for AT, and most of all a drop in the motivation.
Many research groups closed Only 3 systems remain active in those years (two in the US
and one within the EURATOM project in Ispra Italy). AT systems were developed by some groups of the Mormon churches to develop a translation of the Bible.
Automatic translation: The renaissance
Around the 80ties the Commission of the European Communities (CEC) bought the English-French version of Systran
The METAL systems (Siemens) were developed AT began to be adapted for companies From the 8ties on there is a large development of AT systems
in Japan EUROTRA project New flow of money for AT from the 80ties on
Automatic translation: The present times
In the 90ties: AT based on statistics Verbmobil project in Germany: Translation of spoken language
bidirectionally from German to English and German to Japanese. Developed between 1993 and 2000 with the partnership of Siemens and
Philips) Possible applications for small companies http://verbmobil.dfki.de/overview-us.html
Present times: Hybrid AT The needs of AT grow bigger (internet, globalization) Technical advancement: elaboration of big corpora and development of
statistical methods for AT AT for small languages as well.
Image of rollercoaster
AT: Evaluation Positive
It is important to keep a good perspective: Results do not need to be perfect
Sketch of the scenario for users Help for translators (who work in any case with the results) Provide rough translations to check if something is important
To become better Input quality: How easy are text to be translated? Most of the texts have not been in advance intended for a
translation
AT: a possible stage scenario For the most part the aim of AT is to make the first translation process.
It is composed by 8 main stages
1) Documents are in an electronic form 2) Several computers are linked by a network 3) There is a MT system (called for example X) 4) Bits of the document to be translated are sent to the MT system 5) The text needs to be translated 6) The MT system gives an output 7) Post-editing 8) Human double check
AT: The stage process
1) The submitted test should be in a format that helps the MT systemBad input = Bad output
Short sentences Grammatical Avoid semantic and syntactic ambiguities A good input means a better output, therefore less post-
editing time is required.
There are text-critique systems built in MT programs
Systran online: an example
Input:
“ieri sera sono andata a casa e dopo aver visto che il cartone del latte si era aperto e tutto il latte era sparso sul pavimento mi sono sentita male”
Output:
“yesterday evening has gone to house and after to have since the latte cardboard of the latte ones had been opened and all the era scattered on the pavement they are felt to me badly”
Systran online: an example
Input:
Too long sentences
Ambiguous words between the two languages: “Era” Output:
Not optimal
Post editing is needed
Human double check is needed as well
What does a translator need (human) and machine need? Knowledge of the source language Knowledge of the target language Knowledge of the correspondences between L1 and L2 Cultural knowledge Common sense All types of knowledge should be at several levels:
Lexical The lexical knowledge is implemented by dictionaries
Phonological Morphological Syntactic Semantic Pragmatic Discourse
What does a translator need (human) and machine need? Knowledge of the source language Knowledge of the target language Knowledge of the correspondences between L1 and L2 Cultural knowledge Common sense All types of knowledge should be at several levels:
Lexical The lexical knowledge is implemented by dictionaries
Phonological Morphological Syntactic Semantic Pragmatic Discourse
Formal representations of grammar
A language is not a simple concatenation of words but words are put together in specific groups of words called constituents which are eventually unified in phrases (NP, VP, PP, NegP etc)
More simply a (English) sentence is usually formed by a subject, a verb and an object, with the possible presence of auxiliaries or modal verbs, or Wh-elements
In linguistics the structure of a sentence (syntax) is formally represented in many ways. According to the graphical way proposed by Jackendoff (1977) the most used representation is the “Tree structure” with a typical X-bar schema
The phrase structure
NP
Specifier
Head Complement (typically another phrase
Parsing (or analysis) by MT
MT use annotated text Part of speech POS-tagging (if a element in a sentence is a
Noun or a verb etc) Syntactic (how phrases are related to each other) Semantic
MT use formal representation of grammar as models and they parse (derive) the syntactic structure of sentences using more or less complicated algorithms.
Usually lists are used instead of the formal linguistic representations
Automatic parsers
Automatic parsers use a formal grammar as base
They take an input sentence They apply that specific grammar to the
sentences To check whether the sentence is grammatical And it can show and derive how words are combined
into phrases They give an insight in the syntactic structure of
a sentence
Translation engines
Translation engines are the part of MT that actually perform the automatic translation
They can be classified according to their architectureTransformer architecture Linguistic knowledge architectures
Tranfer systems Interlingual
Transformer (or transfer) engines
Input sentences can be transformed into output (target language) sentences by carrying out the simplest possible parse.
The source words are replaced with their target language equivalents as specified in a bilingual dictionary , and then roughly re-arranging their order to suit the rules of the target language
Stages Parser (analysis of the source language) Transformation rules (include bilingual dictionary and
some re-order rules. Some morphological transformations are present as well (morphological component)
Transformer engines 2
Transformers do not have much independent knowledge neither of the source language nor of the target language They rarely recognize ungrammaticalities The output can sometimes be totally ungrammatical,
getting similar to a word-salad + Points
Robust: it does not stop in case it encounters unknown words
- Points Not linguistically oriented Can give ungrammatical output Difficult to expand into a multilingual system
Linguistic knowledge architectures
For a high quality MT a linguistic knowledge of both the input and the output languages is needed (together with a knowledge of the differences between them) Linguistic knowledge architectures have a deeper
syntactic analysis. They have a substantial grammar for both input and
output languages They have a comparative grammar to compare the
input and the output languages The two grammars are developed and represented
quite separately.
Linguistic knowledge architectures 2
Analysis Parser and grammar are used to analyze the input language
Transfer A transfer is made to change the underlying representation of
the input language into the one of the output language Synthesis
From the generated underlying representation of the output language the generator creates a sentence in the output language (using the relevant grammar as well)
All these processes can be made having the two grammars available
But to solve some differences between the grammars of the input and the output languages comparative rules are needed
Linguistic knowledge architectures 3
But to solve some differences between the grammars of the input and the output languages comparative rules are needed
Example Le mele piacciono a Gianni Le mele (subj) piacciono (V) a Gianni (object) Gianni likes apples Gianni (subj) likes (V) apples (obj)
For every differences in grammar between the two languages specific comparative grammar rules will have to be written
The deeper the level of abstraction of the parser is the smaller the amount of comparative grammar rules have to be written
LK architectures 3
AdvantagesThe output will be always grammatical It is theoretically a reversible system (from L1
to L2 and conversely from L2 to L1)
Interlingua As the need for contrastive grammar decreases (given by a
deeper level of the parser) what is called INTERLINGUA arises
Interlingual systems are language independent Interlingual representations concern meanings Interlingua tries to explain how the world is made and how
their elements are out together
Parser depth
Interlingua
Comparative grammar
Interlingua
Interlingua is suitable meaning-representation for all languages John can not go: obligatory(not(go(john)))
Each translation happens in two steps The source language is translated into the interlingua The interlingua is then translated to the output
language
But: each translation happens in two steps The process is longer Generalization to the worst case
Interlingua
Interlingua is suitable meaning-representation for all languages John can not go: obligatory(not(go(john)))
Each translation happens in two steps The source language is translated into the interlingua The interlingua is then translated to the output
language
But: each translation happens in two steps The process is longer Generalization to the worst case
Interlingua
Think, by analogy, of individuals living in a series of tall closed towers, all erected over a common foundation. When they try to communicate with one another, they shout back and forth, each from his own closed tower. It is difficult to make the sound penetrate even the nearest towers, and communication proceeds very poorly indeed.
But, when an individual goes down his tower, he finds himself in a great open basement, common to all the towers. Here he establishes easy and useful communication with the persons who have also descended from their towers.
Thus it may be true that the way to translate from Chinese to Arabic, or from Russian to Portuguese, is not to attempt the direct route, shouting from tower to tower. Perhaps the way is to descend, from each language, down to the common base of human communication -the real but as yet undiscovered universal language- and then re-emerge by whatever particular route is convenient.
Warren Weaver
AT: other methods
Example-based methodsTexts in parallel corporas are comparedThe process matches against stored
examples translations It works on having a corpus of bilingual
translations and the goal is to find the best matching translation (using specific algorithms)
The challenge is to be able to draw conclusions about the rules of translation
AT: other methods
Statistical methods Translate the words having as a result a literal
translation in the target language This translation is edited in order to make a good
expression in the target language They are based on probabilistic statistics Some problems
Not all good languages in the target language are a good translation
With case ambiguities Den Vorschlag lehnt die Kommision ab1) The proposal rejects the commission2) The commission rejects the proposal
AT: other methods
Example-based methodsTexts in parallel corporas are comparedThe process matches against stored
examples translations It works on having a corpus of bilingual
translations and the goal is to find the best matching translation (using specific algorithms)
The challenge is to be able to draw conclusions about the rules of translation
AT: some problems
Idiomatic expressions (very difficult)“Non piangere sul latte versato”
Do not cry on the poured (spilled) milk It is useless to cry on a done damage
http://babelfish.altavista.com/tr http://www.systran.co.uk/ http://www.google.com/translate_t
Lexical and morphological mistakes
AT: some problems
Semantic ambiguities Il capo ascolta la musica The boss listens to the music
Morphology Der Urinstinkt ist noch immer vorhanden The primitive instinct is still present
Complicated constructionsToo long sentences, with too many subordinate sentences“Se avessi saputo che sarebbe andata a casa l'avrei immediatamente fermata”“If I would have known that she would have gone home, I would have immediately stopped her”