computer communication b

36
Computer communication B Automatic (Machine) translation

Upload: daxia

Post on 14-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Computer communication B. Automatic (Machine) translation. Bibliography. Arnold, D., Balkan, L., Lee Humphreys, R, Maijer, S. & Sadler, L. (1994) Machine translation: an introductury guide. Blackwell, Oxford Hutchins, W. John. (2000) Early years in machine translation. Benjamins - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computer communication B

Computer communication B

Automatic (Machine) translation

Page 2: Computer communication B

Bibliography Arnold, D., Balkan, L., Lee Humphreys, R,

Maijer, S. & Sadler, L. (1994) Machine translation: an introductury guide. Blackwell, Oxford

Hutchins, W. John. (2000) Early years in machine translation. Benjamins

More that you find to go deeper into the topic is fine

Page 3: Computer communication B

Automatic translation Automatic translation:

It`s a process to translate in an automatic way all or part of the process to translate from one human language to the other

AT has several aspects A social political aspect

Automation of translation can be a necessity for societies which do not want to impose a common language on their members.

An economic aspect Human translators are expensive and human trasmation can take

lots of time. Scientific aspect

AT it`s at the interface between AI, linguistics and computer science

Page 4: Computer communication B

Automatic translation: a brief history First patent application for machine translation in the 30 First discussions about the possibility to have automatic translation around

1946/47

1949 (Weaver Warren, letter to the Rockfeller foundation) “I have a text in front of me which is written in Russian but I am going to

pretend that is really written in English and that it has been coded in some strange symbols. All I need to do is strip off the code in order to retrieve the information contained in the text”

Around those years theoretical doubts (and linguistic) for ambiguous sentences The philosopher Bar-Hillel: Fully Automatic High Quality Machine Translation

was impossible both from a theoretical and a technical perspective)

Page 5: Computer communication B

Automatic translation: a brief history In the 50ties AT was introduced as an academic topic 1954

First public demonstration of a Automatic Translation system wich translated from English to Russian (Systran)

1955 First AT activities in the Soviet Union

These were the years were AT gained a lot of popularity and gained a lot of financing as well

1964 (published in 1966 within: ”Language and machines: computers in translation and linguistics”) ALPAC (Automatic Language Processing Advisory Committee) report:

“There is no immediate or predictable prospect of useful machine translation”

Page 6: Computer communication B

Automatic translation: the dark years (66-75)

In consequence to the ALPAC report, there was a huge drop in the financing for AT, and most of all a drop in the motivation.

Many research groups closed Only 3 systems remain active in those years (two in the US

and one within the EURATOM project in Ispra Italy). AT systems were developed by some groups of the Mormon churches to develop a translation of the Bible.

Page 7: Computer communication B

Automatic translation: The renaissance

Around the 80ties the Commission of the European Communities (CEC) bought the English-French version of Systran

The METAL systems (Siemens) were developed AT began to be adapted for companies From the 8ties on there is a large development of AT systems

in Japan EUROTRA project New flow of money for AT from the 80ties on

Page 8: Computer communication B

Automatic translation: The present times

In the 90ties: AT based on statistics Verbmobil project in Germany: Translation of spoken language

bidirectionally from German to English and German to Japanese. Developed between 1993 and 2000 with the partnership of Siemens and

Philips) Possible applications for small companies http://verbmobil.dfki.de/overview-us.html

Present times: Hybrid AT The needs of AT grow bigger (internet, globalization) Technical advancement: elaboration of big corpora and development of

statistical methods for AT AT for small languages as well.

Page 9: Computer communication B

Image of rollercoaster

Page 10: Computer communication B

AT: Evaluation Positive

It is important to keep a good perspective: Results do not need to be perfect

Sketch of the scenario for users Help for translators (who work in any case with the results) Provide rough translations to check if something is important

To become better Input quality: How easy are text to be translated? Most of the texts have not been in advance intended for a

translation

Page 11: Computer communication B

AT: a possible stage scenario For the most part the aim of AT is to make the first translation process.

It is composed by 8 main stages

1) Documents are in an electronic form 2) Several computers are linked by a network 3) There is a MT system (called for example X) 4) Bits of the document to be translated are sent to the MT system 5) The text needs to be translated 6) The MT system gives an output 7) Post-editing 8) Human double check

Page 12: Computer communication B

AT: The stage process

1) The submitted test should be in a format that helps the MT systemBad input = Bad output

Short sentences Grammatical Avoid semantic and syntactic ambiguities A good input means a better output, therefore less post-

editing time is required.

There are text-critique systems built in MT programs

Page 13: Computer communication B

Systran online: an example

Input:

“ieri sera sono andata a casa e dopo aver visto che il cartone del latte si era aperto e tutto il latte era sparso sul pavimento mi sono sentita male”

Output:

“yesterday evening has gone to house and after to have since the latte cardboard of the latte ones had been opened and all the era scattered on the pavement they are felt to me badly”

Page 14: Computer communication B

Systran online: an example

Input:

Too long sentences

Ambiguous words between the two languages: “Era” Output:

Not optimal

Post editing is needed

Human double check is needed as well

Page 15: Computer communication B

What does a translator need (human) and machine need? Knowledge of the source language Knowledge of the target language Knowledge of the correspondences between L1 and L2 Cultural knowledge Common sense All types of knowledge should be at several levels:

Lexical The lexical knowledge is implemented by dictionaries

Phonological Morphological Syntactic Semantic Pragmatic Discourse

Page 16: Computer communication B

What does a translator need (human) and machine need? Knowledge of the source language Knowledge of the target language Knowledge of the correspondences between L1 and L2 Cultural knowledge Common sense All types of knowledge should be at several levels:

Lexical The lexical knowledge is implemented by dictionaries

Phonological Morphological Syntactic Semantic Pragmatic Discourse

Page 17: Computer communication B

Formal representations of grammar

A language is not a simple concatenation of words but words are put together in specific groups of words called constituents which are eventually unified in phrases (NP, VP, PP, NegP etc)

More simply a (English) sentence is usually formed by a subject, a verb and an object, with the possible presence of auxiliaries or modal verbs, or Wh-elements

In linguistics the structure of a sentence (syntax) is formally represented in many ways. According to the graphical way proposed by Jackendoff (1977) the most used representation is the “Tree structure” with a typical X-bar schema

Page 18: Computer communication B

The phrase structure

NP

Specifier

Head Complement (typically another phrase

Page 19: Computer communication B

Parsing (or analysis) by MT

MT use annotated text Part of speech POS-tagging (if a element in a sentence is a

Noun or a verb etc) Syntactic (how phrases are related to each other) Semantic

MT use formal representation of grammar as models and they parse (derive) the syntactic structure of sentences using more or less complicated algorithms.

Usually lists are used instead of the formal linguistic representations

Page 20: Computer communication B

Automatic parsers

Automatic parsers use a formal grammar as base

They take an input sentence They apply that specific grammar to the

sentences To check whether the sentence is grammatical And it can show and derive how words are combined

into phrases They give an insight in the syntactic structure of

a sentence

Page 21: Computer communication B

Translation engines

Translation engines are the part of MT that actually perform the automatic translation

They can be classified according to their architectureTransformer architecture Linguistic knowledge architectures

Tranfer systems Interlingual

Page 22: Computer communication B

Transformer (or transfer) engines

Input sentences can be transformed into output (target language) sentences by carrying out the simplest possible parse.

The source words are replaced with their target language equivalents as specified in a bilingual dictionary , and then roughly re-arranging their order to suit the rules of the target language

Stages Parser (analysis of the source language) Transformation rules (include bilingual dictionary and

some re-order rules. Some morphological transformations are present as well (morphological component)

Page 23: Computer communication B

Transformer engines 2

Transformers do not have much independent knowledge neither of the source language nor of the target language They rarely recognize ungrammaticalities The output can sometimes be totally ungrammatical,

getting similar to a word-salad + Points

Robust: it does not stop in case it encounters unknown words

- Points Not linguistically oriented Can give ungrammatical output Difficult to expand into a multilingual system

Page 24: Computer communication B

Linguistic knowledge architectures

For a high quality MT a linguistic knowledge of both the input and the output languages is needed (together with a knowledge of the differences between them) Linguistic knowledge architectures have a deeper

syntactic analysis. They have a substantial grammar for both input and

output languages They have a comparative grammar to compare the

input and the output languages The two grammars are developed and represented

quite separately.

Page 25: Computer communication B

Linguistic knowledge architectures 2

Analysis Parser and grammar are used to analyze the input language

Transfer A transfer is made to change the underlying representation of

the input language into the one of the output language Synthesis

From the generated underlying representation of the output language the generator creates a sentence in the output language (using the relevant grammar as well)

All these processes can be made having the two grammars available

But to solve some differences between the grammars of the input and the output languages comparative rules are needed

Page 26: Computer communication B

Linguistic knowledge architectures 3

But to solve some differences between the grammars of the input and the output languages comparative rules are needed

Example Le mele piacciono a Gianni Le mele (subj) piacciono (V) a Gianni (object) Gianni likes apples Gianni (subj) likes (V) apples (obj)

For every differences in grammar between the two languages specific comparative grammar rules will have to be written

The deeper the level of abstraction of the parser is the smaller the amount of comparative grammar rules have to be written

Page 27: Computer communication B

LK architectures 3

AdvantagesThe output will be always grammatical It is theoretically a reversible system (from L1

to L2 and conversely from L2 to L1)

Page 28: Computer communication B

Interlingua As the need for contrastive grammar decreases (given by a

deeper level of the parser) what is called INTERLINGUA arises

Interlingual systems are language independent Interlingual representations concern meanings Interlingua tries to explain how the world is made and how

their elements are out together

Parser depth

Interlingua

Comparative grammar

Page 29: Computer communication B

Interlingua

Interlingua is suitable meaning-representation for all languages John can not go: obligatory(not(go(john)))

Each translation happens in two steps The source language is translated into the interlingua The interlingua is then translated to the output

language

But: each translation happens in two steps The process is longer Generalization to the worst case

Page 30: Computer communication B

Interlingua

Interlingua is suitable meaning-representation for all languages John can not go: obligatory(not(go(john)))

Each translation happens in two steps The source language is translated into the interlingua The interlingua is then translated to the output

language

But: each translation happens in two steps The process is longer Generalization to the worst case

Page 31: Computer communication B

Interlingua

Think, by analogy, of individuals living in a series of tall closed towers, all erected over a common foundation. When they try to communicate with one another, they shout back and forth, each from his own closed tower. It is difficult to make the sound penetrate even the nearest towers, and communication proceeds very poorly indeed.

But, when an individual goes down his tower, he finds himself in a great open basement, common to all the towers. Here he establishes easy and useful communication with the persons who have also descended from their towers.

Thus it may be true that the way to translate from Chinese to Arabic, or from Russian to Portuguese, is not to attempt the direct route, shouting from tower to tower. Perhaps the way is to descend, from each language, down to the common base of human communication -the real but as yet undiscovered universal language- and then re-emerge by whatever particular route is convenient.

Warren Weaver

Page 32: Computer communication B

AT: other methods

Example-based methodsTexts in parallel corporas are comparedThe process matches against stored

examples translations It works on having a corpus of bilingual

translations and the goal is to find the best matching translation (using specific algorithms)

The challenge is to be able to draw conclusions about the rules of translation

Page 33: Computer communication B

AT: other methods

Statistical methods Translate the words having as a result a literal

translation in the target language This translation is edited in order to make a good

expression in the target language They are based on probabilistic statistics Some problems

Not all good languages in the target language are a good translation

With case ambiguities Den Vorschlag lehnt die Kommision ab1) The proposal rejects the commission2) The commission rejects the proposal

Page 34: Computer communication B

AT: other methods

Example-based methodsTexts in parallel corporas are comparedThe process matches against stored

examples translations It works on having a corpus of bilingual

translations and the goal is to find the best matching translation (using specific algorithms)

The challenge is to be able to draw conclusions about the rules of translation

Page 35: Computer communication B

AT: some problems

Idiomatic expressions (very difficult)“Non piangere sul latte versato”

Do not cry on the poured (spilled) milk It is useless to cry on a done damage

http://babelfish.altavista.com/tr http://www.systran.co.uk/ http://www.google.com/translate_t

Lexical and morphological mistakes

Page 36: Computer communication B

AT: some problems

Semantic ambiguities Il capo ascolta la musica The boss listens to the music

Morphology Der Urinstinkt ist noch immer vorhanden The primitive instinct is still present

Complicated constructionsToo long sentences, with too many subordinate sentences“Se avessi saputo che sarebbe andata a casa l'avrei immediatamente fermata”“If I would have known that she would have gone home, I would have immediately stopped her”