luísa coheur - projecto pt-star
DESCRIPTION
Apresentação da Dra. Luísa Coheur na I Conferência Internacional de Tradução e Tecnologia, 13 e 14 de Maio, Faculdade de Letras do Porto.TRANSCRIPT
Tradução Automática de Fala para Fala no Projecto PT-STAR
Luísa Coheur (L2F/INESC-ID)
Place Logos of Partner Institutions
2
INESC-ID and L2F
3 3
INESC-ID
Brief history Established January 2000 (Owned by IST and INESC)
Private Not-for Profit Research Institute of Public Interest
Associated Laboratory since December 2004
Facilities
Alameda
Tagus Park
4 4
The Spoken Language Systems Lab
History Work on speech processing for Portuguese since the 90s
Creation: 2001
Mission Creating technology to bridge the gap between natural spoken language and the
underlying semantic information.
Interdisciplinary background: Signal processing, natural language processing, linguistics, etc.
5 5
Core Technologies
Speech processing Text-to-speech synthesis
Automatic process for building new voices
Limited domain synthesis
Expressive speech synthesis
Audio-visual synthesis
Automatic speech recognition Robust speech recognition
Speaker adaptation
Large vocabulary continuous recognition
Rich transcription of spontaneous speech
Speech coding
Speech enhancement
Speaker and language identification
Text processing – Morphological analysis
– Syntactic analysis
– Semantic analysis
– Discourse analysis
– NL Generation
– Named entity extraction
– Information retrieval
– Summarization
– Question answering
– Machine translation
Spoken language processing – Speech understanding
– Spoken dialog systems
– Speech-to-Speech machine translation
– Summarization of spoken documents
– Question answering on spoken documents
– Classification of multimedia documents
– Language tutoring
– etc.
6
Statistical Machine Translation
7
Statistical Machine Translation
Automatic Translators target to maximize: Faithfulness or fidelity
How close is the meaning of the translation to the meaning of the original
Fluency or naturalness
How natural the translation is, just considering its fluency in the target language
Developed by researchers from IBM
ˆ T argmax T fluency(T)faithfulness(T,S)
8
Statistical Machine Translation
ˆ T argmax T fluency(T)faithfulness(T,S)
Translation Model Language Model
Estou cansado Fluência Fidelidade
I’m exhausted 5 3
Tired me 2 5
I love cookies 5 0
9
Modelo de língua: fluêcia
Qual a frase mais fluente? Passa a: “qual a mais provável”
Podemos recorrer a modelos de língua criados com base em N-grams, por exemplo
Advantage: this is monolingual knowledge!
10
Modelo de tradução: fidelidade
Qual a frase mais fiel? Aqui há que observar como frases na língua fonte se traduzem na línga
alvo.
Problema: precisa de Corpora paralelos Parlamento Europeu
TED Talks
…
11
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok . 1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp .
4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat .
5a. wiwok farok izok stok . 5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat .
6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .
Translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
12
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok . 1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp .
4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat .
5a. wiwok farok izok stok . 5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat .
6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .
Translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
13
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok . 1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp .
4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat .
5a. wiwok farok izok stok . 5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat .
6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .
Translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
14
Spanish/English corpus
1a. Garcia and associates . 1b. Garcia y asociados .
7a. the clients and the associates are enemies . 7b. los clients y los asociados son enemigos .
2a. Carlos Garcia has three associates . 2b. Carlos Garcia tiene tres asociados .
8a. the company has three groups . 8b. la empresa tiene tres grupos .
3a. his associates are not strong . 3b. sus asociados no son fuertes .
9a. its groups are in Europe . 9b. sus grupos estan en Europa .
4a. Garcia has a company also . 4b. Garcia tambien tiene una empresa .
10a. the modern groups sell strong pharmaceuticals . 10b. los grupos modernos venden medicinas fuertes .
5a. its clients are angry . 5b. sus clientes estan enfadados .
11a. the groups do not sell zenzanine . 11b. los grupos no venden zanzanina .
6a. the associates are also angry . 6b. los asociados tambien estan enfadados .
12a. the small groups are not modern . 12b. los grupos pequenos no son modernos .
15
Speech to Speech Machine Translation
16
Speech to speech machine translation
Speech-to-Speech Machine Translation (S2SMT) technologies aim at enabling natural language communication between people that do not share the same language
17
Speech to speech machine translation
S2SMT can be seen as a cascade of three major components: Automatic Speech Recognition
Machine Translation
Text-to-Speech Synthesis
18
Speech to speech machine translation
19
The PT-STAR project
20
The PT-STAR project
Team: L2F/INESC-ID
LTI/CMU
UBI
FLUL
21
The PT-STAR project
One of the main problems of S2SMT is the still weak integration between the three components The main goal of PT-STAR (Speech Translation Advanced
Research to and from Portuguese) is to improve speech translation systems for Portuguese by strengthening this integration
22
Task 1: ASR/MT
TASK 1
23
Task 1: ASR/MT
Challenge Improve full stops and commas insertions
Segmentation is a hard problem in automatic translation
Improve capitalization
Important to disambiguate (Ex: Pedro Steps Rabbit)
Detect interrogatives
Important if you target synthesis
Porte everything to English
Try to make everything as much language independent as possible
24
Rich transcriptions
boa tarde o governo considera que as medidas de austeridade aprovadas e em vigor só para já adequadas às necessidades financeiras de portugal o ministro das finanças mostra-se confiante com as metas traçadas no programa de estabilidade e crescimento apesar de não fechar as portas à hipótese de medidas adicionais de controlo orçamental em dois mil e doze é desta forma que teixeira dos santos responde a pressão dos países da moeda única querem que portugal e espanha avança com mais medidas de austeridade dentro de ano e meio ainda em mês passou diz que o governo decidiu apertar o cinto aos portugueses e já europa vem pedir mais para depois de dois mil e onze o ministro das finanças não fecha a porta, mas defende cada ano a seu tempo acho que estamos de em condições de alimentar digamos confessa estar confiantes de que o objectivo para dois mil e dez vai ser conseguido com as medidas adicionais que foram entretanto já decididas
25
Rich transcriptions
[anchor 150] Boa tarde o governo considera que as medidas de austeridade aprovadas e em vigor. Só para já adequadas às necessidades financeiras de Portugal. O ministro das Finanças mostra-se confiante com as metas traçadas no programa de Estabilidade e Crescimento. Apesar de não fechar as portas à hipótese de medidas adicionais de controlo orçamental, em dois mil e doze. É desta forma que Teixeira dos Santos responde a pressão dos países da moeda única, querem que Portugal e Espanha avança com mais medidas de austeridade, dentro de ano e meio.
[spk 2000] Ainda em mês passou diz que o Governo decidiu apertar o cinto aos portugueses e já Europa vem pedir mais para depois de dois mil e onze. O ministro das Finanças não fecha a porta, mas defende cada ano, a seu tempo.
[spk 1000] Acho que estamos de em condições de alimentar, digamos confessa estar confiantes, de que o objectivo para dois mil e dez, vai ser conseguido com as medidas adicionais que foram entretanto já decididas.
Tópicos: Política; Economia; Nacional;
26
Rich transcriptions
[anchor 150] Boa tarde o governo considera que as medidas de austeridade aprovadas e em vigor. Só para já adequadas às necessidades financeiras de Portugal. O ministro das Finanças mostra-se confiante com as metas traçadas no programa de Estabilidade e Crescimento. Apesar de não fechar as portas à hipótese de medidas adicionais de controlo orçamental, em dois mil e doze. É desta forma que Teixeira dos Santos responde a pressão dos países da moeda única, querem que Portugal e Espanha avança com mais medidas de austeridade, dentro de ano e meio.
[spk 2000] Ainda em mês passou diz que o Governo decidiu apertar o cinto aos portugueses e já Europa vem pedir mais para depois de dois mil e onze. O ministro das Finanças não fecha a porta, mas defende cada ano, a seu tempo.
[spk 1000] Acho que estamos de em condições de alimentar, digamos confessa estar confiantes, de que o objectivo para dois mil e dez, vai ser conseguido com as medidas adicionais que foram entretanto já decididas.
Tópicos: Política; Economia; Nacional;
27
Translation
[anchor 150] Good afternoon, the government believes that the austerity measures approved and in force. Only for already suited to financial needs of Portugal. The finance minister seems confident with the targets set out in the stability and growth programme. Despite not close the door to the possibility of additional measures of budgetery control in two thousand, twelve. This is the way that Teixeira dos Santos responds the pressure of the countries of the single currency, they want Spain and Portugal progresses with more austerity measures, within a year and a half.
[spk 2000] Still in month passed says that the government has decided to tighten their belts the Portuguese and already Europe comes to ask for more for after two thousand and eleven. The finance minister is not closes the door, but defends each year, the his time.
[spk 1000] I think that we are in conditions of food, say admits be trusted, that the objective for two thousand, ten, will be achieved with the additional measures that were in the meantime, has already decided.
Topic: Politics; Economy; National;
28
Task 1: ASR/MT
Challenge Take advantage of in-domain texts to build domain adapted
language models for ASR and MT
Domain adaptation is one of the major problems in SMT (in a word is not seen during training, the system will not be able to translate it)
29
Task 1: ASR/MT
Challenge Take advantage of imperfect transcriptions (in which annotations do
not include laughter, applause, filled pauses, repetitions, or other disfluencies, and sometimes contain errors) to build acoustic models for ASR
Example:
… In my opinion the many options to solve the...
… In my opinion ++BREATH++ the ++UH++ many options to solve the...
30
Task 2: MT/TTS
TASK 2
31
Task 2: MT/TTS
Challenges Built Statistical Parametric Synthetic voices for Portuguese
How do deal with translation errors when you target synthesis?
Techniques for optimal synchronization using MT N-best list
Grammar based phrasing strategies to improve synthesis of disfluent MT output
Voice Morphing
Cross lingual voice morphing to match source speaker
32
Task 3: MT
TASK 3
33
Task 3: MT
Challenges Alignments
New algorithms to generate the well known lexicalized reordering model using weighted alignment matrices
Geppetto: a toolkit for word alignments and phrase extraction
Users can improve the phrase extraction algorithm, due to the fact that key control points can be manipulated
Available at Google code
34
Task 3: MT
Challenges Error analysis
Taxonomy and detailed analysis of Moses vs. Google
From BP to EP
Built the BP2EP translator
Corpora:
TAP-UP corpus
Flight magazine with parallel corpora PT/EN
6000 questions translated into PT
Original corpus in EN, from TREC
Translation Model adapted with the questions’ corpus
Important BLEU improvements (EN/PT 9, PT/EN 8)
35
Task 3: MT
Challenges Participated in IWSLT 2010 (Evaluation Campaign)
CN-EN, EN-CN
FR-EN
36
Task 4: Proof of concept
TASK 4
37
Proof-of-concept
Prototype development (pt, en, cn) Broadcast news (S2T)
TED TALKS (S2S)
Real time demo (S2S)
38
Demo e referências
Demonstração em vídeo:
https://www.l2f.inesc-id.pt/demos/pt-star/Demo_S2S.mov
Referências na comunicação social:
Reportagem na SIC Notícias
Artigo no "Ciência Hoje“
Reportagem na revista Sábado