on feature selection in maximum entropy approach to ......on feature selection in maximum entropy...

On Feature Selection in Maximum

Entropy Approach to Statistical Concept-

based Speech-to-Speech Translation

Authored by: Liang Gu and Yuqing Gao

Presented by: Ruhi Sarikaya

October 1st, 2004

Outline

� Statistical Concept-based Speech Translation

� Natural Concept Generation (NCG)

� Feature Selection in Statistical NCG

� Conciseness vs. Informativity of Concepts

� Features using both Concept & Word Information

� Multiple Feature Selection

� Experimental Results

“On Feature Selection in Maximum Entropy Approach to Statistical Concept-based Speech-to-Speech Translation,” IWSLT, Kyoto, Japan, 2004

Overview of IBM MASTOR System

Speech

Recognizer

Speech

Recognizer

Translation

Lexicon

ME

Models

Speech Input

NLU

Parser

NLU

ParserConcept

Extractor

Concept

Extractor

Phrase/Word

Translator

Phrase/Word

Translator

Speech Output

Speech

Synthesizer

Speech

Synthesizer

Statistical

NLG

Statistical

NLG

Annotated

Corpus


Spoken Language Translation via Concepts

Speech

Recognizer

Speech

Recognizer

Speech Input

Natural

Language

Understanding

Natural

Language

Understanding

Natural

Concept

Generation

Natural

Concept

Generation

Speech Output

Speech

Synthesizer

Speech

Synthesizer

Natural

Word

Generation

Natural

Word

Generation

Natural Language Generation

Statistical Concept-based Machine Translation


Spoken Language Translation via Concepts

!S!

QUERY SUBJECT PLACEWELLNESS

Is he anywhere else besides his abdomen

Natural Language

Understanding

PRON

除了除了除了除了他的他的他的他的腹部腹部腹部腹部在其他任何地方在其他任何地方在其他任何地方在其他任何地方他他他他流血流血流血流血吗吗吗吗

Natural Language

Generation

Natural Concept

Generation

Natural Concept

Generation

Natural Word

Generation

Natural Word

Generation

PLACE PREPPH BODY-PART

!S!

QUERYSUBJECTPLACE WELLNESS

PLACEPREPPH BODY-PART

bleeding


Concept-based Speech Translation

� Language-independent representation of intended

meanings

� Parsed from source language

� Organized in a language-dependent tree-structure

� Comparable to interlingua

� More flexible meaning preservation

� Wider sentence coverage

� Easier portability between different domains

� Design and selection of concepts

� Generation of concepts

ConceptsConcepts

MeritsMerits

ChallengesChallenges


Natural Concept Generation (NCG)

� Correct set of concepts in target language

� Appropriate order of concepts in target language

� Statistical model-based generation

� Trained on Maximum-Entropy models

� Design of generation procedure

� Generation of concept sequences

� Transformation of semantic parse tree

� Selection of features

Purpose

Approaches

Challenges

7“On Feature Selection in Maximum Entropy Approach to Statistical Concept-based Speech-to-Speech Translation,” IWSLT, Kyoto, Japan, 2004

Statistical NCG on Sequence Level

QUERY PRON WEAPONPOSSESS

QUERYPRON WEAPONPOSSESS

English

Chinese


Statistical NCG on Sequence Level

C1 C2 CM…

SnS1 …S2

Source C

Target S Sn+1

( )( )

( )∑∏

∏

∈

−−

−

=

Vs k

sscsfg

k

k

sscsfg

k

nnmnnmk

nnmk

sscsp1

1

,,,,

,,,,

1,, v

v

α

α

kfv

: feature of concepts

g: binary test function

kα : probability weight corresponding to feature kfv

( ) ( ) =

= −−

otherwise

sscsfifsscsfg nnmknnmk

0

,,,1,,,, 1

1

vv


Statistical NCG on Sequence Level (Cont.)

( )

= ∏=

−∈

+

M

m

nnmVs

n sscsps1

11 ,,maxargGeneration

Select the concept candidate with highest probability:

C1 C2 CM…

S1 Sn+1…S2

Source C

Target S

START10 == −ss

SN…


Model Training by Maximizing Entropy

( )( )

( )∑∏

∏

∈

−−

−

=

Vs k

sscsfg

k

k

sscsfg

k

nnmnnmk

nnmk

sscsp1

1

,,,,

,,,,

1,, v

v

α

α

( )[ ]∑∑∑= ∈

−=L

l qs m

nnmk

l

sscsp1

1,,logmaxargα

α

{ }LlqQ l ≤≤= 1, : total set of concept sequences


Structural Concept Sequence Generation

Traverse the semantic parse tree in a bottom-up left-to-right breath-first

mode

Traverse the semantic parse tree in a bottom-up left-to-right breath-first

mode

For each un-processed concept unit on a parse tree, generate an optimal

concept unit in target language via the procedure

For each un-processed concept unit on a parse tree, generate an optimal

concept unit in target language via the procedure

Repeat until all units in the parse tree in the source language are processed Repeat until all units in the parse tree in the source language are processed

!S!

ADV ACTION-VERB PLACE

swiftness go-away place

!S!

ACTION-VERB BE ADV

go-away art place swiftness

leave the building quickly赶快赶快赶快赶快离开离开离开离开大楼大楼大楼大楼

12“On Feature Selection in Maximum Entropy Approach to Statistical Concept-based Speech-to-Speech Translation,” IWSLT, Kyoto, Japan, 2004

Feature Selection in Maximum-Entropy-

based Statistical NCG

Baseline

Features

Baseline

Features

Augmented

Features on

Parallel Corpora

Augmented

Features on

Parallel Corpora

C1 C2 CM…

SnS1 …S2

Source C

Target S SN…

( ) ( )kkkk

k sscsf 101

4 ,,, −+=v

( ) ( )kkkkk

k ssccsf 10101

5 ,,,, −++=v

� new features derived from both source language and target language

� Trained on parallel tree-bank

� Strengthen the link between source language and target language


Conciseness vs. Informativity of Concepts

ConcisenessConciseness

InformativityInformativity

• Define minimum number of distinct concepts

• Reduce labor-extensive, time-consuming

annotation process

• Improve NLU parsing

• Define concepts as informative as possible

• Concept generation largely relies on the

sufficient information provided by each

concept

• Improve NCG


Examples of NCG with too Concise Concepts

WHQ SUBJECT

What did you eat yesterday

AUX ACTION TIME

WHQSUBJECT

你你你你昨天昨天昨天昨天吃了吃了吃了吃了什么什么什么什么

ACTIONTIME

�

WHQ SUBJECT

Where did you eat yesterday

AUX ACTION TIME

WHQSUBJECT

你你你你昨天昨天昨天昨天吃吃吃吃在哪里在哪里在哪里在哪里

ACTIONTIME

�

� Two input English sentences with SAME set and order of concepts generate two

DIFFERENT concept sequences

� The concept WHQ is too concise that it is not informative enough to discriminate

the different generation behavior between (WHQ, what) and (WHQ, where)

� Features of and not helpful( ) ( )kkkk

k sscsf 101

4 ,,, −+=v ( ) ( )kkkkk

k ssccsf 10101

5 ,,,, −++=v


Using both Concept & Word Information

WHQ SUBJECT


AUX ACTION TIME

WHQSUBJECT


ACTIONTIME

WHQ SUBJECT


AUX ACTION TIME

WHQSUBJECT

你你你你昨天昨天昨天昨天吃吃吃吃在哪里在哪里在哪里在哪里

ACTIONTIME

Concept information only

WHQ-what SUBJECT


AUX ACTION TIME

WHQ-whatSUBJECT


ACTIONTIME

WHQ-where SUBJECT


AUX ACTION TIME

WHQ-whereSUBJECT

你你你你昨天昨天昨天昨天在哪里在哪里在哪里在哪里吃的吃的吃的吃的

ACTIONTIME

�

�

�

�

Concept & Word information


Features using both Concept & Word Information

( )( )( )

( )( )∑∏

∏

∈

−++−++

−++

=

Vs k

sswwccsfg

k

k

sswwccsfg

k

nnmmmmnnmmmmk

nnmmmmk

sswwccsp111

7

1117

,,,,,,,

,,,,,,,

111 ,,,,, v

v

α

α

( )[ ]∑ ∑ ∑= ∈

−

=−++=

L

l qs

M

m

nnmmmmk

l

sswwccsp1

1

1

111 ,,,,,logmaxargα

α

Optimized on parallel treebank: { }LlvuQQ ll ≤≤= 1,

Concept Sequence:

Word Sequence:

{ }McccC ,,, 11 L=

{ }MwwwW ,,, 21 L=


Multiple Feature Selection

ProblemProblem

StrategyStrategy

� Sparser data because of higher dimensional features

� Additional sets of features in ME-based concept generation

� Multiple sets of features represent context information in

both the source and the target language at different levels

ExampleExample Feature A:

Feature B:

( )( )

( )( )

( )( )

( )( )

∑∑ ∑

∏∑

∏

∏∑

∏

= ∈

−

=

∈

∈

+

=

−++

−++

−+

−+

L

l qs

M

m

k

sswwccsfg

k

Vs

k

sswwccsfg

k

k

ssccsfg

k

Vs

k

ssccsfg

k

k

l

nnmmmmkk

nnmmmmkk

nnmmkk

nnmmkk

1

1

1

,,,,,,,

,,,,,,,

,,,,,

,,,,,

1117

1117

115

115

log

log

maxarg

v

v

v

v

α

α

α

α

αα

( ) ( )kkkkk

k ssccsf 10101

5 ,,,, −++=v

( ) ( )kkkkkkk

k sswwccsf 1010101

7 ,,,,,, −+++=v


Experimental Setup

3000 Vocabulary size

medical and force protectionDomain

10,000 annotated parallel sentencesCorpora

68Size of Concept Set

English – Chinese (Mandarin)Language Pair

statistical interlingua-based

speech translationMT Method


Experiments on ME-based Statistical NCG

DescriptionDescription� Evaluate on primary concept sequences that represents the

top-layer concepts in a semantic parser tree

� Concept sequences containing only one concept are removed

as they are easy to generate

� Specific set of parallel concept sequences that contain the

same set of concepts in both languages

� 5600 concept sequences are selected, 80% for training and

20% for testing

� Random partitioning of training and test set for 100 times

� Average error rates were recorded

� Worst-case test: sequences appear in the training corpus are

not allowed to appear in the test corpus

� Normal-case test: sequences appear in the training corpus

may appear in the test corpus


ME-NCG Experiments with Forward Models

0.7% / 0.4%

0.7% / 0.4%

6.2% / 3.5%

14.0% / 8.8%

Training set

(SER / CER)

20.2% / 13.1%+ concept-word

features

17.4% / 11.4%+ multiple feature

selection

21.7% / 14.1%+ feature on parallel

corpora

28.0% / 18.9%

Test set

(SER / CER)

Baseline NCG with

basic feature

NCG Methods

� A concept sequence is considered to have an error during measurement of sequence error rate if one or more errors occur in this sequence

� Concept error rate, on the other hand, evaluates concept errors in concept sequences such as substitution, deletion and insertion

( )4

kfv

( )5

kfv

( )7

kfv

( ) ( )( ) ( )475

kkk fffvvv

+


ME-NCG Experiments with Forward-

Backward Models

0.5% / 0.3%

0.5% / 0.3%

5.7% / 3.2%

9.1% / 5.5%

Training set

(SER / CER)

17.7% / 11.5%+ concept-word

features

15.8% / 10.4%+ multiple feature

selection

17.8% / 11.6%+ feature on parallel

corpora

24.4% / 16.4%

Test set

(SER / CER)

Baseline NCG with

basic feature

NCG Methods

( )4

kfv

( )5

kfv

( )7

kfv

( ) ( )( ) ( )475

kkk fffvvv

+


Experiment on Statistical Interlingua-based S2S

0.437

0.536

0.4890.469Speech-to-Text

0.578 0.605Text-to-Text

Translation Methods

Bleu metric (proposed by Kishore et. al.) measures MT performance

by evaluating n-gram accuracy with a brevity penalty

⋅= ∑

=

N

n

nn pwBPBleu1

logexp

np nw BP: n-gram precision rate : n-gram weight : Brevity penalty

( )

≤

>= − rcife

rcifBP

cr /1

1

( )5

kfv( )4

kfv ( ) ( )( )75

kk ffvv

+


Summary

Attack the problems of feature selection during maximum-entropy-

based model training and concept generation

New concept-word features proposed that exploit both information at

both concept level and word level

Multiple feature selection algorithm combines different features in

maximum-entropy models to alleviate data-sparseness-caused over-

training problem

Significant improvements are achieved in both concept sequence

generation test and speech translation experiments


on feature selection in maximum entropy approach to ......on feature selection in maximum entropy...

Documents