better punctuation prediction with dynamic conditional random fields wei lu and hwee tou ng national...

Post on 31-Dec-2015

218 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Better Punctuation Prediction with Dynamic Conditional Random Fields

Wei Lu and Hwee Tou Ng

National University of Singapore

Talk Overview

• Background• Related Work• Approaches

– Previous approach: Hidden Event Language Model– Previous approach: Linear-Chain CRF– This work: Factorial CRF

• Evaluation• Conclusion

2

• Automatically insert punctuation symbols into transcribed speech utterances

• Widely studied in speech processing community• Example:

>> Original speech utterance:

>> Punctuated (and cased) version:

You are quite welcome . And by the way , we may get other reservations , so could you please call us as soon as you fix the date ?

you are quite welcome and by the way we may get other reservations so could you please call us as soon as you fix the date

Punctuation Prediction

3

Our Task

• Processing prosodic features requires access to the raw speech data, which may be unavailable

• Tackles the problem from a text processing perspective

Perform punctuation prediction for conversational speech texts without relying on prosodic features

4

Related Work

• With prosodic features– Kim and Woodland (2001): a decision tree framework– Christensen et al. (2001): a finite state and a multi-

layer perceptron– Huang and Zweig (2002): a maximum entropy-based

approach– Liu et al. (2005): linear-chain conditional random

fields

• Without prosodic features– Beeferman et al. (1998): comma prediction with a

trigram language model– Gravano et al. (2009): n-gram based approach

5

Related Work (continued)

• One well-known approach that does not exploit prosodic features– Stolcke et al. (1998) presented a hidden event

language model– It treats boundary detection and punctuation insertion

as an inter-word hidden event detection task– Widely used in many recent spoken language

translation tasks as either a pre-processing (Wang et al., 2008) or post-processing (Kirchhoff and Yang, 2007) step

6

Hidden Event Language Model

7

• HMM (Hidden Markov Model)-based approach– A joint distribution over words and inter-word events– Observations are the words, and word/event pairs are

hidden states

• Implemented in the SRILM toolkit (Stolcke, 2002)• Variant of this approach

– Relocates/duplicates the ending punctuation symbol to be closer to the indicative words

– Works well for predicting English question marks

where is the nearest bus stop ?

? where is the nearest bus stop

Linear-Chain CRF

8

• Linear-chain conditional random fields (L-CRF): Undirected graphical model used for sequence learning– Avoid the strong assumptions about dependencies in

the hidden event language model – Capable of modeling dependencies with arbitrary non-

independent overlapping features

Y1 Y2 Y3 Yn

X1 X2 X3 Xn

…word-layer tags

utterance

An Example L-CRF

• A linear-chain CRF assigns a single tag to each individual word at each time step– Tags: NONE, COMMA, PERIOD, QMARK, EMARK

– Factorized features

• Sentence: no , please do not . would you save your questions for the end

of my talk , when i ask for them ?

COMMA NONE NONE PERIOD NONE NONE … NONE COMMA NONE … QMARK

no please do not would you … my talk when … them

9

Features for L-CRF

• Feature factorization (Sutton et al., 2007)– Product of a binary function on assignment of the set

of cliques at each time step, and a feature function solely defined on the observation sequence

– Feature functions: n-gram (n = 1,2,3) occurrences within 5 words from the current word

Example: for the word “do”:

do@0, please@-1, would_you@[2,3], no_please_do@[-2,0]

COMMA NONE NONE PERIOD NONE NONE … NONE COMMA NONE … QMARK

no please do not would you … my talk when … them

10

Problems with L-CRF

• Long-range dependency between the punctuation symbols and the indicative words cannot be captured properly

• For example: no please do not would you save your questions for the end of

my talk when i ask for them

It is hard to capture the long range dependency between the ending question mark (?) and the initial phrase “would you” with a linear-chain CRF

11

Problems with L-CRF

• What humans might do– no please do not would you save your questions for the end

of my talk when i ask for them

– no please do not would you save your questions for the end of my talk when i ask for them

– no , please do not . would you save your questions for the end of my talk , when i ask for them ?

• Sentence level punctuation (. ? !) are associated with the complete sentence, and therefore should be assigned at the sentence level

12

What Do We Want?

• A model that jointly performs all the following three tasks together– Sentence boundary detection (or sentence

segmentation)– Sentence type identification– Punctuation insertion

13

Factorial CRF

14

• An instance of dynamic CRF– Two-layer factorial CRF (F-CRF) jointly annotates an

observation sequence with two label sequences– Models the conditional probability of the label

sequence pairs <Y,Z> given the observation sequence X

Y1 Y2 Y3 Yn

X1 X2 X3 Xn

Z1 Z2 Z3 Zn…sentence-layer tags

word-layer tags

utterance

Example of F-CRF

DEBEG DEIN DEIN DEIN QNBEG QNIN … QNIN QNIN QNIN … QNIN

COMMA NONE NONE PERIOD NONE NONE … NONE COMMA NONE … QMARK

no please do not would you … my talk when … them

• Propose two sets of tags for this joint task– Word-layer: NONE, COMMA, PERIOD, QMARK, EMARK

– Sentence-layer: DEBEG, DEIN, QNBEG, QNIN, EXBEG, EXIN

– Analogous feature factorization and the same feature functions used in L-CRF are used

15

Why Does it Work?

• The sentence-layer tags are used for sentence segmentation and sentence type identification

• The word-layer tags are used for punctuation insertion

• Knowledge learned from the sentence-layer can guide the word-layer tagging process

• The two layers are jointly learned, thus providing evidences that influence each other’s tagging process[no please do not]declarative sent. [would you save your questions

for the end of my talk when i ask for them]question sent.

?QNBEG QNIN …

16

Evaluation Datasets

BTEC CT

CN EN CN EN

Number of utterance pairs 19,972 10,061

Percentage of declarative sentences 64% 65% 77% 81%

Percentage of question sentences 36% 35% 22% 19%

Multiple sentences per utterance 14% 17% 29% 39%

Average words per utterance 8.59 9.46 10.18 14.33

17

• IWSLT 2009 BTEC and CT datasets• Consists of both English (EN) and Chinese (CN)• 90% used for training, 10% for testing

Experimental Setup

• Designed extensive experiments for Hidden Event Language Model– Duplication vs. No duplication– Single-pass vs. Cascaded – Trigram vs. 5-gram

• Conducted the following experiments– Accuracy on CRR texts (F1 measure)– Accuracy on ASR texts (F1 measure)– Translation performance with punctuated ASR texts

(BLEU metric)

18

• Precision # correctly predicted punctuation symbols

# predicted punctuation symbols

• Recall # correctly predicted punctuation symbols

# expected punctuation symbols

• F1 measure 2

1/Precision + 1/Recall

Punctuation Prediction: Evaluation Metrics

19

BTECNO DUPLICATION USE DUPLICATION

L-CRF F-CRFSingle Pass Cascaded Single Pass Cascaded

LM ORDER 3 5 3 5 3 5 3 5

CN

Prec. 87.40 86.44 87.72 87.13 76.74 77.58 77.89 78.50 94.82 94.83

Rec. 83.01 83.58 82.04 83.76 72.62 73.72 73.02 75.53 87.06 87.94

F1 85.15 84.99 84.79 85.41 74.63 75.60 75.37 76.99 90.78 91.25

EN

Prec. 64.72 62.70 62.39 58.10 85.33 85.74 84.44 81.37 88.37 92.76

Rec. 60.76 59.49 58.57 55.28 80.42 80.98 79.43 77.52 80.28 84.73

F1 62.68 61.06 60.42 56.66 82.80 83.29 81.86 79.40 84.13 88.56

Punctuation Prediction Evaluation: Correctly Recognized Texts (I)

20

• The “duplication” trick for hidden event language model is language specific

• Unlike English, indicative words can appear anywhere in a Chinese sentence

CTNO DUPLICATION USE DUPLICATION

L-CRF F-CRFSingle Pass Cascaded Single Pass Cascaded

LM ORDER 3 5 3 5 3 5 3 5

CN

Prec. 89.14 87.83 90.97 88.04 74.63 75.42 75.37 76.87 93.14 92.77

Rec. 84.71 84.16 77.78 84.08 70.69 70.84 64.62 73.60 83.45 86.92

F1 86.87 85.96 83.86 86.01 72.60 73.06 69.58 75.20 88.03 89.75

EN

Prec. 73.86 73.42 67.02 65.15 75.87 77.78 74.75 74.44 83.07 86.69

Rec. 68.94 68.79 62.13 61.23 70.33 72.56 69.28 69.93 76.09 79.62

F1 71.31 71.03 64.48 63.13 72.99 75.08 71.91 72.12 79.43 83.01

Punctuation Prediction Evaluation: Correctly Recognized Texts (II)

21

• Significant improvement over L-CRF (p<0.01)• Our approach is general: requires minimal

linguistic knowledge, consistently performs well across different languages

BTECNO DUPLICATION USE DUPLICATION

L-CRF F-CRFSingle Pass Cascaded Single Pass Cascaded

LM ORDER 3 5 3 5 3 5 3 5

CN

Prec. 85.96 84.80 86.48 85.12 66.86 68.76 68.00 68.75 92.81 93.82

Rec. 81.87 82.78 83.15 82.78 63.92 66.12 65.38 66.48 85.16 89.01

F1 83.86 83.78 84.78 83.94 65.36 67.41 66.67 67.60 88.83 91.35

EN

Prec. 62.38 59.29 56.86 54.22 85.23 87.29 84.49 81.32 90.67 93.72

Rec. 64.17 60.99 58.76 56.21 88.22 89.65 87.58 84.55 88.22 92.68

F1 63.27 60.13 57.79 55.20 86.70 88.45 86.00 82.90 89.43 93.19

Punctuation Prediction Evaluation: Automatically Recognized Texts

22

• 504 Chinese utterances, and 498 English utterances• Recognition accuracy: 86% and 80% respectively• Significant improvement (p < 0.01)

BTECNO DUPLICATION USE DUPLICATION

L-CRF F-CRFSingle Pass Cascaded Single Pass Cascaded

LM ORDER 3 5 3 5 3 5 3 5

CN EN 30.77 30.71 30.98 30.64 30.16 30.26 30.33 30.42 31.27 31.30

EN CN 21.21 21.00 21.16 20.76 23.03 24.04 23.61 23.34 23.44 24.18

Punctuation Prediction Evaluation: Translation Performance

23

• This tells us how well the punctuated ASR outputs can be used for downstream NLP tasks

• Use Berkeley aligner and Moses (lexicalized reordering)

• Averaged BLEU-4 scores over 10 MERT runs with random initial parameters

Conclusion

24

• We propose a novel approach for punctuation prediction without relying on prosodic features– Jointly performs punctuation prediction, sentence

boundary detection, and sentence type identification– Performs better than the hidden event language

model and a linear-chain CRF model – A general approach that consistently works well

across different languages– Effective when incorporated with downstream NLP

tasks

top related