pptphrase tagset mapping for french and english treebanks and its application in machine translation...

25th International Conference, GSCL 2013

Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye He, Shuo Li, and Ling Zhu

September 25th -27th, 2013, Darmstadt, Germany

Natural Language Processing & Portuguese-Chinese Machine Translation

Laboratory

Department of Computer and Information Science

University of Macau

Background of language Treebank

Motivation

Designed phrase tagset mapping

Application in MT evaluation1. Manual evaluations

2. Traditional automatic MT evaluation methods

3. Designed unsupervised MT evaluation

4. Evaluating the evaluation method

5. Experiments

6. Open source code

Discussion

Further information

• To promote the development of syntactic analysis

• Many language treebanks are developed

– English Penn Treebank (Marcus et al., 1993; Mitchell et al., 1994)

– German Negra Treebank (Skut et al., 1997)

– French Treebank (Abeillé et al., 2003)

– Chinese Sinica Treebank (Chen et al., 2003)

– Etc.

• Problems

– Different treebanks use their own syntactic tagsets

– The number of tags ranging from tens (e.g. English Penn Treebank) to hundreds (e.g. Chinese Sinica Treebank)

– Inconvenient when undertaking the multilingual or cross-lingual research

• To bridge the gap between these treebanks and facilitate future research

– E.g. the unsupervised induction of syntactic structure

• Petrov et al. (2012) develop a universal POS tagset

• How about the phrase level tags?

• The disaccord problem in the phrase level tags remains unsolved– Let’s try to solve it

• Tentative design of phrase tagset mapping

– On English Penn Treebank I, II & French Treebank

• 9 universal phrasal categories covering

– 14 phrase tags in English Penn Treebank I

– 26 phrase tags in English Penn Treebank II

– 14 phrase tags in French Treebank

Table 1: phrase tagset mapping for French and English treebanks

• Universal phrasal categories: NP (noun phrase), VP (verb phrase), AJP (adjective phrase), AVP (adverbial phrase), PP (prepositional phrase), S (sub/-sentence), CONJP (conjunction phrase), COP (coordinated phrse), X (other phrases or unknown)

• NP covering

– French tags: NP

– English tags: NP, NAC (the scope of certain prenominal modifiers within an NP), NX (within certain complex NPs to mark the head of NP), WHNP (wh-noun phrase), QP (quantifier phrase)

• VP covering

– French tags: VN (verbal nucleus), VP (infinitives and nonfinite clauses)

– English tags: VP (verb phrase)

• AJP covering

– French tags: AP (adjectival phrase)

– English tags: ADJP (adjective phrase), WHADJP (wh-adjective phrase)

• AVP covering

– French tags: AdP (adverbial phrases)

– English tags: ADVP (adverb phrase), WHAVP (wh-adverb phrase), PRT (particle)

• PP covering

– French tags: PP

– English tags: PP, WHPP (wh-propositional phrase phrase)

• S covering

– French tags: SENT (sentence), S (finite clause)

– English tags: S (simple declarative clause), SBAR (clause introduced by a subordinating conjunction), SBARQ (direct question introduced by a wh-phrase), SINV (declarative sentence with subject-aux inversion), SQ (sub-constituent of SBARQ), PRN (parenthetical), FRAG (fragment), RRC (reduced relative clause).

• CONJP covering

– French tags: N/A

– English tags: CONJP

• COP covering

– French tags: COORD (coordinated phrase)

– English tags: UCP (coordinated phrases belonging to different categories)

• X covering

– French tags: unknown

– English tags: X (unknown or uncertain), INTJ (interjection), LST (list marker)

4. Application in Machine Translation

evaluation

• Rapid development of Machine Translations

– MT began as early as in the 1950s (Weaver, 1955)

– Big progress science the 1990s due to the development of computers (storage capacity and computational power) and the enlarged bilingual corpora (Marino et al. 2006)

• Difficulties of MT evaluation

– language variability results in no single correct translation

– the natural languages are highly ambiguous and different languages do not always express the same content in the same way (Arnold, 2003)

• Traditional manual evaluation criteria:

– intelligibility (measuring how understandable the sentence is)

– fidelity (measuring how much information the translated sentence retains as compared to the original) by the Automatic Language Processing Advisory Committee (ALPAC) around 1966 (Carroll, 1966)

– adequacy (similar as fidelity), fluency (whether the sentence is well-formed and fluent) and comprehension(improved intelligibility) by Defense Advanced Research Projects Agency (DARPA) of US (White et al., 1994)

• Problems of manual evaluations :

– Time-consuming

– Expensive

– Unrepeatable

– Low agreement (Callison-Burch, et al., 2011)

• Measuring the similarity of automatic translation and reference translation

– Automatic translation (or hypothesis translation, target translation): by automatic MT system

– Reference translation: by professional translators

– Source language and source document: not used

• Traditional automatic evaluation:

– BLEU: n-gram precisions (Papineni et al., 2002)

– TER: edit distances (Snover et al., 2006)

– METEOR: precision and recall (Banerjee and Lavie, 2005)

• Problems in supervised MT evaluation

– Reference translations are expensive

– Reference translations are not available is some cases

• Could we get rid of the reference translation?

– Unsupervised MT evaluation method

– Extract information from source and target language

– How to use the designed universal phrase tagset?

• Assume that the translated sentence should have a similar set of phrase categories with the source sentence.

– This design is inspired by the synonymous relation between source and target sentence.

• Two sentences that have similar set of phrases may talk about different things.

– However, this evaluation approach is not designed for general circumstance

– Assume that the targeted sentences are indeed the translated sentences from the source document

• First, we parse the source and target languages respectively

• Then we extract the phrase set from the source and target sentences

• Third, we convert the phrases into the developed universal phrase categories

• Last, we measure the similarity of source and target language on the universal phrase sequences

Figure 1: the parsed French and English sentence

Figure 2: convert the extracted phrase into universal phrase tags

The level of extracted phrase tags: just the upper level of POS tags, bottom-up

• What is the similarity metric we employed?

• Designed similarity metric: HPPR

– N1 gram position order difference penalty

– Weighted N2 gram precision

– Weighted N3 gram recall

– Weighted geometric mean in n-gram precision & recall

– Weighted harmonic mean to combine sub-factors

– The parameters are tunable according to different language pairs

• 𝐻𝑃𝑃𝑅 = 𝐻𝑎𝑟(𝑤𝑃𝑠𝑁1𝑃𝑠𝐷𝑖𝑓, 𝑤𝑃𝑟𝑁2𝑃𝑟𝑒,𝑤𝑅𝑐𝑁3𝑅𝑒𝑐)

• 𝐻𝑃𝑃𝑅 =𝑤𝑃𝑠+𝑤𝑃𝑟+𝑤𝑅𝑐

𝑤𝑃𝑠𝑁1𝑃𝑠𝐷𝑖𝑓

+𝑤𝑃𝑟𝑁2𝑃𝑟𝑒

+𝑤𝑅𝑐𝑁3𝑅𝑒𝑐

• 𝑁1𝑃𝑠𝐷𝑖𝑓, 𝑁2𝑃𝑟𝑒, and 𝑁3𝑅𝑒𝑐 are the corpus levelscores of sub-factors position difference penalty, precision and recall.

• The sentence level 𝑁1𝑃𝑠𝐷𝑖𝑓 score:

• 𝑁1𝑃𝑠𝐷𝑖𝑓 = exp(−𝑁1𝑃𝐷)

• 𝑁1𝑃𝐷 =1

𝐿𝑒𝑛𝑔𝑡ℎℎ𝑦𝑝∑|𝑃𝐷𝑖|

• 𝑃𝐷𝑖 = |𝑃𝑠𝑁ℎ𝑦𝑝 −𝑀𝑎𝑡𝑐ℎ𝑃𝑠𝑁𝑠𝑟𝑐|

• 𝑃𝑠𝑁ℎ𝑦𝑝 and𝑀𝑎𝑡𝑐ℎ𝑃𝑠𝑁𝑠𝑟𝑐 are the position number

of matching tag in the hypothesis and sourcesentence respectively. When no match for the tag:𝑃𝐷𝑖 = |𝑃𝑠𝑁ℎ𝑦𝑝 − 0|

Figure 3: N1 gram tag alignment algorithm

Figure 4: 𝑁1𝑃𝐷 calculation example

• Corpus-level weighted n-gram precision & recall

• 𝑁2𝑃𝑟𝑒 = exp(∑𝑛=1𝑁2 𝑤𝑛𝑙𝑜𝑔𝑃𝑛)

• 𝑁3𝑅𝑒𝑐 = exp(∑𝑛=1𝑁3 𝑤𝑛𝑙𝑜𝑔𝑅𝑛)

• 𝑃𝑛 =#𝑚𝑎𝑡𝑐ℎ𝑒𝑑 𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘𝑠

#𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘𝑠 𝑜𝑓 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑐𝑜𝑟𝑝𝑢𝑠

• 𝑅𝑛 =#𝑚𝑎𝑡𝑐ℎ𝑒𝑑 𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘𝑠

#𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘𝑠 𝑜𝑓 𝑠𝑜𝑢𝑟𝑐𝑒 𝑐𝑜𝑟𝑝𝑢𝑠

Figure 5: bigram chunk matching example

• How reliable is the automatic metric?

• Evaluation criteria for evaluation metrics:

– Human judgments are the golden to approach, currently

– Correlation with human judgments (Callison-Burch, et al., 2011, 2012)

• Spearman rank correlation coefficient rs:

– 𝑟𝑠 𝑋𝑌 = 1 −6 ∑𝑖=1

𝑛 𝑑𝑖2

𝑛(𝑛2−1)

– Two rank sequences 𝑋 = 𝑥1, … , 𝑥𝑛 , 𝑌 = {𝑦1, … , 𝑦𝑛}

• Corpus from WMT

– Workshop of statistical machine translation

– SIGMT, ACL’S special interest group of machine translation

• Training data (WMT11), tune the parameters

– 3, 003 sentences for each document

– 18 automatic French-to-English MT systems

• Testing data (WMT12)

– 3, 003 sentences for each document

– 15 automatic French-to-English MT systems

• Training, tune the parameters

– N1, N2 and N3 are tuned as 2, 3 and 3 due to the fact that the 4-gram chunk match usually results in 0 score.

– Tuned values of factor weights are shown in table

Table 2: tuned parameter values

• Comparisons with:– BLEU, measure the closeness of the hypothesis and

reference translations, n-gram precision

– TER, measure the editing distance of hypothesis to reference translations

Table 3: training (development) scores on WMT11 corpus

Table 4: testing scores on WMT12 corpus

Table 5: correlation score intro (Cohen, 1988)

The experiment results on the development and testing corpora show thatHPPR without using reference translations has yielded promisingcorrelation scores (0.63 and 0.59 respectively).

There is still potential to improve the performances of all the three

metrics, even though that the correlation scores which are higher than 0.5are already considered as strong correlation as shown in Table 5.

• Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation– Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye He,

Shuo Li, and Ling Zhu. GSCL 2013, Darmstadt, Germany. LNCS Vol. 8105, pp. 119-131, Volume Editors: IrynaGurevych, Chris Biemann and Torsten Zesch.

• Open source tool for phrase tagset mapping and HPPR similarity measuring algorithms: https://github.com/aaronlifenghan/aaron-project-hppr

http://www.linkedin.com/in/aaronhan

https://github.com/aaronlifenghan/aaron-project-hppr

• Facilitate future research in multilingual or cross-lingual literature, this paper designs a phrase tags mapping between the French Treebank and the English Penn Treebank using 9 phrase categories.

• One of the potential applications of the designed universal phrase tagset is shown in the unsupervised MT evaluation task in the experiment section.

• There are still some limitations in this work to be addressed in the future.

– The designed universal phrase categories may not be

able to cover all the phrase tags of other languagetreebanks, so this tagset could be expanded when necessary.

– The designed HPPR formula contains the n-gram factors

of position difference, precision and recall, which may not

be sufficient or suitable for some of the other language

pairs, so different measuring factors should be added or

switched when facing new tasks.

• Actually speaking, the designed models are very related to the similarity measuring. Where we have employed them is in the MT evaluation. These works may be further developed into other literature:

– information retrieval

– question and answering

– Searching

– text analysis

– etc.

• Ongoing and further works:

– The combination of translation and evaluation, tuning the translation model using evaluation metrics

– Evaluation models from the perspective of semantics

– The further explorations of unsupervised evaluation models, extracting other features from source and target languages

• Aaron open source tools: https://github.com/aaronlifenghan

• Aaron network Home: http://www.linkedin.com/in/aaronhan

https://github.com/aaronlifenghan

http://www.linkedin.com/in/aaronhan

GSCL 2013, Darmstadt, Germany

Aaron L.-F. Hanemail: hanlifengaaron AT gmail DOT com

Natural Language Processing & Portuguese-Chinese Machine Translation

Laboratory

Department of Computer and Information Science

University of Macau

pptphrase tagset mapping for french and english treebanks and its application in machine translation...

Documents