decoding chinese character sequence: neural network model and beyond

Decoding Chinese character sequence: neural network model and beyond

Zhao Hai Shanghai Jiao Tong University

[email protected]

, 2015.04.19

Outlines Motivation

Chinese character plays the role Sino-Tibetan languages

Sinosphere languages

Chinese IME

Loose Machine translation

Neural Network Language model

Experimental results

Conclusion

Chinese connection

Chinese characters have more external connections than Chinese language itself.

Chinese is only related to Sino-Tibetan languages, but Chinese characters may introduce more relative languages.

Sino-Tibetan Language Family Tree

4

Sino-Tibetan Languages in Map

Sinosphere, writing connects all

Chinese characters, kanji, well developed logograms, the oldest continuously used system of writing in the world, are still used in China, Vietnam, Korea peninsular, Japan and Singapore, Malaysia.

The Sinosphere, is unofficially referred to regions that have been historically or culturally influenced by China.

However, languages in sinosphere have weak linguistic relations with Chinese

Sino-Tibetan languages, writing in the same way Lolo(Yi)/

Why character writing system leads to the future

Accommodate more reading change Vietnamese, reading and writing differences occur only within 50 years

Burmese, writing is one way, reading is another, 1000 years

Accommodate more free word order

Four Types of Word Order Sino-Tibetan

Languaages

Chinese Bai

Verb Object Modifier Noun

Karen Shan Thai

Verb Object Noun Modifier

Jingpho Object Verb Modifier Noun

Burmese Tibetan Lolo Qiang

Object Verb Noun Modifier

Free order in Chinese

Is Chinese in SOV order? NO

Is Chinese support a modifier-noun order NO

Which order in Chinese

Chinese is a free order language, Only detailed semantics and functional words matter

Every types of word order is possible

Thats the future: 4,000 years evolution with the largest population Thus we need a character-based writing system

Sino-Tibetan Languages

From Writing

Chinese 20 centuray BC hanzi

Tibetan 7 century abugida

Tangut 11 century Self-made hanzi

Burmese 11 century abugida

Alphabetization of languages in Sinosphere (red means official position)

China etc Japan Korea Vietnam

Romanization alphabets

Chinese pinyin Romanization scheme for Japanese

Romanization scheme for Korean

Chu Quoc Ngu ()

National alphabets Kana() hangul

Chinese characters Hanzi() Kanji()

hanja() Hantu / Chu Nom (/)

Application Driven Syllable-to-character conversion tasks Chinese pinyin IME (input method engine)

From pinyin sequence to Chinese character sentence

Loose Machine translation From kana, hangul, Vietnamese to Chinese character sentence

Rewriting those Sino-Tibetan languages

more To see multilingual pronounciation difference on the basis of semantic

equivalence.

Pinyin based Chinese IME

Most IMEs are based on Pinyin.

Ignoring tone, there are less than 500 pinyin syllables in Chinese

Meanwhile, 3,000-20,000 Chinese characters are used, which depends on different application situations.

For any case, main obstacle for pinyin IME is: let users choose the wanted character as fast as possible for any input pinyin syllable.

General Strategy

For each inputted pinyin syllable, there are usually dozens of Chinese characters on mapping

If, we input bi-syllable, tri-syllable, or even longer pinyin syllables at a time, then, much fewer character candidates are on the mapping list.

Therefore, for more quick and accurate Chinese input, Input syllable sequence as long as possible!

Pinyin IME as Chinese character sequence decoding task

Input: pinyin sequence

Output: one-to-one mapping Chinese character sequence

Sequence labeling task Maximum entropy: previous work

Statistical machine translation: ours

zi ran yu yan chu li

Chinese character sequence decoding: SMT Yang and Zhao, PACLIC 2012

Pipeline No alignment learning

Only adopt standard MERT tuning and MOSES decoding

Effectively integrate language model and other linguistic features

Accuracy and whole-sentence-accuracy both outperform previous maximum entropy model.

10K 100K 1M

ME 0.829 0.891 0.933

SMT 0.947 0.952 0.955

10K 100K 1M

ME 0.075 0.169 0.302

SMT 0.402 0.429 0.454

A close lexical connection on Chinese characters

Japanese About more than 50% Japanese

vocabulary come from Chinese. However, in modern times, a lot of words that represent modern western science, techniques and culture were first written in Japanese kanji, then passed back to Chinese.

Korean Sino-Korean covers 60%

Vietnamese Sino-Vietnamese covers 60%

/ /

/ /

lich s l sh

inh ngha dng y

Phone phu fng f

thi s sh sh

Both Vietnamese and Korean adopted alphabetized writing in modern times Japanese is a difference, whose writing is mixed with alphabets and

Chinese characters, so that Chinese can guess what Japanese text means more or less.

But Vietnamese and Korean

For machine translation between Vietnamese/Korean and Chinese, it is very hard to collect sufficient parallel corpus.

Korean can be written in this way

Sino-Korean writing:

Korean only .

Korean with Chinese () () .

Korean and ChineseKorean as majority .

Korean and ChineseChinese as majority .

Korean can be written in this way: South Korea's constitution, the first part

31 419 , , , , , , 1948 7 12 8 .

1

1 .

, .

2 .

.

3 .

4 , .

5 .

, .

meaning read Chinese character sequence Regarding the historic connection among all these languages in

Sinosphere, we present a Chinese character transliteration form that follow strict lexical semantic equivalence for related machine translation.

In term of Japanese word kun-yomi(), such a sequence of Chinese characters in word order of the original language is called meaning read Chinese character sequence (MRCCS,).

Language difference: Korean vs. Chinese

Sound: Korean is spoken without tone (like Japanese), but Chinese has.

Korean follows vowel harmony rules.

Grammar Korean is SOV in its syntax (just like Japanese), while Chinese is SVO.

Korean is agglutinative in its morphology, in which rich suffixes are used for meaning representation. Chinese is an isolating language, word order is its main grammatical means.

Korean has five groups and nine parts of speech. Only words like noun, pronoun () can be translated.

Language difference: Vietnamese vs. Chinese Sound

Both have tones, Vietnamese has six, and Chinese has five.

Grammar Both are isolating (analytic) languages, and neither use morphological

marking of case, gender, number or tense. Word order plays the most important grammar role in both languages. Both conform to SVO word order.

As most south east Asian languages, Thai, Cambodian, Vietnamese is head-initial, which is quite different from Chinese. So the word, Vietnamese language,, should not beVitNam

Tingin Vietnamese, but Ting Vit Nam. The phrase, the official language of Kinh people, should bengn nglanguage

chnh thcoficialcaofdn tcpeoplesKinh.

Problems to be solved

Grammar translation MRCCS is ungrammatical Chinese.

Solution Re-phrasing based revising

Simple solution: only use language model to perform reordering.

From n-gram language to neural network language model

NNLM Background Neural network language models (NNLM), or continuous-space

language models (CSLMs), have been shown to improve the

performance of perplexity (PPL) and statistical machine

translation (SMT) . However, CSLMs have not been used in the

decoding, because using CSLMs in decoding takes a lot of time.

We propose a method for converting CSLMs into back-off n-gram

language models (BNLMs) so that we can use converted CSLMs in

decoding.

Why not CSLM in decoding?

2000 NTCIR-9 English Sentences as test data.

5-gram CSLM (4 context words) and BNLM trained from the

same 1 million NTCIR-9 English sentences.

Evaluate the probability of every n-gram.

LMS CPU Time1 CPU Time2 CPU Time3

BNLM 3.241 s 4.044 s 4.404 s

CSLM 42.058 s 42.372 s 38.361 s

CSLM in SMT

Training

Tuning

Decoding

Decoding MERT

CSLM

N-best

Result

1st Pass

CONV

Re-rank

Training

Tuning

Decoding

Decoding MERT

CSLM

N-best

Result

1st Pass

BNLM

Re-rank

Conversion Method Text Data

Converting

Entropy Pruning

2-gram CONV

Renormalized back-off weights3-gram BNLM

3-gram CONV

Renormalized back-off weights4-gram BNLM

Converting

Entropy Pruning

Converting

Entropy Pruning

4-gram CONV

2-gram CSLM

3-gram CSLM

4-gram CSLM

2-gram BNLM

Append3-gram BNLM

Append4-gram BNLM

As BLM

Arsoy et al. in ICASSP 2013

Wang et al.(Ours) in EMNLP 2013

Experiments and Results Corpus:

(1) NTCIR-9: 1 million sentences from Chinese to English

(2) TED : 186K sentences from Chinese to English (additional monolingual corpus is hard to obtain)

Pinyin IME decoding with NNLM

Test corpus LM One-Best N-best Character acc.

10K trigram 0.7472 0.8992 0.968

10K NNLM 0.7571 0.9014 0.968

400K trigram 0.6702 0.8608 0.9546

400K NNLM 0.6768 0.8645 0.9559

5-3

A full example on Vietnamese to Chinese translation

Du khch Ty Ban Nha thng thc tr ti Trm Anh qun .

tourist Spain appreciate tea at Tram Anh shop .

Parallel Du khch Ty Ban Nha thng thc tr ti Trm Anh qun.

(1) MRCCS (2) Reorder with language model scores

(it will be a precise translation besides preposition phrase should precede the verb in Chinese)

A full example on Vietnamese to Chinese translation

Google translation Spanish tourists enjoy tea at the British outpost. (both Chinese and English translation are far from the correct meaning of the original Vietnamese text.)

Why ? Consider Google translation converts the word British into ngi Anh in Vietnamese. Note that the above named entity word, Trm Anh, which also consists the same core syllable, Anh.

As Google translation cannot find a good mapping for Trm Anh, it turns to use the English translation of ngi Anh instead, therefore, the incorrect translation British comes. We can derive from the above procedure that Google translation use English as pivot language for Vietnamese to Chinese translation.

This shows that historic connection between related languages can help improve machine translation.

Vietnamese to MRCCS

Conclusions

Chinese character as pivot Accurate translation than before

Using the same decoder to solve different problems.

Neural network language model works

Using Chinese characters as the latest writing system

PACLIC 29 2015@Shanghai

Thank you

xie xie

decoding chinese character sequence: neural network model and beyond

Technology

chinese chinese

writing chinese

chinese language

chinese sinotibetan

chinese support

chinese character sequence

dozens of chinese characters

character writing system