decoding chinese character sequence: neural network model and beyond
TRANSCRIPT
-
Decoding Chinese character sequence: neural network model and beyond
Zhao Hai Shanghai Jiao Tong University
, 2015.04.19
-
Outlines Motivation
Chinese character plays the role Sino-Tibetan languages
Sinosphere languages
Chinese IME
Loose Machine translation
Neural Network Language model
Experimental results
Conclusion
-
Chinese connection
Chinese characters have more external connections than Chinese language itself.
Chinese is only related to Sino-Tibetan languages, but Chinese characters may introduce more relative languages.
-
Sino-Tibetan Language Family Tree
4
-
Sino-Tibetan Languages in Map
-
Sinosphere, writing connects all
Chinese characters, kanji, well developed logograms, the oldest continuously used system of writing in the world, are still used in China, Vietnam, Korea peninsular, Japan and Singapore, Malaysia.
The Sinosphere, is unofficially referred to regions that have been historically or culturally influenced by China.
However, languages in sinosphere have weak linguistic relations with Chinese
-
Sino-Tibetan languages, writing in the same way Lolo(Yi)/
-
Why character writing system leads to the future
Accommodate more reading change Vietnamese, reading and writing differences occur only within 50 years
Burmese, writing is one way, reading is another, 1000 years
Accommodate more free word order
-
Four Types of Word Order Sino-Tibetan
Languaages
Chinese Bai
Verb Object Modifier Noun
Karen Shan Thai
Verb Object Noun Modifier
Jingpho Object Verb Modifier Noun
Burmese Tibetan Lolo Qiang
Object Verb Noun Modifier
-
Free order in Chinese
Is Chinese in SOV order? NO
Is Chinese support a modifier-noun order NO
-
Which order in Chinese
Chinese is a free order language, Only detailed semantics and functional words matter
Every types of word order is possible
Thats the future: 4,000 years evolution with the largest population Thus we need a character-based writing system
-
Sino-Tibetan Languages
From Writing
Chinese 20 centuray BC hanzi
Tibetan 7 century abugida
Tangut 11 century Self-made hanzi
Burmese 11 century abugida
-
Alphabetization of languages in Sinosphere (red means official position)
China etc Japan Korea Vietnam
Romanization alphabets
Chinese pinyin Romanization scheme for Japanese
Romanization scheme for Korean
Chu Quoc Ngu ()
National alphabets Kana() hangul
Chinese characters Hanzi() Kanji()
hanja() Hantu / Chu Nom (/)
-
Application Driven Syllable-to-character conversion tasks Chinese pinyin IME (input method engine)
From pinyin sequence to Chinese character sentence
Loose Machine translation From kana, hangul, Vietnamese to Chinese character sentence
Rewriting those Sino-Tibetan languages
more To see multilingual pronounciation difference on the basis of semantic
equivalence.
-
Pinyin based Chinese IME
Most IMEs are based on Pinyin.
Ignoring tone, there are less than 500 pinyin syllables in Chinese
Meanwhile, 3,000-20,000 Chinese characters are used, which depends on different application situations.
For any case, main obstacle for pinyin IME is: let users choose the wanted character as fast as possible for any input pinyin syllable.
-
General Strategy
For each inputted pinyin syllable, there are usually dozens of Chinese characters on mapping
If, we input bi-syllable, tri-syllable, or even longer pinyin syllables at a time, then, much fewer character candidates are on the mapping list.
Therefore, for more quick and accurate Chinese input, Input syllable sequence as long as possible!
-
Pinyin IME as Chinese character sequence decoding task
Input: pinyin sequence
Output: one-to-one mapping Chinese character sequence
Sequence labeling task Maximum entropy: previous work
Statistical machine translation: ours
zi ran yu yan chu li
-
Chinese character sequence decoding: SMT Yang and Zhao, PACLIC 2012
Pipeline No alignment learning
Only adopt standard MERT tuning and MOSES decoding
Effectively integrate language model and other linguistic features
Accuracy and whole-sentence-accuracy both outperform previous maximum entropy model.
10K 100K 1M
ME 0.829 0.891 0.933
SMT 0.947 0.952 0.955
10K 100K 1M
ME 0.075 0.169 0.302
SMT 0.402 0.429 0.454
-
A close lexical connection on Chinese characters
Japanese About more than 50% Japanese
vocabulary come from Chinese. However, in modern times, a lot of words that represent modern western science, techniques and culture were first written in Japanese kanji, then passed back to Chinese.
Korean Sino-Korean covers 60%
Vietnamese Sino-Vietnamese covers 60%
/ /
/ /
lich s l sh
inh ngha dng y
Phone phu fng f
thi s sh sh
-
Both Vietnamese and Korean adopted alphabetized writing in modern times Japanese is a difference, whose writing is mixed with alphabets and
Chinese characters, so that Chinese can guess what Japanese text means more or less.
But Vietnamese and Korean
For machine translation between Vietnamese/Korean and Chinese, it is very hard to collect sufficient parallel corpus.
-
Korean can be written in this way
Sino-Korean writing:
Korean only .
Korean with Chinese () () .
Korean and ChineseKorean as majority .
Korean and ChineseChinese as majority .
-
Korean can be written in this way: South Korea's constitution, the first part
31 419 , , , , , , 1948 7 12 8 .
1
1 .
, .
2 .
.
3 .
4 , .
5 .
, .
-
meaning read Chinese character sequence Regarding the historic connection among all these languages in
Sinosphere, we present a Chinese character transliteration form that follow strict lexical semantic equivalence for related machine translation.
In term of Japanese word kun-yomi(), such a sequence of Chinese characters in word order of the original language is called meaning read Chinese character sequence (MRCCS,).
-
Language difference: Korean vs. Chinese
Sound: Korean is spoken without tone (like Japanese), but Chinese has.
Korean follows vowel harmony rules.
Grammar Korean is SOV in its syntax (just like Japanese), while Chinese is SVO.
Korean is agglutinative in its morphology, in which rich suffixes are used for meaning representation. Chinese is an isolating language, word order is its main grammatical means.
Korean has five groups and nine parts of speech. Only words like noun, pronoun () can be translated.
-
Language difference: Vietnamese vs. Chinese Sound
Both have tones, Vietnamese has six, and Chinese has five.
Grammar Both are isolating (analytic) languages, and neither use morphological
marking of case, gender, number or tense. Word order plays the most important grammar role in both languages. Both conform to SVO word order.
As most south east Asian languages, Thai, Cambodian, Vietnamese is head-initial, which is quite different from Chinese. So the word, Vietnamese language,, should not beVitNam
Tingin Vietnamese, but Ting Vit Nam. The phrase, the official language of Kinh people, should bengn nglanguage
chnh thcoficialcaofdn tcpeoplesKinh.
-
Problems to be solved
Grammar translation MRCCS is ungrammatical Chinese.
Solution Re-phrasing based revising
Simple solution: only use language model to perform reordering.
From n-gram language to neural network language model
-
NNLM Background Neural network language models (NNLM), or continuous-space
language models (CSLMs), have been shown to improve the
performance of perplexity (PPL) and statistical machine
translation (SMT) . However, CSLMs have not been used in the
decoding, because using CSLMs in decoding takes a lot of time.
We propose a method for converting CSLMs into back-off n-gram
language models (BNLMs) so that we can use converted CSLMs in
decoding.
-
CSLM
-
Why not CSLM in decoding?
2000 NTCIR-9 English Sentences as test data.
5-gram CSLM (4 context words) and BNLM trained from the
same 1 million NTCIR-9 English sentences.
Evaluate the probability of every n-gram.
LMS CPU Time1 CPU Time2 CPU Time3
BNLM 3.241 s 4.044 s 4.404 s
CSLM 42.058 s 42.372 s 38.361 s
-
CSLM in SMT
Training
Tuning
Decoding
Decoding MERT
CSLM
N-best
Result
1st Pass
CONV
Re-rank
Training
Tuning
Decoding
Decoding MERT
CSLM
N-best
Result
1st Pass
BNLM
Re-rank
-
Conversion Method Text Data
Converting
Entropy Pruning
2-gram CONV
Renormalized back-off weights3-gram BNLM
3-gram CONV
Renormalized back-off weights4-gram BNLM
Converting
Entropy Pruning
Converting
Entropy Pruning
4-gram CONV
2-gram CSLM
3-gram CSLM
4-gram CSLM
2-gram BNLM
Append3-gram BNLM
Append4-gram BNLM
As BLM
Arsoy et al. in ICASSP 2013
Wang et al.(Ours) in EMNLP 2013
-
Experiments and Results Corpus:
(1) NTCIR-9: 1 million sentences from Chinese to English
(2) TED : 186K sentences from Chinese to English (additional monolingual corpus is hard to obtain)
-
Pinyin IME decoding with NNLM
Test corpus LM One-Best N-best Character acc.
10K trigram 0.7472 0.8992 0.968
10K NNLM 0.7571 0.9014 0.968
400K trigram 0.6702 0.8608 0.9546
400K NNLM 0.6768 0.8645 0.9559
5-3
-
A full example on Vietnamese to Chinese translation
Du khch Ty Ban Nha thng thc tr ti Trm Anh qun .
tourist Spain appreciate tea at Tram Anh shop .
Parallel Du khch Ty Ban Nha thng thc tr ti Trm Anh qun.
(1) MRCCS (2) Reorder with language model scores
(it will be a precise translation besides preposition phrase should precede the verb in Chinese)
-
A full example on Vietnamese to Chinese translation
Google translation Spanish tourists enjoy tea at the British outpost. (both Chinese and English translation are far from the correct meaning of the original Vietnamese text.)
Why ? Consider Google translation converts the word British into ngi Anh in Vietnamese. Note that the above named entity word, Trm Anh, which also consists the same core syllable, Anh.
As Google translation cannot find a good mapping for Trm Anh, it turns to use the English translation of ngi Anh instead, therefore, the incorrect translation British comes. We can derive from the above procedure that Google translation use English as pivot language for Vietnamese to Chinese translation.
This shows that historic connection between related languages can help improve machine translation.
-
Vietnamese to MRCCS
-
Vietnamese to MRCCS
-
Conclusions
Chinese character as pivot Accurate translation than before
Using the same decoder to solve different problems.
Neural network language model works
Using Chinese characters as the latest writing system
-
PACLIC 29 2015@Shanghai
-
Thank you
xie xie