research on semantic-based passive transformation in...

G. Zhou et al. (Eds.): NLPCC 2013, CCIS 400, pp. 346–354, 2013. © Springer-Verlag Berlin Heidelberg 2013

Research on Semantic-Based Passive Transformation in Chinese-English Machine Translation

Wenfei Chang, Zhiying Liu, and Yaohong Jin

Institute of Chinese Information Processing, Beijing Normal University, Beijing, China [email protected], {liuzhy,jinyaohong}@bnu.edu.cn

Abstract. Passive voice is widely used in English while it is less used in Chinese, which is more prevalent in patent documents. The difference requires us to transform the voice in Chinese-English machine translation in order to make the result more smooth and natural. Previous studies in this field are based on statistics, but the effect is not very good. In this paper we propose a strategy to deal with the Chinese-English passive voice transformation from the perspective of semantic. Through analyzing the sentences, a series of transformation rules are summarized. Then we test them in our system. Experiment results show that the transformation rules can achieve an accuracy of 89.1% overall.

Keywords: passive voice, patent documents, Machine translation, Transformation rules.

1 Introduction

Voice refers to the expression of the relationship between a verb and a noun phrase in a language [1]. It includes two types: active voice and passive voice. Active voice indicates that the subject is the agent of the action; passive voice means that the subject is the patient of the action. There are passive sentences both in Chinese and English, but they have a lot of differences in grammar grammatical concept, form of structure, typical usages and semantic roles. In English, passive voice will be used when the agent is uncertain or inconvenience to implicit or can be seen from the context. In addition, when the sentence emphasizes on the event or action itself rather than the agent, the passive voice is adopted, too. However, in Chinese we use active voice in most cases except the sentence is used to express the feeling of unhappy or unsatisfied. As a result, passive voice is widely used in English while it is less used in Chinese. These differences require us to transform the voice in order to make the translation result more smooth and natural.

With the rapid development of the world economy, the update velocity of the technical knowledge becomes faster than ever. According to The World Intellectual Property Organization (WIPO), patent applications increased year by year and reached 1.8 million in 2010. Most applications are from China or Europe and effective in these areas. In order to better protect the benefit of the applicants, several major Intellectual Property Office actively exploring how to improve the effect of machine translation.

Semantic-Based Passive Transformation in Chinese-English Machine Translation 347

Patent documents as official and juridical documents, they tend to have some fixed formats and they are suitable for machine translation (MT). However, the present MT systems don’t have a good strategy to deal with the problem of passive transformation, thus greatly degrades the whole quality of MT.

The writing Center of University of Delaware has done a statistics, result shows that the passive form accounts for 65% of all predicate verbs in science and technology [2]. According to [3], passive voice is one of the most important characteristics in English. There is l/3，even more than l/2 verbs appear in passive voice in the field of science and technology. 500 Chinese-English bilingual abstracts of patent documents has been analyzed in [4], and found that the passive voice is not appeared only in 22 English abstracts. That means more than 95% English patent abstracts use passive voice. So it is essential to explore the passive translation methods in Chinese-English patent machine translation.

The remainder of this paper is organized as follows. We discuss the related work in Section 2. Semantic analysis of the passive voice is performed in Section 3. Next is the transformation process in Section 4. The experiments and discussion are presented in section 5. Finally, a conclusion is given and the further work is expected in Section 6.

2 Related Work

There mainly two fields research on the passive transformation. One field is traditional linguistic and the other is information processing field.

In traditional linguistic field, many papers have realized that passive voice is widely used in English, especially in the field of science and technology. Some researchers [5][6] has discovered that only transitive verbs can be used in the passive voice. Besides the verb must be used to express a kind of act and followed by an object. The difference between English and Chinese has been analyzed in [2], they proposed that we should follow the language habit and translate the voice as much as possible. Meanwhile, they present six methods about how to transform voice. But they mainly pay attention to the transformation from English into Chinese. The similarities and differences of the constituent components in Chinese and English passive sentences have been discussed in [7].They described the situation which should transform voice by analyzing the features of the subject, object, predicate or the passive preposition in the sentence.

Though they have an in-depth study on the passive transformation, most of the present studies are from the perspective of human rather than the machine, so it doesn’t apply to machine translation.

In information processing field, some researchers has put forward some translation methods from the perspective of lexical semantic and syntactic structure [8][9].And [10] present a method to dispose the passive transformation based on the Case Grammar. However, the related study is still limited in this field.

348 W. Chang, Z. Liu, and Y. Jin

Besides in present MT systems, most of them are based on statistics. Among them, Google Translator (name it Google for short) is the best. So we select some sentences from the patent documents and put them in Google to check the effect.

Example 1 根据各齿轮的旋转，夹持光盘，并装载托盘12。 Reference1,: In accordance with the revolutions of the combined gears, an optical

disk [is chucked], and a tray [is loaded]. Google: According to the rotation of the gears, the [clamping] disc, and the

[loading] tray 12. Example 2 这个字通过在光学领域内执行逐个比特的布尔“与”运算来识别。 Reference: The word [is recognized] by carrying out in the optical domain a

bit-wise Boolean “AND” operation. Google: The word through the implementation of the bit by bit in the optical field

within the Boolean “and” operation to [identify]. Example 1 has omitted the subject, and the object has omitted in Example 2. In

these cases, the words showed by italics should be transformed into passive voice according to the usage of English. But the result show that Google failed to transform it. After test some kinds of sentences, we find the accuracy of passive transformation is low. As we can see though statistical method is the mainstream, it doesn’t have a good strategy to treat the passive transformation at the moment. The results reflect that it is difficult to achieve a good effect without using syntactic and semantic analysis when translating long patent sentences.

Hence, in this paper, from the perspective of semantic, we propose a systematic processing strategy which composed by a series of rules according to the features of the patent documents, which has greatly improved the effect of MT.

3 Semantic Analysis of the Passive Voice

In English, the structure of “be+V-ed” is used to indicate the sentence is a passive sentence, that is to say, it is the mark of the passive sentence. However, in Chinese, many passive meaning are expressed by the active form, thus judging whether a sentence should be translated into passive sentence in Chinese-English MT system should not only rely on the passive mark but also have to observe the sentence semantic. Sentences with passive mark are only one kind of the sentences which should be transformed, there are many kinds of sentences without passive mark should transformed when translating, too. They all should use passive voice when translated into English. Different transform methods are adopted in the process of transformation according to whether can find a passive mark in the sentence or not.

3.1 Sentences with Passive Mark in Chinese

In Chinese, the preposition BEI or SUO are used to mark the passive voice. But there are some differences in usage.

1 The bilingual corpus is provided by China Patent Information Center.


• Passive Mark BEI

BEI is an unconditional transformation mark whenever we find BEI before a verb in the sentence. Regardless of whether BEI is closely adjacent to the verb, the passive voice will be used when translated into English.

1)Patient+ BEI+ Verb: In this kind of sentences, BEI is immediately before the verb, there is no other part between them, the order of the language blocks in the sentence would keep unchanged when translated into English.

Example 3因此提交订单的交易者将被通知成交。(Thereby the trader that sent in the order will be informed about the deal.)

2)Patient + BEI +…+Verb: It is allowed to have an agent or adverb or other components between BEI and the verb in this kind of sentences. And the order of the language blocks would keep unchanged, too.

Example 4 如图中可见的，排列单元被匹配单元分离并连接到输入机构3。(As can be seen in the figure the ranking unit is separated by the matching unit and connected to the input mechanism 3.)

• Passive Mark SUO

SUO is also a mark of the passive voice. Unlike BEI, there isn’t allow any part between SUO and the verb, therefore if we find the word SUO located immediately before a verb in Chinese, then we should transform the verb into passive form when translated into English.

Example 5 因此，它不需要处理在第一排列单元所接收的并且不是最优排列的订单。(Hence, it does not need to handle the order that was received at the first ranking unit and which was not top ranked.)

3.2 Sentences without Passive Mark in Chinese

Through the statistical analysis of 1000 sentences, we find that sentences should be transformed into passive voice when translating and without passive mark can achieve the proportion as high as 61%. The data can be seen in Table 1.

Table 1. Classification of Passive Sentence

Type Sentence number Proportion

Sentences with passive mark 390 39%

Sentences without passive mark 610 61%

From the table we can see that most of the passive sentences are without passive

mark in Chinese. So it is difficult for the MT systems to recognize the passive meaning and transform the verb into passive voice when translating. Though they are difficult to distinguish, they have an important role in enhancing the transformation accuracy rate. Consequently, they are the emphasis of our research.

Our research are performed based on the Hierarchical Network of Concepts theory (HNC theory)[11], which is a natural language understanding theory from the


perspective of semantic. HNC views the language processing as a mapping process from the natural language space to the language concept space. The language concepts can be divided into two categories: action concept (presenting GX) and effect concept (presenting GY) (The action is cause and the effect is result.)[12]. According to the concept category of the main verb in the sentence, two sentence categories have been classified: global action sentence and global effect sentence. And in this section, our work is done based on the division of the two sentence categories.

• Action Sentence

The verb in global action sentence mainly expresses the meaning of one participant exerts a power to the other. Generally speaking, this category of sentences needn’t transform into passive voice if the components are complete. But when there is a component ellipsis or there is a preposition immediately next to the main verb in the sentence, then the sentence should be transformed into passive voice.

Component ellipsis in sentence. The complete sentence structure is SVO both in Chinese and English. However, the sentence without subject or object can be frequently found in Chinese. Then the structure of the sentence will become the form of “V+NP” or “NP+V”. In these structures, NP acts the patient of the action. So the sentences should be transformed into passive voice when translated into English.

“Verb+Prep” structure in sentence. The compound structure composed by the main verb and an immediately adjacent preposition is used to describe an objective phenomenon. The subject in this kind of sentences no longer acts the agent, but the patient of the action. So we should transform the sentence into passive voice when translating.

• Effect Sentence

Unlike the action sentences, there is no agent or patient in the sentence, the effect sentences are used to describe a kind of objective phenomenon. But when the verb expresses a strong result meaning, the word itself implies an agent, so it should be translated into passive voice, too. In view of this situation, we have chosen to add related property “ALL_PASS” in the knowledge base in order to provide information for the MT system. As long as the main verb has the property of “ALL_PASS[Y]”, it would be transformed into passive voice in the translation process.

4 Transformation Rules and Algorithm

According to several situations we have mentioned above, a series of rules are drawn up to transform the passive voice in MT system.

4.1 Transformation Rules

• Transformation with Passive Mark in Chinese

There are mainly two rules in this part according to [13].


Rule 1: (b)2{(-1)CHN[被

]}+(0)LC_CHK[E,EG,EP]=>DEL_NODE(-1)+COPY[-1,0]+(0){VOI=P}$ Rule 2: (-1)CHN[所]&LC_CHK[QE]+(0)LC_CHK[E,EG,EP]=>DEL_NODE(-1)+(0){

VOI=P}$ Rule 1 means that if we can find the preposition BEI(被) before E,EG,EP3

regardless of whether they are immediately adjacent to node 0, then preposition BEI(被) will be deleted, components between preposition BEI(被) and node 0 will be copied as well as node 0 will be transformed into passive voice.

Example 6 一条指定水平线的像素数据的扫描级被有次序地存储在一个地址存储器中。(A scanning level of pixel data for a given horizontal line is regularly stored in an address memory.)

Rule 2 means that if SUO(所) act QE4 and immediately adjacent to node 0, then delete SUO(所) and transform node 0 into passive voice.

Example 7 图像传感器装置所测定的色彩范围取决于光源的色彩。(The range of colors measured by an image sensor device depends on the color of the illuminant.)

• Transformation without Passive Mark in Chinese

In action sentences, we give different transform rules according to the different situations. Several examples are given below.

Rule 3: (-1){BEGIN%}+(b){!LC_CHK[GBK]}+(0){LC_CHK[E,EG,EP]&LC_SC_KEY[

GX]&!CHN[使,具有,使得]}+(1)LC_CHK[GBK]=>(-1)+COPY[-1,0]+(1)+(0){VOI =P}$

Rule 3 means that if the verb belongs to [GX]5 except the words “使”, “具有”, “使得”, and we can’t find GBK6 before it, then node (1) will be put forward before the verb and the verb will be transformed into passive voice in the process of translation.

Example 8 在外壳118中在叶片120的径向向内的位置处形成环形凹槽122。(An annular recess 122 is formed in housing 118 radially inward of blade 120.)

Rule 4: (b){(-1)BEGIN%}+(b){!LC_CHK[L0]}+(0)LC_CHK[E,EG,EP]&LC_SC_KEY[

GX]+(1){END%}=>(-1)+COPY[-1,0]+(0){VOI=P}+(1)$ Rule 4 means that if the verb belongs to [GX] and we can’t find L07 before it as

well as it locates at the end of the sentence, then the verb will be transformed into passive voice.

2 (b) means looking for something forward. 3 E, EG, EP are terminologies in HNC which mean the verb in sentence. 4 QE is a terminology in HNC which means the modifier of E. 5 GX means action concept. 6 GBK is short for general object chunk. 7 L0 is a terminology in HNC which means the mark of main semantic chunk.


Rule 5: (0)LC_CHK[E]+(1)CHN[至,到,给,于,成]&LC_CHK[HV]=>(0){VOI=P}+

DEL_NODE(1)+ADD_NODE(ENG=[to])$ Rule 5 means that if there is a preposition immediately behind E and act HV8, then

we will transform the verb into passive voice and HV will be substituted by the English word “to” when translating.

Example 9 在步骤505中，已标准化的像素数据子集投射到色空间子集中。(In step 505, the normalized pixel data subset is projected into the color space subset.)

In effect sentence, we will take advantage of the information which in the knowledge base to determine whether to transform the voice or not. One rule is used to invoke the information.

Rule 6: (0)LC_CHK[E,EG,EP]&LC_SC_KEY[ALL_PASSIVE]=>(0){VOI=P}$ Example 10 具有预定形状的反光板形成于一下壳体中。(A reflection plate

with a predetermined shape is formed inside a lower casing.) Rule 6 means that if the verb has been labeled the tag of “ALL_PASSIVE” in

knowledge base, it will be transformed into passive voice.

4.2 Algorithm

According to the features of the transformation rules, we design the procedure of transforming the passive voice in MT system semantically as below:

Step 1: To determine if there is a passive mark in Chinese sentence. If yes, go to step 6; if no, go to step 2.

Step 2: To determine the concept category of the predicative verb. If GX, go to step 3; if GY, go to step 5.

Step 3: To determine if there is a component ellipsis in the sentence. If yes, go to step 6; if no, go to step 4.

Step 4: To determine if it is the “Verb + Prep” structure in the sentence. If yes, go to step 6; if no, go to end.

Step 5: To determine if the main verb has the property of ALL_PASS[Y]. If yes, go to step 6; if no, go to end.

Step 6: To transform the verb into passive voice.

5 Experiments and Result Analysis

5.1 Experiments

In this experiment, we have selected 1000 sentences randomly and put them into our rule-based system (name it RB for short) to test the transformation effect. Meanwhile, we test them in Google, too. Three types of data are counted and the definite data can be seen in Table2.

8 HV is a terminology in HNC which means the verb suffix.


Table 2. Types of data

Type Total number Should be

transformed Transformed

Right

transformed

RB 1000 632 540 481

Google 1000 632 515 430

Then, the Precision (P) and Recall(R) are calculated, and the results are shown in Table 3:

Table 3. Result of transformation

System Precision Recall

RB 89.1% 76.1%

Google 83.4% 68.1%

From table 3 we can see that our system has achieved the higher Precision and Recall than Google, and the accuracy can reach as high as 89% overall. The result indicates that our method can efficiently improve the translation performance in Chinese-English machine translation system.

5.2 Result Analysis

Although our system has achieved good results, there are still areas for improvement. By analyzing errors in the result, we find there are mainly have four reasons: a) Rules have not covered all the kinds of linguistic phenomenon. b) In effect sentence, the passive voice transformation mainly relies on the information in knowledge base, so if the verb has been wrongly given the information of “ALL_PASS[Y]”, it will be wrongly transformed. c) Our work is performed based on the verb; if the verb is wrongly recognized in the sentence, then it will not match the right transformation rule. That is the main reason that leading to the low Recall. d) The system may be left some sentences unanalyzed, thus leading to the transformation work can’t be proceeded.

6 Conclusions and Future Work

Passive voice is widely used in English patent documents while it is less used in Chinese. So it is an important problem in Chinese-English machine translation. In this paper, with the guidance of HNC, we first classify the sentences into two types: sentences with passive mark in Chinese and sentences without passive mark in Chinese. And then analyze them in detail. Wherein sentence without passive mark in Chinese is our emphasis, in this part, we further analyze the sentences which should be transformed when translating in action sentence and effect sentence respectively. Through analyzing amount of bilingual sentences, we have concluded the


transformation rules then tested them in our system. Results show that the precision of our system has achieved 89.1%.

In the future, in view of the reasons for the error, we will investigate more sentences in order to supplement and refine the existing rules. On the other hand, we will further improve the related information in the knowledge base.

Acknowledgements. This work was supported by the Hi-Tech Research and Development Program of China (2012AA011104), and the Fundamental Research Funds for the Central Universities.

References

1. Richards, J.C., Schmidt, R.W.: Longman Dictionary of Language Teaching and Applied Linguistics, 3rd edn. Foreign Language Teaching and Research Press, Beijing (2005)

2. Man, B., Zijuan, S., Shengtao, Z.: A method of translating English passive voice into Chinese. Journal of Guangdong Mechanical Institute 14(2) (June 1996)

3. Bin, L.: The comparative approach to the translation of English typical patterns in MT software. Southwest Jiaotong University, 5 (2004)

4. Zhiying, L., Yaohong, J.: Passive sentence transformation in Chinese-English patent machine translation. The Journal of China Universities of Posts and Telecommunications 19(suppl. 2), 135–139 (2012)

5. Baoyu, B.: A discussion on English voice transformation. Journal of Daqing College 16(3) (August 1996)

6. Yongxin, Z.: Comparison of Chinese and English passive structure. Foreign Language Teaching (February 1983)

7. Wenhua, X.: Comparison of passive sentences in Chinese and English. Language Teaching and Linguistic Studies (April 1983)

8. Yaohong, J.I.N., Zhiying, L.I.U.: Improving Chinese-English patent machine translation using sentence segmentation. In: IEEE 7th International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE 2011), Tokushima, Japan, pp. 620–625 (2011)

9. Nunberg, G.: The Linguistics of Punctuation. CSLI Lecture Notes, No. 18, Stanford CA (1990) (July 2012); Bai, X., Zhan, W.: Constraints of BEI and process of English passive in machine translation, New expansion of Chinese passive expression, 1–17 (2006)

10. Jian, L., Bingxi, W., Yonghui, G.: Rule-Based Converter and Generation in English-Chinese MT System. In: The 2nd National Conference on Computational Linguistics for Students, pp. 390–393 (2004)

11. Zengyang, H.: Hierarchical Network of Concepts (HNC) Theory. Tsinghua University Press (1998)

12. Chuanjiang, M.: HNC (hierarchical network of concepts) theory introduction. Tsinghua University Press, Beijing (2005)

13. Yun, Z., Yaohong, J.: A Chinese-English patent machine translation system based on the theory of hierarchical network of concepts. The Journal of China Universities of Posts and Telecommunications 19(suppl. 2), 140–146 (2012)

research on semantic-based passive transformation in...

Documents