crestec - taus tokyo forum 2015
TRANSCRIPT
Research on developing a system to translate
Japanese laws and regulations
CRESTEC Inc.
Yasuhiro SEKINE([email protected])
April 9, 2015 @ TAUS Executive Forum 2015
BackgroundJapanese Law Translation (http://www.japaneselawtranslation.go.jp/?re=02)
o Launched in April, 2009 by Ministry of Justice, Japano 489 translated laws (As of March 31, 2015)o More than 100,000 accesses everyday
BackgroundLaw Data Providing System (http://law.e-gov.go.jp/cgi-bin/idxsearch.cgi)
o Launched in April, 2001 by Ministry of Internal Affairs and Communications
o Provides about 8000 texts of Japanese laws and regulations (Japanese text only)
BackgroundProblems
o Only 489 translations of laws are available
More than 8000 laws and regulations now effective in Japan
o Most of the translations do not include the latest amendment
About 100 laws are amended every year in Japan
Translating every law and keeping them updated is costly in terms of money and human resources.
To solve this problem is one of the motivations to develop a system to provide translation of every law at its latest version using technologies.
BackgroundJapanese Law Machine Translation (http://itrd.crestec.co.jp/jlmt/default_en.aspx)
o Test version was released in July, 2014.o 8206 translated laws (As of April 9, 2015)
Overview of the system• Purpose of the system
o Provide translation of every law in Japan→ Collect every Japanese law from the Law Data Providing System and translate them all automatically
o Provide translations of the latest amendment version and keep them updated
→ Collect the latest amendment version of laws and retranslate them constantly
• Functionso Search by keywords
o Search by category
Resources• Source texts
o HTML files downloaded from the Law Data Providing System
• Bilingual corpuso Made from translated laws downloaded from the Japanese Law
Translationo Used for translation memory and machine translation
• Dictionarieso Governmental organizationso Positions of governmental authoritieso Titles of lawso Place names …etc.o Compiled by hands
Translation methodsTranslation methods
(1) Automatic conversionExpression which has definitive translations
Expression with numbers …etc.
(2) 100% match from translation memory100% match from translation memory
(3) Automatic post-edit of fuzzy matchSpecific type of fuzzy match from translation memory is post-edited automatically
(4) Statistical machine translationMicrosoft Translator Hub
Translation order (1) →(2) →(3) →(4)
Translation method (1)Translation method (1) Automatic conversion
• The expressions which have definitive translations are translated usingdictionarieso Law title (民法 → Civil Code)
o Position (内閣広報官 → Cabinet Public Relations Secretary)
o Organization (林野庁 → Forestry Agency)
o Place (北海道 → Hokkaido) …etc.
• The expressions with numbers are translated by conversion programo Law number (昭和五十二年政令第二十号 → Cabinet Order No. 20 of 1977)
o Reference number (第五条第三項第二号 → Article 5, paragraph (3), item(ii))
o Date (昭和五十二年六月八日 → June 8, 1977)
o Price (千五百二十円 → 1,520 yen)
o Age (十八歳 → 18 years of age)
o Weight (二十ミリグラム → 20 milligram)
o Length (三十キロメートル → 30 kilometers) ...etc.
Translation method (2)Translation method (2) 100% match from translation memory
o A 100% match in the translation memory is used
o Translation memory is made from Japanese and English XML files downloaded from the Japanese Law Translation website
o Translation memory consists of 273,046 units taken from 489 laws (As of March 31, 2015)
Translation method (3)Translation method (3) Automatic post-edit
If there is a fuzzy match and the parts which need corrections can be converted like translation method (1), translation of the fuzzy match is automatically post-edited.
次の各号のいずれかに該当する者は、三十万円以下の罰金に処する。
Translation method (3)Translation method (3) Automatic post-edit
If there is a fuzzy match and the parts which need corrections can be converted like translation method (1), translation of the fuzzy match is automatically post-edited.
次の各号のいずれかに該当する者は、三十万円以下の罰金に処する。
↓ the sentence is abstracted by variables
[次の各号のいずれかに該当する者は、<price>以下の罰金に処する。]
Translation method (3)Translation process (3) Automatic post edit
If there is a fuzzy match and the parts which need correction can be converted like translation process (1), equivalent translation of the fuzzy match is automatically translated
次の各号のいずれかに該当する者は、三十万円以下の罰金に処する。
[次の各号のいずれかに該当する者は、<price>以下の罰金に処する。]
↓ find correspondent abstracted sentences from the translation memory
MT: [次の各号のいずれかに該当する者は、<price>以下の罰金に処する。]
次の各号のいずれかに該当する者は、五百万円以下の罰金に処する。
A person who falls under any of the following items shall be punished by a fine of not more than five million yen.
Translation method (3)Translation process (3) Automatic post edit
If there is a fuzzy match and the parts which need correction can be converted like translation process (1), equivalent translation of the fuzzy match is automatically translated
次の各号のいずれかに該当する者は、三十万円以下の罰金に処する。
A person who falls under any of the following items shall be punished by a fine of not more than five million yen.
MT: 次の各号のいずれかに該当する者は、五百万円以下の罰金に処する。
A person who falls under any of the following items shall be punished by a fine of not more than five million yen.
Translation method (3)Translation process (3) Automatic post edit
If there is a fuzzy match and the parts which need correction can be converted like translation process (1), equivalent translation of the fuzzy match is automatically translated
次の各号のいずれかに該当する者は、三十万円以下の罰金に処する。
A person who falls under any of the following items shall be punished by a fine of not more than three hundred thousand yen.
MT: 次の各号のいずれかに該当する者は、五百万円以下の罰金に処する。
A person who falls under any of the following items shall be punished by a fine of not more than five million yen.
The text translated by this method is highlighted in blue on mouse over
Translation method (4)Translation method (4) Statistical machine translation
Microsoft Translator Hub
o Translation memory used in the translation method (2) and (3) is usedfor training data.
o Dictionaries used in the translation method (1) are used
(It does not seem to be working though…)
o BLEU Score: 36.59 (17.92 higher than Microsoft’s general domainsystem)
The text translated by this method is highlighted in yellow on mouse over
StatisticsProportion of the translation method by segment
No. of segments
1) Auto conversion
923,270 (20%)
2) 100% match 1,737,505 (40%)
3) Auto post-edit 121,511 (3%)
4) SMT 1,640,607 (37%)
5) Unable to translate
239
Total 4,423,132
Auto conversion100% matchAuto post-editSMTUnable to translate
StatisticsProportion of the translation method by character
No. of characters
1) Auto conversion
6,387,268 (5%)
2) 100% match 21,453,627 (17%)
3) Auto post-edit 3,428,573 (3%)
4) SMT 95,468,688 (75%)
5) Unable to translate
563,782
Total 127,301,938
1) Auto conversion2) 100% match3) Auto post-edit4) SMT5) Unable to translate
Quality of translationQuality of the translation
A score which roughly indicates quality of the translation is calculated by the proportion of the translation methods, and the search results are ordered by that score. The score is displayed as a symbol “★” in the search result.
Higher score
Lower score
(1) Automatic conversion
(2) 100% match
(3) Automatic post-edit
(4) Statistical machine translation
Challenges for the future• Quality evaluation of translations
o How much are they understandable?
• Increasing recyclability of translation memoryo Add more variableso Add potential highly recyclable translations to the translation memory
• Improving quality of statistical machine translationo Use dictionarieso Try other settingso Try other MT engines
• Applying this system to other documentso Municipal lawso School ruleso Internal rules for company
Acknowledgements
Japan Legal Information Institute,
Graduate School of Law, Nagoya University
Prof. Tomoko Masuda, Prof. Yoshiharu Matsuura,
Prof. Katsuhiko Toyama, Prof. Tokuyasu Kakuta,
Prof. Yasuhiro Ogawa, Prof. Makoto Nakamura,
and other professors and researchers
Thank you!
Japanese Law Machine Translation
http://itrd.crestec.co.jp/jlmt/
Yasuhiro SEKINE, CRESTEC Inc.