someinterestingdata - nanyang technological...
TRANSCRIPT
Overview
➣ Japanese WordNet
➢ 60% coverage of the English WordNet (with Pictures)
➣ Tanaka Corpus
➣ NICT Multilingual Corpus
➣ NTU Multilingual Corpus
➣ Bracket Dic
HG-251 1
WordNet Overview
➣ We are building an open Japanese WordNet,
inspired by the Princeton WordNet of English
➣ Version 1.1 now available
➢ nlpwww.nict.go.jp/wn-ja
∗ 49,190 Synsets
∗ 85,966 Words
∗ 156,684 Senses
∗ Illustrations for 541 Synsets
➢ Semantic structure based on Princeton WordNet
HG-251 2
Overview
➣ Recently Added
➢ Japanese Definitions and Examples
➢ Links to other resources
➣ Still being extended
➢ Revised Structure
➢ Sense Tagged Corpora
➣ Imperfect Version Available now
Release Early Release Often
HG-251 3
Release Formats
➣ SynSet Word pairs (TAB separated)
➣ English and Japanese combined in sqlite3 database
➢ Includes sense links and ancestor table
➢ Perl module for manipulating
➢ Python: http://subtech.g.hatena.ne.jp/y_yanbe/
20090314/p2
➣ SynSet Illustration Pairs
➣ WordNet-LMF (xml)
➣ Online lookup (NICT, MLSN, LangGrid, Kyoto Project, . . . )
HG-251 5
Illustrations
➣ 849 illustrations (541 synsets)
➢ Tagged as OK (default), wierd or best (iff more than one
illustration)
➢ An illustration illustrates its hypernyms
➣ SVG images (include metadata)
➣ From the Open ClipArt Library (public domain)
➣ Many more untagged images (11,209) in 2009-01-10
snapshot)
HG-251 6
Illustration Example
dir animals/mammals/ recreation/sports/
basename bat_orlando_karam cricket_bat
title bat Cricket Bat
tags bat, mammal, animal sports, cricket, recreation
synset bat#n#1 cricket bat#n#1,bat#n#4
Ja 蝙蝠 バット
match hypernym monosemous
bat ⊂ mammal
HG-251 7
Text Annotation
➣ Base the sense inventory on actual usage
➣ Obtain sense frequencies
➣ Annotate data for WSD
Corpus Sentences Words Content Words Trans
Semcor 12,842 224,260 120,000 En, It, Ja
Glosses 165,977 1,468,347 459,000 En, Ja, . . .
Kyoto 38,383 969,558 527,000 Ja, En, Zh
Table 1: Corpora to be Sense Tagged
HG-251 8
Translating Glosses
➣ Translated Glosses and Examples in the Princeton WordNet
➢ Can the use for the Japanese WordNet
➢ Useful for unsupervised WSD (LESK)
➢ Freely redistributable (aligned to: En, Ko, Es, . . . )
➣ Sense Tagged as the Princeton WordNet Gloss Corpus
➣ Definition for Seal-アザラシ
➢ “any of numerous marine mammals that come on shore to
breed; chiefly of cold regions”
➢ 「繁殖のために岸に上がる海洋性哺乳動物の各種;主
に寒帯地域に」
HG-251 9
Tanaka Corpus
➣ http://tatoeba.org/
➣ Aligned short sentences
➢ 150,000 English
➢ 150,000 Japanese
➢ 19,000 Chinese
➣ Can be used to find correspondences
➣ Petter is using to find translation rules
http://172.21.171.235/~petterha/comp-mrs/overview.
html
HG-251 10
Trilingual Data
➣ Use NICT multilingual corpus (JEC)
➢ crosslingual links narrow the interpretations
➣ The result is a cheaply tagged corpus
委員長として党の結束を大切にしたいAs the chairperson, A 作为
I B 委员长 ,would like to 我
regard C 希望
the unity of E 维护
the party F 党内
as important. G 团结。
HG-251 11
Multilingual WSD
➣ English
➢ party1 “an organization to gain political power”
➢ party2 “a group of people gathered together for pleasure”
➢ party3 “a band of people associated temporarily in some
activity”
➢ party4 “an occasion on which people can assemble for
social interaction”
➣ Japanese
➢ 党1 “an organization to gain political power”
HG-251 12
Conclusion
➣ Created the Japanese WordNet
➢ Usable
➢ Accessible
➣ Similar information available for Chinese
➣ Can use in assignment two
HG-251 13
NICT multilingual corpus
➣ Aiming to have 10 million sentences of parallel text
➢ 2-3 million Ja-Zh
➢ Remainder Ja-En
➢ Small amount of other languages
➣ Make as free as copyright allows us
➢ Used for SMT, EBMT
➢ MASTAR - tourism, manga, JC - scientific
➣ Cathedral and Bazaar test corpus (Language Grid)
➢ En, Zh, Ja, Ko, Fr, Es, De, It, Pt
HG-251 14
NTU Multilingual Corpus
➣ aligned corpus of Chinese, English, Malay, Tamil
and other languages if possible
➢ get data from government publications
national and local
➢ many data cleansing issues — pdf2txt, pictures, ...
HG-251 15
NTU Multilingual Corpus Sample
➣ So practise the 10-Minute Mozzie Wipeout everyday to
ensure that you and your loved ones stay safe, healthy
and happy all year long.
➣ Berita baiknya ialah, oleh kerana cara penularan adalah
serupa, dengan mempraktikkan kebiasaan anti-nyamuk
secara berterusan, kita boleh, dengan secara efektif,
menglindungi diri dari ancaman berkembar Chikungunya dan
Denggi.
HG-251 16
Bracket-Dic
➣ Translation quality is getting better
however unknown words and combinations remain a problem
➣ Dictionaries have incomplete cover
➣ Bilingual corpora are still relatively small in size
➣ Look for examples in basically monolingual data
➢ Text with English glosses in brackets
Pustejovskyの生成的辞書(generative lexicon)の記述方
式を利用して . . .
HG-251 17
➢ Extract the English and the text before it
Pustejovskyの生成的辞書(generative lexicon)の記述方式を利用し
て . . .
generative lexicon ⇔ Pustejovskyの生成的辞書
➢ Problems
∗ How much text before it should we extract?
∗ Is the bracketed text really a gloss?
HG-251 18
Previous Research
➣ Several earlier works:
Using the Web as a Bilingual Dictionary (Nagata et al.,
(2001); Using Bilingual Web Data to Mine and Rank
Translations (Li et al., 2003); Acquiring Compound Word
Translations Both Automatically and Dynamically (Zhang
and Isahara, 2004)
➣ Why redo it? We have new corpora – not all on the web
➢ Possible to improve by looking at many terms
➢ Possible to add domain info
➢ We want the translations
HG-251 19
Corpora Examined
Lang Name Size (MB) Comment
Ja WWW 514,212
J-STAGE 604 some English
NLP 43 some duplicates
Zh BLCU 80,000 OCR errors
SohuTechNews 974 XML
GigaWord 4,444 LDC (traditional)
Table 2: Size and types of Corpora Used
A lot of raw data — how many terms can we find?
HG-251 20
Two kinds of patterns
➣ Fully Bracketed Examples
(1) 「収 穫 逓 減 の 法 則(the law of diminishing
return)」
(2) 《德拉吉报道》(DrudgeReport)
(3) “魔兽世界”(World of Warcraft)
➣ Partly Bracketed Examples
(4) 図1に,明瞭性 (Clarity)・新奇性 (Novelty)
(5) 目标递归策略 (GoalRecursionS
trategy),这是一种内部指导的策略。
HG-251 21
Regular Expressions
full1 = tlbr(term+)trbr lbr(gloss{3,})rbr
《德拉吉报道》(DrudgeReport)
full2 = tlbr(term+)lbr(gloss{3,})rbr trbr
「収穫逓減の法則(the law of diminishing return)」
part = (term+)lbr(gloss{3,})rbr
図1に,明瞭性 (Clarity)
term = any non punctuation (1+ nonlatin (CJK))
gloss = latin, connector punctuation,
full space latin, whitespace
lbr = (( rbr = ))
tlbr = Unicode: Punctuation, Open
trbr = Unicode: Punctuation, Close
HG-251 22
Stop Words
Roman Numerals: xii, iii, . . .
Units: MPa, Kmh, . . .
Smilies: T T, o , m m, x x . . .
Week Days: mon, wed, fri, . . .
Other: pdf, PDF . . .
HG-251 23
Distribution of Bracketed Terms
Lang Name Fully Partly
Ja WWW 896,000 14,861,000
J-STAGE 552 45,000
NLP 64 1,300
Zh BLCU 151,000 6,563,000
SohuTechNews 5,400 123,000
➣ A lot of hits!
➣ Terms the authors think are important
➣ Many terms not found in lexicons
➢ 生成的辞書 ≡ generative lexicon
HG-251 24
世世世ののの中中中はははそそそううう甘甘甘くくくななないいい
# English Chinese
10 World of Warcraft 魔兽世界
5 WOW 魔兽世界
3 WoW 魔兽世界
2 WorldofWarcraft 魔兽世界
1 World orWarcraft 魔兽世界
1 World of WarcraftTM 魔兽世界
1 Warcraft 魔兽世界
➣ Errors in the source corpus
➢ OCR errors
➢ Mistyping
HG-251 25
Partially bracketted is harder
➣ Discard unshared left hand contexts
➢ Macaca fuscata 特にニホンザル
➢ Macaca fuscata 日本産哺乳類の中でこのような動作
が可能な ニホンザル
➣ Discard non-term left hand contexts
➢ s/(̂.* を )//;
➢ s/ˆや //;
➢ s/ˆ対//;
➣ Merge whitespace variations:
World of Warcraft ≈ WorldofWarcraft
HG-251 26
Results after Merging
Lang Name Raw # Merged
Ja WWW 14,861,000 1,635,000
J-STAGE 45,000 20,000
NLP 1,300 372
Zh BLCU 6,563,000 964,000
Sohu 123,000 33,000
➣ Fewer, better pairs
HG-251 27
More Examples (Table 5)
Corpus Rank English J/C Freq
BLCU 1 SOD 超氧化物歧化酶 18,000
1001 quercetin 槲皮素 121
2001 Alcan 加拿大铝业公司 55
3001 CSTC 中国软件评测中心 34
4001 John 约翰 18
JST 1 Bunseki Kagaku 分析化学 517
1001 STEM 走査型TEM 2
2001 structural factor 構造係数 2
3001 explicit attitude *的態度 1
4001 Lake Magadi マガディ湖 1
HG-251 28
Evaluation: 言言言語語語処処処理理理学学学会会会誌誌誌
Known: good terms already in our lexicons
文法機能 grammatical function
(EDR, JMDict, CICC and lingdic)
New: good terms not in any of our lexicons
生成的辞書 generative lexicon
Now in lingdic
General: good translations but not NLP terms
すべての学生 all of the students (Example)
Other: the remainder
似テイル ohxap ketidu (Mongolian)
形態素解析システム JUMAN (Description)
HG-251 29
Results for the NLP Corpus
Status # %
Known 61 16%
New 138 37%
General 74 20%
Other 99 27%
Total 372 100%
➣ Many new useful terms (83% ok)
➣ Useful as-is for alignment
➣ Could still be cleaned further
HG-251 30
ToDo: Other Languages vs English
➣ Extract Data from Other Languages
➢ Thai, Korean, Russian, Greek, . . .
➣ Test the English as English
➢ Re-space
➢ Compare to a language model
HG-251 31
ToDo: Internal Structure
➣ Is it compositional? (if so do we need it?)
➢ 複合述部 ≡ complex predicate
➣ Are the lengths roughly equivalent?
➢ One en word ≈ two characters (we can measure)
➢ What about TLAs (three letter acronyms)?
GTF:http://www.xs4all.nl/~jtv/gtf/
➣ Is it a transliteration?
ペンシルバニア大学 University of Pennsylvania
德拉吉报道 Drudge Report
HG-251 32
Link Japanese-Chinese Results
➣ Sohu-JST (1,695 terms)ウェブログ blog 博客
ファイアーウォール firewall 防火墙
分解能 resolution 分辨率
ヒューマンインターフェース human interface 人机界面
➣ Should evaluate with JC dic
➣ Can do internal and external confirmation
➣ Give the the data to Tsunakawa and Erdenebat
HG-251 33
ToDo: Knowledge Extraction
➣ System Names: (look for /システム$/)
➢ GREEN 選択していく論説文要約システム
➣ Better name handling
写信给当时的大数学家欧拉 Euler
大数学家 欧拉 Euler
➣ General text mining
HG-251 34
Conclusions: Bracket-Dic
➣ Currently extracted for Japanese and Chinese
➣ Some cleaning/merging
⋆ Will release cleaned high frequency data
⋆ Will also release RAW data (as far as possible)
➢ let other people clean it
➢ ask for (but don’t expect) feedback
HG-251 35