geo n ame: a system for back-transliterating pinyin place names kui-lam kwok & qiang deng

Post on 12-Jan-2016

45 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New York email: kwok@ir.cs.qc.edu email:peterqc@yahoo.com. Or: issues involving cross language referencing of a Chinese place by name. - PowerPoint PPT Presentation

TRANSCRIPT

GeoName: a system for back-transliteratingpinyin place names

Kui-Lam Kwok & Qiang Deng

Computer Science Dept., Queens CollegeCity University of New York

email: kwok@ir.cs.qc.eduemail:peterqc@yahoo.com

Or:

issues involving cross language referencing

of a Chinese place by name

Content:

1. Back-transliteration problem

2. GeoName system - a proposed approach

3. Evaluation

4. Observation/conclusion

Transliteration:

• ‘alphabet mismatch’ when expressingChinese (place) names in English Texts

• names represented by PRC Pinyin code:

e.g. Beijing, Shenzhen

Back-Transliteration:

given the Pinyin code,

what are the original Chinese characters?

Back-Transliteration:

Why Chinese Characters are needed?

• remove ambiguity of referenced Pinyin place

• reconcile names in English & Chinese texts

• may assist alignment in E/C parallel texts

• necessary for E-C Cross Language IR (when translating English queries containing

Pinyin place, person, organization names)

4 Possible Ambiguities in

English–Chinese

cross language place name references

Ambiguity #3: Back-transliteration--> which character string is correct?

e.g.•China’s capital in Chinese - 北京•PRC Pinyin (1 char, 1 syllable) -

北 --> bei 京 --> jing

•map back from Pinyin to characters –bei --> { 北 , 贝 , 被 , 背 , 碑 , 杯 , 备 , 鐾 , …} (total 23)jing--> { 京 , 景 , 井 , 静 , 敬 , 竞 , 精 , 荆 , …} (total 20)

•ambiguous candidates: 北井 , 贝京 , 贝荆 , …北京which one?

Ambiguity #4: Name Reference--> same name, different places

Suppose result of back-transliteration is:

beijing --> 贝荆 , then which 贝荆 ? (longitude, latitude)

Ambiguity #1: E/C Pinyin Systems--> which Pinyin system was used ?

e.g. ‘Hong Kong’ in characters - 香港

PRC Pinyin: 香 -> xiang, 港 -> gangWade-Giles: 香 -> hsiang, 港 -> kangHong Kong: 香 -> hong, 港 -> kong …

‘hong kong’ back-transliterate using PRC Pinyin:

hong -> { 红洪鸿宏虹弘泓闳烘项黉哄 … } (19)kong -> { 孔空恐崆控箜倥 } (7)

Original ‘ 香港’ is NOT one of these 7x19 combinations !

Ambiguity #2: Syllable Segmentationwhich segmentation is correct?

e.g. 秦皇岛 - possible pinyin writing styles:

• Qin Huang Dao• QinHuangDao• Qinhuangdao <-- most common, used in NYT

--> how many syllables?Qin huang dao 3 charQin huang da o 4 charQin hu ang dao 4 charQin hu ang da o 5 char

Summarize: given a Pinyin geographic name

1. Pinyin system -- which?

2. segmentation -- how many syllables?

3. back-transliterate -- which candidate character string?

4. resolve same name, different places.

GeoName:

a system for back-transliteratingPinyin place names

GeoName: E-C cross language place reference

1. which Pinyin system?-- user chooses; or allow both PY & WG

2. how many segmented syllables?-- fewest syllables ranked first

3. back-transliterate: which candidate ?-- a) bi-list; b) confirm by web/Chinese place list; c) rank candidates by frequency

4. resolve same name different places -- not considered

GeoName –

Given English Pinyin place E =e1e2.. en (n syllables),many possible Chinese character string candidates:

C* = c1c2.. cn = argmaxC P(C|E)

= argmaxC P(E|C)*P(C)

~ argmaxC P(C), by assuming

P(E|C) ~ Πi P(ei|C) i.e. ei, ek

independent ~ Πi P(ei| ci) i.e. ei, ck

independent ~ 1 i.e. all ci map to unique

ei

GeoName –

P(C) = language model of Chinese place names<obtain training data by processing TREC, NTCIR Chinese collections using BBN IdentiFinder: ~80K approximate unique place names>

Use P(C) to sort candidates; fewest syllables rankedearlier<bigram model P(c2|c1)P(c3|c2).. not too good>

GeoName –

A heuristic weighting formula based on whole string, bigram and character frequencies:

g(C) = a1*log [f(C)+a1] + a2*log [f(cicj)+a2]

+ a3*log [f(ci)+a3],

- factor ignored if f(.) = 0; a1>a2>a3

- a1*log [f(C)+a1] => a string seen before

is probably correct

Evaluation

Use frequency formula only on 162

Pinyin city names from bilingual map

(no bilingual pair list were employed)

GeoName Evaluation - Frequency Formula(back-transliterating 162 Pinyin geographic names)

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10 >10

Rank

Cu

mm

ula

tiv

e #

Co

rre

ct

at

Ra

nk

48%

70%

74%

82%

Examples of Correct Names ranked #1Daqiu ( 大丘 ), Wanbi ( 湾碧 ), Gongzhuling, ..

( 公主岭 )Examples of Failed Names• Non-Pinyin:

Qarqi, Yengisar, Jorra, Dongkar, .. ( 察尔齐 ) ( 阳霞 ) ( 觉拉 ) ( 洞嘎 )

• mainly longer names:Tuolu, Fenglingguan, Qingguandu,( 驮芦 ) ( 枫岭关 ) ( 清官渡 )Dating, Shasonggang, Denglonghe, ..( 大亭 ) ( 杉松岗 ) ( 灯笼河 )

GeoName – further improvement

Hypothesis: prefer candidate strings that have been seen before as location

names

confirm candidates on:

1. a bilingual list (~4K) – tag: 100ftp://ftpserver.ciesin.columbia.edu/pub/data/China /CITAS/gb_code/

2. Chinese monolingual place name list (~80K+4K) – tag:

010

3. web data via Google search – tag: 001

1. Pinyin place nameinput; user indicatesPRC or WG system.

3. Bilingual table(4k) lookup. tag 100

2. Pinyin segmentation; map to all possible GB character strings.tag 000

4. Merge GB candidates

6. WWWconfirmation.tag 101, 001

5. Monolingual name list (84k) confirmation.tag 110, 010

7. Evaluate weight g(C);rank according to:(1) tag, (2) name character length, (3) g(C).

tag 111, 011

GeoName –flowchart

GeoName – Evaluation

Evaluate system result using:

tag=000, rank by g(C)tag=001, web confirmation + g(C)tag=010, mono-list confirmation + g(C)tag=111, bi-list + all above

GeoName Evaluation - Various Methods(back-transliterating 162 Pinyin geographic names)

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10 >10

Rank

Cu

mm

ula

tiv

e #

Co

rre

ct

at

Ra

nk

freq+mono.list (010)

all (111)

freq only (000)

freq+web (001)

48%

70%

74%

82%

72%

83%

86%

79%

Example of back-transliteration: web & no-web

Tag = 111 (with web confirmation)

Chagugang 001 1.38629436 汊沽港 000 15.68423107 查古港 000 9.24647942 诧古港 000 9.24647942 岔古港 000 8.55333224 锸古港 000 8.55333224 槎古港 000 8.55333224 楂古港 000 8.55333224 汊古港 000 8.55333224 嚓古港 000 8.55333224 刹古港

Tag = 110 (without web confirmation)

Chagugang 000 15.68423107 查古港 000 9.24647942 诧古港 000 9.24647942 岔古港 000 8.55333224 锸古港 000 8.55333224 槎古港 000 8.55333224 楂古港 000 8.55333224 汊古港 000 8.55333224 嚓古港 000 8.55333224 刹古港 000 8.55333224 差古港

Examples:

Luliangqu 010 40.02587171 吕梁区 000 9.24647942 吕梁瞿 000 9.24647942 吕梁衢 000 9.24647942 吕梁渠 000 9.24647942 吕梁曲 000 9.24647942 陆良瞿 000 9.24647942 陆良衢 000 9.24647942 陆良渠 000 9.24647942 陆良曲 000 9.24647942 陆良区 district/region

Xiaoyishi 110 40.18588115 孝义市 000 9.24647942 孝尾市 000 9.24647942 萧尾市 000 8.55333224 箫尾市 000 8.55333224 筱尾市 000 8.55333224 骁尾市 000 8.55333224 潇尾市 000 8.55333224 崤尾市 000 8.55333224 哓尾市 000 8.55333224 效尾市 city

Yimaxiang 000 15.68423107 义马乡 000 9.24647942 义马缃 000 9.24647942 义马巷 000 9.24647942 义马祥 000 9.24647942 义马湘 000 9.24647942 义马襄 000 9.24647942 义马香 000 9.24647942 伊玛缃 000 9.24647942 伊玛巷 000 9.24647942 伊玛祥 village

Mengnanzhuang 000 14.95494484 蒙南庄 000 8.51719319 懵南庄 000 8.51719319 孟南庄 000 8.51719319 盟南庄 000 8.51719319 萌南庄 000 7.82404601 虻南庄 000 7.82404601 勐南庄 000 7.82404601 梦南庄 000 7.82404601 猛南庄 000 7.82404601 锰南庄 place

Conclusion:

• reasonable back-transliteration results for map cities

• longer names (>2 char), more error • non-pinyin names, does not work

Future Work:

• increase training data• improve ranking function• direct translation (not just confirmation)

using web• better/more realistic evaluation

If interested:

can demonstrate GeoName (needs Linux re-boot)

Try GeoName at:

http://post.cs.qc.edu/spell2gb/(needs Chinese character display)

feedback appreciated

Thank You!

top related