shared task proposal, fire 2012 monojit choudhury microsoft research lab india

17
Search in Transliterated Space Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

Upload: frederica-morris

Post on 24-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

Search in Transliterated Space

Shared Task Proposal, FIRE 2012

Monojit ChoudhuryMicrosoft Research Lab India

Page 2: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

A Transliterated World Wide Web

Song Lyrics

Page 3: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

A Transliterated World Wide Web

Reviews and Forums

Page 4: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

A Transliterated World Wide Web

Facebook and Twitter

Page 5: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

A Transliterated World Wide Web

And lot more

Page 6: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

Beyond Indic languages

Many languages that use non-Roman script Arabic (Saudi Arabia, UAE, Egypt,

Morocco,…) Persian Indian sub-continental languages (IL &

Dzongkha, Nepalese, Sinhala) Thai, Vietnamese Cyrillic (Russian, Ukrainian) Chinese, Japanese, Korean (rare)

Page 7: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

Aspects of Transliterated Text

Code Mixing

Transliteration

Errors, Contracti

on

Page 8: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

IR Scenario - I

Mono-script Monolingual IR in transliterated space Query: thandee hava yeh chandni

suhanee Results: Only Roman transliterated

documents

Challenge: Spelling variations tandee hawa ye chandny soohaany

Page 9: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

IR Scenario - II

Cross-script and Multi-script Monolingual IR in transliterated space

Query: thandee hava yeh chandni OR ठं� डी� हवा� ये चाँ��दनी� Results: Both Roman transliterated

or in native script

Challenge: Transliteration

Page 10: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

Scenario - III

Cross-script and Cross-lingual IR Query: death of mareech and subahoo Document: Hindi (Transliterated and

Devanagari) and English documents

Page 11: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

Shared Task on Retrieval

Mono-scriptMonolingual

IR

Transliterated query in

Roman

Transliterated documents in Roman

Cross-scriptMonolingual

IR

Transliterated query in

Roman

Transliterated documents in native script

Multi-scriptMonolingual

IR

Query in Roman or

native script

Documents in Roman and native scripts

Page 12: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

Shared Sub-Tasks

Language identification of transliterated queries, documents, code-mixed text

kooda kazhikkan oru urgan split pea soup undaki ML ML ML ML EN EN EN ML

Transliteration Forward: കഴി�ക്കാ�ന്‍ kazhikkan Backward: kazhikkan കഴി�ക്കാ�ന്‍

Page 13: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

Available Data

20000 word pairs each in Bengali, Telugu, and Hindi (labeled with language tags)

35000 unique Hindi-Roman word pairs obtained from aligning Bollywood song lyrics

More data under preparation from FaceBook on mixture of various languages.

Looking for partners to extend!

Page 14: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

Available Data

Currently we have 500 query and url-rel judged pairs for Bollywood song lyrics

Looking for partners to extend it to other (Indian) Languages

Other domains?

Page 15: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

Thank you! [email protected]

Page 16: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

Other resources

Lexicons Pronunciation lexicons G2P for some languages Stemmers and morphological

analyzers

Anything else?

Page 17: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

Concluding Remarks

We have built Multi-script Bollywood Song Search and working on transliteration and code-mixing

These are just some initial ideas that came up from our experiences

If you are interested please let me know