![Page 1: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/1.jpg)
ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task
Avinash YadavRobins YadavSukomal Pal
Department of Computer Science & EngineeringIndian School of Mines Dhanbad, India
![Page 2: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/2.jpg)
Contents IntroductionAdhoc retrieval task participationMorpheme Extraction Task
participationConclusion
![Page 3: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/3.jpg)
IntroductionStemmerISMstemmerEvaluation
![Page 4: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/4.jpg)
StemmerAttempts to reduce word variants to its stem
or root formExample – education, educating, educativewill all reduce to educat
Approaches for StemmingLanguage based approachStatistical approach
![Page 5: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/5.jpg)
ISMstemmerstatistical stemmerbased on suffix extractionsuffix frequencyalgorithm
![Page 6: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/6.jpg)
Data PreprocessingConvert the corpus into single file
File 1
File 2
File n
…
Single File
Cleaning of data
John asked a girl with an apple of Kashmir, “ do you
have the time”. She said,
“yes”.John asked a girl with an apple of Kashmir do you have the time she said yes
Removing Stop Words
John asked a girl with an apple of Kashmir do you have the time she said yes
John asked girl with apple Kashmir you time she said yes
John asked girl with apple Kashmir you time she said yes
Johnaskedgirlwith appleKashmiryoutimeshesaidyes
Convert file into Single
Column
![Page 7: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/7.jpg)
Data preprocessing (contd….)unique words extractedHindi- 4,90,391English-7,95,144
![Page 8: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/8.jpg)
Find valid suffixesReverse the
words of single column file
aborning absolution absorption abuilding acquisition activation added addition admiration admitted admitting agreed agreeing allotted allotting ambling angling
gninroba noitulosba noitprosba gnidliuba noitisiuqca noitavitca dedda noitidda noitarimda dettimda gnittimda deerga gnieerga dettolla gnittolla gnilbma gnilgna
Sort the reversed
list
gninroba noitulosba noitprosba gnidliuba noitisiuqca noitavitca dedda noitidda noitarimda dettimda gnittimda deerga gnieerga dettolla gnittolla gnilbma gnilgna
dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba
Find suffix according
to threshold
dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba
degniniot
gni
17%
40%
![Page 9: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/9.jpg)
Threshold usedEnglish: 0.01 - 0.1%
Hindi: 0.1 – 1.0%
![Page 10: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/10.jpg)
Stemming of corpusStem the
reversed words with reversed valid suffixes
dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba
dda erga ttimda ttolla dliuba eerga lbma lgna nroba ttimda ttolla arimda avitca idda isiuqca prosba ulosba
Reverse stemmed words
to get the original words
dda erga ttimda ttolla dliuba eerga lbma lgna nroba ttimda ttolla arimda avitca idda isiuqca prosba ulosba
addagreadmittallottabuildagreeamblanglabornadmittallottadmiraactivaaddiacquisiabsorpabsolu
![Page 11: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/11.jpg)
Note: If the length of a word after
stemming is less than ’3’ alphabets, then that word will not be stemmed
agingking
agk
![Page 12: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/12.jpg)
Evaluation of ISMstemmerFor evaluation of ISMstemmer we have
participated in:
1. Monolingual Adhoc retrieval task in English and Hindi Languages
2. Morpheme Extraction Task (MET) of FIRE-2012
![Page 13: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/13.jpg)
Adhoc Retrieval Task(ART) ParticipationMonolingual taskLanguages chosen:
EnglishApproachResults
HindiApproachResults
![Page 14: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/14.jpg)
ART: English Approach:
Indexing:Search Engine used:
Indri(IndriBuildIndex)Retrieval:
Search engine used: Lemur (RetEval)Data Provided:
Corpus from The Telegraph and BD News50 query set
![Page 15: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/15.jpg)
ART: English (contd….)Results:
Run id No. of queries
No. of results
No. of relevant docs.
No. of rel. docs ret.
MAP value
EE.ism.unstemmed
50 50000 3539 2503 0.2264
EE.ism.krovetzstemmer
50 50000 3539 2504 0.2255
EE.ism.ismstemmer
50 50000 3539 2415 0.2096
![Page 16: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/16.jpg)
ART: HindiApproach:
Indexing: Search Engine used: Indri
(IndriBuildIndex)Retrieval:
Search Engine used: Indri (IndriRunQuery)Data Provided:
Corpus from Navbharat Times and Amar Ujala
50 query set
![Page 17: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/17.jpg)
ART: Hindi (contd….)Results:Run id No. of
queriesNo. of results
No. of relevant docs
No. of rel. docs ret.
MAP value
HH.ism.unstemmed.indri
50 50000 2309 222 0.0173
HH.stemmmedcorpus.unstemmedquery
50 50000 2309 98 0.0026
HH.stemmmedcorpus.stemmedquery
50 50000 2309 209 0.0137
![Page 18: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/18.jpg)
Morpheme Extraction Task Participation
Tool submittedResults
![Page 19: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/19.jpg)
MET Tool Submission.ISMstemmer submittedevaluated at IR Labs: DAIICT,
Gujarattested on 6 languages of South
Asian originhas given efficient results with 3
languages
![Page 20: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/20.jpg)
MET Results:1. BENGALI
Institute Language MAP ObtainedBaseline Bengali 0.2740JU Bengali 0.3307DCU Bengali 0.3300IIT-KGP Bengali 0.3225CVPR-Team1 Bengali 0.3159ISM Bengali 0.3103
CVPR-Team2+ Bengali NA
![Page 21: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/21.jpg)
MET Results (contd….)2. GUJARATI
Institute Language MAP ObtainedBaseline Gujarati 0.2677ISM Gujarati 0.2824
3. MARATHIInstitute Language MAP ObtainedBaseline Marathi 0.2320ISM Marathi 0.2797IIT-B Marathi 0.2684
![Page 22: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/22.jpg)
MET Results (contd….)4. ODIA
Institute Language MAP ObtainedBaseline Odia 0.1537IIIT-Bh Odia 0.1537ISM Odia 0.1537
5. HINDIInstitute Language MAP ObtainedBaseline Hindi 0.2821DCU Hindi 0.2963ISM Hindi 0.2793
![Page 23: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/23.jpg)
MET Results (contd….)6. TAMIL
Institute Language MAP ObtainedBaseline Tamil NAAUCEG Tamil NAISM Tamil NA
NA : results are not available, due non-availability of qrels
![Page 24: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/24.jpg)
Reasons for Underperformance with Hindi
overstemmingundesired stemming of proper
nouns
![Page 25: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/25.jpg)
OverstemmingThis refers to words that shouldn’t be grouped
together by stemming, but are.Example –
1. accent, accentual, accentuateStem word – accent
2. accept, acceptant, acceptorStem word – accept
3. access, accessible, accessionStem word – access
due to overstemming it may be possible that these all group into wrong stem - acce
![Page 26: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/26.jpg)
Undesired stemming of proper nounsproper nouns should not be stemmed as
they are not inflected
Example – BeijingIt will get stemmed to Beij
![Page 27: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/27.jpg)
ConclusionART: English: not satisfactory Hindi: poor Reasons: overstemming undesired stemming of proper nouns
MET: performed efficiently with Bengali, Gujarati and
Marathi languages performed up to the mark with Odia underperformed with Hindi
![Page 28: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/28.jpg)
References1. Banerjee R. and Pal S. 2011. ISM@FIRE-2011 Bengali
Monolingual Task: A frequency based stemmer. Forum for Information Retrieval Evaluation 2011, ISI kolkata.
2. www.isical.ac.in/~fire/ (as on 06.12.2012)3. Christopher D. Manning, Hinrich Schütze: Foundations of
Statistical Natural Language Processing, MIT Press (1999), ISBN 978-0-262-13360-9.
4. http://en.wikipedia.org/wiki/Information_retrieval (as on 06.12.2012)
5.http://sourceforge.net/p/lemur/wiki/Indri%20query%20Language%20Reference/ (as on 06.12.2012)
6. www.lemurproject.org (as on 06.12.2012)7. Paik, J. H., Mitra, M., Parui, S. K., and J¨ arvelin, K. 2011.
GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29, 4, Article 19 (November 2011)
![Page 29: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/29.jpg)
References (contd…)8. Paik, J. H. and Parui, S. K. 2011. A fast corpus-based
stemmer. ACM Trans. Asian Lang. N form. Process. 10, 2, Article 8 (June 2011).
9. Paik J. H., Pal Dipasree, Parui S. K. A Novel Corpus-Based Stemming Algorithm using Co-occurrence Statistics. SIGIR’11, July 24–28, 2011, Beijing, China.
10. Xu, J. and Croft, W. B. 1998. Corpus-based stemming using co-occurrence of word variants. ACM Trans. Inf. Syst. 16, 1, 61–81.
11. http://en.wikipedia.org/wiki/Stemming (as on 06.12.2012)12. How Effective Is Suffixing? Donna Harman. lister Hill
Center for Biomedical Communications, National Library of Medicine, Bethesda, MD 20209
![Page 30: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task](https://reader036.vdocuments.us/reader036/viewer/2022062323/5681660f550346895dd95160/html5/thumbnails/30.jpg)
THANK YOU!!