abstract - sourceforgemultiword.sourceforge.net/download/presentations_mwe2010/...(bari bari, one...
TRANSCRIPT
-
AbstractAbstractAbstractAbstractAbstractAbstractAbstractAbstract� Identification of Reduplication, a subtask of
Multi-word Expression identification.
� Reduplication, a very productive process at
both the grammatical as well as semantic
levels in Bengali.
� Here, reduplications have been identified
from the Bengali corpus of the articles of the
noted Indian Nobel laureate Rabindranath
Tagore.
� Rule-Based Approach consisting of two
phases i.e. identification of reduplication and
semantic analysis.
-
� Repetition of any linguistic unit such as
phoneme, morpheme, word, phrase, clause or the
utterance as a whole.
Example: In English : ha-ha, blah-blah etc.
In Bengali: �����-������ (abal-tabal, incoherent).
What is Reduplication?What is Reduplication?What is Reduplication?What is Reduplication?
� Bengali, richest Indian language with 2400 words
(Chaudhuri et al., 2005) in the onomatopoeic and
idiophonic category of reduplication.
� Reduplication carries various semantic meanings and
helps to identify the mental state of the speaker.
� Two coarse-grained categories:
(a) repetition at the expression level.
(b) repetition at the contents or semantic (sense) level.
-
General ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral Classification�� Onomatopoeic ExpressionOnomatopoeic Expression:
�� �� (khat khat, knock knock)
�� Complete Reduplication:Complete Reduplication:
�-� (bara-bara,big big)
� Partial Reduplication:� Partial Reduplication:
���-
�� (thakur-thukur ,God)
� Semantic Reduplication:
Synonym: ����-�� (matha-mundu, head)Antonym: ���-��� (din-rat, day and night) Class representative: ��-���� (cha-paani, snacks)
� Correlative Reduplication:
�������� (maramari, fighting)
-
Expression level Expression level Expression level Expression level Expression level Expression level Expression level Expression level ClassificationClassificationClassificationClassificationClassificationClassificationClassificationClassification
� NonNon--soundsound SymbolicSymbolic WordsWords
� Nouns and pronouns
��� ��� (bari bari, one house to other)� Adjectives
��� ��� � � (lal lal phul, red flowers)��� ��� � �� Verbs
���� ���� (bolte bolte, speaking) [Mandatory]���� ���� (bhebe chinte, thinking) [Optional]
� Adverb ���� ���� (dhere dhere, slowly)
�� Sound WordsSound Words
�� �� (chal chal, sound of water falling)
-
Sense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level Reduplication� Sense of repetition:
���-��� (bachar bachar, every year)
� Sense of plurality:
� � ��� (bara bara bari, many big houses )
� Sense of emphatic meaning:
��� ��� � � (lal-lal phul, deep red rose)
� Sense of completion:� Sense of completion:
����-���� (kheye deye jabo, after eating)
� Sense of hesitation or softness:
��� ��� �� (Hasi-hasi mukh, laughing face)
� Sense of incompleteness of the verbs:
ক�� ���� ���� (kotha bolte bolte, talking about )
� Sense of corresponding correlative words:
�������� (Maramari, fighting)
� Sense of onomatopoeia:
�� �� (khat khat, knock knock)
-
System DesignSystem DesignSystem DesignSystem DesignSystem DesignSystem DesignSystem DesignSystem Design
Phase 1: Identifying Reduplications
Identify mainly five cases of reduplication
i.e.Onomatopoeic, complete, partial, semantic
and correlative reduplications.
Phase 2: Semantic Analysis
Extraction of associated meaning or
semantics like sense level reduplications.
-
Phase Phase Phase Phase Phase Phase Phase Phase 11111111System ArchitectureSystem ArchitectureSystem ArchitectureSystem ArchitectureSystem ArchitectureSystem ArchitectureSystem ArchitectureSystem Architecture
TokenizerTokenizerTokenizerTokenizerCorpus
Bengali
Corpus
RuleRuleRuleRule&&&&based Identifierbased Identifierbased Identifierbased Identifier
ClassifierClassifierClassifierClassifier
Set of Inflections
Set of Inflections
DictionaryDictionary
-
Components of The Components of The Components of The Components of The Components of The Components of The Components of The Components of The ArchitectureArchitectureArchitectureArchitectureArchitectureArchitectureArchitectureArchitecture
CorpusCorpusCorpusCorpusCorpusCorpusCorpusCorpusArticles (novel, stories, dramas) ofRabindranath Tagore [http://www.rabindra-rachanabali.nltr.org]
TokenizerTokenizerTokenizerTokenizerTokenizerTokenizerTokenizerTokenizerSeparates words based on blank space orSeparates words based on blank space orspecial symbols (like hyphen, exclamationnotation etc) to identify two consecutivewords.
RuleRule--basedbased IdentifierIdentifier
Consecutive tokens are passed to it to verifywhether they are reduplicated words or notbased on different algorithms.
ClassifierClassifier
CClassify reduplications at expression level.
-
Components of The Components of The Components of The Components of The Components of The Components of The Components of The Components of The ArchitectureArchitectureArchitectureArchitectureArchitectureArchitectureArchitectureArchitecture
DictionaryDictionary
It includes the lexicon and the associated
semantics. The system uses both Bengali-
to-Bengali (monolingual) and Bengali-to-
English (bilingual) dictionaries.
�� Set of inflectionsSet of inflections
0(����), �(-�, -), -�(-��), -��, -��(-���), -�, -��(��), ���, -���, -��, -�, -����, -�, -�
-
Brief classification Brief classification Algorithms Algorithms
�Complete: comparison for complete equality of two
words is checked.
�partial: 3 cases - (i) change of the first vowel
attached with first consonant, (ii) change of
consonant itself in first position or (iii) change of
both matra and consonant.
Exception: �����-������(abal-tabal, incoherent)Exception: �����-������(abal-tabal, incoherent)[Solution: only consonants that are produced afterchanging are ‘$’, ‘�’, ‘�’, ‘ ’(S.K.Chattopadhyay, 1992.)]
� Onomatopoeic: after removing inflection, words
are divided equally and then comparison is done.
�Correlative : the formative affixes ‘–�’ , ‘-%’ areadded with the root to form 1st and 2nd words
respectively and agglutinated.
�Semantic : a dictionary based approach using set of
above mentioned inflections.
-
Phase Phase Phase Phase Phase Phase Phase Phase 22222222Semantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysis
�Correspondence between general and
sense level reduplications:
ReduplicationsReduplications SemanticsSemantics (Sense)(Sense)
onomatopoeic onomatopoeic onomatopoeiaonomatopoeia
semantic or partialsemantic or partial completioncompletion
correlative wordcorrelative word corresponding corresponding correlative wordscorrelative wordscorrelative wordscorrelative words
Complete Complete Repetition /Repetition /hesitation, softnesshesitation, softness
Problem for sense disambiguation of complete
reduplication: multiple sense depending on the
context.
� System identifies some related words like ‘ক��’(kara, to do), ‘����’ (bhaba, to think), ‘����’ (mato,like), ‘��&�’ (laga, feel) for disambiguation.
� These are not enough for disambiguating the
sense of the phrase.
-
Experimental ResultsExperimental ResultsExperimental ResultsExperimental ResultsExperimental ResultsExperimental ResultsExperimental ResultsExperimental Results
�The collected corpus includes 14,810 tokens
for 3675 distinct word forms at the root
level.
�Metrics:
� IR metrics: Precision, Recall, F-score.
� Frequency measurements of each class.
� Hyphen and close form count.� Hyphen and close form count.
�Evaluation:
Reduplication Precision Recall F-score
Onomatopoeic 99.85 99.77 99.79
Complete 99.98 99.92 99.95
Partial 79.15 75.80 77.44
Semantic 85.20 82.26 83.71
Correlative 99.91 99.73 99.82
System 92.82 91.50 92.15
-
Error AnalysisError AnalysisError AnalysisError AnalysisError AnalysisError AnalysisError AnalysisError Analysis
0
10
20
30
40
50
60
70
80
90
100
Precision
Recall
F-Score
�Partial and semantic evaluation scores are notsatisfactory because of some wrong taggingby the shallow parser.
�Some synonymous reduplication (����- �'�, dhire-susthe, slowly and steadily)implies anonymous sense of the previousword but not its exact synonym. These wordsare not identified properly due to the lack ofBengali lexicons like WordNet.
-
Frequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesis� Frequency is an important indication of
whether a compound is a MWE.
� 8.52% of reduplications are hyphened.
� percentage of closed reduplications is
33.09% where maximum of them are
onomatopoeic, correlative and semantic
reduplications.
� 100% of correlative reduplications and� 100% of correlative reduplications and
maximum of onomatopoeic reduplications
are closed.
8.51
51.0626.6
12.7
18.08
Frequency Analysis
Onomatopoeic
Complete
Partial
Semantic
Correlative
-
ConclusionConclusion
� The reduplication is mainly used for
emphasis, generality, intensity, or to
show continuation of an act.
� The semantics of the reduplicated words
indicate some sort of senseindicate some sort of sense
disambiguation that cannot be bounded
by only rule based analysis.
� Further researches on the field of
Stylometry analysis of the authors or
Plagiarism detection.