p05- dina: a multi-dialect dataset for arabic emotion analysis

DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis

Muhammad Abdul-Mageed1,2, Hassan AlHuzliy1, Duaa’ Abu Elhija1, Mona Diab2

Indiana University1, The George Washington University2

Emotions

• Categories of emotion: – Ekman (e.g., 1992) proposes there are 6 basic

emotions: anger, disgust, fear, happiness, sadness, and surprise

– Plutchik (1980, 1985, 1994) adds trust and anticipation • Emotion on 3 dimensions:– e.g., Francisco and Gervas (2006) mark the attributes

of pleasantness, activation, and dominance in the genre of fairy tales.

– DINA is focused on the Ekman emotions.

Motivations• Opinion Mining:– Provides an enriching component beyond the mere binary

valence (i.e. positive and negative) of most sentiment analysis systems.

• Health & Wellness– Early detection of certain emotional disorders such as depression. – Improving the well-being of people by exposing them to desired

emotions (since emotion is contagious [Kramer et al., 2014]).• Education:– Integrating emotionally-aware agents in intelligent

computer-assisted language learning, for example, should prove useful and enhance the naturalness of the pedagogical experience.

Motivations Cont.• Marketing:– e.g., emotion-sensitive language generation can help with

marketing (Heath et al., 2001; Tan et al., 2014), political campaigning, etc.

• Security:– Deflect potential hazards and anticipate dangerous

behaviors • Author Profiling:– Useful for predicting age and gender (Meina et al., 2013;

Flekova and Gurevych, 2013; Farias et al., 2013; Bamman et al., 2014; Forner et al., 2013) and personality (Mohammad and Kiritchenko, 2013)

Related Work• SemEval-2007 Affective Text task (Strapparava and

Mihalcea, 2007) [SEM07]: – Collection and classification of emotion and

valence in news headlines• Aman and Szpakowicz (2007):– Annotation and detection of emotions from blogs

• Qadir and Riloff (2014), Mohammad (2012), Wang et al. (2012):– use hashtags as an approximation of emotion

categories to collect emotion data

Arabic: Motivations

• Morphologically Rich Language– Highly inflected: person, number, gender, case,

mood, aspect, voice• Strategic Language:– One of the 6 languages of UN, with ~ 300M

speakers worldwide• Exponential Web growth:– More than 2000% growth rate on the Web in 2010

onwards (www.internetworldstats.com).

Arabic Dialects

Data Collection

• Crawled Twitter data using a seed set of size < 10 phrases for each of the six Ekman emotion types.

• Each phrase is composed of an emotion word (e.g., “happy”) and the first personal pronoun “I”.

• We collect only tweets where a seed phrase occurs in the tweet body text.

• This approach does not depend on hashtags.• We collect 500 tweets from each of the 6 emotion

types. Total = 3,000.• Seeds capture various Arabic dialects.

Table 1. Example seeds

Annotation

• To verify the utility of this seeds approach, two college-educated native speakers of Arabic labeled the data.

• For labeling, we use one of four tags from the set {“no-emotion/zero”, “weak-emotion”, “moderate/fair-emotion”, “strong-emotion”}.

• We measure inter-annotator agreement as to these intensity labels in Cohen’s Kappa.

• We also calculate the % of emotion-carrying tweets per category (those that did not end up assigned the label “no-emotion/zero”).

DINA: Agreement & % Emotion

Table 3. Agreement in fine-grained annotation and average percentage of emotion

Gold Labels from Happiness Class

Table 2. Agreement in happiness annotation

Examples: Anger

Examples: Disgust

Examples: Fear

Examples: Happiness

Examples: Sadness

Examples: Surprise

Context of No- and Mixed Emotions

• Even with a list of well-crafted seeds, both annotators assign “no-emotion” for 7.5% of the data.

• This is a function of emotion being a pragmatics-level phenomenon.

• Contexts for “no-emotion” include:– Reported speech– Sarcasm

Reported Speech

Sarcasm

Conclusion

• Emotion is like other pragmatic-level phenomena; hence a seed-collection approach is useful, but not perfect.

• Phenomena like reported speech and sarcasm interact with our method for emotion data collection.

• DINA is multidialectal, but we do not have exact dialect labels on the tweets.

• DINA is at 3,000 tweets, and we plan to grow the size.• Full evaluation of DINA is only possible when we build

models exploiting these data, which we plan to do.

p05- dina: a multi-dialect dataset for arabic emotion analysis

Education

p05-06 · title: p05-06 created date: 6/24/2010 12:26:58 pm

regional dialect and social dialect

p05 quickr customization

jewish dialect and new york dialect

dina note.docx

tibetan multi-dialect speech and dialect identity...

p05 – trigger upgrade: 401.4

bio p05 list izrocki

specification data portal pendant 5.5 (tc1-p05)

modular seating system - img comfort€¦ · dina wms350...

dlms class description p05.h 09.09.09

a+801--p05 peripheral devices

p05 ^p08* - yokohama · title: p05 ^p08* created date:...

p05 · title: p05 created date: 8/22/2019 12:10:06 pm

uin-p05 sound

p05 en 001 business model patern

philippine general hospital various projects p05

dialect (regiolect, socilect, language vs dialect)

z (computer program language) zaar dialect (may subd geog...

p05-06 0814€¦ · title: p05-06_0814 created date:...