a corpus for entity profiling in microblog posts

A Corpus for Entity Profiling in Microblog Posts

UNED NLP & IR Group

Madrid, Spain

ISLA, University of Amsterdam

Amsterdam, The Netherlands

LREC Workshop on Language Engineering for Online Reputation Management

May 26th, 2012 - Istambul, Turkey

Edgar Meij, Andrei Oghina, Minh T. Bui, Mathias Breuss,

Maarten de Rijke Damiano Spina

Introduction

• Online Reputation Management

– Public image of an entity in Online Media

– Entity = { brand, organization, company, person, product }

• Microblogging services (e.g. Twitter)

– People sharing thoughts about an entity

– Dynamic, Real-Time

• Human Language Technologies

– Aid to reputation managers

– Retrieval and Analysis of entity mentions

Sentiment vs. Profiling

• Sentiment analysis

• Entity Profiling – “hot” topics that people talk about in the context of an entity

Our task: Aspect identification

• @xbox_news here we go again,

microsoft being jealous of sony again.

• I lov big Sony headphones .. I lov my #music 2 b

more beautiful

• not surprising that @graypowell was out and about - he used to be a ’Field Verification & Operator Acceptance Engineer’ at Sony

Our task: Aspect identification

• @xbox_news here we go again,

microsoft being jealous of sony again.

• I lov big Sony headphones .. I lov my #music 2 b more beautiful

• not surprising that @graypowell was out and about - he used to be a ’Field Verification & Operator Acceptance Engineer’ at Sony

Goal

• Build manually annotated corpora

– Evaluate the task of entity profiling in microblog streams


WePS-3 ORM Corpus

Collection of tweets Disambiguated company names (e.g. apple fruit vs. Apple Inc.)


WePS-3 ORM Corpus

Pooling Aspects

Tweet annotation

Opinion targets

Approach I: Pooling aspects

• Pooling methodology

– 4 Ranking Methods:

• TF.IDF [Salton and Buckley, 1988]

• Log-Likelihood Ratio [Dunning, 1993]

• Parsimonious Language Model [Hiemstra et al. 2004]

• Opinion target extraction using topic-specific subjective lexicons [Jijkoun et al. 2010]

– Top 10 terms

• Manual annotation

Aspects dataset: annotation example

Aspects dataset: outcome

• Three annotators, substantial agreement

(> 0.6 Cohen/Fleiss’ kappa)

• 94 entities, 17775 tweets, ≈177 tweets/entity

• 2455 terms, 1304 aspects (54.11%)

Approach II: Tweet annotation

• Opinion targets dataset

• Tweet-level annotation – Is the tweet subjective?

• Phrase-level annotation – Subjective phrase

– Opinion target phrase p: • p is an aspect of the entity

• p is included in a sentence that contains a direct subjective phrase

• p is the target of the expressed opinion

Opinion Targets dataset: annotation example

Opinion targets dataset: outcome

• 59 entities, 9396 tweets, ≈159 tweets/entity

• 15.16% of tweets with subjective phrases

• 13.82% of tweets with opinion targets

Aspects vs. Opinion targets

1650 783 270

Aspects

Terms in Opinion Targets

Aspects vs. Opinion targets

1650 783 270

Aspects

Terms in Opinion Targets

26.69%

12.67%


• Available at

http://bitly.com/profilingTwitter

WePS-3 ORM Corpus

Pooling

Aspects dataset

Tweet annotation

Opinion targets dataset

• 94 entities, 17,775 tweets ≈177 tweets/entity • 2455 terms, 1304 aspects (54.11%)

• 59 entities, 9,396 tweets, ≈159 tweets/entity • 15.16% of tweets with subj. phrases • 13.82% of tweets with opinion targets

http://bitly.com/profilingTwitter

a corpus for entity profiling in microblog posts

Technology

opinion targets terms

task of entity profiling

analysis of entity

entity dynamic

opinion targetsaspects783

expressed opinion

opinion target extraction

online media entity