a corpus for entity profiling in microblog posts
DESCRIPTION
Microblogs have become an invaluable source of information for the purpose of online reputation management. Streams of microblogs are of great value because of their direct and real-time nature. An emerging problem is to identify not only microblog posts (such as tweets) that are relevant for a given entity, but also the specific aspects that people discuss. Determining such aspects can be non-trivial because of creative language usage, the highly contextualized and informal nature of microblog posts, and the limited length of this form of communication. In this paper we present two manually annotated corpora to evaluate the task of identifying aspects on Twitter, both of them based upon the WePS-3 ORM task dataset and made available online. The first is created using a pooling methodology, for which we have implemented various methods for automatically extracting aspects from tweets that are relevant for an entity. Human assessors have labeled each of the candidates as being relevant. The second corpus is more fine-grained and contains opinion targets. Here, annotators consider individual tweets related to an entity and manually identify whether the tweet is opinionated and, if so, which part of the tweet is subjective and what the target of the sentiment is, if any.TRANSCRIPT
A Corpus for Entity Profiling in Microblog Posts
UNED NLP & IR Group
Madrid, Spain
ISLA, University of Amsterdam
Amsterdam, The Netherlands
LREC Workshop on Language Engineering for Online Reputation Management
May 26th, 2012 - Istambul, Turkey
Edgar Meij, Andrei Oghina, Minh T. Bui, Mathias Breuss,
Maarten de Rijke Damiano Spina
Introduction
• Online Reputation Management
– Public image of an entity in Online Media
– Entity = { brand, organization, company, person, product }
• Microblogging services (e.g. Twitter)
– People sharing thoughts about an entity
– Dynamic, Real-Time
• Human Language Technologies
– Aid to reputation managers
– Retrieval and Analysis of entity mentions
Sentiment vs. Profiling
• Sentiment analysis
• Entity Profiling – “hot” topics that people talk about in the context of an entity
Our task: Aspect identification
• @xbox_news here we go again,
microsoft being jealous of sony again.
• I lov big Sony headphones .. I lov my #music 2 b
more beautiful
• not surprising that @graypowell was out and about - he used to be a ’Field Verification & Operator Acceptance Engineer’ at Sony
Our task: Aspect identification
• @xbox_news here we go again,
microsoft being jealous of sony again.
• I lov big Sony headphones .. I lov my #music 2 b more beautiful
• not surprising that @graypowell was out and about - he used to be a ’Field Verification & Operator Acceptance Engineer’ at Sony
Goal
• Build manually annotated corpora
– Evaluate the task of entity profiling in microblog streams
A Corpus for Entity Profiling in Microblog Posts
WePS-3 ORM Corpus
Collection of tweets Disambiguated company names (e.g. apple fruit vs. Apple Inc.)
A Corpus for Entity Profiling in Microblog Posts
WePS-3 ORM Corpus
Pooling Aspects
Tweet annotation
Opinion targets
A Corpus for Entity Profiling in Microblog Posts
WePS-3 ORM Corpus
Pooling Aspects
Tweet annotation
Opinion targets
Approach I: Pooling aspects
• Pooling methodology
– 4 Ranking Methods:
• TF.IDF [Salton and Buckley, 1988]
• Log-Likelihood Ratio [Dunning, 1993]
• Parsimonious Language Model [Hiemstra et al. 2004]
• Opinion target extraction using topic-specific subjective lexicons [Jijkoun et al. 2010]
– Top 10 terms
• Manual annotation
Aspects dataset: annotation example
Aspects dataset: outcome
• Three annotators, substantial agreement
(> 0.6 Cohen/Fleiss’ kappa)
• 94 entities, 17775 tweets, ≈177 tweets/entity
• 2455 terms, 1304 aspects (54.11%)
Approach II: Tweet annotation
• Opinion targets dataset
• Tweet-level annotation – Is the tweet subjective?
• Phrase-level annotation – Subjective phrase
– Opinion target phrase p: • p is an aspect of the entity
• p is included in a sentence that contains a direct subjective phrase
• p is the target of the expressed opinion
Opinion Targets dataset: annotation example
Opinion targets dataset: outcome
• 59 entities, 9396 tweets, ≈159 tweets/entity
• 15.16% of tweets with subjective phrases
• 13.82% of tweets with opinion targets
Aspects vs. Opinion targets
1650 783 270
Aspects
Terms in Opinion Targets
Aspects vs. Opinion targets
1650 783 270
Aspects
Terms in Opinion Targets
26.69%
12.67%
A Corpus for Entity Profiling in Microblog Posts
• Available at
http://bitly.com/profilingTwitter
WePS-3 ORM Corpus
Pooling
Aspects dataset
Tweet annotation
Opinion targets dataset
• 94 entities, 17,775 tweets ≈177 tweets/entity • 2455 terms, 1304 aspects (54.11%)
• 59 entities, 9,396 tweets, ≈159 tweets/entity • 15.16% of tweets with subj. phrases • 13.82% of tweets with opinion targets