entity set expansion in opinion documents
DESCRIPTION
Entity Set Expansion in Opinion Documents. Lei Zhang Bing Liu University of Illinois at Chicago. Introduction to opinion mining. Opinion Mining Computational study of opinions, sentiments expressed in text - PowerPoint PPT PresentationTRANSCRIPT
Entity Set Expansion in Opinion Documents
Lei Zhang Bing Liu
University of Illinois at Chicago
Introduction to opinion mining
Opinion Mining Computational study of opinions, sentiments
expressed in text
Why opinion mining now? mainly because of the Web, we can get huge volumes of opinionated text
Why opinion mining is important
Whenever we need to make a decision, we would like to hear other’s advice.
In the past. Individual : Friends or family. Business : Surveys and consultants.
Word of mouth on the Web People can express their opinions in reviews,
forum discussions, blogs…
An example for opinion mining application
What is an entity in opinion documents
An entity can be a product, service, person, organization or event in opinion document.
Basically, opinion mining is to extract opinions expressed on entities and their attributes.
“I brought a Sony camera yesterday, and its picture quality is great.”
picture quality is the product attribute; Sony is the entity.
Why we need entity extraction
Without knowing the entity, the piece of opinion has little value.
Companies want to know the competitors in the market. This is the first step to understand the competitive landscape from opinion documents.
Related work
Named entity recognition (NER) Aims to identity entities such as names of
persons, organizations and locations in natural language text.
Our problem is similar to NER problem, but with some differences. 1. Fine grained entity classes (products, service) rather
than coarse grained entity classes (people, location, organization )
2. Only want a specific type: e.g. a particular type of drug names.
3. Neologism : e.g. “Sammy” (Sony) , “SE” (Sony- Ericsson)
4. Feature sparseness (lack of contextual patterns)5. Data noise (over-capitalization , under-capitalization)
NER methodsSupervised learning methods The current dominant technique for addressing
the NER problem Hidden Markov Models (HMM) Maximum Entropy Models (ME) Support Vector Machines (SVM) Conditional Random Field (CRF)
Shortcomings: Rely on large sets of labeled examples. Labeling
is labor-intensive and time-consuming.
NER methods
Unsupervised learning methods Mainly clustering. Gathering named entities
from clustered groups based on the similarity of context. The techniques rely on lexical resources (e.g., WordNet), on lexical patterns and on statistics computed on a large unannotated corpus.
Shortcomings: low precision and recall for the result
NER methods
Semi-supervised learning methods
Show promise for identifying and labeling entities. Starting with a set of seed entities, semi-supervised methods use either class specific patterns to populate an entity class or distributional similarity to find terms similar to the seeds.
Specific methods: Bootstrapping Co-traning Distributional similarity
Our problem is set expansion problem
To find competing entities, the extracted entities must be relevant, i.e., they must be of the same class/type as the user provided entities.
The user can only provide a few names because there are so many different brands and models.
Our problem is actually a set expansion problem, which expands a set of given seed entities.
Set expansion problem
Given a set Q of seed entities of a particular class C, and a set D of candidate entities, we wish to determine which of the entities in D belong to C. That is, we “grow” the class C based on the set of seed examples Q.
This is a classification problem. However, in practice, the problem is often solved as a ranking problem.
Distributional similarity
Distributional similarity is a classical method for set expansion problem.
(Basic idea: the words with similar meanings tend to appear in similar contexts)
It compares the similarity of the word distribution of the surround words of a candidate entity and the seed entities, and then ranking the candidate entities based on the similarity values.
Our result shows this approach is inaccurate.
Bayesian sets
Based on Bayesian inference and was designed specifically for the set expansion problem.
It learns from a seeds set (i.e., a positive set P) and an unlabeled candidate set U.
Bayesian sets
Given and , we aim to rank the elements of by how well they would “fit into” a set which includes (query)
Define a score for each:
From Bayes rule, the score can be re-written as:
}{eD DQ D
Q
De
)(
)()(
ep
Qepescore
)()(
),()(
Qpep
Qepescore
Bayesian sets Intuitively, the score compares the probability
that e and were generated by the same model with the same unknown parameters θ, to the probability that e and came from models with different parameters θ and θ’.
Q
Q
Bayesian sets
Compute following equations:
Bayesian sets
The final score can be computed as:
Where
α and β are hyperparameters obtained from data
The top ranked entities should be highly related to seed set Q according to Bayesian sets algorithm
Direct application of Bayesian sets
For seeds and candidate entities, the feature vector is created as follows:
(1) A set of features is first designed to represent each entity.
(2) For each entity, identify all the sentences in the corpus that contain the entity. Based on the context, produce single feature vector to represent the entity.
But it produces poor result.
(Reason: First, Bayesian sets uses binary feature, multiple occurrences of an entity in the corpus, which give rich contextual information, is not fully exploited; Second, the number of seeds is very small, the result is not reliable)
Improving Bayesian sets
We propose a more sophisticated method to use Bayesian Sets. It consists of following two steps.
(1) Feature identification: A set of features to represent each entity is
designed.
(2) Data generation:
(a) Multiple feature vector for an entity (b) Feature reweighting (c) Candidate entity ranking
How to get candidate entities
Part of Speech (POS) tags – NNP, NNPS and CD as entity indicators;
A phrase (could be one word) with a sequence of NNP, NNPS and CD POS tags as one candidate entity. (e.g. “Nokia/NNP N97/CD” as a single entity “Nokia N97”)
How to identity features
Like a typical learning algorithm, one has to design a set of features for learning. Our feature set consists of two subsets:
Entity word features (EWF): characterize the
words representing entity themselves. This set of features is completely domain independent. (e.g. “Sony” , “IBM”)
Surrounding word features (SWF): the surrounding words of a candidate entity. (e.g. “ I bought the Sony tv yesterday”)
Data generation Because for each candidate entity, there are
several feature vector is generated. It causes the problem of feature sparseness and entity ranking.
We proposed two techniques to deal with the problems:
(1) Feature reweighting
(2) Candidate entity ranking
Feature reweighting Recall the score for an entity from Bayesian set
N is the number of items in the seed set. qij is the feature j of seed entity qi
mj is the mean of feature j of all possible entities. k is a scaling factor (which we use 1).
In order to make a positive contribution to the final score of entity e, wj must be greater than zero.
Feature reweighting
It means if feature j is effective (wj > 0), the seed data mean must be greater than the candidate data mean on feature j.
Due to the idiosyncrasy of the data, There are many high-quality features, whose seed data mean may be even less than the candidate data mean.
Feature reweighting
For example: In drug data set: “prescribe” is a very good entity feature. “Prescribe EN/NNP” (EN represents an entity, NNP is its POS tag) strongly suggests that EN is a drug. However, the problem is that the mean of this feature in the seed set is 0.024 which is less than its candidate set mean 0.025
It means it is worse than no feature at all !
Feature reweighting
In order to fully utilize all features, we change original mj to by multiplying a scaling factor to force all feature weights wj > 0:
The idea is that we lower the candidate data mean intentionally so that all the features found from the seed data can be utilized.
we let to be greater than N for all features j.
to determine t.
Identifying high-quality Features
features A and B, same feature frequency => same feature weight
In some cases:
for feature A, all feature count may come from only one entity in the seed set; for feature B, the feature counts are from different different entities (e.g. Bought + “ Nokia” , “Motorola” “SE”)
feature B is a better feature than feature A
shared by or associated with more entities.
Identifying high-quality Features
r is used to represent feature quality for feature j. h is the number of unique entities that have jth feature. T is the total number of entities in the seed set.
Candidate entity ranking
Each unique candidate entity may generate multiple feature vectors
It is highly desirable to rank those correct and frequent entities at the top
Ranking candidate entities
Md is the median value for all feature vector scores of candidateentity , n is the candidate entity’s frequencyfs(d) is the final score for the candidate entity
Additional technique: enlarge the seed sets
Enlarging the seed set using some high precision syntactic coordination patterns.
EN [or | and] EN
from EN to EN
neither EN nor EN
prefer EN to EN
[such as | especially | including] EN (, EN)* [or | and] EN
EN is the entity name.
e.g “Nokia and Samsung do not produce smart phones,”
Additional technique: bootstrapping Bayesian set
This strategy again tries to find more seeds, but using Bayesian Sets itself. run the algorithm iteratively.
At the end of each iteration, pick up k top ranked entities (k = 5 in our experiments).
The iteration ends if no new entity is added to the current seed list.
The whole algorithm
Experiment result
Similar web-based systemsGoogle Sets
Boo! Wa!
Experiment result
Thank you