entity set expansion in opinion documents

Entity Set Expansion in Opinion Documents

Lei Zhang Bing Liu

University of Illinois at Chicago

Introduction to opinion mining

Opinion Mining Computational study of opinions, sentiments

expressed in text

Why opinion mining now? mainly because of the Web, we can get huge volumes of opinionated text

Why opinion mining is important

Whenever we need to make a decision, we would like to hear other’s advice.

In the past. Individual : Friends or family. Business : Surveys and consultants.

Word of mouth on the Web People can express their opinions in reviews,

forum discussions, blogs…

An example for opinion mining application

What is an entity in opinion documents

An entity can be a product, service, person, organization or event in opinion document.

Basically, opinion mining is to extract opinions expressed on entities and their attributes.

“I brought a Sony camera yesterday, and its picture quality is great.”

picture quality is the product attribute; Sony is the entity.

Why we need entity extraction

Without knowing the entity, the piece of opinion has little value.

Companies want to know the competitors in the market. This is the first step to understand the competitive landscape from opinion documents.

Related work

Named entity recognition (NER) Aims to identity entities such as names of

persons, organizations and locations in natural language text.

Our problem is similar to NER problem, but with some differences. 1. Fine grained entity classes (products, service) rather

than coarse grained entity classes (people, location, organization )

2. Only want a specific type: e.g. a particular type of drug names.

3. Neologism : e.g. “Sammy” (Sony) , “SE” (Sony- Ericsson)

4. Feature sparseness (lack of contextual patterns)5. Data noise (over-capitalization , under-capitalization)

NER methodsSupervised learning methods The current dominant technique for addressing

the NER problem Hidden Markov Models (HMM) Maximum Entropy Models (ME) Support Vector Machines (SVM) Conditional Random Field (CRF)

Shortcomings: Rely on large sets of labeled examples. Labeling

is labor-intensive and time-consuming.

NER methods

Unsupervised learning methods Mainly clustering. Gathering named entities

from clustered groups based on the similarity of context. The techniques rely on lexical resources (e.g., WordNet), on lexical patterns and on statistics computed on a large unannotated corpus.

Shortcomings: low precision and recall for the result

NER methods

Semi-supervised learning methods

Show promise for identifying and labeling entities. Starting with a set of seed entities, semi-supervised methods use either class specific patterns to populate an entity class or distributional similarity to find terms similar to the seeds.

Specific methods: Bootstrapping Co-traning Distributional similarity

Our problem is set expansion problem

To find competing entities, the extracted entities must be relevant, i.e., they must be of the same class/type as the user provided entities.

The user can only provide a few names because there are so many different brands and models.

Our problem is actually a set expansion problem, which expands a set of given seed entities.

Set expansion problem

Given a set Q of seed entities of a particular class C, and a set D of candidate entities, we wish to determine which of the entities in D belong to C. That is, we “grow” the class C based on the set of seed examples Q.

This is a classification problem. However, in practice, the problem is often solved as a ranking problem.

Distributional similarity

Distributional similarity is a classical method for set expansion problem.

(Basic idea: the words with similar meanings tend to appear in similar contexts)

It compares the similarity of the word distribution of the surround words of a candidate entity and the seed entities, and then ranking the candidate entities based on the similarity values.

Our result shows this approach is inaccurate.

Bayesian sets

Based on Bayesian inference and was designed specifically for the set expansion problem.

It learns from a seeds set (i.e., a positive set P) and an unlabeled candidate set U.

Bayesian sets

Given and , we aim to rank the elements of by how well they would “fit into” a set which includes (query)

Define a score for each:

From Bayes rule, the score can be re-written as:

}{eD DQ D

Q

De

)(

)()(

ep

Qepescore

)()(

),()(

Qpep

Qepescore

Bayesian sets Intuitively, the score compares the probability

that e and were generated by the same model with the same unknown parameters θ, to the probability that e and came from models with different parameters θ and θ’.

Q

Q

Bayesian sets

Compute following equations:

Bayesian sets

The final score can be computed as:

Where

α and β are hyperparameters obtained from data

The top ranked entities should be highly related to seed set Q according to Bayesian sets algorithm

Direct application of Bayesian sets

For seeds and candidate entities, the feature vector is created as follows:

(1) A set of features is first designed to represent each entity.

(2) For each entity, identify all the sentences in the corpus that contain the entity. Based on the context, produce single feature vector to represent the entity.

But it produces poor result.

(Reason: First, Bayesian sets uses binary feature, multiple occurrences of an entity in the corpus, which give rich contextual information, is not fully exploited; Second, the number of seeds is very small, the result is not reliable)

Improving Bayesian sets

We propose a more sophisticated method to use Bayesian Sets. It consists of following two steps.

(1) Feature identification: A set of features to represent each entity is

designed.

(2) Data generation:

(a) Multiple feature vector for an entity (b) Feature reweighting (c) Candidate entity ranking

How to get candidate entities

Part of Speech (POS) tags – NNP, NNPS and CD as entity indicators;

A phrase (could be one word) with a sequence of NNP, NNPS and CD POS tags as one candidate entity. (e.g. “Nokia/NNP N97/CD” as a single entity “Nokia N97”)

How to identity features

Like a typical learning algorithm, one has to design a set of features for learning. Our feature set consists of two subsets:

Entity word features (EWF): characterize the

words representing entity themselves. This set of features is completely domain independent. (e.g. “Sony” , “IBM”)

Surrounding word features (SWF): the surrounding words of a candidate entity. (e.g. “ I bought the Sony tv yesterday”)

Data generation Because for each candidate entity, there are

several feature vector is generated. It causes the problem of feature sparseness and entity ranking.

We proposed two techniques to deal with the problems:

(1) Feature reweighting

(2) Candidate entity ranking

Feature reweighting Recall the score for an entity from Bayesian set

N is the number of items in the seed set. qij is the feature j of seed entity qi

mj is the mean of feature j of all possible entities. k is a scaling factor (which we use 1).

In order to make a positive contribution to the final score of entity e, wj must be greater than zero.

Feature reweighting

It means if feature j is effective (wj > 0), the seed data mean must be greater than the candidate data mean on feature j.

Due to the idiosyncrasy of the data, There are many high-quality features, whose seed data mean may be even less than the candidate data mean.

Feature reweighting

For example: In drug data set: “prescribe” is a very good entity feature. “Prescribe EN/NNP” (EN represents an entity, NNP is its POS tag) strongly suggests that EN is a drug. However, the problem is that the mean of this feature in the seed set is 0.024 which is less than its candidate set mean 0.025

It means it is worse than no feature at all !

Feature reweighting

In order to fully utilize all features, we change original mj to by multiplying a scaling factor to force all feature weights wj > 0:

The idea is that we lower the candidate data mean intentionally so that all the features found from the seed data can be utilized.

we let to be greater than N for all features j.

to determine t.

Identifying high-quality Features

features A and B, same feature frequency => same feature weight

In some cases:

for feature A, all feature count may come from only one entity in the seed set; for feature B, the feature counts are from different different entities (e.g. Bought + “ Nokia” , “Motorola” “SE”)

feature B is a better feature than feature A

shared by or associated with more entities.

Identifying high-quality Features

r is used to represent feature quality for feature j. h is the number of unique entities that have jth feature. T is the total number of entities in the seed set.

Candidate entity ranking

Each unique candidate entity may generate multiple feature vectors

It is highly desirable to rank those correct and frequent entities at the top

Ranking candidate entities

Md is the median value for all feature vector scores of candidateentity , n is the candidate entity’s frequencyfs(d) is the final score for the candidate entity

Additional technique: enlarge the seed sets

Enlarging the seed set using some high precision syntactic coordination patterns.

EN [or | and] EN

from EN to EN

neither EN nor EN

prefer EN to EN

[such as | especially | including] EN (, EN)* [or | and] EN

EN is the entity name.

e.g “Nokia and Samsung do not produce smart phones,”

Additional technique: bootstrapping Bayesian set

This strategy again tries to find more seeds, but using Bayesian Sets itself. run the algorithm iteratively.

At the end of each iteration, pick up k top ranked entities (k = 5 in our experiments).

The iteration ends if no new entity is added to the current seed list.

The whole algorithm

Experiment result

Similar web-based systemsGoogle Sets

Boo! Wa!

Experiment result

Thank you