information extraction based approach to summarize social
TRANSCRIPT
Information Extraction based Approach to summarize SocialInteractions
Thesis submitted in partial fulfillment
of the requirements for the degree of
Master of Science (by Research)
in
Computer science and Engineering
by
Sudheer Kovelamudi
200602014
Search and Information Extraction Lab
International Institute of Information Technology
Hyderabad - 500 032, INDIA
July 2012
Copyright c© Sudheer Kovelamudi, 2012
All Rights Reserved
International Institute of Information Technology
Hyderabad, India
CERTIFICATE
It is certified that the work contained in this thesis, titled “Information Extraction based
Approach to summarize Social Interactions” by Sudheer Kovelamudi, has been carried out
under my supervision and is not submitted elsewhere for a degree.
Date Advisor: Dr. Vasudeva varma
To my loving mother, father and sister.
Acknowledgments
I thank Dr. Vasudeva Varma, my advisor and thesis guide for inspiring and supporting me
in carrying my research. Our fruitful conversations helped in weighing our ideas and research
perspectives at work. He is responsible for bringing up exquisite taste of research in me and
myself becoming a passionate researcher.
I sincerely thank my seniors Kiran, Praveen, Sai Krishna, Kranthi, VijayBharath and my
friend Phani Gadde for sparing their time in giving me valuable suggestions. I am very much
thankful to Dr. Ravi Vijayaraghavan, Sethu ramalingam, Kranthi Adusumilli for their inputs
during our weekly meetings and providing a link to current industry research so that my work
can be used effectively. I truly acknowledge the efforts of my juniors Arpit, Ajay in helping me
in the completion of my thesis. I thank all my fellow researchers in SIEL for creating such a
productive ambiance all through my research track.
I thank Dr. Suresh Purini for granting me travel fund to attend IJCNLP 2011 in Thailand
and present my work. I sincerely thank all the reviewers of my work at IJCNLP 2011 for
providing their valuable feedback.
I enjoyed all my days at IIITH from the beginning of my B.tech in all proportions of academic
and extra-circular. I thank my fellow football players for providing such energetic and positive
time during evenings. I appreciate the time I spent during my MS with Pruthvi, Akhilesh,
Abhilash, Santosh and my other fellow CSD students. I take this as an opportunity to thank
my 2k6-batch mates and all my friends in campus with whom I shared my happiness and grief.
I cherish the time I spent with them all through my life.
v
Abstract
With the advent of Web 2.0, Internet has been subdued by a group of applications which
facilitate the creation and exchange of user-generated content. Such groups are collectively
represented as Social Media which changed the phase of communication between individuals.
Blogs, Internet forums, user reviews, chats, activities on social networking sites etc., are some
of notable online forms of current social media. As the content of social interactions on these
social media forms is increasing rapidly, it led to the problem of information overload. A user
may find it difficult to go through millions of lines of different users to grasp the status of a
topic. This problem can be dealt wisely by presenting crux of the content rather than the whole
user generated text. Automatic extraction of important topics or attributes, and summarization
of these topics may help in saving considerable human effort in understanding the content.
Automatic summarization is always a classical solution for information overload problem.
Summarization of news articles has been explored from theory to building satisfactory models,
but summarization of user generated content has not been given much attention. This may
be because of the amount of user content on the Internet is once not so significant but later
exponentially increased. In this thesis we study the text in user generated content focusing on
its summarization using new extraction methodologies.
We initially focus on deriving information extraction techniques for social interactions do-
main. We identify extraction of important topics or attributes of a discussion as the first step
towards successful summarization of social interactions. There have been developed many ways
of extraction from structured text using natural language processing tools. But not much have
been developed for information extraction in unstructured social media text. Hence we first fo-
cus on deriving extraction methodologies which can later enhance summarization. We refrained
from using language processing resources in our extraction procedures to make sure their suc-
cess in all domains of social interactions rather than the selected domain of experimentation.
We made use of knowledge from external resources like Wikipedia and Web in harnessing the
extraction quality.
We used machine learning (regression) in judging the extracted output. We choose the sce-
nario of online customer reviews for testing our extraction engine. As the online retail market
vi
vii
is growing immense, it poses a large arena of products, their descriptions, customer and profes-
sional reviews that are pertinent to them. Reviews contain useful opinionated information about
products and their attributes. Most of the product attribute extraction techniques in literature
work on structured descriptions using several text analysis tools. However, attributes in these
descriptions are limited compared to those in the customer reviews of a product, where users
discuss more deeper and specific attributes. In this thesis, we propose a novel supervised do-
main independent model for product attribute extraction from user reviews. The user generated
content contains unstructured and semi-structured text where conventional language grammar
dependent tools like parts-of-speech tagger, named-entity recognizers, parsers do not perform
to their expected levels. We used Wikipedia and Web to identify the product attributes from
customer reviews.
In later parts of this thesis we focus on summarization of user generated content taking
help from our extraction modules. Our summarization work can be classified as extractive
summarization where text units from the original content are used in summary production.
The text units we choose for a summary are sentence level units. Our trained system picks
sentences from the content and then rank them to produce summary. Sentence ranking is done
by estimating the sentence importance through a combination of word level and sentence level
features. We choose a scenario where summarization is very much needed. Sales/service chats
in the present day E-commerce are crucial for customer support and growth of a company.
These chats should be clearly analyzed and followed for product, service, agent and customer
validation. Hence summarization of these chats can minimize considerable human effort. Here,
we suggest a novel approach to effectively summarize sales/service chats by analyzing their
structure and using Wikipedia. Our system outperformed classic text summarization systems
when applied for social interactions.
Contents
Chapter Page
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Social Media and Social Interactions . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 User Generated Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Industry perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Criticality of User Generated Text in mining Social Interactions . . . . . . . . . . 5
1.5 Problem Description and Contributions Made . . . . . . . . . . . . . . . . . . . . 6
1.5.1 Information Extraction from Social Interactions . . . . . . . . . . . . . . . 7
1.5.2 Summarization of Social Interactions . . . . . . . . . . . . . . . . . . . . . 8
1.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.7 Organisation of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Approaches of mining in social interactions . . . . . . . . . . . . . . . . . . . . . 10
2.2 Need for Information Extraction in Social Interactions . . . . . . . . . . . . . . . 10
2.3 Work related to Attribute extraction . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Uses of product attribute extraction . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Extraction to Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Flavours of Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 Background related to Summarization of Social interactions . . . . . . . . . . . . 16
3 Information Extraction based approach for summarizing Social interactions . . . . . . 18
3.1 Machine learning and evaluation related background for IE & Summarization . . 18
3.1.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.2 ROUGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Information Extraction Framework for Social Interactions . . . . . . . . . . . . . 21
3.2.1 Most Frequent Items-MFI . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Context Relation using Wikipedia - CR . . . . . . . . . . . . . . . . . . . 22
3.2.3 Role of surrounding window - SW . . . . . . . . . . . . . . . . . . . . . . 26
3.2.4 Web search engine reference-WR . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Data for Information Extraction system in Social interactions . . . . . . . . . . . 29
3.4 Product attribute extraction using Wikipedia . . . . . . . . . . . . . . . . . . . . 31
3.5 Product attribute extraction using Wikipedia & Web . . . . . . . . . . . . . . . . 33
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 IE based Framework for Summarizing Social interactions . . . . . . . . . . . . . . 34
3.7.1 Modeling process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
viii
CONTENTS ix
3.7.1.1 Summary Generation . . . . . . . . . . . . . . . . . . . . . . . . 363.7.2 Semantic relation using Wikipedia - SR . . . . . . . . . . . . . . . . . . . 373.7.3 Prepositional Importance - PF . . . . . . . . . . . . . . . . . . . . . . . . 393.7.4 Term Frequency - TF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.7.5 Wiki Frequency - WF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.8 Data for Summarization of Social interactions . . . . . . . . . . . . . . . . . . . . 403.9 Experiments related to summarization of consumer sales/service Interactions . . 41
3.9.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.9.2 Comparison in performances of PF, TF, SR & WF . . . . . . . . . . . . . 423.9.3 Summarization Outcome in terms of different features . . . . . . . . . . . 43
4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Appendix A: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
List of Figures
Figure Page
2.1 Extraction of templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 Context Relation using Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 WR illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3 WR using Bing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.4 Summarization Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
x
List of Tables
Table Page
1.1 Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1 CR, SW, MFI as the features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2 CR, SW, MFI, WR as the features . . . . . . . . . . . . . . . . . . . . . . . . . . 333.3 Relative scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.4 Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.5 Scores of baseline, different features in our model . . . . . . . . . . . . . . . . . . 42
xi
Chapter 1
Introduction
The communication sector of the world has seen dramatic changes after the launch of Web
in 1991. And with web 2.0 it goes close to saying impossible in connecting two persons from
different parts of the world. Web 2.0 is associated with applications that provide base for
information exchange, sharing and interoperability. Users can do a lot more than just passive
viewing of information from Web 2.0 websites. It is basically a user centric design. Any two
persons in the world can interact using different ways provided by current applications of
web 2.0. By ascending what was already possible in Web 1.0, they facilitate the user with
more software, storage capabilities and interface easily through their browser. Here network is
basically providing a platform for computing. Not much time elapsed in transformation of the
web into a social domain. People are now virtually present in their social circles and are in
contact with each other with the help of Internet. A part of this virtual world termed as social
media poses a large number of mobile and web-based technologies that made communication
more interactive.
1.1 Social Media and Social Interactions
Enabled by the ideological and technological basement of Web 2.0, social media can be
defined as a group of applications that allow the creation and exchange of information between
users. It is omnipresent which connects communities, organizations and individuals allowing
them to interact and communicate. Social media shapes into different forms of users social
interactions like Internet forums, social blogs, microblogs, consumer reviews, picture blogs,
wikis, video sharing, rating, social bookmarking, social networking and podcasts. Email, social
networking websites, crowdsourcing, vlogs, voice over IP and instant messengers are widely
used applications in the above mentioned forms of social media. Many organizations build their
own social networks for their brands using these applications. There are also other private
communities such as Facebook, Orkut, Google+ that engage people to interact around events.
1
Organizations and businesses look at social interactions as consumer generated media be-
cause of the reason that social media which promote social interactions are relatively inexpensive
to share information to larger audiences compared to traditional media such as newspapers and
television1. There are also other commercial user interactive practices that involve education
and enable interactive preparation. The content in online social interactions belongs to the
category of User Generated Content.
1.2 User Generated Content
The content in social interactions is mostly controlled by users who are more general public
rather than field professionals. As there is no strict referendum for presentation in social media,
a user’s presentation is more adopted towards environmental variables. The text especially in
user generated content (UGC) is getting mixed up with lot of noise.
The quality of user generated content varies from extremely useful information to spam. The
Organization for Economic Cooperation and Development (OECD) [24] defines UGC as fitting
the following requirements:
• A content which is made publicly available even though to a small group of people.
• The content should be presenting certain amount of creativity. There is no hard line for
the amount of creativity. It always depends on the context. Creativity in here means that
the content should be produced by the user rather than copying from some professional
source and posting it. For example, the part where a snippet from a professional article is
used for describing a situation by a user doesn’t get count as UGC. There can always be
a very small percentage of content on UGC websites that cannot be considered as UGC.
• The creation of user generated content generally happens outside of professional routines.
With exponential increase in user generated content on Internet, it is hard in actually finding
and classifying useful information. Here, we will be dealing with text forms of user generated
content that prevail in social interactions.
The text form of user generated content occurs in customer reviews/feedbacks on e-commerce
websites like Amazon.com, blogs like Wordpress, wikis in text based collaboration formats, ed-
ucational content, citizen journalism, social bookmarking and group based aggregation and tag
sharing like del.ic.ous, hosting sites like Youtube, social networking sites like Twitter, Google+,
Facebook.
1http://en.wikipedia.org/wiki/Social_media
2
In Table 1.1 we presented some of the basic differences of social interaction to a regular
document like news article in text. More discussion on these differences is given in section 1.4.
Table 1.1 Differences
Social interaction Regular document
language grammar isnot guaranteed
grammar is consistent
very short phrases existas sentences
small sentences may ex-ist but no small phrasesindependently
noise because of non-standard emotionalcontent words
no non-standard emo-tional content words ex-ist
lack of good sentencestructure
sentence structure guar-anteed
missing punctuation punctuated text
NLP tools cannot per-form consistently
NLP tools can performconsistently
discourse level features no discourse level fea-tures
false starts in sentencesare allowed
no false starts
In this thesis, we mine and summarize information from social interactions taking care of
user generated content across domains and providing unified solution.
1.3 Industry perspective
The amount of people contributing to consumer media through social interactions has grown
to such a level that the content have turned into large databases which pose as an incredible
market-research value for companies. These databases help organizations to grasp consumer
needs, market trends, interests and their spending capabilities.
• Companies have developed platforms for consumers to create content on product recom-
mendations which led to the advancement of e-Commerce. In many such platforms, people
are even enabled to rate the recommendations that were made by others which earned
trust and reliability factor. The main focus and interest for a potential buyer of a product
would be to gather more information on the product to actual real experiences of people
3
for the product. The buyer tries to get the reviews of people who are not linked to the
company selling the product and thus are unbiased. [1].
It is interesting to notice that some companies have even based their business model on
these recommendations.
• The world of E-commerce is expanding, offering millions of products for customers. Data
related to these products like descriptions supplied by retailers, customer reviews and
other customer-retailer statistics are also growing rapidly.
• The customer reviews present a text of comprehensive information regarding the ex-
perience with a particular product. The product ratings are the other kind of product
recommendation that are a bit different from customer reviews but sometimes follow the
latter. They are usually used to give a very brief approval of the user on the product used
based on its quality on scale of one to five.
• There is one more major perspective of online user social interactions in social media
which takes the form of chats. Chats through internet range from casual conversation
between two or occasionally more users of a display-based communications system to
organizational conferencing2.
• The following is an interesting aspect how chats are influencing businesses. E-commerce
industry possess a customer support segment that monitors the problems of their cus-
tomers and their resolutions through effective ways like live chats. As more consumers go
to Web support and online shopping, there come tremendous opportunities for compa-
nies to deliver better customer service online and drive more sales. In live chats, every
customer is assigned an agent who would look after the concerns of that customer in real
time if possible. These chats are typically service chats where the company provides its
service to the customer. The second category of chats that we are also addressing are
sales chats. In sales chats, a customer approaches a live chat session to buy a product. An
agent then answers questions posed by the customer in regards of product verification,
order, mode of purchase, transaction details etc. In the sales scenario the agent-customer
chat sessions take place before and after sales, which can make customers satisfy and ef-
fectively increase sales volume. The sales and service chats are structural similar in terms
of customer querying and a company’s representative resolving them.
Both sales and service chats play a crucial role in developing a fair relationship of a
company with its customers. Hence assessment of sales/service chats is done regularly to
2http://en.wikipedia.org/wiki/Chat
4
improve the company standards. In order to assess such user social interactions which
contain rich market oriented value, one needs to understand the user generated content
pertaining to them and the problems in processing it.
1.4 Criticality of User Generated Text in mining Social Inter-
actions
The text in user generated content occurring in social interactions is usually low in natural
language grammar, structure and formality. It also disagrees with other aspects of language
in ways like missing letter-case information while dealing with named entities, missing punc-
tuations, repetitions, lack of good sentence structure, false starts, non-standard words, pause
filling words like “uuumm”, “uhh” and other texting disfluencies. They are more prone to ex-
press emotional and context specific content. Unstructured noisy text data is found in informal
environmental settings such as online chat, text messages, e-mails, message boards, user reviews,
blogs, wikis and social networking posts.3.
The degree of distortion of structure and nature of the text in user generated content varies
• From domain to domain (blogs, reviews, chats, etc..)
• From user to user
• Editor tool environment
– If the editor provides dictionaries, it may be more standard.
– If there are space constraints, the user may tend to input more information repre-
sented by less text which consequently result in breaking the grammar.
• The worst cases occur when the user presents different styles in same domain and envi-
ronment depending on his mood, availability of time, etc.
Hence while carrying information extraction specific tasks, the contemporary research is fac-
ing a lot of problems with unstructured text as missing punctuation and use of short phrases
can often hinder performance levels of standard natural language processing tools such as Parts
of speech tagging (POS), Parsing and Named Entity Recognition (NER).
Possible Solutions:
It is better to avoid any natural language tool that is in its best performance trained on a good
3http://en.wikipedia.org/wiki/Noisy_text_analytics
5
formal cleaned and structured language data. This is because the performances when these
tools are used on unstructured text are not consistent. The formats or patterns of unstructured
text in terms of irregularities change with change of domain.
For example, reviews given by users on E-commerce sites are more unstructured than those
given by professional reviewers, and the user generated content in a given chat conversation is
much more unstructured compared to reviews domain. If we train the natural language tools
freshly on unstructured data from a given domain, they may not be consistent in performance
on unstructured text from other domains.
In this thesis we dealt with information extraction and summarization in social interactions
by avoiding natural language processing tools (NLP) as much as possible. We experimented on
two different scenarios for extraction and summarization, one of them is customer reviews of
products and the other is sales/service chats. The details of problems addressed are elaborated
in the following sections.
1.5 Problem Description and Contributions Made
Given the scenario and content of social interactions, the two important sub-connected
tasks of mining information i.e., Information Extraction (IE) and Summarization
involve gathering of information crux from the given content. These are challenging
tasks because of the several text variations accustomed to social interactions.
Following are the contributions made in addressing the above problems in this thesis.
1. We examined the need for information extraction and summarization systems for text in
social interactions domain.
2. We examined scenarios of customer reviews for products on e-commerce websites and
customer sales/service chats, as examples of typical social interactions.
3. We built a domain independent information extraction system for extracting product
attributes from customer reviews.
4. We built features to extract semantics from user generated content using Wikipedia and
Web.
5. We examined the role of information extraction in summarization of user generated con-
tent in social interactions.
6. We crossed barriers of user reviews and chats domains to provide a unified solution to
information mining in social interactions.
6
7. We built a system for summarizing sales/service chats by deriving semantics utilizing the
features from our extraction system.
1.5.1 Information Extraction from Social Interactions
To closely examine the user generated content and design an extraction engine we selected
the scenario of customer reviews on E-commerce sites such as Amazon.com, ebay.com, etc.
The online retail market is growing immense, offering millions of products for customers. The
products are generally described in terms of a few set of attributes. Such product attributes are
mined from the descriptions to represent the product in a structured manner.
Often descriptions deal with generic attributes. For example, specific attributes like power
consumption, pulsator, load, spin-dry effectiveness, noise, water usage, water leakage, etc for
a product like washing machine cannot be correctly found in descriptions. On the other hand,
customers express their opinions in the form of reviews. The opinions expressed are in terms
of attributes they like and dislike but not always in terms of those that are provided by the
retailer for that particular product. Hence mining attributes about which the customers discuss
can be really helpful for sellers as well as for other customers.
Mining product attributes from customer reviews can lead retailers to fetch and group other
products that are having similar specific attributes and forecast more precisely. Hence many
retailers are trying to enrich their product knowledge bases with these domain specific and
product specific attributes. Attribute extraction from reviews is also useful in tasks like review
summarization, product rating, sales agent assessment, opinion mining of reviews, product
recommendation systems, customer relationship management, customer satisfaction analysis,
customer profiling, etc.
On the customers’ side, they are more prone to seek opinions of other customers who actually
used the product or bought it from a particular retailer website. They ask for unbiased evalu-
ation of a product by leveraging information from multiple reviews, although each individual
review can be subjective in nature. Therefore a person is more interested to read a featured
review than overall reviews like “the product is really great, awesome!” or “this is the greatest
product I have ever seen!!!” or simply the product rating.
Mining attributes from customer reviews is a challenging task as they mostly comprise of
user generated content. We already know that text in such user generated content is low in
natural language grammar, structure, formality which often hinder NLP tools.
By this motivation, we have designed a novel framework that can extract attributes of a
product with out making use of natural language tools but treating the text as ‘Bag Of words’
and using the knowledge of Wikipedia and Web.
7
1.5.2 Summarization of Social Interactions
After designing information extraction procedures for social interactions, we investigated
into summarizing the content of social interactions. Summarization involves basic procedures
like
• Selection of information from original content
• Ranking of content and then its organization in a summary.
We have already introduced the sales/service chats in section 1.3. An agent in a typical contact
center handles over a hundred calls in a day. They operate in various communication modes such
as E-mail, voice and online chat which consequently produce huge (gigabytes) data in the form
of chat logs, customer feedback, voice conversation transcriptions, E-mails, etc. Text modes of
communication like online customer-agent chats, and interactions over email tend to be noisy.
Also, transcription of voice conversations using state of the art automatic speech recognition
results in text with 30-40% word error rate4.
Analysis of such data is actually essential for customer satisfaction analysis, call modeling,
customer relationship management, customer profiling, agent profiling, etc. For such analysis,
sophisticated and advanced techniques are needed for automatic procedures to handle poorly
created text. To assess a sales/service chat, one needs to go through all the chat session and its
previous chat sessions if required (as in situations where if appropriate solution is not provided to
the issue raised, the customer may comeback after a period of time). This demands considerable
human effort. This effort can be minimized if assessors are provided with summaries of chat
sessions they need to go through. This also results in assessors effectively grading agents and
thereby increases chat throughput.
summarization helps agents to quickly grasp the information exchanged in a chat session.
1.6 Evaluation
In this thesis as we deal with problems of information extraction and summarization of
content in social interactions, we provide evaluation for both summarization and information
extraction.
The evaluation procedures we followed help in providing a platform for comparison of peer
results.
We adopted Precision, Recall and F-measure as the evaluation measures for assessing the
capability of our extraction engine. While for the evaluation of our summarization system we
4http://en.wikipedia.org/wiki/Noisy_text_analytics
8
used the popular ROUGE metric to score the summaries produced. The mode of evaluation is
more elaborated in Chapter-3.
1.7 Organisation of Thesis
The rest of this thesis is organized into the following chapters:
• Chapter 2 presents a survey of related work and relevant literature associated with infor-
mation extraction. It includes different approaches for text mining in social interactions,
their areas of success and drawbacks. It expands the perspective of information extrac-
tion into summarization of social interactions and presents work related to summarization
along with various flavors of summarization.
• Chapter 3 deals with details of our solution to ‘information extraction from online so-
cial interactions’ problem. The extraction framework is explained in detail. Elaboration is
given towards motivation and derivation of Wikipedia based features and other word den-
sity features which help in identifying salient keywords along with their semantic relation.
It elaborates motivation of our approach towards summarization of social interactions and
presents in detail the summarization framework we built using features from our previous
extraction engine.
More interaction to the datasets used and the classification of experimental runs using
knowledge from Wikipedia and Web is given. It also elaborates the details of evaluation
procedures adopted, the experiments conducted in a strict environment and other ablation
tests that involve evaluation of our individual features especially with regard to using
external knowledge for summarization.
We followed different evaluation themes for extraction and summarization to prove various
contributions of this thesis.
• Chapter 4 concludes with important derivations of this thesis to the related research
field of text mining in social interactions. It describes the possibilities of adoption of the
central idea presented in this thesis to different scenarios of online social interactions. It
also leaves the readers with the list of intriguing extensions and future plans for this work
which can have a sound impact on current social media.
9
Chapter 2
Related Work
2.1 Approaches of mining in social interactions
After a good study in summarization of news media, research has been migrating to mining of
social media in recent years for the many benefits it offers. Mining social media helps enhancing
security issues, tracking trends in the world and in managing user interactions. With the advent
of community question answering it is now possible, easy and very effective to post a question
on popular community QA forums such as Yahoo! Answers. For users, these community QA
sites became a popular platform for a wide range of information needs where they can rely on
other users to provide them with answers. The archives of millions of such questions and their
sets of answers, many of which are priceless for the information needs of other users. To access
such immense repository of knowledge, effective information mining systems are required [2].
2.2 Need for Information Extraction in Social Interactions
As any user can contribute an answer to a question on a community forum, the majority of
the content often reflects personal opinions and experiences. For this reason, there is always an
urge for research to focus on extraction of salient information from social interactions. There has
been budding research targeted towards mining and summarizing opinion from blogs, reviews
and chats to evaluate overall drive of a content.
Theory has been developed in accordance to finding resultant or overall sentiment associated
with the media rather than digging out important information from the content.
Question Answering and Information Extraction have been studied over the past decade
however, evaluation has generally been limited to isolated targets or small scopes1.
1http://nlp.cs.qc.cuny.edu/kbp/2011/
10
Work along the lines of Information Extraction is still continuing because of the reason that
it forms the first step for most of the summarization, question answering and opinion mining
systems.
Recently TAC2 (Textual Analysis Conference) has started a track called knowledge base
population which consists of a slot filling task in which participants are encouraged to run their
extraction systems on a given data. This task explores extraction of information about entities
with reference to an external knowledge source. Using basic schema for persons, organizations,
and locations, a knowledge base must be created and populated using information found in
text.
Recognizing textual entailment (RTE), a task also introduced by TAC aims at capturing
major semantic inference needs across many natural language processing applications, such as
Information Extraction (IE), Question Answering (QA) and summarization.
The task of Recognizing textual entailment aims at validating a hypothesis if it is entailing
(agreeing) a given text.
An example, assumed common background knowledge of the business news domain and the
following text:T1: Nokia and Intel will merge their top-end smart phone software as they face increasing pressure from cellphone
industry newcomers Google and Apple.
The following hypotheses are entailed:
• - H1.1 Google and Apple are newcomers in cellphone industry
• - H1.2 Nokia and Intel are facing pressure.
• - H1.3 Nokia and Intel are facing pressure from Google and Apple.
• - H1.4 Nokia and Intel produce top-end smartphone software.
If H is not entailed by T, there are two possibilities:
1. H contradicts T
2. The information in H cannot be judged as TRUE, on the basis of the information contained in T.
On the basis of entailment, a question answering system can be re-framed, representing the
text as the question and the Hypothesis as the expected answer pattern. Now, the QA problem
is restructured to identifying texts that entail the expected answer. Given a question, the text
entails the expected answer form.
In information extraction, entailment holds between different text variants that express the
same content.
We proceeded on the RTE by finding linguistic structures, which we call them as templates
that share same anchors. The lexical elements describing the context of a sentence are termed as
2http://www.nist.gov/tac/2011/
11
Figure 2.1 Extraction of templates
anchors. Templates that are extracted from different sentences both from text and hypothesis
and if the anchors agree with each other (i.e., overlap) for these sentences, then the case is taken
as entailment.
For example the sentences ‘Yahoo bought Overture’ and ‘Yahoo acquired Overture’ share
the anchors {X= Yahoo, Y = Overture}, suggesting that the two sentences entail each other.
Entailment is nothing but the given text is in agreement with the source text. Figure-2.1
portrays our system.
Getting such templates from a text requires the methods of information extraction. As the
data which we worked for the RTE task is mainly news wire media, hence we took major
help from the natural language processing tools like Named Entity Recognizer (NER), Parts of
speech taggers (POS) to identify named entities for template building.
This shows the importance of information extraction in the field of text analysis and artificial
intelligence. The text that has been used for these tasks mainly come from news articles but
not much from social media.
The extraction systems for unstructured text are still in the form of theory and experimen-
tation but did not even reach budding stages in development. This shows a requirement for
12
good information extraction systems for structured text and a very stronger urge in case of
social media.
People have experimented their extraction techniques in different scenarios of social media
but found a greater requirement for solid systems which can deal with user generated subjec-
tivity of the text rather than simple objective text as found in news media.
For a close look, we have selected a scenario for information extraction in social media i.e.,
extraction of product attributes from user reviews about the products they purchase and had
experience with, on E-commerce websites like Amazon.com.
One perspective of extracting product attributes from customer reviews is like identifying
subtopics (attributes) for a given topic (product) in a discussion. Hence working on this problem
allows us to investigate extraction systems for domains that contain user generated text.
2.3 Work related to Attribute extraction
A good amount of research had been put into product attribute extraction in recent years.
But the focus was laid in extraction of attributes from product descriptions and a little was
done in extracting the same or more specific attributes from user reviews. Much of the existing
work focuses on whole review classification and overall opinion extraction.
Work related to word order occurrences where product attributes are believed to exist as noun
phrases was already contributed [11, 5]. But it [10, 9, 17] was shown that using noun phrases
tend to produce noise and low precision. Their work presents the identification of product
attributes with the help of Parts-Of-Speech(POS) tags and the occurrence of adjectives.
But in most of the cases when free format reviews are considered, the POS taggers do not
function at the expected level as grammar is not guaranteed in user generated text.
Some amount of effort was also put forth in the area of product attribute extraction from
product descriptions [25]. Their algorithm is based on the fact that descriptions are structured
pieces of text. They have trained a noun phrase recognizer model and used it for identifying
noun phrases in such product descriptions which worked well on structured text but when tested
did not work on unstructured text and long reviews.
Mooney [21] gave a good survey of prevailing techniques in general information extraction.
Chin [4] has done contextual sense disambiguation and semantic association using Wordnet.
We have dealt with the method of using general ontologies like Wordnet to find synonyms
of product attributes from reviews which is inconsistent as the ontologies like Wordnet lack
domain knowledge. If we go for domain specific ontologies they are very limited.
Using knowledge bases like thesaurus, and calculating associated information of a pair of
words by their association counts calculated through queries to a search engine was done by
13
Turney, which is interesting in a way of supplying external knowledge [29]. The limitations of
some of these methods lies in their lack of acquisition of context knowledge. Therefore context
analysis is in demand as the product attributes are context and product dependent.
2.4 Uses of product attribute extraction
Extracting product attributes from text holds a lot of use in industry and for knowledge
building.
• Demand forecasting and for predicting market trends through marking of positive and
negative attributes of a product.
• Product recommendations
• Provides ease for comparison of manufacturers, suppliers and retailers.
• Building knowledge bases for products
2.5 Extraction to Summarization
The summarization literature suggests information extraction as an important process of
summarization. Many text summarization systems in the past have used information extraction
either implicitly or explicitly to produce summaries.
Generally information extraction implies that we already know what kind of information
needs to be found from the source text where as, summarization implies finding the interesting
parts of the source text.
But from the perspective of system developers, the two applications overlap and blend into
each other.
The information extraction models used for summary production were used to be in the
form of extracting important text units which are further used to rank content in text to get its
summary. Extracting text units purposefully means that extraction of named entities, actions,
attributes, subjects, etc., which ever are required for a domain. We seen in our previous section
people have developed very specific kind of extraction systems which mostly work best for
documented text rather than for social interactions. When aiming for social interactions one
needs to take care that their systems should work across different domains of social interactions
including extreme cases like chats. These should later be useful for other information retrieval
systems like question answering and summarization.
14
In this thesis, as we selected the scenario of product reviews for building our extraction
system, we choose a different domain like corporate sales/service chats for summarization.
We used the developments of our domain independent extraction system in enhancing our
summarization system.
Almost five decades of research have been put into text summarization. The following are
the different forms of summarization prevailing today.
2.6 Flavours of Summarization
• Single Document Vs. Multi-Document : This categorization is based on the original
content considered for summary generation. In single document summarization, one needs
to summarize from a single document of text and, for multi-document summarization it
is from multiple documents that pertain to a focused topic or multiple topics. Handling
redundancy in information is the biggest challenge in multi-document summarization.
DUC followed by TAC are the conferences which focused on research in summarization
providing various tasks along with datasets bringing together researchers from NLP and
other areas.
• Query focused Vs. Generic : When summaries are produced through procedures
which consider user’s need as a query, this process is called Query focused summarization.
In generic summarization, summaries are produced by capturing important information
from source documents.
• Extract Vs. Abstract : An extractive summary is a summary consisting of entirely
the material from source text where as an abstract is a summary whose material is not
entirely from the source. In general abstract summaries are human written summaries.
Automated summary generation systems are trying to achieve abstract summaries to go
close to human produced summaries using technologies like natural language generators
from the phrases extracted, but didn’t even reach amateur levels.
• Blog Summarization corresponds to summarizing blogs available on the internet. They
can focus towards a topic or can provide generic summaries for all the information dealt
in a blog series. The blog summarization has gone a step ahead in the recent years by con-
sidering and including the information from user comments to a blog while summarizing
the blog. The challenges lie in dealing with user generated content.
• Update Summarization : When a user is aware of past proceedings in a topic or a
stream of information, the user desires to get an updated summary avoiding the infor-
15
mation he already knew. The task of producing summaries on the updated information
avoiding redundant information is known as Update summarization.
• Personalized Vs. Guided Summarization : The notion of importance and relevance
changes from person to person. A personalized summarizer caters based on users interests
and personal background . But in guided summarization, the summary is guided by a fixed
template that is prepared for that particular domain of text. Suppose the template is in
the form of a questionnaire, the summary should answer the questions in the template if
the answers are present in the source documents.
• Chat Summarization : Summarization of chats is in its beginning stages and is be-
ing studied extensively in recent years because of the bottlenecks it presents towards
text summarization and other corporate businesses, governmental defense requirements.
Summarization in this area is perceived in a way bit different in procedures compared
to traditional summarization because of the medium of source content. Chats are more
unstructured compared to blogs in content which makes it hard in extraction of salient
information from a chat.
2.7 Background related to Summarization of Social interactions
A good amount of research had been already done in traditional text summarization but
not much into summarization of chats. The research that had been put into summarization of
chats gave attention to technical blogs, forums, reviews and recently Internet chats but none of
them have addressed the problem which we focused in this thesis. The attempts done already
were in course of using standard Natural Language Processing (NLP) tools and techniques for
extracting information from social interactions.
We suggest to avoid relying on NLP tools as social interactions comprise of different domains
and text mediums which keep on changing their formalities.
Coming to past contributions in social interactions field, Roman worked on extraction of
stance, politeness and bias in a conversation [26, 28]. Zhuang proposed a method [35] that uses
key attributes along with occurrence frequency and parts-of-speech. This method works to an
extent in user reviews of good text structuring but in case of more dynamic social interactions
like sales/service chats and Internet chats contain short phrases and improper grammatical
structure where it is better to avoid any use of natural language tools.
Attempts were made to summarize internet chats by segmenting the text into subtopics
[8, 34, 33]. This method cannot be applied directly to various categories of chats like sales/service
chats and other chat scenarios where chats are focused to a particular topic, as it misses several
16
specificities in such type of chats. Also segmenting the text into subtopics is not a priority in
these situations.
Summarization of chats using a list of phrases and there by classifying sentences with these
phrases and their frequency of occurrence was also experimented [12]. Murray and Zechner
worked on summarization of multi-party diverse domain discussion lists [22, 23, 32] which
even though belong to the family of unstructured text possess divergence from basic chats and
corporate customer service chats in terms of participants and in the level of topic shifts. The
sales/services chats can be taken as the most focused chat media among all kinds of chats.
Much of existing work on chat summarization focuses on topic shifts, identifying question-
answer pairs and topic clustering. No good effort had been made yet in mining information
by deriving semantics from a text domain such as social interactions. In our scenario every
exchange of information is a question answer (QA) pair but focused towards the root question.
Here, there is no great need to identify topic shifts and QA pairs as the chats are more focused
towards their initial root question and its resolution.
In this thesis we made a successful maiden attempt in automatic summarization of social
interactions by deriving actual semantics implicated in a discussion using external knowledge
sources.
17
Chapter 3
Information Extraction based approach for summarizing Social
interactions
Before getting into details of our approach to the contemporary problem of mining infor-
mation from social interactions, a brief elaboration has been given in the following sections
about the machine learning techniques employed, their specific requirement for this problem
and metrics used to evaluate our approach.
3.1 Machine learning and evaluation related background for IE
& Summarization
3.1.1 Support Vector Machines
Classification of data is a common task in machine learning. Machine learning is about
prediction of properties using known properties learned from the training data.
Support Vector Machines (SVMs), a new generation machine learning system deliver state-of-
the-art performance in real-world applications such as pattern recognition, image classification,
text categorisation, biosequences analysis, hand-written character recognition, etc., and are now
considered as one of the standard tools for data mining1.
From a set of training samples, where each marked as belonging to one of two categories, an
SVM training algorithm builds a model that assigns new samples into one or the other category.
More analytically, support vector machines construct hyperplanes or set of hyperplanes in a
high-dimensional space, which can be used for classification of samples into different categories.
For the current extraction task, we have used the classification by support vector machines
where as for the later summarization module we made us of support vector regression, which
is another segment of SVMs, dealt in later sections of this chapter.
1http://en.wikipedia.org/wiki/Support_vector_machine
18
In classification tasks, the parent problem may be stated in a finite dimensional space but
often happens that the sets are not linearly separable in that given dimensional space. Hence,
in SVMs, original finite-dimensional space is mapped into a much higher-dimensional space
making the discrimination task happen easier in that dimensional space.
A good discrimination between categories is achieved by the hyperplane that has the greatest
distance to the nearest training data sample of any class which is termed as the functional
margin. In general the greater the margin the lower is the generalization error of that classifier.
And for this reason we selected Support Vector Machines among other machine learning
algorithms like neural networks, decision tree learning and probabilistic graphical models like
Bayesian networks to aid our extraction procedure by classification.
Our methodology is modeled keeping in mind the different domains of social interactions and
for this reason we need to aim for minimizing the generalization error in our system rather than
overfitting our system for a particular domain data. Because training sets are finite and limited
compared to the future which is uncertain, learning theory usually does not yield guaranteed
consistency in the performance of algorithms2. Hence it is always better to choose a machine
learning algorithm that minimizes the generalization error while addressing scientific problems
such as ours.
3.1.2 ROUGE
In summarization domain, ROUGE scores in content match have shown to be generally
highly correlated with human evaluation[14].
A belt of conferences, special topic sessions and workshops on automatic text summariza-
tion like ACL, COLING, SIGIR, (WAS 2000-02), and USA government sponsored evaluation
efforts through conferences like DUC, TAC have advanced the technology and produced some
experimental text summarization systems (Radev et al. 2001, McKeown et al. 2002).
Based on various statistical metrics, results show that automatic evaluation using n-grams or
unigram co-occurrences between summary pairs correlates very well with human evaluations[16].
However, in spite of all these efforts, there are no common, repeatable and convenient eval-
uation procedures that can be applied with ease to support system development and quick
comparison among different summarization procedures.
Finally ROUGE (a package for automatic evaluation of summaries) was developed by Chin-
Yew Lin. ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation[15]. It includes
measures to determine automatically, the quality of a summary by comparing it with corre-
sponding human model summaries.
2http://en.wikipedia.org/wiki/Machine_learning
19
The measures count the number of overlapping and matching units such as word sequences
and pairs between the system-generated summary that has to be evaluated and the human-made
reference summaries.
• Formally, ROUGE-N is an n-gram recall between a candidate summary and reference
summary. Here n is the length of n-gram. Suppose r is a reference summary, then the
basic formulation for ROUGE-N is given as
ROUGE −N =
∑gramn∈rCOUNTmatch(gramn)∑
gramn∈rCOUNT (gramn)
If more than one reference summary is provided, then ROUGE-N is
ROUGE −N =
∑r∈reference
∑gramn∈rCOUNTmatch(gramn)∑
r∈reference∑
gramn∈rCOUNT (gramn)
The COUNTmatch() gives the number of n-grams matched. Thus ROUGE-2 evolved.
• ROUGE-S is Skip-Bigram Co-Occurrence Statistics. Skip-bigram is any pair of words in
their sentence order, allowing for arbitrary gaps. For example in the sentence “Sudheer is
writing thesis.”, C(4,2) = 6 skip-bigrams are contained. They are
– Sudheer-is,
– Sudheer-writing,
– Sudheer-thesis,
– is-writing,
– is-thesis,
– writing-thesis
Skip-bigram co-occurrence statistics measure the overlap of skipbigrams between a can-
didate summary (S) and a reference summary (r). Suppose skipBi() gives the count of
matched skip bigrams, then ROUGE-S is given as
R =skipBi(S, r)
C(r, 2)
P =skipBi(S, r)
C(S, 2)
ROUGE − S =2RP
R + P
C(r, 2) and C(S, 2) delivers the number of skip-bigrams in reference summary and candi-
date summary respectively.
20
• ROUGE-SU is an extension of ROUGE-S with the addition of unigram as counting unit.
Hence ROUGE-SU4 is nothing but matching bigrams with skip distance up to 4 along
with the unigram count of the words, stemmed (Lin, 2004).
3.2 Information Extraction Framework for Social Interactions
The major information extraction (IE) problem that has been studied extensively in the
past is named entity extraction. In general scenario of news media corpus, the extraction of
desired entities and information is guided by named entity recognition tools, Parts-Of-Speech
tools along with the applied rules framed using knowledge from the field of given content.
Coming to the media of social interactions, we tried extracting the crux entities i.e., product
attributes from a set of customer reviews. We used the orthodox methods listed above for
extraction but the reviews which consist of incomplete sentences and short phrases made the
system perform poor and inconsistent. This is because the current NLP tools are trained on
ordered language datasets with not much extension to noisy text analytics.
We dealt the problem of extracting attributes of a product by designing a solution that
can serve as a general extraction method for all dimensions of social interactions. We tried to
automatically grasp the semantics of content by using external resources like Wikipedia.
For any given product, our approach to attribute extraction involves:
1. Collection of customer reviews of the given product.
2. Filter out stop words.
3. Compute features that we have defined, for the remaining words.
4. Identification of possible attribute words using classification model trained on these fea-
tures.
Support vector machines (SVMs), are a set of related supervised learning methods for clas-
sification and regression analysis which are used to facilitate our model.
We need to classify the possible words into attributes where we used SVMs (LIBSVM pack-
age) for our machine learning tasks. LIBSVM is an integrated software for support vector
classification, regression and distribution estimation. It also supports multi-class classification
[3].
We used a LIBSVM package for machine learning tasks like data scaling to parameter selec-
tion. We made use of scripts provided with the package for selection of appropriate kernel and
21
internal parameters, but it is time expensive as it tests all the kernels that suit for the given
data. We opted for it in the training phase.
Our task is now reduced to identifying a set of features that can pick out the attributes
from customer reviews. The features on which our system has been trained are explained in the
following sections.
3.2.1 Most Frequent Items-MFI
Words related to topics that are discussed more occur at high frequencies in any given text.
In general people discuss about the attributes of a product in their reviews frequently. The
‘Most Frequent Items’ feature boosts the importance of attribute words by their frequency of
occurrence in customer reviews. This feature is closer to the tf − idf [27] measure. Our task
does not relate with the identification of attributes of a particular product from the customer
reviews of various products, but attribute identification only from the customer reviews of that
particular product .
The set of words {z1, z2, z3, ....zm} used for this feature are obtained from customer reviews
of a given product after stop word removal is done. For any word zi the ‘Most Frequent Items’
feature is computed by
MFI(zi) =Freq(zi)
m∑j=1
Freq(zj)
Freq(zi) gives total number of occurrences of zi in reviews of a given product.
3.2.2 Context Relation using Wikipedia - CR
To understand a context or to identify a context, we need the set of keywords that can
portray the context. So, we assume that any context C can be expressed as C = {t1, t2, t3, ...tn}where ‘ti’ are the related keywords dealt in C.
The product forms the context in customer reviews. People talk about the product and its
attributes in their reviews. Its attributes and other highly related things belong to the set of
keywords of the context.
The CR feature is about identifying the list of related keywords mentioned in customer
reviews that can be found in Wikipedia. We start with identifying all words that have been
discussed in reviews of a given product in Wikipedia and then proceed with calculating the
most semantically related words among them.
22
When we make judgments about semantic relatedness between any two given words, we
draw huge amount of background knowledge about the concepts represented by these words.
Hence, any trial to state the semantic relatedness between different words automatically also
needs to do the same. One can use hand-crafted lexical structures like thesauri and taxonomies,
or statistical analysis of huge corpora to process the semantic decisions automatically [19]. The
limiting factors of such techniques when carried across domains are the background knowledge,
precision, scalability and scope. With more than 4 million articles and thousands of volunteers
all over the world, Wikipedia which is a growing massive repository of knowledge, is the best
alternative when targeted by such limitations.
We explore Wikipedia’s link structure, category structure, article titles, and page types
from the static and latest pages-articles xml dump3 of Wikipedia. We only need Wikipedia’s
structure rather than it’s full textual content. We have created SQL database, tables to store
and access the page titles and articles fast, which has been suggested and explored already
[20]. We map a word in customer reviews to a Wikipedia article if the word is contained in
that Wikipedia article title. We call such words as Wikipedia words and if cannot be mapped,
we refer them as Non-Wikipedia words in later sections. A word can be mapped to all its
homonyms in Wikipedia. For instance the word ‘bank’ can refer to ‘river bank’ or a ‘savings
bank’ in Wikipedia. To disambiguate and identify the correct possible article mappings for a
given word, we need to first disambiguate words which may possibly contain mappings in more
than one domain. To address this, we used a method [20] where articles for unambiguous words
are used to disambiguate the ambiguous words.
Computing semantic relatedness between two words that are mapped to Wikipedia, is equal
to finding the semantic relatedness between articles in Wikipedia to which these words refer.
And to do this, the best known way is to compute the relation from the links to these articles
in Wikipedia [18, 19].
The relation between two Wikipedia articles x and y is given by
Relationx,y = 1− max(log|A|, log|B|)− log|A ∩B|T −min(log|A|, log|B|)
Here A and B are the set of articles which link to the articles x and y respectively, T is the
total number of Wikipedia articles, A∩B is their overlap. Thus for every Wikipedia word, we
find the semantic relatedness to all other such words. Context relatedness feature (CR) of a
word is computed as the sum of its similarity scores with all other such words in the context
which is then normalized by the total number of such words. Therefore for a Wikipedia words
set {x1, x2, x3, ....xk}, semantic relatedness of xi to the context is given by
3http://dumps.wikimedia.org/enwiki/
23
CRxi =
K∑j=1j 6=i
Relationxi,xj
k
The applicability of CR feature is justified in terms of high scalability and the ever growing
knowledge of Wikipedia.
For Non-Wikipedia words {y1, y2, y3, ....yl} in the product reviews, the CR feature is modified
as the average of all CR feature values for Wikipedia words, from reviews of that particular
product. Hence the CR value for any non-Wikipedia word yi is uniformly given as
CRyi =
k∑j=1
CRxj
k
where xj is a Wikipedia word.
Illustration of CR:
For example if we happen to find the words lens and camera in a piece of text, we try to find
the inlinks to the articles in wikipedia that these words represent to calculate the semantic
relation between lens and camera. In the following Figure-3.1, it has been found and shown
that the Wikipedia articles like Angular Resolution, Ray tracing, Optical coating, Microlens,
Asphericlens, Carlzeiss, Image, Zoomlens, Optics, Photography, etc.., are the common inlinks
between the two Wikipedia articles lens and camera. Such number of common inlinks are more
than enough according to our thresholds to consider the two words are closely and semantically
related.
24
Figure 3.1 Context Relation using Wikipedia
25
3.2.3 Role of surrounding window - SW
We have taken into account the surrounding text of ‘t’ Wikipedia words to the left and
right of a given Wikipedia word to examine its role in identifying an attribute. As some topics
arise and eventually diminish in a small window of discussion, the situation motivates us to
consider the relation with the surrounding text as a classification feature in identifying product
attributes.
This feature can help in identifying sub-attributes (attributes of attributes). The sub-attributes
may not seem related when overall context is considered, but they are relevant when limited
contexts in which they occur are considered.
Suppose if there are p instances of Wikipedia word xi in the reviews. The relation of xi with
the surrounding text is computed as
SWxi =
t∑j=−tj 6=i
Relationxi,xj
(k)(N)(p)
Where N is the total number of words in customer reviews of a given product. The window
length t is arbitrarily taken as N20 . “-t” means t words to the left of xi and vice-versa.
The SW feature for the non-Wikipedia words is uniformly given as average of all SW feature
values of Wikipedia words from reviews.
SWyi =
k∑j=1
SWxj
k
3.2.4 Web search engine reference-WR
As there are words that cannot be mapped to Wikipedia, we may loose a few trivial attributes
in the candidate selection stage. To boost such words we use knowledge on the Web. The WR
feature measures the association of a particular word from customer reviews of a product with
that product on the Internet.
26
Figure 3.2 WR illustration
27
In the above illustration Fig-3.2, a query is formed with words ‘camera’ and ‘lens’ to under-
stand the relation of these concepts. We trust there is a greater possibility for words that are
sematically related to co-occur more. Hence we measure the degree of association of these two
words in the search results when a bigram query is formed using them.
We can see that for the query “camera lens”, there are more than 10 instances of ‘camera’
and ‘lens’ occuring together in the result snippets. The threshold for deciding whether any two
words are semantically related from their association frequency in search results is automatically
learnt by the system in its training phase.
Unfortunately the Google Web Search API has been officially deprecated. So we have used
Bing search API4 to compute WR for a word. The following illustration Fig-3.3 shows the
instance of Bing retrieving search results for the query “camera lens”.
Figure 3.3 WR using Bing
4http://msdn.microsoft.com/en-us/library/dd251072.aspx
28
WR value for a word zi is given by
WRzi =Res(zi, P )
SN
Res(Zi, P ) is the number of instances where the word zi and the product name P both
occur within the text snippets given as search results by the search engine. This frequency is
normalized by the total number of search results SN that are taken into account. The limitation
of this feature is that the system needs to be online with a search engine.
Based on research literature in information extraction, our previous related work and intu-
itions, a new information extraction system is designed with features like CR, SW, MFI, WR.
This information extraction model is intended to extend the scope of its features in expending
to larger information mining systems for social interactions. The current extraction system is
tested for performance estimation in extracting product attributes from costumer reviews.
3.3 Data for Information Extraction system in Social interac-
tions
We have trained our newly designed extraction system using SVM and evaluated against
two popular datasets of reviews, the Reviews-9-products dataset [6] and the Customer Reviews
dataset [9].
Product Reviews:
The CustomerReviews dataset contains the semi-structured and unstructured user reviews of
five products:
• ApexAD2600Progressive− scanDV Dplayer,
• CanonG3,
• CreativeLabsNomadJukeboxZenXtra40GB,
• Nikoncoolpix4300,
• Nokia6610.
The Reviews− 9− products consists of user reviews on the following nine products:
• CanonPowershotSD500,
• DiaperChamp,
29
• Ipod,
• MicroMP3,
• Norton,
• CanonS100,
• Hitachirouter,
• LinksysRouter,
• Nokia6600.
These datasets have been used for opinion mining tasks and referred by several other pub-
lications5. They have already been annotated manually in terms of attributes of products and
opinions on those annotated attributes. For our task, we don’t need the opinion, hence we did
not take into account any opinion annotations.
The words that are annotated as attributes consists of trivial words, terminologies, and
concepts. The given datasets contain customer reviews of products from different domains.
Experiments are carried out at two levels. First, crucial features are tested to know their
respective performance, and then the complete combination of features is tested. To train our
model we used Reviews − 9 − products dataset and for testing CustomerReviews dataset
is used. Similarly we have also done testing on Reviews-9-products dataset by generating the
training data from CustomerReviews dataset.
Both the datasets consists of annotated reviews about a total fourteen products taken from
Amazon6.
The following is an instance which illustrates the nature of the datasets. It shows review text
of a camera annotated according to the attributes discussed.
camera[+2][p]##the more i work with it , the more i love it !
##i would recomend that you purchase a lexar media cf for the camera as the sandisk card that comes packaged is too
small and too slow !
camera[+3][u]##this quality and ease of use for under 1500 - i ’m thrilled with my purchase !
[t]outstanding camera
camera[+3]##this is my first digital camera , and i am very pleased with it... .
##i do not know a whole lot about photography , but i am happy to know that this camera can always perform , even as i
grow in skill and knowledge .
camera[+2][u]##seriously , this thing has everything that a pro or expert amateur could want .
picture[+2], auto mode[+2]##but at the same time , it takes wonderful pictures very easily in ” auto ” mode , so that even
an average joe like me can use it !
5http://www.cs.uic.edu/ liub/FBS/sentiment-analysis.html6http://www.amazon.com/
30
four megapixel[+1]##four megapixels is great .
##i know there are five mp cameras out there , but this thing does just fine for me .
##if you want , check out the canon website and they have some sample images , taken by this camera , for you to download..
.
product[+3][u]##it is a very amazing product .
camera[+3][p]##i highly recommend it .
[t]love my new g 3
————–end of snippet———————
Symbols used in the annotated reviews:
[t]: the title of the review: Each [t] tag starts a review.
We did not use the title information in our papers.
xxxx[+—-n]: xxxx is a product feature.
[+n]: Positive opinion, n is the opinion strength: 3 strongest,
and 1 weakest. Note that the strength is quite subjective.
You may want ignore it, but only considering + and -
[-n]: Negative opinion
## : start of each sentence. Each line is a sentence.
[u] : feature not appeared in the sentence.
[p] : feature not appeared in the sentence. Pronoun resolution is needed.
[s] : suggestion or recommendation.
[cc]: comparison with a competing product from a different brand.
[cs]: comparison with a competing product from the same brand.
3.4 Product attribute extraction using Wikipedia
We have considered the MFI feature as baseline for this approach. The reason is that MFI
is intuitive due to the fact that people when discussing about a product mention its attributes
a good number of times in their reviews. The precision in our task is given by
Precision =Number Of Attributes Identified correctly
Total Number Of words Identified as Attributes
and the recall is given by
Recall =Number Of Attributes Identified correctly
Number of Attributes Actually Annotated
The F-score which is a cumulative of both Precision and Recall is given by
F − score =2× Precision×Recall
Precision + Recall
31
When we have tested our Wikipedia based features CR, SW along with the baseline feature
MFI, we encountered a low recall but a good average precision of approximately 88%. The reason
behind this low recall is that trivial words and some verbs cannot be mapped to Wikipedia. For
example, for the DiaperChamp product listed in Table 3.1 the annotated attributes like bang−for−the−buck, deal, looking, cost−effective, works, pull, assemble, costlier, clean, safer, etc.,
cannot be correctly linked to the articles of Wikipedia. To rule out such discrepancies we can use
an ontology like Wordnet. But it adds a lot of noise. The attributes identified using Wikipedia,
when manually observed contained more quality attributes than loose attributes like the words
mentioned above. The statistics of identified attributes from both datasets are shown in Table
3.1 and their collective precision, recall and f-score values are given in Table 3.3.
Table 3.1 CR, SW, MFI as the features
Product Name AnnotatedAt-tributes
CandidatesSelected
AttributesIdentified
Diaper Champ 68 16 14
Canon G3 106 30 25
Hitachi router 82 14 11
Canon S100 99 26 23
Nokia 6600 147 48 44
MicroMP3 196 41 35
Nikon coolpix 4300 76 16 13
ipod 92 23 10
Creative Labs Nomad Juke-box Zen Xtra 40GB
186 47 43
norton 107 24 23
Linksys Router 85 24 18
Apex AD2600 Progressive-scan DVD player
115 24 19
Canon PowerShot SD500 70 13 12
Nokia 6610 111 35 31
32
3.5 Product attribute extraction using Wikipedia & Web
The web based feature WR when combined with the other features, the recall increased.
The statistics of product attributes identification from both datasets using the features CR,
SW, MFI, WR are shown in Table 3.2. The results in terms of precision, recall and f-score have
been presented in Table 3.3. We can clearly see that the combination of all the four features
which include Wikipedia based features and other frequency, web based features has performed
the best in terms of f-score. The increase in recall is due to gain in knowledge using WR. The
fall in precision can be explained by the boosting of insignificant words in the search results.
If a Wikipedia word in the reviews is identified as an attribute by our model, we output the
actual article title in Wikipedia for which this word is mapped as a product attribute. For a
non-Wikipedia word if identified as an attribute we output the word itself as a product attribute.
Table 3.2 CR, SW, MFI, WR as the features
Product Name AnnotatedAt-tributes
CandidatesSelected
AttributesIdentified
Diaper Champ 68 57 45
Canon G3 106 93 70
Hitachi router 82 79 66
Canon S100 99 91 73
Nokia 6600 147 112 85
MicroMP3 196 133 102
Nikon coolpix 4300 76 54 46
ipod 92 85 66
Creative Labs Nomad Juke-box Zen Xtra 40GB
186 157 122
norton 107 94 73
Linksys Router 85 79 52
Apex AD2600 Progressive-scan DVD player
115 90 79
Canon PowerShot SD500 70 63 52
Nokia 6610 111 92 74
33
Table 3.3 Relative scores
Featurecombina-tion
Recall Precision F-score
MFI 0.112 0.603 0.189
CR, SW,MFI
0.202 0.878 0.328
CR, SW,MFI, WR
0.666 0.802 0.727
3.6 Discussion
In Table 3.2, the given products Diaper Champ and ipod belong to the most divergent
domains. For Diaper Champ our model identified 45 out of 68 annotated attributes where
as for ipod, it identified 66 out of 92 annotated attributes. Similarly for the product Apex
AD2600 Progressive-scan DVD player, it identified 79 out of 115 attributes. This shows that
recall is approximately equal across the products which is an evidence that the model does not
depend on the domain of a product. Table 3.3 presents that the precision remains approximately
constant in adding web but improved recall which is due to mere addition of knowledge. As
it was also already mentioned that Wikipedia based features SW and CR contribute for more
quality attributes than loose attributes identified by WR. This shows that Wikipedia which is
increasing its database with structured knowledge every second can easily surpass the need for
web in coming years in attaining performances for tasks like these.
3.7 IE based Framework for Summarizing Social interactions
Coupling information extraction techniques with summarization is a relatively less explored
area, making it hard to find any relevant literature. We combine information extraction, extrac-
tive summarization to support content directed summaries. It was already mentioned earlier in
Chapter 2 regarding the selection of sales/service chats scenario for summarizing social inter-
actions. Our methodology to summarize sales/service chats is evolved after careful observation
of their structural elements. We treat the problem of summarizing sales/service chats as an
extractive summarization task and followed sentence ranking method where we measure impor-
tance of each sentence according to the combination of features scores. We selected this method
34
so that we can address several properties of sales/service chats through our varied ranking
features.
The sales/service chat summarization can be compared to Guided summarization task re-
cently introduced at TAC7 where summaries for source documents are guided by given tem-
plates. The templates are typically a set of questions that the summary should answer.
The test dataset released by NIST8 for the sake of TAC guided summarization composed of
five categories:
• Accidents and Natural Disasters
• Health and Safety
• Attacks
• Endangered Resources
• Investigations and Trials
Each of the above mentioned categories had a template of aspects that the summary had to
answer. For example, the accident category has the following template:
• WHAT: what happened
• WHY: reasons for accident
• DAMAGES: damages caused by the accident
• COUNTERMEASURES: countermeasures, rescue efforts, prevention efforts
The guided summarization task emphasizes a unified information model that can be emu-
lated by automatic summarizers. Also it highlights the task of finding relevant content on the
sub-sentential level enabling the use of information extraction techniques and other semantic
methods. It promotes a deeper linguistic and semantic analysis of source documents.
Through research collaboration we are allowed to proceed on a sales/service chat dataset
from a reputed firm. In case of sales/service chats the template is taken as
1. What is the issue?
2. Steps for resolution of the issue if mentioned.
7http://www.nist.gov/tac/8http://www.nist.gov/index.html
35
But the difficulty of summarization in here lies in unstructured text medium where in TAC,
systems are required to work on news wire data. Hence one cannot proceed with finding entities,
dates, values in chat by extraction rules using language tools etc., that can answer questions
given in the templates.
In this thesis, we put forth a model to produce summaries for the sales/service chats that
can answer the above template by deriving semantics.
After experimenting and stepping forward in formulating methods for IE in social interac-
tions, we planned to move in terms of extractive summarization (explained earlier in Section
2.4) where we made use of our extraction method developed earlier.
3.7.1 Modeling process
We have trained our system with our defined features using SVM regression [7] as it elimi-
nates the need to check feature independence and works robust. The word level machine learn-
ing features from our extraction framework are adopted and modified to sentence level features
along with some new features. Feature values of every sentence are extracted and its impor-
tance (I(s)) is estimated. Each sentence s in training data is converted into a tuple of form
(F (s), I(s)). F is vector of feature values of sentence s, F (s) = {f1, f2, f3, ..}. A model is built
by training on these tuples. Importance of a sentence in test dataset is predicted based on these
trained models.
3.7.1.1 Summary Generation
Once the sentence importance score is obtained, the sentences are ranked in the order of their
importance. The top ranked sentences thus finally obtained in the test data are considered as the
possible candidates to build the summary. For the sake of readability, even though a sentence
is ranked higher than the other, it follows the order of its occurrence in the source text.
Sentence importance is the target value in training and testing. ROUGE which is explained
in the previous sections is used to represent target values of sentence importance.
Hence the sentence importance is estimated as ROUGE-su4 score of that sentence when com-
pared to its reference summary during training and it is the value to be predicted in testing.
This sentence importance value is used for ranking sentences for inclusion into our 100-word
candidate summaries. The basic framework of our summarization system is represented by
Figure-3.4 where sentence importance value is calculated using a set of features. The following
are the algorithmic steps of our system with corresponding inputs and outputs.
36
Training:
Input : t r a i n i n g s e t (T) o f s o c i a l i n t e r a c t i o n s .
Algorithm :
f o r sentence s in T:
c a l c u l a t e f e a t u r e s f 1 ( s ) , f 2 ( s ) , f 3 ( s ) , f 4 ( s ) , . . ;
c a l c u l a t e F( s ) = ROUGE( s , model summary ) ;
Train SVM with {F( s ) , <f1 , f2 , f3 , f4 , . . > } ;
Output : t r a ined SVM model M.
Summary production:
Input : A s o c i a l i n t e r a c t i o n I .
Algorithm :
f o r sentence s in I :
c a l c u l a t e f e a t u r e s f 1 ( s ) , f 2 ( s ) , f 3 ( s ) , f 4 ( s ) , . . ;
form f e a t u r e vec to r F = <f1 , f2 , f3 , f4 , . . > ;
p r e d i c t ranks us ing SVM model−M F( s ) = { f1 , f2 , f3 , f4 , . . } ;
G = s o r t ( I , F( s ) ) ;
where G i s Ranked l i s t o f s en t ence s so r t ed in descending order
o f obta ined t a r g e t rank va lue s F( s ) .
f o r sentence in RankedList G:
While ( summary . length <=100):
summary . add ( sentence ) ;
ad jus t summary to source sequence ;
Output : summary o f I .
The following sections elaborate the features fi we used.
3.7.2 Semantic relation using Wikipedia - SR
The semantic relation using Wikipedia (SR) feature is well represented as the method of
finding semantic relatedness between words mentioned in a text by linking them to their possible
articles in Wikipedia. The feature SR is cleverly derived from CR and SW of our IE framework
to suit for the given summarization scenario. From our previous section 3.2.2, we are already
familiar in deriving relation between any given words through inlinks and outlinks structure
between the Wikipedia articles that these words represent [19]. This way of deriving semantics
from a text can be an effective way given an unstructured text environment [13].
37
User generated content
Preprocessing
Sentences
Feature 1
Feature 2
Feature 3
SUMMARY
RANKER
Figure 3.4 Summarization Framework
The reason for selecting this feature is to derive possible relationship between text segments
in the chat and there by weighting the sentences.
The best known way to compute the relation [18, 19] between any two Wikipedia articles x,
y is refreshed again for the sake of convenience
Relationx,y = 1− max(log|A|, log|B|)− log|A ∩B|T −min(log|A|, log|B|)
Here A and B are the set of articles which link to the articles x and y respectively, T is the
total number of Wikipedia articles, A ∩B is their overlap.
We have taken into account the surrounding text of ‘t’ Wikipedia words to the left and right
of a given Wikipedia word in a chat to examine its role in linking the information between the
sentences.
The relation of a Wikipedia word xi with its surrounding text is computed as
SRxi=
t∑j=−tj 6=i
Relationxi,xj
38
The window length t is arbitrarily taken as average sentence length in the data. “-t” means
t Wikipedia words to the left of xi and vice-versa.
There can also be words in sales/service chats that cannot be linked to any article in
Wikipedia. Hence the SR feature for non-Wikipedia words is uniformly given as average of
all SR feature values of Wikipedia words in a given sentence.
SRyi =
k∑j=1
SWxj
kk are the total Wikipedia words in the given sentence. The SR value for a sentence si is given
by
SRsi =
∑xj∈si
SRxj+
∑yj∈si
SRyj
|si|where xj and yj represent Wikipedia word and Non-Wikipedia word respectively. |si| represents
total number of words in sentence si.
3.7.3 Prepositional Importance - PF
A preposition in English grammar generally represents the temporal, spatial or logical re-
lationship of its object to the rest of the sentence [31]. It is very interesting to observe how
prepositions are implicitly capturing the key elements in a sentence. Observe the role of prepo-
sitions {for, to, with} in the below sentences for example:
Representative: xxxx, for registration issues please send your request to [email protected] with all the needed
info. They are happy to support you finishing your registration
After a careful observation over the data, we propose using the frequency of a small set of
prepositions {in,on,of,at,for,from,to,by,with} as a sentence scoring feature . The frequency of
prepositions is indirectly achieving the effect of performing a Named Entity Recognition (NER)
on a sentence, but without any additional cost of processing or using any POS tags. Score of a
sentence (s) calculated by PF is given as,
PF (si) =
∑wi∈s
IsPrep(wi)
|s|
IsPrep(wi) function returns 1 if wi is a preposition else 0.
39
3.7.4 Term Frequency - TF
Term frequency of a word signifies the word according to the frequency of its occurrence.
We calculated the frequencies of all words except stop words in a given chat conversation and
calculated the TF score for a word as
TF (wi) =Freq(wi)∑
wj∈chatFreq(wj)
Freq(wi) gives total number of occurrences of wi in that particular chat conversation. The
final term frequency(TF) score for a sentence is given by
TF (si) =
∑wj∈si TF (wj)
|si|
3.7.5 Wiki Frequency - WF
We considered the number of Wikipedia words in a sentence as an indicator of significant
information. The Wiki Frequency feature for a sentence s is given by
WF (si) =
∑wj∈si IsWikiWord(wj)
|si|
IsWikiWord(wi) returns 1 if wi is a Wikipedia word else 0.
3.8 Data for Summarization of Social interactions
Most of the conversations in this dataset are in the form of agent-customer sessions but with
a variation in length and issues addressed. The statistics of length of chats is presented in Table
3.4.
It shows that average chat length is of 309 words which is also a bit difficult length for a
human to go through at rush hour, and also there are many more chats that are above 600,
800, 1000 words length. The numbers show the urge to summarize these sales/service chats.
The following is an example snippet of a chat conversation in the provided dataset.
[06:25:36] Agent: Would you like me to send the specification of zz?
[06:26:45] Customer: Sorry I got called away from my desk
[06:26:57] Agent: It is okay.
[06:27:11] Customer: Yes please...
[06:27:36] Customer: In some areas it says included...hmm ..in some it mentioned "coming soon" so I wasn’t
40
Totalno. ofChats
Avgchatlength
Above600words
Above800words
Above1000words
783chats
309words
89chats
40chats
15chats
Table 3.4 Data Statistics
too sure!!
[06:27:49] Agent: I will send you the specification link of zz?
[06:27:58] Customer: yes please..
[06:28:01] Agent: cc.cc.cc <URL>
[06:28:30] Customer: perfect!! thank you.... thats all I was looking for now
Some of the details like the customer’s name, agent’s name, other entities names, dates,
timings, values are masked to safegaurd the policy of the dataset donor.
100-word human model summaries for 60 chat conversations are created of which two equal
disjoint sets are made for training and testing phases for our evaluation. Hence the training set
consists of 30 conversations and testing set consists other 30 conversations.
3.9 Experiments related to summarization of consumer sales/ser-
vice Interactions
3.9.1 Baseline
The summarization system used in TAC 2009 by IIITH [30] that proved to perform good in
single document and multi-document summarizations, is considered as baseline for evaluation.
We have considered this particular system as a baseline as it is based on sentence and word
position features which can be applied to sales/service chats scenario.
The TAC 2009 system is inspired by the anatomy of news articles and is later applied to
other kinds of text because of its generalized features.
In detail it uses two variations of sentence location feature. The first type of sentence location
feature is inpired from the fact that first three sentences of a document generally contain the
most informative content of that document which is true in cases of news articles and descriptive
text like that of articles in Wikipedia9. Hence with this feature the top three sentences are scored
in a negative proportional manner to its position in the source text and directly proportional
for rest of the sentences.
9http://www.wikipedia.org/
41
score(s) = 1− n/1000 {ifn <= 3}
= n/1000 {else}
(3.1)
The other variation of the sentence location feature is left to the learning of the machine on
the TAC 2008 summarization dataset. Hence it scores a sentence according to this optimum
position.
3.9.2 Comparison in performances of PF, TF, SR & WF
We used the traditional Rouge-2 (bigrams) and Rouge-su4 (skipped bigrams, Refer Section-
3.2.2) scores for our system evaluation. In Table 3.5 performances in terms of average ROUGE
System AvgROUGE-2
Avg ROUGE-Su4
Baseline 0.07441 0.11787
TF 0.05963 0.09102
TF+PF 0.07103 0.10932
TF+PF+WF 0.08216 0.12531
TF+PF+SR 0.08909 0.12826
TF+PF+WF+SR 0.08287 0.12013
Table 3.5 Scores of baseline, different features in our model
scores are given for the features Term Frequency (TF), Prepositional Importance (PF), Semantic
Relation (SR) and Wiki Frequency (WF) . It shows that our system outperformed baseline in
average ROUGE-2, ROUGE-su4 scores when PF, TF and SR features are used in combination.
Our system performed well when tested and compared with human model summaries. From
Table 3.5 we can clearly observe the significant performance gap between the baseline and
the best combination < TF,PF, SR >. Also other combinations like < TF,PF,WF > and
< TF, PF, SR,WF > reached close to the top score. The simple TF and combinations omitting
TF had performed poor which established that TF is also an important feature which may not
give a greater performance independently but can render best performance along with vital
features like PF and SR.
42
As words related to issue in a conversation are more frequent, TF helps in extracting the
issue into summaries. The features PF, WF, SR are keen in capturing resolution of an issue.
Fortunately we are able to draw customer satisfaction into our summaries by TF feature captur-
ing the word “thank” used frequently by customer and agent in case of a positive conversation.
We inferred that our semantics based approach could effectively generate summaries for social
interactions like sales/service chats.
We showed that our system is answering the two important queries ‘issue’ and ‘resolution’
but did not make use of any optimizations and dependencies related to these queries, to make
sure that our system can be applied to all other social interactions.
More functional view of different features in our approach have been given with examples in
the next section.
3.9.3 Summarization Outcome in terms of different features
The following is an example of a sales chat followed by its summaries to illustrate the
performances of different features we have used. Some of the details like the customer’s name,
agent’s name, other entity names, dates, timings, values are masked to safeguard the policy of
dataset donor.
[06:18:46] Agent: Thank you for contacting zz Pre-Sales Chat. My name is xx. How may I help you today?
[06:19:40] Customer: Hi xx,... my name is yy and am looking to purchase zz... I was wondering if this
product comes with only wi-fi or 3G and wi-fi
[06:20:15] Agent: I will be glad to assist you
[06:20:39] Agent: Please give me a moment while I check this information for you
[06:20:45] Customer: sure... thank you
[06:25:05] Agent: Thank you for your time. I have checked that 3G and wi-fi are available in zz.
[06:25:36] Agent: Would you like me to send the specification of zz?
[06:26:45] Customer: Sorry I got called away from my desk
[06:26:57] Agent: It is okay.
[06:27:11] Customer: Yes please...
[06:27:36] Customer: In some areas it says included...hmm ..in some it mentioned "coming soon" so
I wasn’t too sure!!
[06:27:49] Agent: I will send you the specification link of zz?
[06:27:58] Customer: yes please..
[06:28:01] Agent: cc.cc.cc <URL>
[06:28:30] Customer: perfect!! thank you.... thats all I was looking for now
[06:28:51] Agent: You are welcome.
[06:29:00] Agent: We have an option to transfer your chat to our Sales Support Team to help you
in customizing zz and help to place an order. May I transfer the chat
[06:29:37] Customer: I am actually in Cxx right now and dont have access to my Cyy credit card for
that information..so..
[06:29:54] Customer: Sorry I will have to do that at a later moment
[06:31:00] Agent: I understand. Do you have any other queries regarding zz products
43
[06:31:14] Customer: thats it for now
[06:31:47] Agent: You are welcome.
To ensure that we are always improving our service, you may receive a survey invitation at the end
of the chat session to tell us what you think about our products and services. Your feedback will be
highly appreciated
[06:32:20] Agent: Thank you for contacting zz Pre-Sales Chat. You have a great day.
The summaries for the above sales chat using different features are given in the following
subsection demonstrating the possible reasons.
Summaries:
1. The following summary is generated using the features : Term Frequency (TF), SemanticRelation(SR) and Prepositional Importance(PF)
[06:19:40] Customer: Hi xx,... my name is yy and am looking to purchase zz... I was wondering
if this product comes with only wi-fi or 3G and wi-fi [06:25:05] Agent: Thank you for your
time. I have checked that 3G and wi-fi are available in zz. [06:29:00] Agent: We have an option
to transfer your chat to our Sales Support Team to help you in customizing zz and help to place
an order. May I transfer the chat [06:29:37] Customer: I am actually in Cxx right now and
dont have access to my Cyy credit card for that information..so..
2. The following summary is generated using the features : Term Frequency (TF), Preposi-tional Importance(PF)
[06:19:40] Customer: Hi xx,... my name is yy and am looking to purchase zz... I was wondering
if this product comes with only wi-fi or 3G and wi-fi [06:25:05] Agent: Thank you for your
time. I have checked that 3G and wi-fi are available in zz. [06:28:30] Customer: perfect!! thank
you.... thats all I was looking for now [06:29:00] Agent: We have an option to transfer your
chat to our Sales Support Team to help you in customizing zz and help to place an order. May
I transfer the chat [06:31:47] Agent: You are welcome. To ensure that we are always improving
our service, you may receive a survey invitation at the end of the chat session to tell us what
you think about our products and services.
3. The following summary is generated using the TF feature
[06:18:46] Agent: Thank you for contacting zz Pre-Sales Chat. My name is xx. How may I help
you today? [06:19:40] Customer: Hi xx,... my name is yy and am looking to purchase zz... I
was wondering if this product comes with only wi-fi or 3G and wi-fi [06:25:05] Agent: Thank
you for your time. I have checked that 3G and wi-fi are available in zz. [06:32:20] Agent: Thank
you for contacting zz Pre-Sales Chat. You have a great day.
4. The following summary is generated using the SR feature
[06:19:40] Customer: Hi xx,... my name is yy and am looking to purchase zz... I was wondering
if this product comes with only wi-fi or 3G and wi-fi [06:25:36] Agent: Would you like me to
send the specification of zz? [06:29:37] Customer: I am actually in Cxx right now and dont have
access to my Cyy credit card for that information..so.. [06:31:47] Agent: You are welcome. To
44
ensure that we are always improving our service, you may receive a survey invitation at the
end of the chat session to tell us what you think about our products and services.
5. The following summary is generated using the PF feature
[06:29:00] Agent: We have an option to transfer your chat to our Sales Support Team to
help you in customizing zz and help to place an order. May I transfer the chat [06:29:37]
Customer: I am actually in Cxx right now and dont have access to my Cyy credit card for
that information..so.. [06:29:54] Customer: Sorry I will have to do that at a later moment
[06:31:47] Agent: You are welcome. To ensure that we are always improving our service, you
may receive a survey invitation at the end of the chat session to tell us what you think about
our products and services. Your feedback will be highly appreciated
Finally, from the above example, one can observe that Prepositional Importance feature
(PF) is trying to attain sentences that contain prepositions in high densities. Semantic Relation
(SR) is using knowledge from Wikipedia to recognize information and gaining sentences that are
most related. We can also clearly see that issue related words like 3G, Wifi, Pre-Sales and other
thanks giving phrases that contain the word thank occurring in relatively higher frequencies,
hence these sentences are grabbed by the Term Frequency (TF) feature.
45
Chapter 4
Conclusions
In this thesis, we examined the role of social media in carrying valuable information. We
examined the difficulties put forth by user generated content for text analysis. We devised
methods to effectively mine information from user generated content of social interactions. We
examined the need for good information extraction and summarization systems for text in social
interactions. We deeply observed the scenarios of customer reviews and customer sales/service
chats as examples in representing users social interactions.
We presented a domain independent approach for automatic discovery of product attributes
from user reviews. Extracting product attributes from customer reviews is like identifying
subtopics (attributes) for a given topic (product) in a discussion. We worked on this prob-
lem to investigate new extraction systems for social interactions with user generated text.
Our work has highlighted the possibility of providing an incremental learning capability for
the extraction system. The performance scores of our system show that it is a good design to
apply Wikipedia to carve out product attributes from customer reviews. The Wikipedia based
feature is later extended to draw semantics from social interactions.
Our contribution is in leveraging information and in getting assistance from greater knowl-
edge sources like Wikipedia and Web when doing tasks across domains while discarding all
the help from language tools. We viewed the problem as a classification model which achieved
a good performance on the given datasets. Attribute extraction for products from customer
reviews helps tasks like summarization of reviews, product recommendation, enriching product
attribute knowledge bases.
In this work we have trained and tested our system over the products that belong to different
domains but interestingly found that, it works uniform for all domains.
As we did not make use of any natural language processing tools, this work can be extended
to any other language with little changes in the preprocessing stage.
46
Later, we examined the role of information extraction in summarization of user generated
content in social interactions. We crossed barriers of user reviews and into that of chats to
provide a unified solution for extraction and summarization in social interactions.
In this thesis, we addressed the problem of summarizing corporate sales/service chats and
suggested a model which provides summary not only by considering the structure of the chats
but also by extracting semantics. We implicitly used the extraction system which we built for
social interactions. From the results we conclude that our proposed model can be safely applied
to the chat domain.
Possible extensions for this thesis include other aspects of social interactions like effectively
extracting sentiment into summaries. Extending our extraction system to more information
mining procedures and also by including other dynamic properties of interactions to summarize
them can be taken as future directions from this thesis.
47
Appendix A
Social Interactions :- The set of interactions by users which can comprise of blogs, cus-
tomer reviews, sales/service chats, Internet relay chats, social networking blogs and posts, etc.
Annotated attributes :- Attributes already annotated(words marked by human as at-
tributes). Annotation is a common lexicon used in IR & NLP literature.
Candidates selected :- Candidate words that are selected as attributes by our machine.
Attributes Identified :- Words that are correctly identified as attributes by our machine.
Sales/service chats :- Customer-agent chat conversations to address queries of customers.
customer/user reviews:- Product reviews and comments by users/customers on e-commerce
websites.
UGC :- User generated content which generally occurs in user generated forms of social
media and social interactions.
48
Related Publications
• Sudheer Kovelamudi, Sethu Ramalingam, Arpit Sood and Vasudeva Varma, “Domain In-
dependent Model for Product Attribute Extraction from User Reviews using Wikipedia”,
In International Joint Conference on Natural Language Processing (IJCNLP), pages 1408-
1412. AFNLP, 2011.
• Vasudeva Varma, Sudheer Kovelamudi, Jayant Gupta, Nikhil Priyatam, Arpit Sood,
Harshit Jain, Aditya Mogadala, Srikanth Reddy Vaddepally, “IIIT Hyderabad in Summa-
rization and Knowledge Base Population at TAC 2011”, In proceedings of Text Analysis
Conference (TAC), National Institute of Standards and Technology Gaithersburg, Mary-
land USA, November, 2011.
• Praveen Bysani, Kranthi Reddy, Vijay Bharath Reddy, Sudheer Kovelamudi, Prasad Pin-
gali, Vasudeva Varma, “IIIT Hyderabad in Guided Summarization and Knowledge Base
Population”, In the Working Notes of Text Analysis Conference (TAC), National Institute
of Standards and Technology Gaithersburg, Maryland USA, November, 2010.
• Vasudeva Varma, Vijay Bharath Reddy, Sudheer K, Praveen Bysani, GSK Santosh, kiran
kumar, kranthi Reddy, karuna Kumar, nithin M, “IIIT Hyderabad at TAC 2009”, In the
Working Notes of Text Analysis Conference (TAC), National Institute of Standards and
Technology Gaithersburg, Maryland USA, November, 2009.
49
Bibliography
[1] N. Balasubramaniam. User-generated content. Business Aspects of the Internet of Things, page 28,
2009.
[2] J. Bian, Y. Liu, E. Agichtein, and H. Zha. Finding the right facts in the crowd: factoid question
answering over social media. In Proceedings of the 17th international conference on World Wide
Web, WWW ’08, pages 467–476, New York, NY, USA, 2008. ACM.
[3] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on
Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.
ntu.edu.tw/~cjlin/libsvm.
[4] O. S. Chin, N. Kulathuramaiyer, and A. W. Yeo. Automatic discovery of concepts from text. In
Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, WI ’06,
pages 1046–1049, Washington, DC, USA, 2006. IEEE Computer Society.
[5] B. Daille. Study and implementation of combined techniques for automatic extraction of termi-
nology. The Balancing Act: Combining Symbolic and Statistical Approaches to Language, 1:49–66,
1996.
[6] X. Ding, B. Liu, and P. S. Yu. A holistic lexicon-based approach to opinion mining. In Proceedings
of the international conference on Web search and web data mining, WSDM ’08, pages 231–240,
New York, NY, USA, 2008. ACM.
[7] R. Fan, P. Chen, and C. Lin. Working set selection using second order information for training
support vector machines. The Journal of Machine Learning Research, 6:1889–1918, 2005.
[8] R. Farrell, P. Fairweather, and K. Snyder. Summarization of discussion groups. In CIKM, pages
532–534. ACM, 2001.
[9] M. Hu and B. Liu. Mining and summarizing customer reviews. In Proceedings of the tenth ACM
SIGKDD international conference on Knowledge discovery and data mining, KDD ’04, pages 168–
177, New York, NY, USA, 2004. ACM.
[10] M. Hu and B. Liu. extraction and summarization on the web. In Proceedings Of The National
Conference On Artificial Intelligence, volume 21, page 1621. Menlo Park, CA; Cambridge, MA;
London; AAAI Press; MIT Press; 1999, 2006.
[11] J. Justeson and S. Katz. Technical terminology: some linguistic properties and an algorithm for
identification in text. Natural language engineering, 1(01):9–27, 1995.
[12] P. Kannan, M. Jain, R. Vijayaraghavan, P. Albert, and V. Amudhan. Mining interactions to
manage customer experience throughout a customer service lifecycle, Oct. 22 2009. US Patent
App. 12/604,252.
50
[13] S. Kovelamudi, S. Ramalingam, A. Sood, and V. Varma. Domain independent model for product
attribute extraction from user reviews using wikipedia. In IJCNLP, pages 1408–1412. AFNLP,
2011.
[14] C. Lin. Looking for a few good metrics: Automatic summarization evaluationhow many samples
are enough. In NTCIR Workshop, volume 4, pages 1–10, 2004.
[15] C. Lin. Rouge: A package for automatic evaluation of summaries. In Proceedings of the workshop
on text summarization branches out (WAS 2004), pages 25–26, 2004.
[16] C.-Y. Lin and E. Hovy. Automatic evaluation of summaries using n-gram co-occurrence statis-
tics. In Proceedings of the 2003 Conference of the North American Chapter of the Association for
Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 71–78,
Stroudsburg, PA, USA, 2003. Association for Computational Linguistics.
[17] B. Liu, M. Hu, and J. Cheng. Opinion observer: analyzing and comparing opinions on the web. In
Proceedings of the 14th international conference on World Wide Web, WWW ’05, pages 342–351,
New York, NY, USA, 2005. ACM.
[18] O. Medelyan, I. Witten, and D. Milne. Topic indexing with Wikipedia. In AAAI WikiAI workshop,
2008.
[19] D. Milne. Computing semantic relatedness using wikipedia link structure. In NZCSRSC. Citeseer,
2007.
[20] D. Milne and I. Witten. An open-source toolkit for mining Wikipedia. In NZCSRSC, volume 9,
2009.
[21] R. J. Mooney and R. Bunescu. Mining knowledge from text using information extraction. SIGKDD
Explor. Newsl., 7:3–10, June 2005.
[22] G. Murray, S. Renals, J. Carletta, and J. Moore. Evaluating automatic summaries of meeting
recordings. In ACL 2005 MTSE Workshop, pages 33–40, 2005.
[23] P. Newman and J. Blitzer. Summarizing archived discussions: a beginning. In International con-
ference on Intelligent user interfaces, pages 273–276. ACM, 2003.
[24] W. P. on the Information Economy. Participative web: User-created content. 2007.
[25] S. Raju, P. Pingali, and V. Varma. An unsupervised approach to product attribute extraction.
Advances in Information Retrieval, pages 796–800, 2009.
[26] N. Roman, P. Piwek, and A. Carvalho. Politeness and bias in dialogue summarization: Two ex-
ploratory studies. Computing Attitude and Affect in Text: Theory and Applications, pages 171–185,
2006.
[27] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information
Processing & Management, 24(5):513–523, 1988.
[28] S. Somasundaran and J. Wiebe. Recognizing stances in online debates. In Proceedings of the Joint
Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on
Natural Language Processing of the AFNLP: Volume 1-Volume 1, pages 226–234. ACL, 2009.
[29] P. Turney. Coherent keyphrase extraction via web mining. In International Joint Conference on
Artificial Intelligence, volume 18, pages 434–442. Lawrence Erlbaum Associates LTD, 2003.
51
[30] V. Varma, P. Bysani, V. Kranthi Reddy, K. Santosh GSK, S. Kovelamudi, N. Kiran Kumar, and
N. Maganti. iiit hyderabad at tac 2009. In Proceedings of Test Analysis Conference 2009 (TAC
09), 2009.
[31] V. Varma, P. Bysani, K. Reddy, V. Reddy, S. Kovelamudi, S. Vaddepally, R. Nanduri, K. Kumar,
S. Gsk, and P. Pingali. iiit hyderabad in guided summarization and knowledge base population.
[32] K. Zechner. Automatic summarization of open-domain multiparty dialogues in diverse genres.
Computational Linguistics, 28(4):447–485, 2002.
[33] L. Zhou and E. Hovy. Digesting virtual geek culture: The summarization of technical internet relay
chats. In Annual Meeting on Association for Computational Linguistics, pages 298–305. ACL, 2005.
[34] L. Zhou and E. Hovy. On the summarization of dynamically introduced information: Online dis-
cussions and blogs. In AAAI-2006 Spring Symposium on Computational Approaches to Analyzing
Weblogs, 2006.
[35] L. Zhuang, F. Jing, and X. Zhu. Movie review mining and summarization. In CIKM, pages 43–50.
ACM, 2006.
52