information extraction based approach to summarize social

Information Extraction based Approach to summarize SocialInteractions

Thesis submitted in partial fulfillment

of the requirements for the degree of

Master of Science (by Research)

in

Computer science and Engineering

by

Sudheer Kovelamudi

200602014

[email protected]

Search and Information Extraction Lab

International Institute of Information Technology

Hyderabad - 500 032, INDIA

July 2012

Copyright c© Sudheer Kovelamudi, 2012

All Rights Reserved

International Institute of Information Technology

Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “Information Extraction based

Approach to summarize Social Interactions” by Sudheer Kovelamudi, has been carried out

under my supervision and is not submitted elsewhere for a degree.

Date Advisor: Dr. Vasudeva varma

To my loving mother, father and sister.

Acknowledgments

I thank Dr. Vasudeva Varma, my advisor and thesis guide for inspiring and supporting me

in carrying my research. Our fruitful conversations helped in weighing our ideas and research

perspectives at work. He is responsible for bringing up exquisite taste of research in me and

myself becoming a passionate researcher.

I sincerely thank my seniors Kiran, Praveen, Sai Krishna, Kranthi, VijayBharath and my

friend Phani Gadde for sparing their time in giving me valuable suggestions. I am very much

thankful to Dr. Ravi Vijayaraghavan, Sethu ramalingam, Kranthi Adusumilli for their inputs

during our weekly meetings and providing a link to current industry research so that my work

can be used effectively. I truly acknowledge the efforts of my juniors Arpit, Ajay in helping me

in the completion of my thesis. I thank all my fellow researchers in SIEL for creating such a

productive ambiance all through my research track.

I thank Dr. Suresh Purini for granting me travel fund to attend IJCNLP 2011 in Thailand

and present my work. I sincerely thank all the reviewers of my work at IJCNLP 2011 for

providing their valuable feedback.

I enjoyed all my days at IIITH from the beginning of my B.tech in all proportions of academic

and extra-circular. I thank my fellow football players for providing such energetic and positive

time during evenings. I appreciate the time I spent during my MS with Pruthvi, Akhilesh,

Abhilash, Santosh and my other fellow CSD students. I take this as an opportunity to thank

my 2k6-batch mates and all my friends in campus with whom I shared my happiness and grief.

I cherish the time I spent with them all through my life.

v

Abstract

With the advent of Web 2.0, Internet has been subdued by a group of applications which

facilitate the creation and exchange of user-generated content. Such groups are collectively

represented as Social Media which changed the phase of communication between individuals.

Blogs, Internet forums, user reviews, chats, activities on social networking sites etc., are some

of notable online forms of current social media. As the content of social interactions on these

social media forms is increasing rapidly, it led to the problem of information overload. A user

may find it difficult to go through millions of lines of different users to grasp the status of a

topic. This problem can be dealt wisely by presenting crux of the content rather than the whole

user generated text. Automatic extraction of important topics or attributes, and summarization

of these topics may help in saving considerable human effort in understanding the content.

Automatic summarization is always a classical solution for information overload problem.

Summarization of news articles has been explored from theory to building satisfactory models,

but summarization of user generated content has not been given much attention. This may

be because of the amount of user content on the Internet is once not so significant but later

exponentially increased. In this thesis we study the text in user generated content focusing on

its summarization using new extraction methodologies.

We initially focus on deriving information extraction techniques for social interactions do-

main. We identify extraction of important topics or attributes of a discussion as the first step

towards successful summarization of social interactions. There have been developed many ways

of extraction from structured text using natural language processing tools. But not much have

been developed for information extraction in unstructured social media text. Hence we first fo-

cus on deriving extraction methodologies which can later enhance summarization. We refrained

from using language processing resources in our extraction procedures to make sure their suc-

cess in all domains of social interactions rather than the selected domain of experimentation.

We made use of knowledge from external resources like Wikipedia and Web in harnessing the

extraction quality.

We used machine learning (regression) in judging the extracted output. We choose the sce-

nario of online customer reviews for testing our extraction engine. As the online retail market

vi

vii

is growing immense, it poses a large arena of products, their descriptions, customer and profes-

sional reviews that are pertinent to them. Reviews contain useful opinionated information about

products and their attributes. Most of the product attribute extraction techniques in literature

work on structured descriptions using several text analysis tools. However, attributes in these

descriptions are limited compared to those in the customer reviews of a product, where users

discuss more deeper and specific attributes. In this thesis, we propose a novel supervised do-

main independent model for product attribute extraction from user reviews. The user generated

content contains unstructured and semi-structured text where conventional language grammar

dependent tools like parts-of-speech tagger, named-entity recognizers, parsers do not perform

to their expected levels. We used Wikipedia and Web to identify the product attributes from

customer reviews.

In later parts of this thesis we focus on summarization of user generated content taking

help from our extraction modules. Our summarization work can be classified as extractive

summarization where text units from the original content are used in summary production.

The text units we choose for a summary are sentence level units. Our trained system picks

sentences from the content and then rank them to produce summary. Sentence ranking is done

by estimating the sentence importance through a combination of word level and sentence level

features. We choose a scenario where summarization is very much needed. Sales/service chats

in the present day E-commerce are crucial for customer support and growth of a company.

These chats should be clearly analyzed and followed for product, service, agent and customer

validation. Hence summarization of these chats can minimize considerable human effort. Here,

we suggest a novel approach to effectively summarize sales/service chats by analyzing their

structure and using Wikipedia. Our system outperformed classic text summarization systems

when applied for social interactions.

Contents

Chapter Page

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Social Media and Social Interactions . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 User Generated Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Industry perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Criticality of User Generated Text in mining Social Interactions . . . . . . . . . . 5

1.5 Problem Description and Contributions Made . . . . . . . . . . . . . . . . . . . . 6

1.5.1 Information Extraction from Social Interactions . . . . . . . . . . . . . . . 7

1.5.2 Summarization of Social Interactions . . . . . . . . . . . . . . . . . . . . . 8

1.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.7 Organisation of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1 Approaches of mining in social interactions . . . . . . . . . . . . . . . . . . . . . 10

2.2 Need for Information Extraction in Social Interactions . . . . . . . . . . . . . . . 10

2.3 Work related to Attribute extraction . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Uses of product attribute extraction . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 Extraction to Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6 Flavours of Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.7 Background related to Summarization of Social interactions . . . . . . . . . . . . 16

3 Information Extraction based approach for summarizing Social interactions . . . . . . 18

3.1 Machine learning and evaluation related background for IE & Summarization . . 18

3.1.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.2 ROUGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Information Extraction Framework for Social Interactions . . . . . . . . . . . . . 21

3.2.1 Most Frequent Items-MFI . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.2 Context Relation using Wikipedia - CR . . . . . . . . . . . . . . . . . . . 22

3.2.3 Role of surrounding window - SW . . . . . . . . . . . . . . . . . . . . . . 26

3.2.4 Web search engine reference-WR . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Data for Information Extraction system in Social interactions . . . . . . . . . . . 29

3.4 Product attribute extraction using Wikipedia . . . . . . . . . . . . . . . . . . . . 31

3.5 Product attribute extraction using Wikipedia & Web . . . . . . . . . . . . . . . . 33

3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.7 IE based Framework for Summarizing Social interactions . . . . . . . . . . . . . . 34

3.7.1 Modeling process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

viii

CONTENTS ix

3.7.1.1 Summary Generation . . . . . . . . . . . . . . . . . . . . . . . . 363.7.2 Semantic relation using Wikipedia - SR . . . . . . . . . . . . . . . . . . . 373.7.3 Prepositional Importance - PF . . . . . . . . . . . . . . . . . . . . . . . . 393.7.4 Term Frequency - TF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.7.5 Wiki Frequency - WF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.8 Data for Summarization of Social interactions . . . . . . . . . . . . . . . . . . . . 403.9 Experiments related to summarization of consumer sales/service Interactions . . 41

3.9.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.9.2 Comparison in performances of PF, TF, SR & WF . . . . . . . . . . . . . 423.9.3 Summarization Outcome in terms of different features . . . . . . . . . . . 43

4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Appendix A: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

List of Figures

Figure Page

2.1 Extraction of templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Context Relation using Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 WR illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3 WR using Bing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.4 Summarization Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

x

List of Tables

Table Page

1.1 Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1 CR, SW, MFI as the features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2 CR, SW, MFI, WR as the features . . . . . . . . . . . . . . . . . . . . . . . . . . 333.3 Relative scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.4 Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.5 Scores of baseline, different features in our model . . . . . . . . . . . . . . . . . . 42

xi

Chapter 1

Introduction

The communication sector of the world has seen dramatic changes after the launch of Web

in 1991. And with web 2.0 it goes close to saying impossible in connecting two persons from

different parts of the world. Web 2.0 is associated with applications that provide base for

information exchange, sharing and interoperability. Users can do a lot more than just passive

viewing of information from Web 2.0 websites. It is basically a user centric design. Any two

persons in the world can interact using different ways provided by current applications of

web 2.0. By ascending what was already possible in Web 1.0, they facilitate the user with

more software, storage capabilities and interface easily through their browser. Here network is

basically providing a platform for computing. Not much time elapsed in transformation of the

web into a social domain. People are now virtually present in their social circles and are in

contact with each other with the help of Internet. A part of this virtual world termed as social

media poses a large number of mobile and web-based technologies that made communication

more interactive.

1.1 Social Media and Social Interactions

Enabled by the ideological and technological basement of Web 2.0, social media can be

defined as a group of applications that allow the creation and exchange of information between

users. It is omnipresent which connects communities, organizations and individuals allowing

them to interact and communicate. Social media shapes into different forms of users social

interactions like Internet forums, social blogs, microblogs, consumer reviews, picture blogs,

wikis, video sharing, rating, social bookmarking, social networking and podcasts. Email, social

networking websites, crowdsourcing, vlogs, voice over IP and instant messengers are widely

used applications in the above mentioned forms of social media. Many organizations build their

own social networks for their brands using these applications. There are also other private

communities such as Facebook, Orkut, Google+ that engage people to interact around events.

1

Organizations and businesses look at social interactions as consumer generated media be-

cause of the reason that social media which promote social interactions are relatively inexpensive

to share information to larger audiences compared to traditional media such as newspapers and

television1. There are also other commercial user interactive practices that involve education

and enable interactive preparation. The content in online social interactions belongs to the

category of User Generated Content.

1.2 User Generated Content

The content in social interactions is mostly controlled by users who are more general public

rather than field professionals. As there is no strict referendum for presentation in social media,

a user’s presentation is more adopted towards environmental variables. The text especially in

user generated content (UGC) is getting mixed up with lot of noise.

The quality of user generated content varies from extremely useful information to spam. The

Organization for Economic Cooperation and Development (OECD) [24] defines UGC as fitting

the following requirements:

• A content which is made publicly available even though to a small group of people.

• The content should be presenting certain amount of creativity. There is no hard line for

the amount of creativity. It always depends on the context. Creativity in here means that

the content should be produced by the user rather than copying from some professional

source and posting it. For example, the part where a snippet from a professional article is

used for describing a situation by a user doesn’t get count as UGC. There can always be

a very small percentage of content on UGC websites that cannot be considered as UGC.

• The creation of user generated content generally happens outside of professional routines.

With exponential increase in user generated content on Internet, it is hard in actually finding

and classifying useful information. Here, we will be dealing with text forms of user generated

content that prevail in social interactions.

The text form of user generated content occurs in customer reviews/feedbacks on e-commerce

websites like Amazon.com, blogs like Wordpress, wikis in text based collaboration formats, ed-

ucational content, citizen journalism, social bookmarking and group based aggregation and tag

sharing like del.ic.ous, hosting sites like Youtube, social networking sites like Twitter, Google+,

Facebook.

1http://en.wikipedia.org/wiki/Social_media

2

In Table 1.1 we presented some of the basic differences of social interaction to a regular

document like news article in text. More discussion on these differences is given in section 1.4.

Table 1.1 Differences

Social interaction Regular document

language grammar isnot guaranteed

grammar is consistent

very short phrases existas sentences

small sentences may ex-ist but no small phrasesindependently

noise because of non-standard emotionalcontent words

no non-standard emo-tional content words ex-ist

lack of good sentencestructure

sentence structure guar-anteed

missing punctuation punctuated text

NLP tools cannot per-form consistently

NLP tools can performconsistently

discourse level features no discourse level fea-tures

false starts in sentencesare allowed

no false starts

In this thesis, we mine and summarize information from social interactions taking care of

user generated content across domains and providing unified solution.

1.3 Industry perspective

The amount of people contributing to consumer media through social interactions has grown

to such a level that the content have turned into large databases which pose as an incredible

market-research value for companies. These databases help organizations to grasp consumer

needs, market trends, interests and their spending capabilities.

• Companies have developed platforms for consumers to create content on product recom-

mendations which led to the advancement of e-Commerce. In many such platforms, people

are even enabled to rate the recommendations that were made by others which earned

trust and reliability factor. The main focus and interest for a potential buyer of a product

would be to gather more information on the product to actual real experiences of people

3

for the product. The buyer tries to get the reviews of people who are not linked to the

company selling the product and thus are unbiased. [1].

It is interesting to notice that some companies have even based their business model on

these recommendations.

• The world of E-commerce is expanding, offering millions of products for customers. Data

related to these products like descriptions supplied by retailers, customer reviews and

other customer-retailer statistics are also growing rapidly.

• The customer reviews present a text of comprehensive information regarding the ex-

perience with a particular product. The product ratings are the other kind of product

recommendation that are a bit different from customer reviews but sometimes follow the

latter. They are usually used to give a very brief approval of the user on the product used

based on its quality on scale of one to five.

• There is one more major perspective of online user social interactions in social media

which takes the form of chats. Chats through internet range from casual conversation

between two or occasionally more users of a display-based communications system to

organizational conferencing2.

• The following is an interesting aspect how chats are influencing businesses. E-commerce

industry possess a customer support segment that monitors the problems of their cus-

tomers and their resolutions through effective ways like live chats. As more consumers go

to Web support and online shopping, there come tremendous opportunities for compa-

nies to deliver better customer service online and drive more sales. In live chats, every

customer is assigned an agent who would look after the concerns of that customer in real

time if possible. These chats are typically service chats where the company provides its

service to the customer. The second category of chats that we are also addressing are

sales chats. In sales chats, a customer approaches a live chat session to buy a product. An

agent then answers questions posed by the customer in regards of product verification,

order, mode of purchase, transaction details etc. In the sales scenario the agent-customer

chat sessions take place before and after sales, which can make customers satisfy and ef-

fectively increase sales volume. The sales and service chats are structural similar in terms

of customer querying and a company’s representative resolving them.

Both sales and service chats play a crucial role in developing a fair relationship of a

company with its customers. Hence assessment of sales/service chats is done regularly to

2http://en.wikipedia.org/wiki/Chat

4

improve the company standards. In order to assess such user social interactions which

contain rich market oriented value, one needs to understand the user generated content

pertaining to them and the problems in processing it.

1.4 Criticality of User Generated Text in mining Social Inter-

actions

The text in user generated content occurring in social interactions is usually low in natural

language grammar, structure and formality. It also disagrees with other aspects of language

in ways like missing letter-case information while dealing with named entities, missing punc-

tuations, repetitions, lack of good sentence structure, false starts, non-standard words, pause

filling words like “uuumm”, “uhh” and other texting disfluencies. They are more prone to ex-

press emotional and context specific content. Unstructured noisy text data is found in informal

environmental settings such as online chat, text messages, e-mails, message boards, user reviews,

blogs, wikis and social networking posts.3.

The degree of distortion of structure and nature of the text in user generated content varies

• From domain to domain (blogs, reviews, chats, etc..)

• From user to user

• Editor tool environment

– If the editor provides dictionaries, it may be more standard.

– If there are space constraints, the user may tend to input more information repre-

sented by less text which consequently result in breaking the grammar.

• The worst cases occur when the user presents different styles in same domain and envi-

ronment depending on his mood, availability of time, etc.

Hence while carrying information extraction specific tasks, the contemporary research is fac-

ing a lot of problems with unstructured text as missing punctuation and use of short phrases

can often hinder performance levels of standard natural language processing tools such as Parts

of speech tagging (POS), Parsing and Named Entity Recognition (NER).

Possible Solutions:

It is better to avoid any natural language tool that is in its best performance trained on a good

3http://en.wikipedia.org/wiki/Noisy_text_analytics

5

formal cleaned and structured language data. This is because the performances when these

tools are used on unstructured text are not consistent. The formats or patterns of unstructured

text in terms of irregularities change with change of domain.

For example, reviews given by users on E-commerce sites are more unstructured than those

given by professional reviewers, and the user generated content in a given chat conversation is

much more unstructured compared to reviews domain. If we train the natural language tools

freshly on unstructured data from a given domain, they may not be consistent in performance

on unstructured text from other domains.

In this thesis we dealt with information extraction and summarization in social interactions

by avoiding natural language processing tools (NLP) as much as possible. We experimented on

two different scenarios for extraction and summarization, one of them is customer reviews of

products and the other is sales/service chats. The details of problems addressed are elaborated

in the following sections.

1.5 Problem Description and Contributions Made

Given the scenario and content of social interactions, the two important sub-connected

tasks of mining information i.e., Information Extraction (IE) and Summarization

involve gathering of information crux from the given content. These are challenging

tasks because of the several text variations accustomed to social interactions.

Following are the contributions made in addressing the above problems in this thesis.

1. We examined the need for information extraction and summarization systems for text in

social interactions domain.

2. We examined scenarios of customer reviews for products on e-commerce websites and

customer sales/service chats, as examples of typical social interactions.

3. We built a domain independent information extraction system for extracting product

attributes from customer reviews.

4. We built features to extract semantics from user generated content using Wikipedia and

Web.

5. We examined the role of information extraction in summarization of user generated con-

tent in social interactions.

6. We crossed barriers of user reviews and chats domains to provide a unified solution to

information mining in social interactions.

6

7. We built a system for summarizing sales/service chats by deriving semantics utilizing the

features from our extraction system.

1.5.1 Information Extraction from Social Interactions

To closely examine the user generated content and design an extraction engine we selected

the scenario of customer reviews on E-commerce sites such as Amazon.com, ebay.com, etc.

The online retail market is growing immense, offering millions of products for customers. The

products are generally described in terms of a few set of attributes. Such product attributes are

mined from the descriptions to represent the product in a structured manner.

Often descriptions deal with generic attributes. For example, specific attributes like power

consumption, pulsator, load, spin-dry effectiveness, noise, water usage, water leakage, etc for

a product like washing machine cannot be correctly found in descriptions. On the other hand,

customers express their opinions in the form of reviews. The opinions expressed are in terms

of attributes they like and dislike but not always in terms of those that are provided by the

retailer for that particular product. Hence mining attributes about which the customers discuss

can be really helpful for sellers as well as for other customers.

Mining product attributes from customer reviews can lead retailers to fetch and group other

products that are having similar specific attributes and forecast more precisely. Hence many

retailers are trying to enrich their product knowledge bases with these domain specific and

product specific attributes. Attribute extraction from reviews is also useful in tasks like review

summarization, product rating, sales agent assessment, opinion mining of reviews, product

recommendation systems, customer relationship management, customer satisfaction analysis,

customer profiling, etc.

On the customers’ side, they are more prone to seek opinions of other customers who actually

used the product or bought it from a particular retailer website. They ask for unbiased evalu-

ation of a product by leveraging information from multiple reviews, although each individual

review can be subjective in nature. Therefore a person is more interested to read a featured

review than overall reviews like “the product is really great, awesome!” or “this is the greatest

product I have ever seen!!!” or simply the product rating.

Mining attributes from customer reviews is a challenging task as they mostly comprise of

user generated content. We already know that text in such user generated content is low in

natural language grammar, structure, formality which often hinder NLP tools.

By this motivation, we have designed a novel framework that can extract attributes of a

product with out making use of natural language tools but treating the text as ‘Bag Of words’

and using the knowledge of Wikipedia and Web.

7

1.5.2 Summarization of Social Interactions

After designing information extraction procedures for social interactions, we investigated

into summarizing the content of social interactions. Summarization involves basic procedures

like

• Selection of information from original content

• Ranking of content and then its organization in a summary.

We have already introduced the sales/service chats in section 1.3. An agent in a typical contact

center handles over a hundred calls in a day. They operate in various communication modes such

as E-mail, voice and online chat which consequently produce huge (gigabytes) data in the form

of chat logs, customer feedback, voice conversation transcriptions, E-mails, etc. Text modes of

communication like online customer-agent chats, and interactions over email tend to be noisy.

Also, transcription of voice conversations using state of the art automatic speech recognition

results in text with 30-40% word error rate4.

Analysis of such data is actually essential for customer satisfaction analysis, call modeling,

customer relationship management, customer profiling, agent profiling, etc. For such analysis,

sophisticated and advanced techniques are needed for automatic procedures to handle poorly

created text. To assess a sales/service chat, one needs to go through all the chat session and its

previous chat sessions if required (as in situations where if appropriate solution is not provided to

the issue raised, the customer may comeback after a period of time). This demands considerable

human effort. This effort can be minimized if assessors are provided with summaries of chat

sessions they need to go through. This also results in assessors effectively grading agents and

thereby increases chat throughput.

summarization helps agents to quickly grasp the information exchanged in a chat session.

1.6 Evaluation

In this thesis as we deal with problems of information extraction and summarization of

content in social interactions, we provide evaluation for both summarization and information

extraction.

The evaluation procedures we followed help in providing a platform for comparison of peer

results.

We adopted Precision, Recall and F-measure as the evaluation measures for assessing the

capability of our extraction engine. While for the evaluation of our summarization system we

4http://en.wikipedia.org/wiki/Noisy_text_analytics

8

used the popular ROUGE metric to score the summaries produced. The mode of evaluation is

more elaborated in Chapter-3.

1.7 Organisation of Thesis

The rest of this thesis is organized into the following chapters:

• Chapter 2 presents a survey of related work and relevant literature associated with infor-

mation extraction. It includes different approaches for text mining in social interactions,

their areas of success and drawbacks. It expands the perspective of information extrac-

tion into summarization of social interactions and presents work related to summarization

along with various flavors of summarization.

• Chapter 3 deals with details of our solution to ‘information extraction from online so-

cial interactions’ problem. The extraction framework is explained in detail. Elaboration is

given towards motivation and derivation of Wikipedia based features and other word den-

sity features which help in identifying salient keywords along with their semantic relation.

It elaborates motivation of our approach towards summarization of social interactions and

presents in detail the summarization framework we built using features from our previous

extraction engine.

More interaction to the datasets used and the classification of experimental runs using

knowledge from Wikipedia and Web is given. It also elaborates the details of evaluation

procedures adopted, the experiments conducted in a strict environment and other ablation

tests that involve evaluation of our individual features especially with regard to using

external knowledge for summarization.

We followed different evaluation themes for extraction and summarization to prove various

contributions of this thesis.

• Chapter 4 concludes with important derivations of this thesis to the related research

field of text mining in social interactions. It describes the possibilities of adoption of the

central idea presented in this thesis to different scenarios of online social interactions. It

also leaves the readers with the list of intriguing extensions and future plans for this work

which can have a sound impact on current social media.

9

Chapter 2

Related Work

2.1 Approaches of mining in social interactions

After a good study in summarization of news media, research has been migrating to mining of

social media in recent years for the many benefits it offers. Mining social media helps enhancing

security issues, tracking trends in the world and in managing user interactions. With the advent

of community question answering it is now possible, easy and very effective to post a question

on popular community QA forums such as Yahoo! Answers. For users, these community QA

sites became a popular platform for a wide range of information needs where they can rely on

other users to provide them with answers. The archives of millions of such questions and their

sets of answers, many of which are priceless for the information needs of other users. To access

such immense repository of knowledge, effective information mining systems are required [2].

2.2 Need for Information Extraction in Social Interactions

As any user can contribute an answer to a question on a community forum, the majority of

the content often reflects personal opinions and experiences. For this reason, there is always an

urge for research to focus on extraction of salient information from social interactions. There has

been budding research targeted towards mining and summarizing opinion from blogs, reviews

and chats to evaluate overall drive of a content.

Theory has been developed in accordance to finding resultant or overall sentiment associated

with the media rather than digging out important information from the content.

Question Answering and Information Extraction have been studied over the past decade

however, evaluation has generally been limited to isolated targets or small scopes1.

1http://nlp.cs.qc.cuny.edu/kbp/2011/

10

Work along the lines of Information Extraction is still continuing because of the reason that

it forms the first step for most of the summarization, question answering and opinion mining

systems.

Recently TAC2 (Textual Analysis Conference) has started a track called knowledge base

population which consists of a slot filling task in which participants are encouraged to run their

extraction systems on a given data. This task explores extraction of information about entities

with reference to an external knowledge source. Using basic schema for persons, organizations,

and locations, a knowledge base must be created and populated using information found in

text.

Recognizing textual entailment (RTE), a task also introduced by TAC aims at capturing

major semantic inference needs across many natural language processing applications, such as

Information Extraction (IE), Question Answering (QA) and summarization.

The task of Recognizing textual entailment aims at validating a hypothesis if it is entailing

(agreeing) a given text.

An example, assumed common background knowledge of the business news domain and the

following text:T1: Nokia and Intel will merge their top-end smart phone software as they face increasing pressure from cellphone

industry newcomers Google and Apple.

The following hypotheses are entailed:

• - H1.1 Google and Apple are newcomers in cellphone industry

• - H1.2 Nokia and Intel are facing pressure.

• - H1.3 Nokia and Intel are facing pressure from Google and Apple.

• - H1.4 Nokia and Intel produce top-end smartphone software.

If H is not entailed by T, there are two possibilities:

1. H contradicts T

2. The information in H cannot be judged as TRUE, on the basis of the information contained in T.

On the basis of entailment, a question answering system can be re-framed, representing the

text as the question and the Hypothesis as the expected answer pattern. Now, the QA problem

is restructured to identifying texts that entail the expected answer. Given a question, the text

entails the expected answer form.

In information extraction, entailment holds between different text variants that express the

same content.

We proceeded on the RTE by finding linguistic structures, which we call them as templates

that share same anchors. The lexical elements describing the context of a sentence are termed as

2http://www.nist.gov/tac/2011/

11

Figure 2.1 Extraction of templates

anchors. Templates that are extracted from different sentences both from text and hypothesis

and if the anchors agree with each other (i.e., overlap) for these sentences, then the case is taken

as entailment.

For example the sentences ‘Yahoo bought Overture’ and ‘Yahoo acquired Overture’ share

the anchors {X= Yahoo, Y = Overture}, suggesting that the two sentences entail each other.

Entailment is nothing but the given text is in agreement with the source text. Figure-2.1

portrays our system.

Getting such templates from a text requires the methods of information extraction. As the

data which we worked for the RTE task is mainly news wire media, hence we took major

help from the natural language processing tools like Named Entity Recognizer (NER), Parts of

speech taggers (POS) to identify named entities for template building.

This shows the importance of information extraction in the field of text analysis and artificial

intelligence. The text that has been used for these tasks mainly come from news articles but

not much from social media.

The extraction systems for unstructured text are still in the form of theory and experimen-

tation but did not even reach budding stages in development. This shows a requirement for

12

good information extraction systems for structured text and a very stronger urge in case of

social media.

People have experimented their extraction techniques in different scenarios of social media

but found a greater requirement for solid systems which can deal with user generated subjec-

tivity of the text rather than simple objective text as found in news media.

For a close look, we have selected a scenario for information extraction in social media i.e.,

extraction of product attributes from user reviews about the products they purchase and had

experience with, on E-commerce websites like Amazon.com.

One perspective of extracting product attributes from customer reviews is like identifying

subtopics (attributes) for a given topic (product) in a discussion. Hence working on this problem

allows us to investigate extraction systems for domains that contain user generated text.

2.3 Work related to Attribute extraction

A good amount of research had been put into product attribute extraction in recent years.

But the focus was laid in extraction of attributes from product descriptions and a little was

done in extracting the same or more specific attributes from user reviews. Much of the existing

work focuses on whole review classification and overall opinion extraction.

Work related to word order occurrences where product attributes are believed to exist as noun

phrases was already contributed [11, 5]. But it [10, 9, 17] was shown that using noun phrases

tend to produce noise and low precision. Their work presents the identification of product

attributes with the help of Parts-Of-Speech(POS) tags and the occurrence of adjectives.

But in most of the cases when free format reviews are considered, the POS taggers do not

function at the expected level as grammar is not guaranteed in user generated text.

Some amount of effort was also put forth in the area of product attribute extraction from

product descriptions [25]. Their algorithm is based on the fact that descriptions are structured

pieces of text. They have trained a noun phrase recognizer model and used it for identifying

noun phrases in such product descriptions which worked well on structured text but when tested

did not work on unstructured text and long reviews.

Mooney [21] gave a good survey of prevailing techniques in general information extraction.

Chin [4] has done contextual sense disambiguation and semantic association using Wordnet.

We have dealt with the method of using general ontologies like Wordnet to find synonyms

of product attributes from reviews which is inconsistent as the ontologies like Wordnet lack

domain knowledge. If we go for domain specific ontologies they are very limited.

Using knowledge bases like thesaurus, and calculating associated information of a pair of

words by their association counts calculated through queries to a search engine was done by

13

Turney, which is interesting in a way of supplying external knowledge [29]. The limitations of

some of these methods lies in their lack of acquisition of context knowledge. Therefore context

analysis is in demand as the product attributes are context and product dependent.

2.4 Uses of product attribute extraction

Extracting product attributes from text holds a lot of use in industry and for knowledge

building.

• Demand forecasting and for predicting market trends through marking of positive and

negative attributes of a product.

• Product recommendations

• Provides ease for comparison of manufacturers, suppliers and retailers.

• Building knowledge bases for products

2.5 Extraction to Summarization

The summarization literature suggests information extraction as an important process of

summarization. Many text summarization systems in the past have used information extraction

either implicitly or explicitly to produce summaries.

Generally information extraction implies that we already know what kind of information

needs to be found from the source text where as, summarization implies finding the interesting

parts of the source text.

But from the perspective of system developers, the two applications overlap and blend into

each other.

The information extraction models used for summary production were used to be in the

form of extracting important text units which are further used to rank content in text to get its

summary. Extracting text units purposefully means that extraction of named entities, actions,

attributes, subjects, etc., which ever are required for a domain. We seen in our previous section

people have developed very specific kind of extraction systems which mostly work best for

documented text rather than for social interactions. When aiming for social interactions one

needs to take care that their systems should work across different domains of social interactions

including extreme cases like chats. These should later be useful for other information retrieval

systems like question answering and summarization.

14

In this thesis, as we selected the scenario of product reviews for building our extraction

system, we choose a different domain like corporate sales/service chats for summarization.

We used the developments of our domain independent extraction system in enhancing our

summarization system.

Almost five decades of research have been put into text summarization. The following are

the different forms of summarization prevailing today.

2.6 Flavours of Summarization

• Single Document Vs. Multi-Document : This categorization is based on the original

content considered for summary generation. In single document summarization, one needs

to summarize from a single document of text and, for multi-document summarization it

is from multiple documents that pertain to a focused topic or multiple topics. Handling

redundancy in information is the biggest challenge in multi-document summarization.

DUC followed by TAC are the conferences which focused on research in summarization

providing various tasks along with datasets bringing together researchers from NLP and

other areas.

• Query focused Vs. Generic : When summaries are produced through procedures

which consider user’s need as a query, this process is called Query focused summarization.

In generic summarization, summaries are produced by capturing important information

from source documents.

• Extract Vs. Abstract : An extractive summary is a summary consisting of entirely

the material from source text where as an abstract is a summary whose material is not

entirely from the source. In general abstract summaries are human written summaries.

Automated summary generation systems are trying to achieve abstract summaries to go

close to human produced summaries using technologies like natural language generators

from the phrases extracted, but didn’t even reach amateur levels.

• Blog Summarization corresponds to summarizing blogs available on the internet. They

can focus towards a topic or can provide generic summaries for all the information dealt

in a blog series. The blog summarization has gone a step ahead in the recent years by con-

sidering and including the information from user comments to a blog while summarizing

the blog. The challenges lie in dealing with user generated content.

• Update Summarization : When a user is aware of past proceedings in a topic or a

stream of information, the user desires to get an updated summary avoiding the infor-

15

mation he already knew. The task of producing summaries on the updated information

avoiding redundant information is known as Update summarization.

• Personalized Vs. Guided Summarization : The notion of importance and relevance

changes from person to person. A personalized summarizer caters based on users interests

and personal background . But in guided summarization, the summary is guided by a fixed

template that is prepared for that particular domain of text. Suppose the template is in

the form of a questionnaire, the summary should answer the questions in the template if

the answers are present in the source documents.

• Chat Summarization : Summarization of chats is in its beginning stages and is be-

ing studied extensively in recent years because of the bottlenecks it presents towards

text summarization and other corporate businesses, governmental defense requirements.

Summarization in this area is perceived in a way bit different in procedures compared

to traditional summarization because of the medium of source content. Chats are more

unstructured compared to blogs in content which makes it hard in extraction of salient

information from a chat.

2.7 Background related to Summarization of Social interactions

A good amount of research had been already done in traditional text summarization but

not much into summarization of chats. The research that had been put into summarization of

chats gave attention to technical blogs, forums, reviews and recently Internet chats but none of

them have addressed the problem which we focused in this thesis. The attempts done already

were in course of using standard Natural Language Processing (NLP) tools and techniques for

extracting information from social interactions.

We suggest to avoid relying on NLP tools as social interactions comprise of different domains

and text mediums which keep on changing their formalities.

Coming to past contributions in social interactions field, Roman worked on extraction of

stance, politeness and bias in a conversation [26, 28]. Zhuang proposed a method [35] that uses

key attributes along with occurrence frequency and parts-of-speech. This method works to an

extent in user reviews of good text structuring but in case of more dynamic social interactions

like sales/service chats and Internet chats contain short phrases and improper grammatical

structure where it is better to avoid any use of natural language tools.

Attempts were made to summarize internet chats by segmenting the text into subtopics

[8, 34, 33]. This method cannot be applied directly to various categories of chats like sales/service

chats and other chat scenarios where chats are focused to a particular topic, as it misses several

16

specificities in such type of chats. Also segmenting the text into subtopics is not a priority in

these situations.

Summarization of chats using a list of phrases and there by classifying sentences with these

phrases and their frequency of occurrence was also experimented [12]. Murray and Zechner

worked on summarization of multi-party diverse domain discussion lists [22, 23, 32] which

even though belong to the family of unstructured text possess divergence from basic chats and

corporate customer service chats in terms of participants and in the level of topic shifts. The

sales/services chats can be taken as the most focused chat media among all kinds of chats.

Much of existing work on chat summarization focuses on topic shifts, identifying question-

answer pairs and topic clustering. No good effort had been made yet in mining information

by deriving semantics from a text domain such as social interactions. In our scenario every

exchange of information is a question answer (QA) pair but focused towards the root question.

Here, there is no great need to identify topic shifts and QA pairs as the chats are more focused

towards their initial root question and its resolution.

In this thesis we made a successful maiden attempt in automatic summarization of social

interactions by deriving actual semantics implicated in a discussion using external knowledge

sources.

17

Chapter 3

Information Extraction based approach for summarizing Social

interactions

Before getting into details of our approach to the contemporary problem of mining infor-

mation from social interactions, a brief elaboration has been given in the following sections

about the machine learning techniques employed, their specific requirement for this problem

and metrics used to evaluate our approach.

3.1 Machine learning and evaluation related background for IE

& Summarization

3.1.1 Support Vector Machines

Classification of data is a common task in machine learning. Machine learning is about

prediction of properties using known properties learned from the training data.

Support Vector Machines (SVMs), a new generation machine learning system deliver state-of-

the-art performance in real-world applications such as pattern recognition, image classification,

text categorisation, biosequences analysis, hand-written character recognition, etc., and are now

considered as one of the standard tools for data mining1.

From a set of training samples, where each marked as belonging to one of two categories, an

SVM training algorithm builds a model that assigns new samples into one or the other category.

More analytically, support vector machines construct hyperplanes or set of hyperplanes in a

high-dimensional space, which can be used for classification of samples into different categories.

For the current extraction task, we have used the classification by support vector machines

where as for the later summarization module we made us of support vector regression, which

is another segment of SVMs, dealt in later sections of this chapter.

1http://en.wikipedia.org/wiki/Support_vector_machine

18

In classification tasks, the parent problem may be stated in a finite dimensional space but

often happens that the sets are not linearly separable in that given dimensional space. Hence,

in SVMs, original finite-dimensional space is mapped into a much higher-dimensional space

making the discrimination task happen easier in that dimensional space.

A good discrimination between categories is achieved by the hyperplane that has the greatest

distance to the nearest training data sample of any class which is termed as the functional

margin. In general the greater the margin the lower is the generalization error of that classifier.

And for this reason we selected Support Vector Machines among other machine learning

algorithms like neural networks, decision tree learning and probabilistic graphical models like

Bayesian networks to aid our extraction procedure by classification.

Our methodology is modeled keeping in mind the different domains of social interactions and

for this reason we need to aim for minimizing the generalization error in our system rather than

overfitting our system for a particular domain data. Because training sets are finite and limited

compared to the future which is uncertain, learning theory usually does not yield guaranteed

consistency in the performance of algorithms2. Hence it is always better to choose a machine

learning algorithm that minimizes the generalization error while addressing scientific problems

such as ours.

3.1.2 ROUGE

In summarization domain, ROUGE scores in content match have shown to be generally

highly correlated with human evaluation[14].

A belt of conferences, special topic sessions and workshops on automatic text summariza-

tion like ACL, COLING, SIGIR, (WAS 2000-02), and USA government sponsored evaluation

efforts through conferences like DUC, TAC have advanced the technology and produced some

experimental text summarization systems (Radev et al. 2001, McKeown et al. 2002).

Based on various statistical metrics, results show that automatic evaluation using n-grams or

unigram co-occurrences between summary pairs correlates very well with human evaluations[16].

However, in spite of all these efforts, there are no common, repeatable and convenient eval-

uation procedures that can be applied with ease to support system development and quick

comparison among different summarization procedures.

Finally ROUGE (a package for automatic evaluation of summaries) was developed by Chin-

Yew Lin. ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation[15]. It includes

measures to determine automatically, the quality of a summary by comparing it with corre-

sponding human model summaries.

2http://en.wikipedia.org/wiki/Machine_learning

19

The measures count the number of overlapping and matching units such as word sequences

and pairs between the system-generated summary that has to be evaluated and the human-made

reference summaries.

• Formally, ROUGE-N is an n-gram recall between a candidate summary and reference

summary. Here n is the length of n-gram. Suppose r is a reference summary, then the

basic formulation for ROUGE-N is given as

ROUGE −N =

∑gramn∈rCOUNTmatch(gramn)∑

gramn∈rCOUNT (gramn)

If more than one reference summary is provided, then ROUGE-N is

ROUGE −N =

∑r∈reference

∑gramn∈rCOUNTmatch(gramn)∑

r∈reference∑

gramn∈rCOUNT (gramn)

The COUNTmatch() gives the number of n-grams matched. Thus ROUGE-2 evolved.

• ROUGE-S is Skip-Bigram Co-Occurrence Statistics. Skip-bigram is any pair of words in

their sentence order, allowing for arbitrary gaps. For example in the sentence “Sudheer is

writing thesis.”, C(4,2) = 6 skip-bigrams are contained. They are

– Sudheer-is,

– Sudheer-writing,

– Sudheer-thesis,

– is-writing,

– is-thesis,

– writing-thesis

Skip-bigram co-occurrence statistics measure the overlap of skipbigrams between a can-

didate summary (S) and a reference summary (r). Suppose skipBi() gives the count of

matched skip bigrams, then ROUGE-S is given as

R =skipBi(S, r)

C(r, 2)

P =skipBi(S, r)

C(S, 2)

ROUGE − S =2RP

R + P

C(r, 2) and C(S, 2) delivers the number of skip-bigrams in reference summary and candi-

date summary respectively.

20

• ROUGE-SU is an extension of ROUGE-S with the addition of unigram as counting unit.

Hence ROUGE-SU4 is nothing but matching bigrams with skip distance up to 4 along

with the unigram count of the words, stemmed (Lin, 2004).

3.2 Information Extraction Framework for Social Interactions

The major information extraction (IE) problem that has been studied extensively in the

past is named entity extraction. In general scenario of news media corpus, the extraction of

desired entities and information is guided by named entity recognition tools, Parts-Of-Speech

tools along with the applied rules framed using knowledge from the field of given content.

Coming to the media of social interactions, we tried extracting the crux entities i.e., product

attributes from a set of customer reviews. We used the orthodox methods listed above for

extraction but the reviews which consist of incomplete sentences and short phrases made the

system perform poor and inconsistent. This is because the current NLP tools are trained on

ordered language datasets with not much extension to noisy text analytics.

We dealt the problem of extracting attributes of a product by designing a solution that

can serve as a general extraction method for all dimensions of social interactions. We tried to

automatically grasp the semantics of content by using external resources like Wikipedia.

For any given product, our approach to attribute extraction involves:

1. Collection of customer reviews of the given product.

2. Filter out stop words.

3. Compute features that we have defined, for the remaining words.

4. Identification of possible attribute words using classification model trained on these fea-

tures.

Support vector machines (SVMs), are a set of related supervised learning methods for clas-

sification and regression analysis which are used to facilitate our model.

We need to classify the possible words into attributes where we used SVMs (LIBSVM pack-

age) for our machine learning tasks. LIBSVM is an integrated software for support vector

classification, regression and distribution estimation. It also supports multi-class classification

[3].

We used a LIBSVM package for machine learning tasks like data scaling to parameter selec-

tion. We made use of scripts provided with the package for selection of appropriate kernel and

21

internal parameters, but it is time expensive as it tests all the kernels that suit for the given

data. We opted for it in the training phase.

Our task is now reduced to identifying a set of features that can pick out the attributes

from customer reviews. The features on which our system has been trained are explained in the

following sections.

3.2.1 Most Frequent Items-MFI

Words related to topics that are discussed more occur at high frequencies in any given text.

In general people discuss about the attributes of a product in their reviews frequently. The

‘Most Frequent Items’ feature boosts the importance of attribute words by their frequency of

occurrence in customer reviews. This feature is closer to the tf − idf [27] measure. Our task

does not relate with the identification of attributes of a particular product from the customer

reviews of various products, but attribute identification only from the customer reviews of that

particular product .

The set of words {z1, z2, z3, ....zm} used for this feature are obtained from customer reviews

of a given product after stop word removal is done. For any word zi the ‘Most Frequent Items’

feature is computed by

MFI(zi) =Freq(zi)

m∑j=1

Freq(zj)

Freq(zi) gives total number of occurrences of zi in reviews of a given product.

3.2.2 Context Relation using Wikipedia - CR

To understand a context or to identify a context, we need the set of keywords that can

portray the context. So, we assume that any context C can be expressed as C = {t1, t2, t3, ...tn}where ‘ti’ are the related keywords dealt in C.

The product forms the context in customer reviews. People talk about the product and its

attributes in their reviews. Its attributes and other highly related things belong to the set of

keywords of the context.

The CR feature is about identifying the list of related keywords mentioned in customer

reviews that can be found in Wikipedia. We start with identifying all words that have been

discussed in reviews of a given product in Wikipedia and then proceed with calculating the

most semantically related words among them.

22

When we make judgments about semantic relatedness between any two given words, we

draw huge amount of background knowledge about the concepts represented by these words.

Hence, any trial to state the semantic relatedness between different words automatically also

needs to do the same. One can use hand-crafted lexical structures like thesauri and taxonomies,

or statistical analysis of huge corpora to process the semantic decisions automatically [19]. The

limiting factors of such techniques when carried across domains are the background knowledge,

precision, scalability and scope. With more than 4 million articles and thousands of volunteers

all over the world, Wikipedia which is a growing massive repository of knowledge, is the best

alternative when targeted by such limitations.

We explore Wikipedia’s link structure, category structure, article titles, and page types

from the static and latest pages-articles xml dump3 of Wikipedia. We only need Wikipedia’s

structure rather than it’s full textual content. We have created SQL database, tables to store

and access the page titles and articles fast, which has been suggested and explored already

[20]. We map a word in customer reviews to a Wikipedia article if the word is contained in

that Wikipedia article title. We call such words as Wikipedia words and if cannot be mapped,

we refer them as Non-Wikipedia words in later sections. A word can be mapped to all its

homonyms in Wikipedia. For instance the word ‘bank’ can refer to ‘river bank’ or a ‘savings

bank’ in Wikipedia. To disambiguate and identify the correct possible article mappings for a

given word, we need to first disambiguate words which may possibly contain mappings in more

than one domain. To address this, we used a method [20] where articles for unambiguous words

are used to disambiguate the ambiguous words.

Computing semantic relatedness between two words that are mapped to Wikipedia, is equal

to finding the semantic relatedness between articles in Wikipedia to which these words refer.

And to do this, the best known way is to compute the relation from the links to these articles

in Wikipedia [18, 19].

The relation between two Wikipedia articles x and y is given by

Relationx,y = 1− max(log|A|, log|B|)− log|A ∩B|T −min(log|A|, log|B|)

Here A and B are the set of articles which link to the articles x and y respectively, T is the

total number of Wikipedia articles, A∩B is their overlap. Thus for every Wikipedia word, we

find the semantic relatedness to all other such words. Context relatedness feature (CR) of a

word is computed as the sum of its similarity scores with all other such words in the context

which is then normalized by the total number of such words. Therefore for a Wikipedia words

set {x1, x2, x3, ....xk}, semantic relatedness of xi to the context is given by

3http://dumps.wikimedia.org/enwiki/

23

CRxi =

K∑j=1j 6=i

Relationxi,xj

k

The applicability of CR feature is justified in terms of high scalability and the ever growing

knowledge of Wikipedia.

For Non-Wikipedia words {y1, y2, y3, ....yl} in the product reviews, the CR feature is modified

as the average of all CR feature values for Wikipedia words, from reviews of that particular

product. Hence the CR value for any non-Wikipedia word yi is uniformly given as

CRyi =

k∑j=1

CRxj

k

where xj is a Wikipedia word.

Illustration of CR:

For example if we happen to find the words lens and camera in a piece of text, we try to find

the inlinks to the articles in wikipedia that these words represent to calculate the semantic

relation between lens and camera. In the following Figure-3.1, it has been found and shown

that the Wikipedia articles like Angular Resolution, Ray tracing, Optical coating, Microlens,

Asphericlens, Carlzeiss, Image, Zoomlens, Optics, Photography, etc.., are the common inlinks

between the two Wikipedia articles lens and camera. Such number of common inlinks are more

than enough according to our thresholds to consider the two words are closely and semantically

related.

24

Figure 3.1 Context Relation using Wikipedia

25

3.2.3 Role of surrounding window - SW

We have taken into account the surrounding text of ‘t’ Wikipedia words to the left and

right of a given Wikipedia word to examine its role in identifying an attribute. As some topics

arise and eventually diminish in a small window of discussion, the situation motivates us to

consider the relation with the surrounding text as a classification feature in identifying product

attributes.

This feature can help in identifying sub-attributes (attributes of attributes). The sub-attributes

may not seem related when overall context is considered, but they are relevant when limited

contexts in which they occur are considered.

Suppose if there are p instances of Wikipedia word xi in the reviews. The relation of xi with

the surrounding text is computed as

SWxi =

t∑j=−tj 6=i

Relationxi,xj

(k)(N)(p)

Where N is the total number of words in customer reviews of a given product. The window

length t is arbitrarily taken as N20 . “-t” means t words to the left of xi and vice-versa.

The SW feature for the non-Wikipedia words is uniformly given as average of all SW feature

values of Wikipedia words from reviews.

SWyi =

k∑j=1

SWxj

k

3.2.4 Web search engine reference-WR

As there are words that cannot be mapped to Wikipedia, we may loose a few trivial attributes

in the candidate selection stage. To boost such words we use knowledge on the Web. The WR

feature measures the association of a particular word from customer reviews of a product with

that product on the Internet.

26

Figure 3.2 WR illustration

27

In the above illustration Fig-3.2, a query is formed with words ‘camera’ and ‘lens’ to under-

stand the relation of these concepts. We trust there is a greater possibility for words that are

sematically related to co-occur more. Hence we measure the degree of association of these two

words in the search results when a bigram query is formed using them.

We can see that for the query “camera lens”, there are more than 10 instances of ‘camera’

and ‘lens’ occuring together in the result snippets. The threshold for deciding whether any two

words are semantically related from their association frequency in search results is automatically

learnt by the system in its training phase.

Unfortunately the Google Web Search API has been officially deprecated. So we have used

Bing search API4 to compute WR for a word. The following illustration Fig-3.3 shows the

instance of Bing retrieving search results for the query “camera lens”.

Figure 3.3 WR using Bing

4http://msdn.microsoft.com/en-us/library/dd251072.aspx

28

WR value for a word zi is given by

WRzi =Res(zi, P )

SN

Res(Zi, P ) is the number of instances where the word zi and the product name P both

occur within the text snippets given as search results by the search engine. This frequency is

normalized by the total number of search results SN that are taken into account. The limitation

of this feature is that the system needs to be online with a search engine.

Based on research literature in information extraction, our previous related work and intu-

itions, a new information extraction system is designed with features like CR, SW, MFI, WR.

This information extraction model is intended to extend the scope of its features in expending

to larger information mining systems for social interactions. The current extraction system is

tested for performance estimation in extracting product attributes from costumer reviews.

3.3 Data for Information Extraction system in Social interac-

tions

We have trained our newly designed extraction system using SVM and evaluated against

two popular datasets of reviews, the Reviews-9-products dataset [6] and the Customer Reviews

dataset [9].

Product Reviews:

The CustomerReviews dataset contains the semi-structured and unstructured user reviews of

five products:

• ApexAD2600Progressive− scanDV Dplayer,

• CanonG3,

• CreativeLabsNomadJukeboxZenXtra40GB,

• Nikoncoolpix4300,

• Nokia6610.

The Reviews− 9− products consists of user reviews on the following nine products:

• CanonPowershotSD500,

• DiaperChamp,

29

• Ipod,

• MicroMP3,

• Norton,

• CanonS100,

• Hitachirouter,

• LinksysRouter,

• Nokia6600.

These datasets have been used for opinion mining tasks and referred by several other pub-

lications5. They have already been annotated manually in terms of attributes of products and

opinions on those annotated attributes. For our task, we don’t need the opinion, hence we did

not take into account any opinion annotations.

The words that are annotated as attributes consists of trivial words, terminologies, and

concepts. The given datasets contain customer reviews of products from different domains.

Experiments are carried out at two levels. First, crucial features are tested to know their

respective performance, and then the complete combination of features is tested. To train our

model we used Reviews − 9 − products dataset and for testing CustomerReviews dataset

is used. Similarly we have also done testing on Reviews-9-products dataset by generating the

training data from CustomerReviews dataset.

Both the datasets consists of annotated reviews about a total fourteen products taken from

Amazon6.

The following is an instance which illustrates the nature of the datasets. It shows review text

of a camera annotated according to the attributes discussed.

camera[+2][p]##the more i work with it , the more i love it !

##i would recomend that you purchase a lexar media cf for the camera as the sandisk card that comes packaged is too

small and too slow !

camera[+3][u]##this quality and ease of use for under 1500 - i ’m thrilled with my purchase !

[t]outstanding camera

camera[+3]##this is my first digital camera , and i am very pleased with it... .

##i do not know a whole lot about photography , but i am happy to know that this camera can always perform , even as i

grow in skill and knowledge .

camera[+2][u]##seriously , this thing has everything that a pro or expert amateur could want .

picture[+2], auto mode[+2]##but at the same time , it takes wonderful pictures very easily in ” auto ” mode , so that even

an average joe like me can use it !

5http://www.cs.uic.edu/ liub/FBS/sentiment-analysis.html6http://www.amazon.com/

30

four megapixel[+1]##four megapixels is great .

##i know there are five mp cameras out there , but this thing does just fine for me .

##if you want , check out the canon website and they have some sample images , taken by this camera , for you to download..

.

product[+3][u]##it is a very amazing product .

camera[+3][p]##i highly recommend it .

[t]love my new g 3

————–end of snippet———————

Symbols used in the annotated reviews:

[t]: the title of the review: Each [t] tag starts a review.

We did not use the title information in our papers.

xxxx[+—-n]: xxxx is a product feature.

[+n]: Positive opinion, n is the opinion strength: 3 strongest,

and 1 weakest. Note that the strength is quite subjective.

You may want ignore it, but only considering + and -

[-n]: Negative opinion

## : start of each sentence. Each line is a sentence.

[u] : feature not appeared in the sentence.

[p] : feature not appeared in the sentence. Pronoun resolution is needed.

[s] : suggestion or recommendation.

[cc]: comparison with a competing product from a different brand.

[cs]: comparison with a competing product from the same brand.

3.4 Product attribute extraction using Wikipedia

We have considered the MFI feature as baseline for this approach. The reason is that MFI

is intuitive due to the fact that people when discussing about a product mention its attributes

a good number of times in their reviews. The precision in our task is given by

Precision =Number Of Attributes Identified correctly

Total Number Of words Identified as Attributes

and the recall is given by

Recall =Number Of Attributes Identified correctly

Number of Attributes Actually Annotated

The F-score which is a cumulative of both Precision and Recall is given by

F − score =2× Precision×Recall

Precision + Recall

31

When we have tested our Wikipedia based features CR, SW along with the baseline feature

MFI, we encountered a low recall but a good average precision of approximately 88%. The reason

behind this low recall is that trivial words and some verbs cannot be mapped to Wikipedia. For

example, for the DiaperChamp product listed in Table 3.1 the annotated attributes like bang−for−the−buck, deal, looking, cost−effective, works, pull, assemble, costlier, clean, safer, etc.,

cannot be correctly linked to the articles of Wikipedia. To rule out such discrepancies we can use

an ontology like Wordnet. But it adds a lot of noise. The attributes identified using Wikipedia,

when manually observed contained more quality attributes than loose attributes like the words

mentioned above. The statistics of identified attributes from both datasets are shown in Table

3.1 and their collective precision, recall and f-score values are given in Table 3.3.

Table 3.1 CR, SW, MFI as the features

Product Name AnnotatedAt-tributes

CandidatesSelected

AttributesIdentified

Diaper Champ 68 16 14

Canon G3 106 30 25

Hitachi router 82 14 11

Canon S100 99 26 23

Nokia 6600 147 48 44

MicroMP3 196 41 35

Nikon coolpix 4300 76 16 13

ipod 92 23 10

Creative Labs Nomad Juke-box Zen Xtra 40GB

186 47 43

norton 107 24 23

Linksys Router 85 24 18

Apex AD2600 Progressive-scan DVD player

115 24 19

Canon PowerShot SD500 70 13 12

Nokia 6610 111 35 31

32

3.5 Product attribute extraction using Wikipedia & Web

The web based feature WR when combined with the other features, the recall increased.

The statistics of product attributes identification from both datasets using the features CR,

SW, MFI, WR are shown in Table 3.2. The results in terms of precision, recall and f-score have

been presented in Table 3.3. We can clearly see that the combination of all the four features

which include Wikipedia based features and other frequency, web based features has performed

the best in terms of f-score. The increase in recall is due to gain in knowledge using WR. The

fall in precision can be explained by the boosting of insignificant words in the search results.

If a Wikipedia word in the reviews is identified as an attribute by our model, we output the

actual article title in Wikipedia for which this word is mapped as a product attribute. For a

non-Wikipedia word if identified as an attribute we output the word itself as a product attribute.

Table 3.2 CR, SW, MFI, WR as the features

Product Name AnnotatedAt-tributes

CandidatesSelected

AttributesIdentified

Diaper Champ 68 57 45

Canon G3 106 93 70

Hitachi router 82 79 66

Canon S100 99 91 73

Nokia 6600 147 112 85

MicroMP3 196 133 102

Nikon coolpix 4300 76 54 46

ipod 92 85 66

Creative Labs Nomad Juke-box Zen Xtra 40GB

186 157 122

norton 107 94 73

Linksys Router 85 79 52

Apex AD2600 Progressive-scan DVD player

115 90 79

Canon PowerShot SD500 70 63 52

Nokia 6610 111 92 74

33

Table 3.3 Relative scores

Featurecombina-tion

Recall Precision F-score

MFI 0.112 0.603 0.189

CR, SW,MFI

0.202 0.878 0.328

CR, SW,MFI, WR

0.666 0.802 0.727

3.6 Discussion

In Table 3.2, the given products Diaper Champ and ipod belong to the most divergent

domains. For Diaper Champ our model identified 45 out of 68 annotated attributes where

as for ipod, it identified 66 out of 92 annotated attributes. Similarly for the product Apex

AD2600 Progressive-scan DVD player, it identified 79 out of 115 attributes. This shows that

recall is approximately equal across the products which is an evidence that the model does not

depend on the domain of a product. Table 3.3 presents that the precision remains approximately

constant in adding web but improved recall which is due to mere addition of knowledge. As

it was also already mentioned that Wikipedia based features SW and CR contribute for more

quality attributes than loose attributes identified by WR. This shows that Wikipedia which is

increasing its database with structured knowledge every second can easily surpass the need for

web in coming years in attaining performances for tasks like these.

3.7 IE based Framework for Summarizing Social interactions

Coupling information extraction techniques with summarization is a relatively less explored

area, making it hard to find any relevant literature. We combine information extraction, extrac-

tive summarization to support content directed summaries. It was already mentioned earlier in

Chapter 2 regarding the selection of sales/service chats scenario for summarizing social inter-

actions. Our methodology to summarize sales/service chats is evolved after careful observation

of their structural elements. We treat the problem of summarizing sales/service chats as an

extractive summarization task and followed sentence ranking method where we measure impor-

tance of each sentence according to the combination of features scores. We selected this method

34

so that we can address several properties of sales/service chats through our varied ranking

features.

The sales/service chat summarization can be compared to Guided summarization task re-

cently introduced at TAC7 where summaries for source documents are guided by given tem-

plates. The templates are typically a set of questions that the summary should answer.

The test dataset released by NIST8 for the sake of TAC guided summarization composed of

five categories:

• Accidents and Natural Disasters

• Health and Safety

• Attacks

• Endangered Resources

• Investigations and Trials

Each of the above mentioned categories had a template of aspects that the summary had to

answer. For example, the accident category has the following template:

• WHAT: what happened

• WHY: reasons for accident

• DAMAGES: damages caused by the accident

• COUNTERMEASURES: countermeasures, rescue efforts, prevention efforts

The guided summarization task emphasizes a unified information model that can be emu-

lated by automatic summarizers. Also it highlights the task of finding relevant content on the

sub-sentential level enabling the use of information extraction techniques and other semantic

methods. It promotes a deeper linguistic and semantic analysis of source documents.

Through research collaboration we are allowed to proceed on a sales/service chat dataset

from a reputed firm. In case of sales/service chats the template is taken as

1. What is the issue?

2. Steps for resolution of the issue if mentioned.

7http://www.nist.gov/tac/8http://www.nist.gov/index.html

35

But the difficulty of summarization in here lies in unstructured text medium where in TAC,

systems are required to work on news wire data. Hence one cannot proceed with finding entities,

dates, values in chat by extraction rules using language tools etc., that can answer questions

given in the templates.

In this thesis, we put forth a model to produce summaries for the sales/service chats that

can answer the above template by deriving semantics.

After experimenting and stepping forward in formulating methods for IE in social interac-

tions, we planned to move in terms of extractive summarization (explained earlier in Section

2.4) where we made use of our extraction method developed earlier.

3.7.1 Modeling process

We have trained our system with our defined features using SVM regression [7] as it elimi-

nates the need to check feature independence and works robust. The word level machine learn-

ing features from our extraction framework are adopted and modified to sentence level features

along with some new features. Feature values of every sentence are extracted and its impor-

tance (I(s)) is estimated. Each sentence s in training data is converted into a tuple of form

(F (s), I(s)). F is vector of feature values of sentence s, F (s) = {f1, f2, f3, ..}. A model is built

by training on these tuples. Importance of a sentence in test dataset is predicted based on these

trained models.

3.7.1.1 Summary Generation

Once the sentence importance score is obtained, the sentences are ranked in the order of their

importance. The top ranked sentences thus finally obtained in the test data are considered as the

possible candidates to build the summary. For the sake of readability, even though a sentence

is ranked higher than the other, it follows the order of its occurrence in the source text.

Sentence importance is the target value in training and testing. ROUGE which is explained

in the previous sections is used to represent target values of sentence importance.

Hence the sentence importance is estimated as ROUGE-su4 score of that sentence when com-

pared to its reference summary during training and it is the value to be predicted in testing.

This sentence importance value is used for ranking sentences for inclusion into our 100-word

candidate summaries. The basic framework of our summarization system is represented by

Figure-3.4 where sentence importance value is calculated using a set of features. The following

are the algorithmic steps of our system with corresponding inputs and outputs.

36

Training:

Input : t r a i n i n g s e t (T) o f s o c i a l i n t e r a c t i o n s .

Algorithm :

f o r sentence s in T:

c a l c u l a t e f e a t u r e s f 1 ( s ) , f 2 ( s ) , f 3 ( s ) , f 4 ( s ) , . . ;

c a l c u l a t e F( s ) = ROUGE( s , model summary ) ;

Train SVM with {F( s ) , <f1 , f2 , f3 , f4 , . . > } ;

Output : t r a ined SVM model M.

Summary production:

Input : A s o c i a l i n t e r a c t i o n I .

Algorithm :

f o r sentence s in I :

c a l c u l a t e f e a t u r e s f 1 ( s ) , f 2 ( s ) , f 3 ( s ) , f 4 ( s ) , . . ;

form f e a t u r e vec to r F = <f1 , f2 , f3 , f4 , . . > ;

p r e d i c t ranks us ing SVM model−M F( s ) = { f1 , f2 , f3 , f4 , . . } ;

G = s o r t ( I , F( s ) ) ;

where G i s Ranked l i s t o f s en t ence s so r t ed in descending order

o f obta ined t a r g e t rank va lue s F( s ) .

f o r sentence in RankedList G:

While ( summary . length <=100):

summary . add ( sentence ) ;

ad jus t summary to source sequence ;

Output : summary o f I .

The following sections elaborate the features fi we used.

3.7.2 Semantic relation using Wikipedia - SR

The semantic relation using Wikipedia (SR) feature is well represented as the method of

finding semantic relatedness between words mentioned in a text by linking them to their possible

articles in Wikipedia. The feature SR is cleverly derived from CR and SW of our IE framework

to suit for the given summarization scenario. From our previous section 3.2.2, we are already

familiar in deriving relation between any given words through inlinks and outlinks structure

between the Wikipedia articles that these words represent [19]. This way of deriving semantics

from a text can be an effective way given an unstructured text environment [13].

37

User generated content

Preprocessing

Sentences

Feature 1

Feature 2

Feature 3

SUMMARY

RANKER

Figure 3.4 Summarization Framework

The reason for selecting this feature is to derive possible relationship between text segments

in the chat and there by weighting the sentences.

The best known way to compute the relation [18, 19] between any two Wikipedia articles x,

y is refreshed again for the sake of convenience

Relationx,y = 1− max(log|A|, log|B|)− log|A ∩B|T −min(log|A|, log|B|)

Here A and B are the set of articles which link to the articles x and y respectively, T is the

total number of Wikipedia articles, A ∩B is their overlap.

We have taken into account the surrounding text of ‘t’ Wikipedia words to the left and right

of a given Wikipedia word in a chat to examine its role in linking the information between the

sentences.

The relation of a Wikipedia word xi with its surrounding text is computed as

SRxi=

t∑j=−tj 6=i

Relationxi,xj

38

The window length t is arbitrarily taken as average sentence length in the data. “-t” means

t Wikipedia words to the left of xi and vice-versa.

There can also be words in sales/service chats that cannot be linked to any article in

Wikipedia. Hence the SR feature for non-Wikipedia words is uniformly given as average of

all SR feature values of Wikipedia words in a given sentence.

SRyi =

k∑j=1

SWxj

kk are the total Wikipedia words in the given sentence. The SR value for a sentence si is given

by

SRsi =

∑xj∈si

SRxj+

∑yj∈si

SRyj

|si|where xj and yj represent Wikipedia word and Non-Wikipedia word respectively. |si| represents

total number of words in sentence si.

3.7.3 Prepositional Importance - PF

A preposition in English grammar generally represents the temporal, spatial or logical re-

lationship of its object to the rest of the sentence [31]. It is very interesting to observe how

prepositions are implicitly capturing the key elements in a sentence. Observe the role of prepo-

sitions {for, to, with} in the below sentences for example:

Representative: xxxx, for registration issues please send your request to [email protected] with all the needed

info. They are happy to support you finishing your registration

After a careful observation over the data, we propose using the frequency of a small set of

prepositions {in,on,of,at,for,from,to,by,with} as a sentence scoring feature . The frequency of

prepositions is indirectly achieving the effect of performing a Named Entity Recognition (NER)

on a sentence, but without any additional cost of processing or using any POS tags. Score of a

sentence (s) calculated by PF is given as,

PF (si) =

∑wi∈s

IsPrep(wi)

|s|

IsPrep(wi) function returns 1 if wi is a preposition else 0.

39

3.7.4 Term Frequency - TF

Term frequency of a word signifies the word according to the frequency of its occurrence.

We calculated the frequencies of all words except stop words in a given chat conversation and

calculated the TF score for a word as

TF (wi) =Freq(wi)∑

wj∈chatFreq(wj)

Freq(wi) gives total number of occurrences of wi in that particular chat conversation. The

final term frequency(TF) score for a sentence is given by

TF (si) =

∑wj∈si TF (wj)

|si|

3.7.5 Wiki Frequency - WF

We considered the number of Wikipedia words in a sentence as an indicator of significant

information. The Wiki Frequency feature for a sentence s is given by

WF (si) =

∑wj∈si IsWikiWord(wj)

|si|

IsWikiWord(wi) returns 1 if wi is a Wikipedia word else 0.

3.8 Data for Summarization of Social interactions

Most of the conversations in this dataset are in the form of agent-customer sessions but with

a variation in length and issues addressed. The statistics of length of chats is presented in Table

3.4.

It shows that average chat length is of 309 words which is also a bit difficult length for a

human to go through at rush hour, and also there are many more chats that are above 600,

800, 1000 words length. The numbers show the urge to summarize these sales/service chats.

The following is an example snippet of a chat conversation in the provided dataset.

[06:25:36] Agent: Would you like me to send the specification of zz?

[06:26:45] Customer: Sorry I got called away from my desk

[06:26:57] Agent: It is okay.

[06:27:11] Customer: Yes please...

[06:27:36] Customer: In some areas it says included...hmm ..in some it mentioned "coming soon" so I wasn’t

40

Totalno. ofChats

Avgchatlength

Above600words

Above800words

Above1000words

783chats

309words

89chats

40chats

15chats

Table 3.4 Data Statistics

too sure!!

[06:27:49] Agent: I will send you the specification link of zz?

[06:27:58] Customer: yes please..

[06:28:01] Agent: cc.cc.cc <URL>

[06:28:30] Customer: perfect!! thank you.... thats all I was looking for now

Some of the details like the customer’s name, agent’s name, other entities names, dates,

timings, values are masked to safegaurd the policy of the dataset donor.

100-word human model summaries for 60 chat conversations are created of which two equal

disjoint sets are made for training and testing phases for our evaluation. Hence the training set

consists of 30 conversations and testing set consists other 30 conversations.

3.9 Experiments related to summarization of consumer sales/ser-

vice Interactions

3.9.1 Baseline

The summarization system used in TAC 2009 by IIITH [30] that proved to perform good in

single document and multi-document summarizations, is considered as baseline for evaluation.

We have considered this particular system as a baseline as it is based on sentence and word

position features which can be applied to sales/service chats scenario.

The TAC 2009 system is inspired by the anatomy of news articles and is later applied to

other kinds of text because of its generalized features.

In detail it uses two variations of sentence location feature. The first type of sentence location

feature is inpired from the fact that first three sentences of a document generally contain the

most informative content of that document which is true in cases of news articles and descriptive

text like that of articles in Wikipedia9. Hence with this feature the top three sentences are scored

in a negative proportional manner to its position in the source text and directly proportional

for rest of the sentences.

9http://www.wikipedia.org/

41

score(s) = 1− n/1000 {ifn <= 3}

= n/1000 {else}

(3.1)

The other variation of the sentence location feature is left to the learning of the machine on

the TAC 2008 summarization dataset. Hence it scores a sentence according to this optimum

position.

3.9.2 Comparison in performances of PF, TF, SR & WF

We used the traditional Rouge-2 (bigrams) and Rouge-su4 (skipped bigrams, Refer Section-

3.2.2) scores for our system evaluation. In Table 3.5 performances in terms of average ROUGE

System AvgROUGE-2

Avg ROUGE-Su4

Baseline 0.07441 0.11787

TF 0.05963 0.09102

TF+PF 0.07103 0.10932

TF+PF+WF 0.08216 0.12531

TF+PF+SR 0.08909 0.12826

TF+PF+WF+SR 0.08287 0.12013

Table 3.5 Scores of baseline, different features in our model

scores are given for the features Term Frequency (TF), Prepositional Importance (PF), Semantic

Relation (SR) and Wiki Frequency (WF) . It shows that our system outperformed baseline in

average ROUGE-2, ROUGE-su4 scores when PF, TF and SR features are used in combination.

Our system performed well when tested and compared with human model summaries. From

Table 3.5 we can clearly observe the significant performance gap between the baseline and

the best combination < TF,PF, SR >. Also other combinations like < TF,PF,WF > and

< TF, PF, SR,WF > reached close to the top score. The simple TF and combinations omitting

TF had performed poor which established that TF is also an important feature which may not

give a greater performance independently but can render best performance along with vital

features like PF and SR.

42

As words related to issue in a conversation are more frequent, TF helps in extracting the

issue into summaries. The features PF, WF, SR are keen in capturing resolution of an issue.

Fortunately we are able to draw customer satisfaction into our summaries by TF feature captur-

ing the word “thank” used frequently by customer and agent in case of a positive conversation.

We inferred that our semantics based approach could effectively generate summaries for social

interactions like sales/service chats.

We showed that our system is answering the two important queries ‘issue’ and ‘resolution’

but did not make use of any optimizations and dependencies related to these queries, to make

sure that our system can be applied to all other social interactions.

More functional view of different features in our approach have been given with examples in

the next section.

3.9.3 Summarization Outcome in terms of different features

The following is an example of a sales chat followed by its summaries to illustrate the

performances of different features we have used. Some of the details like the customer’s name,

agent’s name, other entity names, dates, timings, values are masked to safeguard the policy of

dataset donor.

[06:18:46] Agent: Thank you for contacting zz Pre-Sales Chat. My name is xx. How may I help you today?

[06:19:40] Customer: Hi xx,... my name is yy and am looking to purchase zz... I was wondering if this

product comes with only wi-fi or 3G and wi-fi

[06:20:15] Agent: I will be glad to assist you

[06:20:39] Agent: Please give me a moment while I check this information for you

[06:20:45] Customer: sure... thank you

[06:25:05] Agent: Thank you for your time. I have checked that 3G and wi-fi are available in zz.

[06:25:36] Agent: Would you like me to send the specification of zz?

[06:26:45] Customer: Sorry I got called away from my desk

[06:26:57] Agent: It is okay.

[06:27:11] Customer: Yes please...

[06:27:36] Customer: In some areas it says included...hmm ..in some it mentioned "coming soon" so

I wasn’t too sure!!

[06:27:49] Agent: I will send you the specification link of zz?

[06:27:58] Customer: yes please..

[06:28:01] Agent: cc.cc.cc <URL>

[06:28:30] Customer: perfect!! thank you.... thats all I was looking for now

[06:28:51] Agent: You are welcome.

[06:29:00] Agent: We have an option to transfer your chat to our Sales Support Team to help you

in customizing zz and help to place an order. May I transfer the chat

[06:29:37] Customer: I am actually in Cxx right now and dont have access to my Cyy credit card for

that information..so..

[06:29:54] Customer: Sorry I will have to do that at a later moment

[06:31:00] Agent: I understand. Do you have any other queries regarding zz products

43

[06:31:14] Customer: thats it for now

[06:31:47] Agent: You are welcome.

To ensure that we are always improving our service, you may receive a survey invitation at the end

of the chat session to tell us what you think about our products and services. Your feedback will be

highly appreciated

[06:32:20] Agent: Thank you for contacting zz Pre-Sales Chat. You have a great day.

The summaries for the above sales chat using different features are given in the following

subsection demonstrating the possible reasons.

Summaries:

1. The following summary is generated using the features : Term Frequency (TF), SemanticRelation(SR) and Prepositional Importance(PF)

[06:19:40] Customer: Hi xx,... my name is yy and am looking to purchase zz... I was wondering

if this product comes with only wi-fi or 3G and wi-fi [06:25:05] Agent: Thank you for your

time. I have checked that 3G and wi-fi are available in zz. [06:29:00] Agent: We have an option

to transfer your chat to our Sales Support Team to help you in customizing zz and help to place

an order. May I transfer the chat [06:29:37] Customer: I am actually in Cxx right now and

dont have access to my Cyy credit card for that information..so..

2. The following summary is generated using the features : Term Frequency (TF), Preposi-tional Importance(PF)


if this product comes with only wi-fi or 3G and wi-fi [06:25:05] Agent: Thank you for your

time. I have checked that 3G and wi-fi are available in zz. [06:28:30] Customer: perfect!! thank

you.... thats all I was looking for now [06:29:00] Agent: We have an option to transfer your

chat to our Sales Support Team to help you in customizing zz and help to place an order. May

I transfer the chat [06:31:47] Agent: You are welcome. To ensure that we are always improving

our service, you may receive a survey invitation at the end of the chat session to tell us what

you think about our products and services.

3. The following summary is generated using the TF feature

[06:18:46] Agent: Thank you for contacting zz Pre-Sales Chat. My name is xx. How may I help

you today? [06:19:40] Customer: Hi xx,... my name is yy and am looking to purchase zz... I

was wondering if this product comes with only wi-fi or 3G and wi-fi [06:25:05] Agent: Thank

you for your time. I have checked that 3G and wi-fi are available in zz. [06:32:20] Agent: Thank

you for contacting zz Pre-Sales Chat. You have a great day.

4. The following summary is generated using the SR feature


if this product comes with only wi-fi or 3G and wi-fi [06:25:36] Agent: Would you like me to

send the specification of zz? [06:29:37] Customer: I am actually in Cxx right now and dont have

access to my Cyy credit card for that information..so.. [06:31:47] Agent: You are welcome. To

44

ensure that we are always improving our service, you may receive a survey invitation at the

end of the chat session to tell us what you think about our products and services.

5. The following summary is generated using the PF feature

[06:29:00] Agent: We have an option to transfer your chat to our Sales Support Team to

help you in customizing zz and help to place an order. May I transfer the chat [06:29:37]

Customer: I am actually in Cxx right now and dont have access to my Cyy credit card for

that information..so.. [06:29:54] Customer: Sorry I will have to do that at a later moment

[06:31:47] Agent: You are welcome. To ensure that we are always improving our service, you

may receive a survey invitation at the end of the chat session to tell us what you think about

our products and services. Your feedback will be highly appreciated

Finally, from the above example, one can observe that Prepositional Importance feature

(PF) is trying to attain sentences that contain prepositions in high densities. Semantic Relation

(SR) is using knowledge from Wikipedia to recognize information and gaining sentences that are

most related. We can also clearly see that issue related words like 3G, Wifi, Pre-Sales and other

thanks giving phrases that contain the word thank occurring in relatively higher frequencies,

hence these sentences are grabbed by the Term Frequency (TF) feature.

45

Chapter 4

Conclusions

In this thesis, we examined the role of social media in carrying valuable information. We

examined the difficulties put forth by user generated content for text analysis. We devised

methods to effectively mine information from user generated content of social interactions. We

examined the need for good information extraction and summarization systems for text in social

interactions. We deeply observed the scenarios of customer reviews and customer sales/service

chats as examples in representing users social interactions.

We presented a domain independent approach for automatic discovery of product attributes

from user reviews. Extracting product attributes from customer reviews is like identifying

subtopics (attributes) for a given topic (product) in a discussion. We worked on this prob-

lem to investigate new extraction systems for social interactions with user generated text.

Our work has highlighted the possibility of providing an incremental learning capability for

the extraction system. The performance scores of our system show that it is a good design to

apply Wikipedia to carve out product attributes from customer reviews. The Wikipedia based

feature is later extended to draw semantics from social interactions.

Our contribution is in leveraging information and in getting assistance from greater knowl-

edge sources like Wikipedia and Web when doing tasks across domains while discarding all

the help from language tools. We viewed the problem as a classification model which achieved

a good performance on the given datasets. Attribute extraction for products from customer

reviews helps tasks like summarization of reviews, product recommendation, enriching product

attribute knowledge bases.

In this work we have trained and tested our system over the products that belong to different

domains but interestingly found that, it works uniform for all domains.

As we did not make use of any natural language processing tools, this work can be extended

to any other language with little changes in the preprocessing stage.

46

Later, we examined the role of information extraction in summarization of user generated

content in social interactions. We crossed barriers of user reviews and into that of chats to

provide a unified solution for extraction and summarization in social interactions.

In this thesis, we addressed the problem of summarizing corporate sales/service chats and

suggested a model which provides summary not only by considering the structure of the chats

but also by extracting semantics. We implicitly used the extraction system which we built for

social interactions. From the results we conclude that our proposed model can be safely applied

to the chat domain.

Possible extensions for this thesis include other aspects of social interactions like effectively

extracting sentiment into summaries. Extending our extraction system to more information

mining procedures and also by including other dynamic properties of interactions to summarize

them can be taken as future directions from this thesis.

47

Appendix A

Social Interactions :- The set of interactions by users which can comprise of blogs, cus-

tomer reviews, sales/service chats, Internet relay chats, social networking blogs and posts, etc.

Annotated attributes :- Attributes already annotated(words marked by human as at-

tributes). Annotation is a common lexicon used in IR & NLP literature.

Candidates selected :- Candidate words that are selected as attributes by our machine.

Attributes Identified :- Words that are correctly identified as attributes by our machine.

Sales/service chats :- Customer-agent chat conversations to address queries of customers.

customer/user reviews:- Product reviews and comments by users/customers on e-commerce

websites.

UGC :- User generated content which generally occurs in user generated forms of social

media and social interactions.

48

Related Publications

• Sudheer Kovelamudi, Sethu Ramalingam, Arpit Sood and Vasudeva Varma, “Domain In-

dependent Model for Product Attribute Extraction from User Reviews using Wikipedia”,

In International Joint Conference on Natural Language Processing (IJCNLP), pages 1408-

1412. AFNLP, 2011.

• Vasudeva Varma, Sudheer Kovelamudi, Jayant Gupta, Nikhil Priyatam, Arpit Sood,

Harshit Jain, Aditya Mogadala, Srikanth Reddy Vaddepally, “IIIT Hyderabad in Summa-

rization and Knowledge Base Population at TAC 2011”, In proceedings of Text Analysis

Conference (TAC), National Institute of Standards and Technology Gaithersburg, Mary-

land USA, November, 2011.

• Praveen Bysani, Kranthi Reddy, Vijay Bharath Reddy, Sudheer Kovelamudi, Prasad Pin-

gali, Vasudeva Varma, “IIIT Hyderabad in Guided Summarization and Knowledge Base

Population”, In the Working Notes of Text Analysis Conference (TAC), National Institute

of Standards and Technology Gaithersburg, Maryland USA, November, 2010.

• Vasudeva Varma, Vijay Bharath Reddy, Sudheer K, Praveen Bysani, GSK Santosh, kiran

kumar, kranthi Reddy, karuna Kumar, nithin M, “IIIT Hyderabad at TAC 2009”, In the

Working Notes of Text Analysis Conference (TAC), National Institute of Standards and

Technology Gaithersburg, Maryland USA, November, 2009.

49

Bibliography

[1] N. Balasubramaniam. User-generated content. Business Aspects of the Internet of Things, page 28,

2009.

[2] J. Bian, Y. Liu, E. Agichtein, and H. Zha. Finding the right facts in the crowd: factoid question

answering over social media. In Proceedings of the 17th international conference on World Wide

Web, WWW ’08, pages 467–476, New York, NY, USA, 2008. ACM.

[3] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on

Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.

ntu.edu.tw/~cjlin/libsvm.

[4] O. S. Chin, N. Kulathuramaiyer, and A. W. Yeo. Automatic discovery of concepts from text. In

Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, WI ’06,

pages 1046–1049, Washington, DC, USA, 2006. IEEE Computer Society.

[5] B. Daille. Study and implementation of combined techniques for automatic extraction of termi-

nology. The Balancing Act: Combining Symbolic and Statistical Approaches to Language, 1:49–66,

1996.

[6] X. Ding, B. Liu, and P. S. Yu. A holistic lexicon-based approach to opinion mining. In Proceedings

of the international conference on Web search and web data mining, WSDM ’08, pages 231–240,

New York, NY, USA, 2008. ACM.

[7] R. Fan, P. Chen, and C. Lin. Working set selection using second order information for training

support vector machines. The Journal of Machine Learning Research, 6:1889–1918, 2005.

[8] R. Farrell, P. Fairweather, and K. Snyder. Summarization of discussion groups. In CIKM, pages

532–534. ACM, 2001.

[9] M. Hu and B. Liu. Mining and summarizing customer reviews. In Proceedings of the tenth ACM

SIGKDD international conference on Knowledge discovery and data mining, KDD ’04, pages 168–

177, New York, NY, USA, 2004. ACM.

[10] M. Hu and B. Liu. extraction and summarization on the web. In Proceedings Of The National

Conference On Artificial Intelligence, volume 21, page 1621. Menlo Park, CA; Cambridge, MA;

London; AAAI Press; MIT Press; 1999, 2006.

[11] J. Justeson and S. Katz. Technical terminology: some linguistic properties and an algorithm for

identification in text. Natural language engineering, 1(01):9–27, 1995.

[12] P. Kannan, M. Jain, R. Vijayaraghavan, P. Albert, and V. Amudhan. Mining interactions to

manage customer experience throughout a customer service lifecycle, Oct. 22 2009. US Patent

App. 12/604,252.

50

[13] S. Kovelamudi, S. Ramalingam, A. Sood, and V. Varma. Domain independent model for product

attribute extraction from user reviews using wikipedia. In IJCNLP, pages 1408–1412. AFNLP,

2011.

[14] C. Lin. Looking for a few good metrics: Automatic summarization evaluationhow many samples

are enough. In NTCIR Workshop, volume 4, pages 1–10, 2004.

[15] C. Lin. Rouge: A package for automatic evaluation of summaries. In Proceedings of the workshop

on text summarization branches out (WAS 2004), pages 25–26, 2004.

[16] C.-Y. Lin and E. Hovy. Automatic evaluation of summaries using n-gram co-occurrence statis-

tics. In Proceedings of the 2003 Conference of the North American Chapter of the Association for

Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 71–78,

Stroudsburg, PA, USA, 2003. Association for Computational Linguistics.

[17] B. Liu, M. Hu, and J. Cheng. Opinion observer: analyzing and comparing opinions on the web. In

Proceedings of the 14th international conference on World Wide Web, WWW ’05, pages 342–351,

New York, NY, USA, 2005. ACM.

[18] O. Medelyan, I. Witten, and D. Milne. Topic indexing with Wikipedia. In AAAI WikiAI workshop,

2008.

[19] D. Milne. Computing semantic relatedness using wikipedia link structure. In NZCSRSC. Citeseer,

2007.

[20] D. Milne and I. Witten. An open-source toolkit for mining Wikipedia. In NZCSRSC, volume 9,

2009.

[21] R. J. Mooney and R. Bunescu. Mining knowledge from text using information extraction. SIGKDD

Explor. Newsl., 7:3–10, June 2005.

[22] G. Murray, S. Renals, J. Carletta, and J. Moore. Evaluating automatic summaries of meeting

recordings. In ACL 2005 MTSE Workshop, pages 33–40, 2005.

[23] P. Newman and J. Blitzer. Summarizing archived discussions: a beginning. In International con-

ference on Intelligent user interfaces, pages 273–276. ACM, 2003.

[24] W. P. on the Information Economy. Participative web: User-created content. 2007.

[25] S. Raju, P. Pingali, and V. Varma. An unsupervised approach to product attribute extraction.

Advances in Information Retrieval, pages 796–800, 2009.

[26] N. Roman, P. Piwek, and A. Carvalho. Politeness and bias in dialogue summarization: Two ex-

ploratory studies. Computing Attitude and Affect in Text: Theory and Applications, pages 171–185,

2006.

[27] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information

Processing & Management, 24(5):513–523, 1988.

[28] S. Somasundaran and J. Wiebe. Recognizing stances in online debates. In Proceedings of the Joint

Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on

Natural Language Processing of the AFNLP: Volume 1-Volume 1, pages 226–234. ACL, 2009.

[29] P. Turney. Coherent keyphrase extraction via web mining. In International Joint Conference on

Artificial Intelligence, volume 18, pages 434–442. Lawrence Erlbaum Associates LTD, 2003.

51

[30] V. Varma, P. Bysani, V. Kranthi Reddy, K. Santosh GSK, S. Kovelamudi, N. Kiran Kumar, and

N. Maganti. iiit hyderabad at tac 2009. In Proceedings of Test Analysis Conference 2009 (TAC

09), 2009.

[31] V. Varma, P. Bysani, K. Reddy, V. Reddy, S. Kovelamudi, S. Vaddepally, R. Nanduri, K. Kumar,

S. Gsk, and P. Pingali. iiit hyderabad in guided summarization and knowledge base population.

[32] K. Zechner. Automatic summarization of open-domain multiparty dialogues in diverse genres.

Computational Linguistics, 28(4):447–485, 2002.

[33] L. Zhou and E. Hovy. Digesting virtual geek culture: The summarization of technical internet relay

chats. In Annual Meeting on Association for Computational Linguistics, pages 298–305. ACL, 2005.

[34] L. Zhou and E. Hovy. On the summarization of dynamically introduced information: Online dis-

cussions and blogs. In AAAI-2006 Spring Symposium on Computational Approaches to Analyzing

Weblogs, 2006.

[35] L. Zhuang, F. Jing, and X. Zhu. Movie review mining and summarization. In CIKM, pages 43–50.

ACM, 2006.

52

information extraction based approach to summarize social

Documents