voc real world enterprise needs

Communicating KnowledgeSentiment Analysis Symposium

Lessons Learned from a VOC Analysis System for a big

Korean Telecommunication CompanyIvan Berlocher

SALTLUXSentiment Analysis Symposium

Nov. 9th 2011

Communicating KnowledgeSentiment Analysis Symposium 2

Introduction

• Saltlux Inc. is located in Seoul, Korea, established in 1979 and renovated in 2003.

• Expertise domain: Information Retrieval, Text/Data/Web/Graph Mining solutions and services

based on Semantic Web Technology.

• Main languages support: Korean, Japanese, English. For other use external solutions.

• 70 employees in Seoul, one Development Center in Vietnam (12 employ-ees)

One sales office in Japan (3 employees)

• Have several partnerships with other companies/institutes: – Ontoprise in Germany– Franz in California– DERI in Ireland

• Have many partnerships with R&D (ETRI, KAIST, Universities…)


Table of Contents• Project & Environment Description

– Needs of Customer– System (Main) Requirements

• VOC Data– Sample Data– Data Analysis

• System Overview• Korean Linguistic• Sentiment Analysis• Lessons Learned• Future work


Project & Environment Description• Needs of Customer

– Customer: Korean Corporation in Telecommunication– Department of Voice of Customer Analysis

– Mission: Analysis (human typed) memos from all call centers for identifying majors problems, make reports for decisions makers in order to improve quality of services and augment customer satisfaction.

– Data: human typed notes covering any kind of questions from customers• Information about subscriptions• Inquiry or complaint about devices (phones) or services, dealer-

ship • Complaints about quality of communication• etc.

The numbers of notes: ~200 thousand a day. (~5 Millions a Month). Required notes to be searchable during 1 year (~60 millions)


Project & Environment Description

• System (Main) Requirements• Distinguish between simple inquiries vs. complaints• Classify into categories/departments of services• Monitor Trends of Topics in real-time, daily, weekly, monthly• Compare trends/tendency between by slice of times• Find related Topics• Manage personal vocabulary• Anonymous”ize” personal data (people names, telephone,

social id, addresses etc.)

Project started in October 2010 for a 3 Months POC. (~10MM)After acceptance(success) integration with real system for another 3 months (~10 MM)2 phases: ~200 000$


VOC Data Sample


VOC Data Sample

• Data often contain some structured information (metadata) but without any standard.

• But most of time, no particular mark/meta.

Cause problem of Named Entities Recogni-tion more complex

All different input of same information( 연락처 :Phone Number)


VOC Data Analysis

• Data contains lot’s of named entities: Products/Services/People/Social ID/phones numbers often related to privacy

• Data contains lot’s of technical (domain) terms• Real content to analysis is mostly very

short(tweets like) but sometimes very.• Lot’s of misspelling/mistyping • Korean(Asian) problem of segmentation, amplified

by speed constraint • Lot’s of (non standard) abbreviations


System Overview

Text Segmentati

on

Morphological Analyzer

Chunk/PhraseIdenti

fication

Named Entities

Recognition

Synonyms & Normalizatio

n

Indexing

Distributed Indexes

Classifier(Hybrid SVM

& Rules)

Analysis Phase

Searching/Clustering

(TopicRank)

TimelinesDumper

DFS

Timelines20110713_0700_1.df20110713_0700_2.df20110713_0700_3.df20110713_0710_1.df20110713_0710_2.df20110713_0710_3.df

Scheduler

Merger & Ranker

Trend (TopN)

DB

Web Server

(Web UI)

Complaint Detector

• Overall Architecture

In the real system, for fast indexing, system has been parallelized on 18 Linux machines.


System Overview

• Home page


System Overview• Top N Keywords Extraction


System Overview• Related Keywords (Word Clustering)


System Overview• Trend (Timeline) view


System Overview• Tweets view


Korean Linguistic

• Brief introduction

Korean is alphabetic based with consonants/vowels, composition by

consonant/vowel or consonant/vowel/consonant.‘ 나는 학생입니다 .” => 나 = ㄴ (N) + ㅏ (A) = NA => 학 = ㅎ (H) + ㅏ (A) + ㄱ (K) = HAKOne unit of consonant/vowel or consonant/vowel/consonant is asyllable called “Eojol”(Syllable) and words are composed of

several“eojeol”.

Basic grammar:Words a composition of one root (Nouns, Adjectives/Verbs)

followedby a flexion marking grammatical role (Subject/Object/Location

etc.)for nouns (Called “Josa”) or aspects/mood (tense, honorific form etc. ) for

verbs/adjectives (Called “Eomi”).


Korean Linguistic• Examples:

‘ 나는 학생입니다 .” => “ 나는” = “ 나” (NA: I/me) + “ 는” (Neun: Thema)

학생입니다 = “ 학생” + “ 입니다” = “ 학생” (Hak-seng: Student) + “ 입니다” (Im-ni-da: am) => I’m (a) student.Lot’s of (composite) inflectional forms:학생 + 입니다 = Noun + Be학생 + 인 / 이예요 / 이다 / 입니까 ?/ 인데 / 인데요 etc. (was, will be …)

(eomi)학생 + Syntactic Role ( 이 :Subject/ 에게 :To/ 한테 :From/ 을 :Object)

etc. (josa)

Korean is highly agglomerative (concatenate prefix/nouns/josa/eomi)

Þ Search Engine: 검색엔진 .Þ High performance search engine: 고성능검색엔진But usage of space is free/arbitrary. Can write equivalently: 검색엔진 or 검색 엔진Especially with SNS, space limited devices for speed constraints (like real-time transcription of conversations) the space is more

and moreun/mis- used. => Need Automatic Segmentation Correction.


Project & Environment Description

• Automatic Segmentation Correction Illustration


Korean Linguistic

• Automatic Segmentation Correction ImplementationBinary Classification Approach:Tagging each syllable as space or not before.Can use any kind of Classifier. Here we use CRF model (could be SVM)with following set of features:

프랑스의 세계적인 디자이너 …

CRF

Accuracy at Character Level

96.25%

Precision at Word Level 95.58%

• Features– 1gram, 2gram, 3gram, 4gram of characters (syllables)– Korean or not, contains number

• Evaluation– Accuracy (character)– Word-precision

# words correct spaced word / # words produced by system

• Very simple to train (easy to get huge data)• Not need of lexicon or any lexical information• Perform surprisingly very well


Korean Linguistic

• Transliteration- Korean used more and more English derived word

transliterated phonetically in Korean alphabet (Reverse of “Romanization”). Especially for foreign names (Companies, Products, Peo-

ple,technical/domain terms) – Transcription is non unique and non standardExamples:

tablet, 태블릿 , 태블릿 , 타블렛 , 테블릿Hitachi, 히타치 , 히타찌 , 히다찌 , 히타찌iPhone 4s, 아이폰 4s, 아이폰포에스 , 아이폰 포에스


Korean Linguistic

• Automatic transliteration recognition- Make a rules based transliteration based on pho-

netic transliteration acting similarly to Soundex, adapted for Korean pronunciation.

tablet, 태블릿T=> ㅌ / ㄸ / ㄷA => ㅏ / ㅓ / ㅔ / ㅐEtc.

This method has high recall but low precision and need post-processing filter-ing (Remove known Korean words from lexicons, remove too short nouns etc.)

Result has to be corrected by human, so need of efficient workbench for pro-ductivity.

Gathered a 130 thousand entries dictionaries, mainly IT oriented.

Still need more Academic research to solve this problem.


Sentiment Analysis

• Complaint DetectionSimilar problem of standard Subjectivity Detection (Detect if a sentence is sentiment bearing or not)

Simple Approach: Binary ClassificationUsing SVM, manually tagged training/test corpuses. (more than 20 thousand)Features Space: N-gram of Characters (Syllables/Eojol) + N-Gram of

Wordsusing 2-4 grams gave best results.Features Extraction is important to lower the features

space.Chi-square/Information Gain gave best results.


Sentiment Analysis

Problems: No freely available resources such Sentiword-NetNeed to build it!Build our general domain dictionary as baseline:20 000 verbs/adjectives classified as positive/negative/neutralResult is a lexicon of ~5000 entries (only positive/negative)Enrich with manually extracted features from N-grams.Precision oriented (92%) but still quite low recall (75%). Overall Accuracy: 85%=> Still working on ways to make recall better without sacrificing precision. Basic Ideas: Bagging / Boosting (Combining several Classifiers)Make hybrid models between (linguistic: semantic/syntactic)

rulesand Machine Learning(statistics)


Lessons Learned

• Lessons Learned- Still a quite big gap between expectation of customer

and reality. Need to explain and let him involved in process of assessment and knowledge/domain vocabu-lary acquisition

- Need acquire a lot of lexicons: => Named entities/Synonyms/Stopwords/Senti-Word- Quality and Quantity of this lexicons is a real assets of

Company. Acquiring lexicons require workbenches for efficiently semi-supervised methods (Filter manually automatic methods) to reduce costs.

- Tuning Classifiers parameters, features extraction, lin-guistic knowledge etc. is time/expertise consuming.

- Simple Academic methods works quite well (even needs lot of tuning)

- Beyond simple search engine, NLP components quality became more and more important, especially for Sen-timent Analysis


Lessons Learned

• Lessons Learned- Customers gain more and more interested in “Big Data”, “Listening Platform”, “Cloud

”, “Social Network/Intelligence”…

- More and more Customers want to get data/opinion out of in-site system (Blogs, Communities(BBS), Tweets etc.). Typical questions:

Þ How many crawlers are needed for crawl all Korean tweets/blogs? Þ How about crawling Facebook?

- How identify “Anti communities” (like “Anti-Samsung”); Who are Power bloggers?Þ Solutions required are required far more than Sentiment Analysis. Þ But often customer can’t afford/don’t want crawling infra-structure and maintenance

fees.

Þ New opportunities to deliver software in other forms than traditional packages selling: SaaS/PaaS (Software/Platform/Infrastructure) as Service.

Þ Even in enterprise, distributed framework is required (not only web scale services)

- Customers (as least in Korea) love knowing technology and are more and more high level users. They not only buy solutions but consulting/expertise.

- Projects are more and more expensive, and many require either Benchmarks/POC


Future Work & Plan

• Future Work (On-going)Acquire more entries in Sentiment dictionary

- Make a framework for handling Linguistic Rules and Statistical (SVM/Rocchio)

- Coupling with Antonyms; and/or hints- Better handling Negation- Better Workbench for faster acquisition / (re-)training- Co-Reference resolution- (Full/Semi) Parsing ?- More complex models than binary classification ?- Building/Maintaining a Platform for Pass/SassA long long way to go…


Questions?

Thank you.