voc real world enterprise needs
DESCRIPTION
VOC sentiment analysis korean language processing morphological analysis CRFTRANSCRIPT
Communicating KnowledgeSentiment Analysis Symposium
Lessons Learned from a VOC Analysis System for a big
Korean Telecommunication CompanyIvan Berlocher
SALTLUXSentiment Analysis Symposium
Nov. 9th 2011
Communicating KnowledgeSentiment Analysis Symposium 2
Introduction
• Saltlux Inc. is located in Seoul, Korea, established in 1979 and renovated in 2003.
• Expertise domain: Information Retrieval, Text/Data/Web/Graph Mining solutions and services
based on Semantic Web Technology.
• Main languages support: Korean, Japanese, English. For other use external solutions.
• 70 employees in Seoul, one Development Center in Vietnam (12 employ-ees)
One sales office in Japan (3 employees)
• Have several partnerships with other companies/institutes: – Ontoprise in Germany– Franz in California– DERI in Ireland
• Have many partnerships with R&D (ETRI, KAIST, Universities…)
Communicating KnowledgeSentiment Analysis Symposium 3
Table of Contents• Project & Environment Description
– Needs of Customer– System (Main) Requirements
• VOC Data– Sample Data– Data Analysis
• System Overview• Korean Linguistic• Sentiment Analysis• Lessons Learned• Future work
Communicating KnowledgeSentiment Analysis Symposium 4
Project & Environment Description• Needs of Customer
– Customer: Korean Corporation in Telecommunication– Department of Voice of Customer Analysis
– Mission: Analysis (human typed) memos from all call centers for identifying majors problems, make reports for decisions makers in order to improve quality of services and augment customer satisfaction.
– Data: human typed notes covering any kind of questions from customers• Information about subscriptions• Inquiry or complaint about devices (phones) or services, dealer-
ship • Complaints about quality of communication• etc.
The numbers of notes: ~200 thousand a day. (~5 Millions a Month). Required notes to be searchable during 1 year (~60 millions)
Communicating KnowledgeSentiment Analysis Symposium 5
Project & Environment Description
• System (Main) Requirements• Distinguish between simple inquiries vs. complaints• Classify into categories/departments of services• Monitor Trends of Topics in real-time, daily, weekly, monthly• Compare trends/tendency between by slice of times• Find related Topics• Manage personal vocabulary• Anonymous”ize” personal data (people names, telephone,
social id, addresses etc.)
Project started in October 2010 for a 3 Months POC. (~10MM)After acceptance(success) integration with real system for another 3 months (~10 MM)2 phases: ~200 000$
Communicating KnowledgeSentiment Analysis Symposium 6
VOC Data Sample
Communicating KnowledgeSentiment Analysis Symposium 7
VOC Data Sample
• Data often contain some structured information (metadata) but without any standard.
• But most of time, no particular mark/meta.
Cause problem of Named Entities Recogni-tion more complex
All different input of same information( 연락처 :Phone Number)
Communicating KnowledgeSentiment Analysis Symposium 8
VOC Data Analysis
• Data contains lot’s of named entities: Products/Services/People/Social ID/phones numbers often related to privacy
• Data contains lot’s of technical (domain) terms• Real content to analysis is mostly very
short(tweets like) but sometimes very.• Lot’s of misspelling/mistyping • Korean(Asian) problem of segmentation, amplified
by speed constraint • Lot’s of (non standard) abbreviations
Communicating KnowledgeSentiment Analysis Symposium 9
System Overview
Text Segmentati
on
Morphological Analyzer
Chunk/PhraseIdenti
fication
Named Entities
Recognition
Synonyms & Normalizatio
n
Indexing
Distributed Indexes
Classifier(Hybrid SVM
& Rules)
Analysis Phase
Searching/Clustering
(TopicRank)
TimelinesDumper
DFS
Timelines20110713_0700_1.df20110713_0700_2.df20110713_0700_3.df20110713_0710_1.df20110713_0710_2.df20110713_0710_3.df
Scheduler
Merger & Ranker
Trend (TopN)
DB
Web Server
(Web UI)
Complaint Detector
• Overall Architecture
In the real system, for fast indexing, system has been parallelized on 18 Linux machines.
Communicating KnowledgeSentiment Analysis Symposium 10
System Overview
• Home page
Communicating KnowledgeSentiment Analysis Symposium 11
System Overview• Top N Keywords Extraction
Communicating KnowledgeSentiment Analysis Symposium 12
System Overview• Related Keywords (Word Clustering)
Communicating KnowledgeSentiment Analysis Symposium 13
System Overview• Trend (Timeline) view
Communicating KnowledgeSentiment Analysis Symposium 14
System Overview• Tweets view
Communicating KnowledgeSentiment Analysis Symposium 15
Korean Linguistic
• Brief introduction
Korean is alphabetic based with consonants/vowels, composition by
consonant/vowel or consonant/vowel/consonant.‘ 나는 학생입니다 .” => 나 = ㄴ (N) + ㅏ (A) = NA => 학 = ㅎ (H) + ㅏ (A) + ㄱ (K) = HAKOne unit of consonant/vowel or consonant/vowel/consonant is asyllable called “Eojol”(Syllable) and words are composed of
several“eojeol”.
Basic grammar:Words a composition of one root (Nouns, Adjectives/Verbs)
followedby a flexion marking grammatical role (Subject/Object/Location
etc.)for nouns (Called “Josa”) or aspects/mood (tense, honorific form etc. ) for
verbs/adjectives (Called “Eomi”).
Communicating KnowledgeSentiment Analysis Symposium 16
Korean Linguistic• Examples:
‘ 나는 학생입니다 .” => “ 나는” = “ 나” (NA: I/me) + “ 는” (Neun: Thema)
학생입니다 = “ 학생” + “ 입니다” = “ 학생” (Hak-seng: Student) + “ 입니다” (Im-ni-da: am) => I’m (a) student.Lot’s of (composite) inflectional forms:학생 + 입니다 = Noun + Be학생 + 인 / 이예요 / 이다 / 입니까 ?/ 인데 / 인데요 etc. (was, will be …)
(eomi)학생 + Syntactic Role ( 이 :Subject/ 에게 :To/ 한테 :From/ 을 :Object)
etc. (josa)
Korean is highly agglomerative (concatenate prefix/nouns/josa/eomi)
Þ Search Engine: 검색엔진 .Þ High performance search engine: 고성능검색엔진But usage of space is free/arbitrary. Can write equivalently: 검색엔진 or 검색 엔진Especially with SNS, space limited devices for speed constraints (like real-time transcription of conversations) the space is more
and moreun/mis- used. => Need Automatic Segmentation Correction.
Communicating KnowledgeSentiment Analysis Symposium 17
Project & Environment Description
• Automatic Segmentation Correction Illustration
Communicating KnowledgeSentiment Analysis Symposium 18
Korean Linguistic
• Automatic Segmentation Correction ImplementationBinary Classification Approach:Tagging each syllable as space or not before.Can use any kind of Classifier. Here we use CRF model (could be SVM)with following set of features:
프랑스의 세계적인 디자이너 …
CRF
Accuracy at Character Level
96.25%
Precision at Word Level 95.58%
• Features– 1gram, 2gram, 3gram, 4gram of characters (syllables)– Korean or not, contains number
• Evaluation– Accuracy (character)– Word-precision
# words correct spaced word / # words produced by system
• Very simple to train (easy to get huge data)• Not need of lexicon or any lexical information• Perform surprisingly very well
Communicating KnowledgeSentiment Analysis Symposium 19
Korean Linguistic
• Transliteration- Korean used more and more English derived word
transliterated phonetically in Korean alphabet (Reverse of “Romanization”). Especially for foreign names (Companies, Products, Peo-
ple,technical/domain terms) – Transcription is non unique and non standardExamples:
tablet, 태블릿 , 태블릿 , 타블렛 , 테블릿Hitachi, 히타치 , 히타찌 , 히다찌 , 히타찌iPhone 4s, 아이폰 4s, 아이폰포에스 , 아이폰 포에스
Communicating KnowledgeSentiment Analysis Symposium 20
Korean Linguistic
• Automatic transliteration recognition- Make a rules based transliteration based on pho-
netic transliteration acting similarly to Soundex, adapted for Korean pronunciation.
tablet, 태블릿T=> ㅌ / ㄸ / ㄷA => ㅏ / ㅓ / ㅔ / ㅐEtc.
This method has high recall but low precision and need post-processing filter-ing (Remove known Korean words from lexicons, remove too short nouns etc.)
Result has to be corrected by human, so need of efficient workbench for pro-ductivity.
Gathered a 130 thousand entries dictionaries, mainly IT oriented.
Still need more Academic research to solve this problem.
Communicating KnowledgeSentiment Analysis Symposium 21
Sentiment Analysis
• Complaint DetectionSimilar problem of standard Subjectivity Detection (Detect if a sentence is sentiment bearing or not)
Simple Approach: Binary ClassificationUsing SVM, manually tagged training/test corpuses. (more than 20 thousand)Features Space: N-gram of Characters (Syllables/Eojol) + N-Gram of
Wordsusing 2-4 grams gave best results.Features Extraction is important to lower the features
space.Chi-square/Information Gain gave best results.
Communicating KnowledgeSentiment Analysis Symposium 22
Sentiment Analysis
Problems: No freely available resources such Sentiword-NetNeed to build it!Build our general domain dictionary as baseline:20 000 verbs/adjectives classified as positive/negative/neutralResult is a lexicon of ~5000 entries (only positive/negative)Enrich with manually extracted features from N-grams.Precision oriented (92%) but still quite low recall (75%). Overall Accuracy: 85%=> Still working on ways to make recall better without sacrificing precision. Basic Ideas: Bagging / Boosting (Combining several Classifiers)Make hybrid models between (linguistic: semantic/syntactic)
rulesand Machine Learning(statistics)
Communicating KnowledgeSentiment Analysis Symposium 23
Lessons Learned
• Lessons Learned- Still a quite big gap between expectation of customer
and reality. Need to explain and let him involved in process of assessment and knowledge/domain vocabu-lary acquisition
- Need acquire a lot of lexicons: => Named entities/Synonyms/Stopwords/Senti-Word- Quality and Quantity of this lexicons is a real assets of
Company. Acquiring lexicons require workbenches for efficiently semi-supervised methods (Filter manually automatic methods) to reduce costs.
- Tuning Classifiers parameters, features extraction, lin-guistic knowledge etc. is time/expertise consuming.
- Simple Academic methods works quite well (even needs lot of tuning)
- Beyond simple search engine, NLP components quality became more and more important, especially for Sen-timent Analysis
Communicating KnowledgeSentiment Analysis Symposium 24
Lessons Learned
• Lessons Learned- Customers gain more and more interested in “Big Data”, “Listening Platform”, “Cloud
”, “Social Network/Intelligence”…
- More and more Customers want to get data/opinion out of in-site system (Blogs, Communities(BBS), Tweets etc.). Typical questions:
Þ How many crawlers are needed for crawl all Korean tweets/blogs? Þ How about crawling Facebook?
- How identify “Anti communities” (like “Anti-Samsung”); Who are Power bloggers?Þ Solutions required are required far more than Sentiment Analysis. Þ But often customer can’t afford/don’t want crawling infra-structure and maintenance
fees.
Þ New opportunities to deliver software in other forms than traditional packages selling: SaaS/PaaS (Software/Platform/Infrastructure) as Service.
Þ Even in enterprise, distributed framework is required (not only web scale services)
- Customers (as least in Korea) love knowing technology and are more and more high level users. They not only buy solutions but consulting/expertise.
- Projects are more and more expensive, and many require either Benchmarks/POC
Communicating KnowledgeSentiment Analysis Symposium 25
Future Work & Plan
• Future Work (On-going)Acquire more entries in Sentiment dictionary
- Make a framework for handling Linguistic Rules and Statistical (SVM/Rocchio)
- Coupling with Antonyms; and/or hints- Better handling Negation- Better Workbench for faster acquisition / (re-)training- Co-Reference resolution- (Full/Semi) Parsing ?- More complex models than binary classification ?- Building/Maintaining a Platform for Pass/SassA long long way to go…
Communicating KnowledgeSentiment Analysis Symposium 26
Questions?
Thank you.