sentiment analysis jadavpur university kolkata, india sivaji bandyopadhyay

92
Sentiment Analysis Sentiment Analysis Jadavpur University Kolkata, India Sivaji Bandyopadhyay

Upload: augustus-arnold

Post on 17-Dec-2015

229 views

Category:

Documents


2 download

TRANSCRIPT

  • Slide 1
  • Sentiment Analysis Jadavpur University Kolkata, India Sivaji Bandyopadhyay
  • Slide 2
  • Sentiment Knowledge Acquisition Sentiment / Subjectivity Detection Sentiment Polarity Detection Sentiment Structurization Sentiment Summary Overview SentimentHuman Intelligence Sentiment Analysis is a multifaceted problem
  • Slide 3
  • Prior Polarity Sentiment Lexicon Automatic Computational Processes WordNet Dictionary Based Antonym Involving Human Intelligence Dr Sentiment Cross Lingual Projection of Sentiment Lexicons Sentiment Knowledge Acquisition Involving Human Intelligence
  • Slide 4
  • Sentiment Analysis 4 Sentiment Analysis Sentiment Detection Sentiment Classification Andrea Esuli and Fabrizio Sebastiani. SentiWordNet: A publicly available lexical resource for opinion mining. In Proceedings of Language Resources and Evaluation (LREC), 2006. IITH is a very good institution. I love Hyderabad, the city is famous for its Biriyani, Pearl and old Mughal architecture! Summer in Hyderabad is too scorching. IITH is a very good institution. I love Hyderabad, the city is famous for its Biriyani, Pearl and old Mughal architecture! Summer in Hyderabad is too scorching.
  • Slide 5
  • What is SentiWordNet? 5 POSOffsetPositivityNegativitySynset Adjective10063610.8750.0happy Noun44665800.3750.0friendliness Adverb2145890.6250.125sharply Verb24719930.00.125shame Prior Polarity Lexicon
  • Slide 6
  • Prior Polarity Lexicon Sentiment Bearing Words: love, hate, good, favorite Challenges for Polarity Identification: Context Information (Pang et al., 2002) Domain Pragmatic Knowledge (Aue and Gamon, 2005) Time Dimension (Read, 2005) Language/Culture Properties (Wiebe and Mihalcea, 2006) 6
  • Slide 7
  • Prior Polarity Lexicon Context Information I prefer Limuzin as it is longer than Mercedes. Avoid longer baggage during excursion in Amazon. Language/Culture Properties (Sahera: A marriage-wear of India) (Durgapujo: A festival of Bengal) Domain Pragmatic Knowledge Sensex go high. Price go high. Time Dimension During 90s mobile phone users generally reported in various online reviews about their color-phones but in recent times color-phone is not just enough. People are fascinated and influenced by touch screen and various software(s) installation facilities on these new generation gadgets. 7 Continue.(2)
  • Slide 8
  • Prior Polarity Lexicon Suppose total occurrence of a word long in a domain corpus is n. The positive and negative occurrence of that word are S p and S n respectively. Therefore in a developed sentiment lexicon the assigned positivity and negativity score of that word will be as follows: Positivity: S p /n Negativity: S n /n These associative positive and negative scores are called prior polarity. 8 Continue.(3)
  • Slide 9
  • Source Lexicon Acquisition Available Resources for English SentiWordNet (Esuli et. al., 2006) SentiWordNet is an automatically constructed lexical resource for English that assigns a positivity score and a negativity score to each WordNet synset. WordNet Affect List (Strapparava et al., 2004) WordNet synsets tagged with six basic emotions: anger, disgust, fear, joy, sadness, surprise. Taboadas Adjective List (Voll et al., 2006) An automatically constructed adjective list with positivity and negativity polarity assignment. Subjectivity Word List (Wilson et. al., 2005) The entries in the subjectivity word list have been manually labeled with part of speech (POS) tags as well as either strong or weak subjective tag depending on the reliability of the subjective nature of the entry. 9
  • Slide 10
  • Source Language Acquisition Chosen Source Lexicon Resources SentiWordNet SentiWordNet is most widely used in several applications such as sentiment analysis, opinion mining and emotion analysis. Subjectivity Word List (Wilson et. al., 2005) Subjectivity Word List is most trustable as the opinion mining system OpinionFinder that uses the subjectivity word list has reported highest score for opinion/sentiment subjectivity (Wiebe and Riloff, 2006) (Das and Bandyopadhyay, 2010) 10 Continue.(1)
  • Slide 11
  • Source Language Acquisition Noise-Reduction A merged sentiment lexicon has been developed from both the resources by removing the duplicates. It has been observed that 64% of the single word entries are common in the Subjectivity Word List and SentiWordNet. The new merged sentiment lexicon consists of 14,135 numbers of tokens. Several filtering techniques have been applied to generate the new list. 11 Continue.(2)
  • Slide 12
  • Source Language Acquisition 12 Continue.(3) SentiWordNetSubjectivity Word List SingleMultiSingleMulti Unambiguous Words 115424790915866990 20789300004745963 Discarded Ambiguous Words Threshold Orientation Strength Subjectivity Strength POS 86944300002652928
  • Slide 13
  • Target Language Generation 13 Generation Strategies Bilingual Dictionary Based Approach WordNet Based Approach Antonym Generation Corpus Based Approach Dr Sentiment (A Gaming Approach)
  • Slide 14
  • Target Language Generation 14 Continue.(1) Bilingual Dictionary Based Approach A word-level translation technique adopted. Robust and reliable synsets (approx 9966) are created by native speakers as well as linguistics experts of the specific languages as a part of English to Indian Languages Machine Translation Systems (EILMT). Various language specific dictionaries acquired.
  • Slide 15
  • Target Language Generation 15 Continue.(2) Bilingual Dictionary Based Approach Hindi (90,872) SHABDKOSH (http://www.shabdkosh.com/)http://www.shabdkosh.com/ Shabdanjali (http://www.shabdkosh.com/content/category/download s/)http://www.shabdkosh.com/content/category/download s/ Bengali (102119) Samsad Bengali-English Dictionary (http://dsal.uchicago.edu/dictionaries/biswas_bengali/)http://dsal.uchicago.edu/dictionaries/biswas_bengali/ Telugu (112310) Charles Philip Brown English-Telugu Dictionary (http://dsal.uchicago.edu/dictionaries/brown/)http://dsal.uchicago.edu/dictionaries/brown/ Aksharamala English-Telugu Dictionary (https://groups.google.com/group/aksharamala)https://groups.google.com/group/aksharamala English-Telugu Dictionary (http://ltrc.iiit.ac.in/onlineServices/Dictionaries/Dict_Fr ame.html)http://ltrc.iiit.ac.in/onlineServices/Dictionaries/Dict_Fr ame.html
  • Slide 16
  • Target Language Generation 16 Continue.(3) Bilingual Dictionary Based Approach Hindi Translation process has resulted 22,708 Hindi entries Bengali Translation process has resulted 34,117 Bengali entries Telugu Translation process has resulted 30,889 Telugu entries Almost 88% Telugu SentiWordNet generated by this process
  • Slide 17
  • Target Language Generation 17 Continue.(4) WordNet Based Expansion Approach Synonymy Expansion WordNet based expansion technique produces more synset members: inactive, motionless, static for a source word still. Prior polarity scores are directly copied Antonymy Expansion WordNet based expansion technique produces more sentiment lexemes: ugly for a source word beautiful. Prior polarities are calculated as: T p =1-S p T n =1-S n where S p, S n are the positivity and negativity score for the source language (i.e, English) and T p, T n are the positivity and negativity score for target languages
  • Slide 18
  • Target Language Generation 18 Continue.(5) WordNet Based Expansion Approach Hindi Hindi WordNet (Jha et al., 2001) (http://www.cfilt.iitb.ac.in/wordnet/webhwn/) is a well structured and manually compiled resource and is being updated since last nine years.http://www.cfilt.iitb.ac.in/wordnet/webhwn/ Almost 60% generated by this process Bengali The Bengali (http://bn.asianwordnet.org/)http://bn.asianwordnet.org/ It only contains 1775 noun synsets as reported in (Robkop et al., 2010) Only 5% new lexicon entries have been generated in this process
  • Slide 19
  • Target Language Generation 19 Continue.(6) Antonymy Generation Affix/SuffixWordAntonym abXNormalAb-normal misXFortuneMis-fortune imX-exXIm-plicitEx-plicit antiXClockwiseAnti-clockwise nonXAlignedNon-aligned inX-exXIn-trovertEx-trovert disXInterestDis-interest unXBiasedUn-biased upX-downXUp-hillDown-hill imXPossibleIm-possible illXLegalIl-legal overX-underXOverdoneUnder-done inXConsistentIn-consistent rX-irXRegularIr-regular Xless-XfulHarm-lessHarm-ful malXFunctionMal-function About 8% of Bengali, 7% of Hindi and 11% of Telugu SentiWordNet entries are generated in this process.
  • Slide 20
  • Target Language Generation 20 Continue.(7) Corpus Based Approach Language/culture specific words: (Sahera: A marriage-wear) (Durgapujo: A festival of Bengal) Technique Generated sentiment Lexicon used a seed list Tag-Set SWP (Sentiment Word Positive) SWN (Sentiment Word Negative) Corpus EILMT language specific corpus: approximately 10K of sentences. Model Conditional Random Field (CRF) An n-gram (n=4) sequence labeling model has been used for the present task.
  • Slide 21
  • Limitations 21 Issues in Cross Lingual Projection Sentiment score may not be equal to source language Relative sentiment score is needed rather than absolute score Language / Culture specific lexicons should be included Sentiment score should be updated by time
  • Slide 22
  • Involving Human Intelligence 22 WORLD INTERNET USAGE AND POPULATION STATISTICS World Regions Population ( 2010 Est.) Internet Users Dec. 31, 2000 Internet Users Latest Data Penetration (% Population) Growth 2000-2010 Users % of Table Africa1,013,779,0504,514,400110,931,70010.9 %2,357.3 %5.6 % Asia3,834,792,852114,304,000825,094,39621.5 %621.8 %42.0 % Europe813,319,511105,096,093475,069,44858.4 %352.0 %24.2 % Middle East212,336,9243,284,80063,240,94629.8 %1,825.3 %3.2 % North America344,124,450108,096,800266,224,50077.4 %146.3 %13.5 % Latin America/Caribbean 592,556,97218,068,919204,689,83634.5 %1,032.8 %10.4 % Oceania / Australia34,700,2017,620,48021,263,99061.3 %179.0 %1.1 % WORLD TOTAL6,845,609,960360,985,4921,966,514,81628.7 %444.8 %100.0 %
  • Slide 23
  • Dr. Sentiment 23 Q1
  • Slide 24
  • Dr. Sentiment 24 Q2 WordPositivityNegativity Good0.6250.0 Better0.8750.0 Best0.9800.0
  • Slide 25
  • Dr. Sentiment 25 Q3
  • Slide 26
  • Dr. Sentiment 26 Q4
  • Slide 27
  • SentimentUn-Explored Dimensions 27 Blue in Islam: In verse 20:102 of the Quran, the word zurq (plural of azraq 'blue') is used metaphorically for evil doers whose eyes are glazed with fear Geo-Spatial
  • Slide 28
  • SentimentUn-Explored Dimensions 28 Age-Wise Senti-Mentality
  • Slide 29
  • SentimentUn-Explored Dimensions 29 Gender-Specific Senti-Mentality
  • Slide 30
  • Expected Impact of the Resources 30 Resources are useful in multiple aspect Mono-Lingual Sentiment/Opinion/Emotion Analysis task Generated language specific SentiWordNet(s) could be expanded by other proposed methods (Dictionary, WordNet, Antonym and Corpus Based Approach) The other dimensions Geospatial Information retrieval Personalized search Recommender System etc Stylometry: A writers Senti-Mentality Plagiarism: Spamming Technique: Geo-Spatial and User Perspective
  • Slide 31
  • The Road Ahead 31 Languages AfrikaansBulgarianDutchGermanIrishMalayRussianThai AlbanianCatalanEstonianGreekItalianMalteseSerbianTurkish ArabicChineseFilipinoHaitianJapaneseNorwegianSlovakUkrainian ArmenianCroatianFinnishHebrewKoreanPersianSlovenianUrdu AzerbaijaniCreoleFrenchHungarianLatvianPolishSpanishVietnamese BasqueCzechGalicianIcelandicLithuanianPortugueseSwahiliWelsh BelarusianDanishGeorgianIndonesianMacedonianRomanianSwedishYiddish Basic SentiWordNet has been developed for 56 languages A. Das and S. Bandyopadhyay. Towards The Global SentiWordNet, In the Workshop on Model and Measurement of Meaning (M3), PACLIC 24, November 4, Sendai, Japan, 2010. (Accepted)
  • Slide 32
  • References 32 Resources I.A. Das and S. Bandyopadhyay. Towards The Global SentiWordNet, In the Workshop on Model and Measurement of Meaning (M3), PACLIC 24, November 4, Sendai, Japan, 2010. II.A. Das and S. Bandyopadhyay. SentiWordNet for Indian Languages, In the 8 th Workshop on Asian Language Resources (ALR), August 21-22, Beijing, China, 2010. III.A. Das and S. Bandyopadhyay. SentiWordNet for Bangla, In Knowledge Sharing Event-4: Task 2: Building Electronic Dictionary, February 23 rd -24 th, 2010, Mysore.
  • Slide 33
  • Solution Architecture Explored Rule-Based Machine Learning Hybrid Adaptive Genetic Algorithm: Multiple Objective Optimization, The Evolutionary Technique to Detect Sentiment Adaptive Genetic Algorithm: Multiple Objective Optimization technique yielded all other techniques Sentiment / Subjectivity Detection
  • Slide 34
  • Sentence subjectivity: An objective sentence expresses some factual information about the world, while a subjective sentence expresses some personal feelings or beliefs. Example: Type: Film Review, Film Name: Deep Blue Sea, Holder: Arbitrary-outside of theatre Oh, This is blue! Is this statement an objective or subjective statement? blue is not a evaluative expression Among different cultures with a different colour scheme (blue; positive or negative?)
  • Slide 35
  • Example: Type: Comment, Holder: Governor of WB, Issue: Nandigram. Governor said the government should keep patience. Is this statement an objective or subjective statement? keep patience regarding what? How to determine Governors comment is important?
  • Slide 36
  • Subjectivity is a social norms Subjectivity knowledge is pragmatic A prior knowledge always help to identify Subjectivity
  • Slide 37
  • A rule-based approach Use Themes and Ontology as pragmatic knowledge SentiWordNet (Bengali): a prior polarity lexicon Features Frequency Average Distribution Functional Word Positional Aspect Theme Identification Ontology List Stemming Cluster Part of Speech Chunk SentiWordNet (Bengali)
  • Slide 38
  • FeaturesOverall Performance incremented by Stemming Cluster 4.05% Part of Speech 3.62% Chunk4.07% Functional Word 1.88% SentiWordNet (Bengali) 5.02% Ontology List3.66% Feature wise System Performance 0 10 20 30 40 50 60 70 80 EnglishBengali Base-Line POS-Chunk Ontology Position Distribution
  • Slide 39
  • DomainPrecisionRecall NEWS72.16%76.00% BLOG74.60%80.40% Overall System Performance Observations Subjectivity detection is trivial for blog corpus rather than for news corpus Performance incremented by 2% only from rule-based system using CRF technique with the same feature set
  • Slide 40
  • GBML used to identify automatically best feature set based on the principle of natural selection and survival of the fittest. The identified fittest feature set is then optimized locally and global optimization is then obtained by multi-objective optimization technique. The local optimization identify the best range of feature values of a particular feature. The Global optimization technique identifies the best ranges of values of given multiple feature.
  • Slide 41
  • TypesFeatures Lexico-Syntactic POS SentiWordNet Frequency Stemming Syntactic Chunk Label Dependency Parsing Discourse Level Title of the Document First Paragraph Average Distribution Theme Word Experimentally Best Identified Feature Set
  • Slide 42
  • GAs are characterized by the five basic components as follows I.Chromosome representation for the feasible solutions to the optimization problem. II.Initial population of the feasible solutions. III.A fitness function that evaluates each solution. IV.Genetic operators that generate a new population from the existing population. V.Control parameters such as population size, probability of genetic operators, number of generation etc.
  • Slide 43
  • Where is the resultant subjectivity function, to be calculated and is the i th feature function. If the present model is represented in a vector space model then the above function could be re-written as: This equation specifies what is known as the dot product between vectors. The GBML provides the facility to search in the Pareto-optimal set of possible features. To make the Pareto optimality mathematically more rigorous, we state that a feature vector x is partially less than feature vector y, symbolically x