information extractionengine forsentiment-topic …...webpage pipeline content extraction:...
TRANSCRIPT
![Page 1: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/1.jpg)
INFORMATION EXTRACTION ENGINE FOR SENTIMENT-TOPIC MATCHINGIN PRODUCT INTELLIGENCE APPLICATIONS
CORNELIA FERNER | INTERNATIONAL DATA SCIENCE CONFERENCE | SALZBURG 2017
WERNER POMWENGER | MARTIN SCHNÖLL | VERONIKA HAAF | ARNOLD KELLER | STEFAN WEGENKITTL
![Page 2: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/2.jpg)
MOTIVATION
http://www.pcworld.com/article/2138145/ultrabooks/lenovo-thinkpad-x1-carbon-review-slightly-overdone-but-plenty-tasty.html
![Page 3: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/3.jpg)
MOTIVATION
![Page 4: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/4.jpg)
AGENDA
� ARIE – article and review information extraction engine
� Topic classification
� Sentiment analysis
� Recap
![Page 5: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/5.jpg)
ARIE
![Page 6: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/6.jpg)
WEBPAGE PIPELINE
� Content extraction:
� Boilerplate removal (comments, ads, teasers etc.)
� Raw text extraction (without html tags)
� Store meta data
� Review validation:
� Only expert reviews are needed
� Sort out ads, comparisons etc.
� Latent Dirichlet Allocation + SVM
� Product type recognition:
� Only laptops (e.g. ultrabooks, convertibles) are needed
� Sort out reviews on displays, speakers etc.
� Maximum Entropy (logistic regression for multiclassproblems)
Content Extraction
Review Validation
Product Type Recognition
WebpagePipeline
![Page 7: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/7.jpg)
ARTICLE PIPELINE
Tokenization
Topic Classification
Sentiment Analysis
ArticlePipeline
![Page 8: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/8.jpg)
ARTICLE PIPELINE
� Tokenization:
� Words and sentences
� Sentence-level annotations for topic classification andsentiment analysis
� Prepocessing:
� Lowercasing
� No stopwords, no stemming, no removal ofinfrequent/frequent words
� Features:
� Bag-of-words (also with bigrams)
� Word2Vec
Tokenization
Topic Classification
Sentiment Analysis
ArticlePipeline
![Page 9: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/9.jpg)
ARTICLE PIPELINE
� � = 1, … , � … dictionary; set of all words
� � = 1, … , � … set of topics
� ∈ � and � ∈ �… topics and sequence of words, respectively
� MaxEnt (logistic regression, softmax):
� � = (��, … , ��) … count vector (absolute word countsin a sequence) with �� = ∑ �(� = �)� ��
� � = �│� = �∑ ��⋅!�"#$"�∈%& = ∏ �!()"#$"
∑ �!()*#$*+*,-� ��
= . � = �│� �
��
Tokenization
Topic Classification
Sentiment Analysis
ArticlePipeline
![Page 10: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/10.jpg)
ARTICLE PIPELINE
� Hidden Markov Model (HMM):
� decoding: find the most probable sequence of hiddenstates (topics) given the model and a sequence ofobservations (words)?
� / = 0�1 = � 2� = �│ = � … transition probabilities
� 3 = 415 = � � = 6│ = � … emission probabilities
� 7� = � � = 1 … initial state probabilities
� 8 = (�, �, /, 3, 7)
� Combining MaxEnt and HMM (Bayes):
� 415 = � � = 6│ = � = 9 :�1│;�5 ⋅9 ;�5∑ 9 :�1│;�< ⋅9 ;�<=>,-
Tokenization
Topic Classification
Sentiment Analysis
ArticlePipeline
Keyboard Display
key bright
![Page 11: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/11.jpg)
ARTICLE PIPELINE
� Combining MaxEnt and HMM (Bayes):
� 415 = 9 :�1│;�5 ⋅9 ;�5∑ 9 :�1│;�< ⋅9 ;�<=>,-
Tokenization
Topic Classification
Sentiment Analysis
ArticlePipeline
![Page 12: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/12.jpg)
ARTICLE PIPELINE
� Combining MaxEnt and HMM (Bayes):
� 415 = 9 :�1│;�5 ⋅?@A∑ 9 :�1│;�< ⋅9 ;�<=>,-
B̂5 ≅ � � = 6Tokenization
Topic Classification
Sentiment Analysis
ArticlePipeline
![Page 13: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/13.jpg)
ARTICLE PIPELINE
� Combining MaxEnt and HMM (Bayes):
� 415 = 9 :�1│;�5 ⋅?@AEF-(1) B̂5 ≅ � � = 6
Tokenization
Topic Classification
Sentiment Analysis
ArticlePipeline
![Page 14: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/14.jpg)
ARTICLE PIPELINE
� Combining MaxEnt and HMM (Bayes):
� 415 = 9 :�1│;�5 ⋅?@AEF-(1) B̂5 ≅ � � = 6
� 415 = �!A"#$"⋅?̂A∑ �!A*#$*+*,-
⋅ G � � = �│� = �!()"#$"
∑ �!()*#$*+*,-
Tokenization
Topic Classification
Sentiment Analysis
ArticlePipeline
![Page 15: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/15.jpg)
ARTICLE PIPELINE
� Combining MaxEnt and HMM (Bayes):
� 415 = 9 :�1│;�5 ⋅?@AEF-(1) B̂5 ≅ � � = 6
� 415 = �!A"#$"⋅?̂A∑ �!A*#$*+*,-
⋅ G � � = �│� = �!()"#$"
∑ �!()*#$*+*,-
� 415 = H � ⋅ G(�) G � = �∑ I(�)=�,-
Tokenization
Topic Classification
Sentiment Analysis
ArticlePipeline
![Page 16: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/16.jpg)
ARTICLE PIPELINE
� Combining MaxEnt and HMM (Bayes):
� 415 = 9 :�1│;�5 ⋅?@AEF-(1) B̂5 ≅ � � = 6
� 415 = �!A"#$"⋅?̂A∑ �!A*#$*+*,-
⋅ G � � = �│� = �!()"#$"
∑ �!()*#$*+*,-
� 415 = H � ⋅ G(�) G � = �∑ I(�)=�,-
Tokenization
Topic Classification
Sentiment Analysis
ArticlePipeline
![Page 17: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/17.jpg)
ARTICLE PIPELINE
� 3152 reviews
� 240220 manually labelled sentences
� 17 predefined topics
Tokenization
Topic Classification
Sentiment Analysis
ArticlePipeline
![Page 18: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/18.jpg)
ARTICLE PIPELINE
� Distance of the topics‘ MaxEnt distributions(bottom left) and the topics‘ word frequencies (top right).
� Hellinger distance:
J �, K K = 12 M � N K
K
Tokenization
Topic Classification
Sentiment Analysis
ArticlePipeline
![Page 19: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/19.jpg)
ARTICLE PIPELINE
Tokenization
Topic Classification
Sentiment Analysis
ArticlePipeline
![Page 20: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/20.jpg)
ARTICLE PIPELINE
Tokenization
Topic Classification
Sentiment Analysis
ArticlePipeline
0,00%
20,00%
40,00%
60,00%
80,00%
100,00%
Classifier Accuracy for Topic Detection
MaxEnt HMM
![Page 21: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/21.jpg)
ARTICLE PIPELINE
Tokenization
Topic Classification
Sentiment Analysis
ArticlePipeline
![Page 22: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/22.jpg)
ARTICLE PIPELINE
� MaxEnt
� baseline
� Recurrent neuralnetwork (RNN)
� Best performance on sentence level
� Recursive neural tensornetwork (RNTN)
� Require parsed syntaxtree
� Good on word level
Tokenization
Topic Classification
Sentiment Analysis
ArticlePipeline
Image credit: Socher et al. https://www.slideshare.net/jiessiecao/parsing-natural-scenes-and-natural-language-with-recursive-neural-networks
![Page 23: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/23.jpg)
ARTICLE PIPELINE
� 21695 manually labelled sentences
� 5-class analysis:
� very positive
� positive
� neutral
� negative
� very negative
� 3-class analysis:
� positive
� neutral
� negative
Tokenization
Topic Classification
Sentiment Analysis
ArticlePipeline
![Page 24: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/24.jpg)
ARTICLE PIPELINE
Tokenization
Topic Classification
Sentiment Analysis
ArticlePipeline
56,71%59,19%
55,62%
66,90% 68,34%65,82%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
RNTN RNN MaxEnt
Classifier Accuracy for Sentiment Analysis
5-class 3-class
![Page 25: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/25.jpg)
LESSONS LEARNT
� Language is ambiguous.
� Lack of standardized features or off-the-shelf (preprocessing) methods.
� There isn‘t only English.
![Page 26: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/26.jpg)
ONGOING RESEARCH
� Representation learning with CNNs (deep neural network)
� Change fully connected layers to adapt net to new tasks
� Multiple languages: reuse network architecture
� Character-wise approach for reusability of representations
http://www.wildml.com
![Page 27: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata](https://reader033.vdocuments.us/reader033/viewer/2022050719/5f6e4968c70b3310ca007c48/html5/thumbnails/27.jpg)
LITERATURE
[MaxEnt] A. Berger, S. Della Pietra and V. Della Pietra, “A maximum entropy approach to natural language processing,” Computational Linguistics, vol. 22(1), pp. 39-71, 1996.
[HMM] Cappé, O., Moulines, E., and Ryden, T., “Inference in Hidden Markov Models,” New York, Springer, 2005.
[RNN] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9(8), pp.1735-1780, 1997.
[RNTN] Socher, R., Perelygin, A., Wu. J., Chuang, J., Manning, C., Ng. A, and Potts, C., “Recursive deep models for semantic compositionality over a sentiment treebank,” Conference on Empirical Methods in Natural Language Processing, 2013.
[UIMA] D. Ferrucci and A. Lally, “UIMA: An architectural approach to unstructured information processing in the corporate research environment,” Natural Language Engineering, 10(3), pp. 237-348, 2004.