© author(s) of these slides including research results from the kom research network and tu...
TRANSCRIPT
© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide14-September-2014
Prof. Dr.-Ing. Ralf SteinmetzKOM - Multimedia Communications Lab
iKNOW_SentenceClassification__SebS___2014.09.18.pptx
Authors:Sebastian Schmidt (presenting)Steffen SchnitzerChristoph Rensing
Generic Sentence Classification: Examining the Scenario of Scientific
Abstracts and Scrum Protocols
Image source: www.moebellisten.de
KOM – Multimedia Communications Lab 2
Introduction Motivation Challenge and concept
Scenarios Overview Corpora
Approach used for classification
Evaluation Setup Results for the scenarios
Conclusion and Future Work
Outline
KOM – Multimedia Communications Lab 3
Information overload through flood of textual documents Professional settings Research settings Educational settings
Hard for individuals to find relevant textual documents according to their information need
String-based filtering can help to reduce the amount of documents to be read “Find online tutorials that deal with Java” “I am searching for a job in the pharmaceutical sector”
Motivation
KOM – Multimedia Communications Lab 4
Contextual ambiguity
Pre-filtering of text sections can help! Based on the type of information contained
Goal: A generic concept for sentence-type classification
Challenge & Concept
“Cleaning staff wanted! We are a company in the pharmaceutic sector.”
vs.
“We are acquiring people having pharmaceutic training”
“For taking this course you should know about Java programming.”
vs.
“After this course you will be an expert in Java programming.”
Company
Description
Learning
Goals
Requirements
Prerequisites
KOM – Multimedia Communications Lab 5
Introduction Motivation Challenge and concept
Scenarios Overview Corpora
Approach used for classification
Evaluation Setup Results for the scenarios
Conclusion and Future Work
Outline
KOM – Multimedia Communications Lab 6
Abstract consists of the content in a condensed form
Typical queries from researchers
Types can be assigned to the sentences, e.g. Motivation Goals Related Work→ Knowing this type simplifies the execution of the queries
ScenariosAbstracts of Scientific Articles
Which other articlesface a particular problem?
Which other articlesuse a particular approach?
Which approachperforms best for
a specific problem?
KOM – Multimedia Communications Lab 7
Common questions (with variations) What went well? What went wrong? What could be improved?
Often informal content “Testing took too long” “Teamwork was excellent” …..
Management might be interested in particular ones only
Automated assignment to questions could simplify the creation of the protocols
ScenariosProtocols of Scrum Retrospective Meetings
Image source: commons.wikimedia.org
KOM – Multimedia Communications Lab 8
Annotation study 8 sentence labels defined 3 annotators 81 abstracts with 628 sentences collected
Taken from multimedia research journal Broken into sentences by us
Majority for 86.94% of the sentences Total agreement for 40.76% only Majority labels are used
→ Corpus MM
CorporaAbstracts of Scientific Articles (Multimedia)
Image source: http://digitalsherpa.com/how-to-use-social-media-to-conduct-market-research/
KOM – Multimedia Communications Lab 9
1000 abstracts 8,633 sentences
Biomedical domain
7 classes Background Objective Result …
Sentences annotated with one label by three annotators High inter-annotator agreement (κ= 0.85)→ Annotations of only one annotator were used
→ Corpus BioM
CorporaAbstracts of Scientific Articles ([1])
Image source: http://www.dmu.ac.uk/research/research-faculties-and-institutes/health-and-life-sciences/biomedical-and-environmental-health/biomedical-and-environmental-health.aspx
KOM – Multimedia Communications Lab 10
139 Scrum retrospective protocols from major software company 653 sentences
Sentences were clustered into “What went well?” “What went wrong?” “What could be improved?”→ Corpus Scrum
All sentences that could not be assigned to a cluster by humans were removed, e.g. “Timing” “Collaboration with Peter Smith”→ Corpus Scrum_Subset
CorporaProtocols of Scrum Retrospective Meetings
KOM – Multimedia Communications Lab 11
Introduction Motivation Challenge and concept
Scenarios Overview Corpora
Approach used for classification
Evaluation Setup Results for the scenarios
Conclusion and Future Work
Outline
KOM – Multimedia Communications Lab 12
Supervised classification with domain-independent features
10 feature groups
Approach
Content All words as features
Sentiment Positive/negative based on word-to-
sentiment mapping Negation Count of negation words
Tense Based on Stanford Lexicalized Parser
Tense indicator Based on word endings and modal
verbs
Adjectives Based on Stanford Lexicalized Parser
Indicative indicator Count of “need”, “should”, “must”
Personal pronouns Based on Stanford Lexicalized Parser
Position of the sentence Normalized position of the sentence
within its context Number of words Total number of words
KOM – Multimedia Communications Lab 13
Introduction Motivation Challenge and concept
Scenarios Overview Corpora
Approach used for classification
Evaluation Setup Results for the scenarios
Conclusion and Future Work
Outline
KOM – Multimedia Communications Lab 14
Different Classifiers used Support Vector Machines Naïve Bayes J48
Weka
10-fold cross validation
Evaluation Setup
Image source: http://www.cs.waikato.ac.nz/ml/weka/, http://scriptslines.com/blog/k-fold-cross-validation/
KOM – Multimedia Communications Lab 15
EvaluationAbstracts of Scientific Articles (F1-Measure)
MM BioM
SVM NB J48 SVM NB J48
All features 0.692 0.690 0.640 0.798 0.731 0.739
Single feature
Words 0.634 0.668 0.575 0.748 0.683 0.668
Position 0.489 0.487 0.492 0.557 0.540 0.554
Tense Indicator 0.278 0.279 0.265 0.254 0.319 0.319
All except single feature
Words 0.555 0.492 0.510 0.666 0.605 0.648
Position 0.634 0.656 0.576 0.750 0.670 0.675
Adjectives 0.699 0.692 0.641 0.799 0.735 0.738
Best resultsfor SVM
Words alonegives resultsthat are OK
Results can be better when not using all
features
KOM – Multimedia Communications Lab 16
EvaluationAbstracts of Scientific Articles
Different tag sets for the same kind of corpus do only seem to have a minor influence on the results→ Size of evaluation data is more relevant
KOM – Multimedia Communications Lab 17
EvaluationProtocols of Scrum Retrospective Meetings (F1-Measure)
Scrum Scrum_Subset
SVM NB J48 SVM NB J48
All features 0.572 0.562 0.513 0.661 0.669 0.592
Single feature
Words 0.552 0.533 0.485 0.647 0.644 0.546
Sentiment 0.323 0.379 0.425 0.415 0.464 0.458
Tense Indicator 0.357 0.339 0.410 0.366 0.366 0.315
All except single feature
Words 0.467 0.484 0.466 0.550 0.570 0.548
Sentiment 0.558 0.550 0.495 0.656 0.650 0.565
Adjectives 0.572 0.560 0.520 0.664 0.685 0.606
Best resultsfor SVM/NB
In the subsetSentiment is meaningful
Results can be better when not using all
features
KOM – Multimedia Communications Lab 18
Introduction Motivation Challenge and concept
Scenarios Overview Corpora
Approach used for classification
Evaluation Setup Results for the scenarios
Conclusion and Future Work
Outline
KOM – Multimedia Communications Lab 19
Results generally good Also the training corpora are not too large No domain-specific features required
Worse results for Scrum scenarios Incorrect grammar Many typos Shorter sentences
Adding contextual information might be helpful
Implementation in application needed for evaluation of usefulness of filtering concept
Conclusion & Future Work
KOM – Multimedia Communications Lab 20
Questions & Contact
Image Source: http://www.dreifragezeichen.de/
KOM – Multimedia Communications Lab 21
[1] Y. Guo, A. Korhonen, M. Liakata, I. S. Karolinska, L. Sun, and U. Stenius. Identifying the information structure of scientific abstracts: An investigation of three different schemes. In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, BioNLP ’10, page 99–107, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.
References