© author(s) of these slides including research results from the kom research network and tu...

23
© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide 14-September-2014 Prof. Dr.-Ing. Ralf Steinmetz KOM - Multimedia Communications Lab iKNOW_SentenceClassification__SebS___2014.09.18.pptx Authors: Sebastian Schmidt (presenting) Steffen Schnitzer Christoph Rensing Generic Sentence Classification: Examining the Scenario of Scientific Abstracts and Scrum Protocols Image source: www.moebellisten.de

Upload: todd-gallagher

Post on 16-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide14-September-2014

Prof. Dr.-Ing. Ralf SteinmetzKOM - Multimedia Communications Lab

iKNOW_SentenceClassification__SebS___2014.09.18.pptx

Authors:Sebastian Schmidt (presenting)Steffen SchnitzerChristoph Rensing

Generic Sentence Classification: Examining the Scenario of Scientific

Abstracts and Scrum Protocols

Image source: www.moebellisten.de

KOM – Multimedia Communications Lab 2

Introduction Motivation Challenge and concept

Scenarios Overview Corpora

Approach used for classification

Evaluation Setup Results for the scenarios

Conclusion and Future Work

Outline

KOM – Multimedia Communications Lab 3

Information overload through flood of textual documents Professional settings Research settings Educational settings

Hard for individuals to find relevant textual documents according to their information need

String-based filtering can help to reduce the amount of documents to be read “Find online tutorials that deal with Java” “I am searching for a job in the pharmaceutical sector”

Motivation

KOM – Multimedia Communications Lab 4

Contextual ambiguity

Pre-filtering of text sections can help! Based on the type of information contained

Goal: A generic concept for sentence-type classification

Challenge & Concept

“Cleaning staff wanted! We are a company in the pharmaceutic sector.”

vs.

“We are acquiring people having pharmaceutic training”

“For taking this course you should know about Java programming.”

vs.

“After this course you will be an expert in Java programming.”

Company

Description

Learning

Goals

Requirements

Prerequisites

KOM – Multimedia Communications Lab 5

Introduction Motivation Challenge and concept

Scenarios Overview Corpora

Approach used for classification

Evaluation Setup Results for the scenarios

Conclusion and Future Work

Outline

KOM – Multimedia Communications Lab 6

Abstract consists of the content in a condensed form

Typical queries from researchers

Types can be assigned to the sentences, e.g. Motivation Goals Related Work→ Knowing this type simplifies the execution of the queries

ScenariosAbstracts of Scientific Articles

Which other articlesface a particular problem?

Which other articlesuse a particular approach?

Which approachperforms best for

a specific problem?

KOM – Multimedia Communications Lab 7

Common questions (with variations) What went well? What went wrong? What could be improved?

Often informal content “Testing took too long” “Teamwork was excellent” …..

Management might be interested in particular ones only

Automated assignment to questions could simplify the creation of the protocols

ScenariosProtocols of Scrum Retrospective Meetings

Image source: commons.wikimedia.org

KOM – Multimedia Communications Lab 8

Annotation study 8 sentence labels defined 3 annotators 81 abstracts with 628 sentences collected

Taken from multimedia research journal Broken into sentences by us

Majority for 86.94% of the sentences Total agreement for 40.76% only Majority labels are used

→ Corpus MM

CorporaAbstracts of Scientific Articles (Multimedia)

Image source: http://digitalsherpa.com/how-to-use-social-media-to-conduct-market-research/

KOM – Multimedia Communications Lab 9

1000 abstracts 8,633 sentences

Biomedical domain

7 classes Background Objective Result …

Sentences annotated with one label by three annotators High inter-annotator agreement (κ= 0.85)→ Annotations of only one annotator were used

→ Corpus BioM

CorporaAbstracts of Scientific Articles ([1])

Image source: http://www.dmu.ac.uk/research/research-faculties-and-institutes/health-and-life-sciences/biomedical-and-environmental-health/biomedical-and-environmental-health.aspx

KOM – Multimedia Communications Lab 10

139 Scrum retrospective protocols from major software company 653 sentences

Sentences were clustered into “What went well?” “What went wrong?” “What could be improved?”→ Corpus Scrum

All sentences that could not be assigned to a cluster by humans were removed, e.g. “Timing” “Collaboration with Peter Smith”→ Corpus Scrum_Subset

CorporaProtocols of Scrum Retrospective Meetings

KOM – Multimedia Communications Lab 11

Introduction Motivation Challenge and concept

Scenarios Overview Corpora

Approach used for classification

Evaluation Setup Results for the scenarios

Conclusion and Future Work

Outline

KOM – Multimedia Communications Lab 12

Supervised classification with domain-independent features

10 feature groups

Approach

Content All words as features

Sentiment Positive/negative based on word-to-

sentiment mapping Negation Count of negation words

Tense Based on Stanford Lexicalized Parser

Tense indicator Based on word endings and modal

verbs

Adjectives Based on Stanford Lexicalized Parser

Indicative indicator Count of “need”, “should”, “must”

Personal pronouns Based on Stanford Lexicalized Parser

Position of the sentence Normalized position of the sentence

within its context Number of words Total number of words

KOM – Multimedia Communications Lab 13

Introduction Motivation Challenge and concept

Scenarios Overview Corpora

Approach used for classification

Evaluation Setup Results for the scenarios

Conclusion and Future Work

Outline

KOM – Multimedia Communications Lab 14

Different Classifiers used Support Vector Machines Naïve Bayes J48

Weka

10-fold cross validation

Evaluation Setup

Image source: http://www.cs.waikato.ac.nz/ml/weka/, http://scriptslines.com/blog/k-fold-cross-validation/

KOM – Multimedia Communications Lab 15

EvaluationAbstracts of Scientific Articles (F1-Measure)

MM BioM

SVM NB J48 SVM NB J48

All features 0.692 0.690 0.640 0.798 0.731 0.739

Single feature

Words 0.634 0.668 0.575 0.748 0.683 0.668

Position 0.489 0.487 0.492 0.557 0.540 0.554

Tense Indicator 0.278 0.279 0.265 0.254 0.319 0.319

All except single feature

Words 0.555 0.492 0.510 0.666 0.605 0.648

Position 0.634 0.656 0.576 0.750 0.670 0.675

Adjectives 0.699 0.692 0.641 0.799 0.735 0.738

Best resultsfor SVM

Words alonegives resultsthat are OK

Results can be better when not using all

features

KOM – Multimedia Communications Lab 16

EvaluationAbstracts of Scientific Articles

Different tag sets for the same kind of corpus do only seem to have a minor influence on the results→ Size of evaluation data is more relevant

KOM – Multimedia Communications Lab 17

EvaluationProtocols of Scrum Retrospective Meetings (F1-Measure)

Scrum Scrum_Subset

SVM NB J48 SVM NB J48

All features 0.572 0.562 0.513 0.661 0.669 0.592

Single feature

Words 0.552 0.533 0.485 0.647 0.644 0.546

Sentiment 0.323 0.379 0.425 0.415 0.464 0.458

Tense Indicator 0.357 0.339 0.410 0.366 0.366 0.315

All except single feature

Words 0.467 0.484 0.466 0.550 0.570 0.548

Sentiment 0.558 0.550 0.495 0.656 0.650 0.565

Adjectives 0.572 0.560 0.520 0.664 0.685 0.606

Best resultsfor SVM/NB

In the subsetSentiment is meaningful

Results can be better when not using all

features

KOM – Multimedia Communications Lab 18

Introduction Motivation Challenge and concept

Scenarios Overview Corpora

Approach used for classification

Evaluation Setup Results for the scenarios

Conclusion and Future Work

Outline

KOM – Multimedia Communications Lab 19

Results generally good Also the training corpora are not too large No domain-specific features required

Worse results for Scrum scenarios Incorrect grammar Many typos Shorter sentences

Adding contextual information might be helpful

Implementation in application needed for evaluation of usefulness of filtering concept

Conclusion & Future Work

KOM – Multimedia Communications Lab 20

Questions & Contact

Image Source: http://www.dreifragezeichen.de/

KOM – Multimedia Communications Lab 21

[1] Y. Guo, A. Korhonen, M. Liakata, I. S. Karolinska, L. Sun, and U. Stenius. Identifying the information structure of scientific abstracts: An investigation of three different schemes. In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, BioNLP ’10, page 99–107, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.

References

KOM – Multimedia Communications Lab 22

Backup SlidesResults Scientific Abstracts

KOM – Multimedia Communications Lab 23

Backup SlidesResults Scrum