the cass technique for evaluating the performance of ... · the cass technique for evaluating the...

Post on 25-Aug-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The CASS Technique for Evaluating the Performance of Argument Mining

Rory Duthie, John Lawrence, Katarzyna Budzynska, Chr is Reed

Centre for Argument TechnologyUniversity of Dundee

Rory Duthie John Lawrence Katarzyna Budzynska Chris Reed

IFiS Polish Academy of Sciences

22

OutlineMotivation and Aim

• Problems when publishing evaluation and results

• CASS (Combined Argument Similarity Score)

Metric• How CASS is calculated

Automation• Deployment of CASS

33

OutlineMotivation and Aim

• Problems when publishing evaluation and results

• CASS (Combined Argument Similarity Score)

Metric• How CASS is calculated

Automation• Deployment of CASS

44

Motivation

•Consistency for the Argument Mining community

•Metric which does not double penalise mismatches

•Automate the calculations

55

Motivation: Consistency for the community

From the 2nd Workshop on Argument(ation) Mining:

• Inter-annotator agreement: 3 papers - Cohen’s Kappa 3 papers - percentage agreement2 papers - precision and recall 3 papers - other methods

• Automatic Argument Mining results: 4 papers - accuracy 5 papers - precision, recall and F-score1 paper - macro-averaged F-score

• Other Metrics in Comp Ling: ROUGE, in text summarization

66

Motivation: Metric (1/3)(Kirschner et al., 2015) provides:• Graph Based approach, APA, Weighted Average

Problems: • Segmentation differences

• Propositional content relations only

• Not all nodes in an analysis (Distance < 6)

• Relation direction ignored

• Set metrics

77

Motivation: Metric (2/3)CASS extends (Kirschner et al., 2015):

• Segmentation differences

• Propositional content relations and dialogical content relations:

• confusion matrices

• all nodes

• differing segmentation

88

Motivation: Metric (3/3)• Use CASS to combine scores

• CASS with any metric

• Annotator agreement and Argument Mining results

• Comparison of analysis in different annotation schemes

9

Motivation: Automatic Solution

Manual VS ManualCohen’s Kappa,Fleiss Kappa…

Manual VS AutomaticPrecision, Recall, F-score,

Accuracy…

1010

VS VS

OutlineMotivation and Aim

• Problems when publishing evaluation and results

• Aim of CASS (Combined Argument Similarity Score)

Metric• How CASS is calculated

Automation• Deployment of CASS

1111

Metric: Segmentation (1/4)

1212

Still, it is possible that, should war erupt in Iraq, American and British forces might fall foul of, for example, the provision of the ICC treaty outlawing attacks on military targets that cause "clearly excessive" harm to civilians.

Metric: Segmentation (2/4)

1313

That is especially so if they do not learn lessons from recent wars and take corrective steps. The weapon most likely to produce such harm is the cluster bomb.

Metric: Segmentation (3/4)

1414

12 31 1810 28S2 17 12 27

S1 20 18 29 39 31 18

Still, it is possible that, should war erupt in Iraq, American and British forces might fall foul of, for example, the provision of the ICC treaty outlawing attacks on military targets that cause "clearly excessive" harm to civilians.

Metric: Segmentation (4/4)

•Pk - (Beeferman et al., 1999)

•WindowDiff - (Pevzner and Hearst, 2002)

•Segmentation Similarity - (Fournier and Inkpen, 2012)

1515

Metric: Calculating Relations

•Guaranteed matching formula used for all propositions and locutions

•We use the Levenshtein distance

•Levenshtein distance and word positions are combined to give node matches

1616

Metric: Propositional Relations (1/3)

1717

5

6

42

31

7

2 4

31

6

8

5

Annotation 1 Annotation 2

Metric: Propositional Relations (2/3)

1818

5

6

42

31

7

2 4

31

6

8

5

Annotation 1 Annotation 2

Metric: Propositional Relations (3/3)

•Pair nodes and check the relation attached

•When there is a differing segmentation, consider fine grained and convergent arguments

•All node pairs are considered to give a confusion matrix

19

Metric: Dialogical Relations (1/3)

2020

Metric: Dialogical Relations (2/3)

2121

Metric: Dialogical Relations (3/3)

•Split calculation into parts

•When there is a differing segmentation, considered matched pairs

•All node pairs are considered to give a confusion matrix

22

CASS technique

•Combine scores for the CASS technique

•Applied to any consistent combination of scores

2323

CASS: Evaluation

•Use CASS – Kappa as it provides an adjustment of the score for chance

•Not the only score that can be used with CASS

2424

CASS: Extension

•Any metric with a confusion matrix can be applied to CASS

• E.g. Balanced Accuracy, Informedness…

•We provide a select set but there is no metric ruled out

2525

OutlineMotivation and Aim

• Problems when publishing evaluation and results

• Aim of CASS (Combined Argument Similarity Score)

Metric• How CASS is calculated

Automation• Deployment of CASS

2626

Automation: AIF (Argument Interchange Format)

•AIF allows us to split calculations into component parts: segmentation, propositional and dialogical

•AIF allows the translation of other representation models to AIF format

•Allows for comparison of corpora in different representations.

•However, CASS technique is independent of AIF

2727

Automation: AIFdb

28

http://www.aifdb.org/search

Automation: AIFcorpora

http://corpora.aifdb.org/

29

Automation: Argument Analytics

http://analytics.arg-tech.org

30

Thank You.

rory@arg.tech

31

Find out more athttp://arg.tech

Come to COMMA 2016: Conference onComputational

Models of Argument(Potsdam)

Investigate thedatasets at

http://aifdb.org

31

ReferencesChristian Kirschner, Judith Eckle-Kohler, and Iryna Gurevych. 2015. Linking the thoughts: Analysis of argumentation structures in scientific publications. In Proceedings of the Second Workshop on Argumentation Mining. Association for Computational Linguistics, pages 1–11.

Doug Beeferman, Adam Berger, and John Lafferty. 1999. Statistical models for text segmentation. Machine learning, 34(1-3):177–210.

Lev Pevzner and Marti A Hearst. 2002. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28(1):19–36.

Chris Fournier and Diana Inkpen. 2012. Segmentation similarity and agreement. In Proceedings of the2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 152–161. Association for Computational Linguistics

3232

top related