quant-question answering benchmark curator · outline 1 motivation 2 approach 3 evaluation 4...

33
QUANT-Question Answering Benchmark Curator Ria Hari Gusmita , Rricha Jalota, Daniel Vollmers, Jan Reineke, Axel-Cyrille Ngonga Ngomo, and Ricardo Usbeck September 10, 2019 Gusmita et al QUANT September 10, 2019 1 / 33

Upload: others

Post on 18-Aug-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

QUANT-Question Answering Benchmark Curator

Ria Hari Gusmita, Rricha Jalota, Daniel Vollmers, Jan Reineke, Axel-Cyrille NgongaNgomo, and Ricardo Usbeck

September 10, 2019

Gusmita et al QUANT September 10, 2019 1 / 33

Page 2: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

Outline

1 Motivation

2 Approach

3 Evaluation

4 QALD-specific Analysis

5 Conclusion & Future Work

Gusmita et al QUANT September 10, 2019 2 / 33

Page 3: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

MotivationDrawback in evaluating Question Answering systems over knowledge bases

Mainly based on benchmark datasets(benchmarks)Challenge in maintaining high-quality andbenchmarks

Gusmita et al QUANT September 10, 2019 3 / 33

Page 4: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

MotivationChallenge in maintaining high-quality and benchmarks

Change of the underlying knowledge base

DBpedia 2016-04 DBpedia 2016-10

http://dbpedia.org/resource/Surfing http://dbpedia.org/resource/Surfer

http://dbpedia.org/ontology/seatingCapacity http://dbpedia.org/property/capacity

http://dbpedia.org/property/portrayer http://dbpedia.org/ontology/portrayer

http://dbpedia.org/property/establishedDate http://dbpedia.org/ontology/foundingDate

Gusmita et al QUANT September 10, 2019 4 / 33

Page 5: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

MotivationChallenge in maintaining high-quality and benchmarks

Metadata annotation errors

Gusmita et al QUANT September 10, 2019 5 / 33

Page 6: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

MotivationDegradation QALD benchmarks against various versions of DBpedia

Gusmita et al QUANT September 10, 2019 6 / 33

Page 7: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

Contribution

QUANT, a framework for the intelligent creation and curation of QA benchmarks

Definition

Given B, D, and Q as benchmark, dataset, and questions respectively

S represents QUANT’s suggestions

i th version of a QA benchmark Bi as a pair (Di ,Qi )Given a query qij ∈ Qi with zero results on Dk with k > iS : qij −→ q′ij

QUANT aimsto ensure that queries from Bi can be reused for Bk

to speed up the curation process as compared to the existing one

Gusmita et al QUANT September 10, 2019 7 / 33

Page 8: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

What QUANT supports

1 Creation of SPARQL queries2 The validity of benchmark metadata3 Spelling and grammatical correctness of questions

Gusmita et al QUANT September 10, 2019 8 / 33

Page 9: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

ApproachArchitecture

Gusmita et al QUANT September 10, 2019 9 / 33

Page 10: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

ApproachSmart suggestions

1 SPARQL suggestion2 Metadata suggestion3 Multilingual Questions and Keywords Suggestion

Gusmita et al QUANT September 10, 2019 10 / 33

Page 11: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

Smart suggestion1. How SPARQL suggestion module works

Gusmita et al QUANT September 10, 2019 11 / 33

Page 12: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

1. SPARQL suggestionMissing prefix

The original SPARQL query

SELECT ? sWHERE {

r e s : New_Delhi dbo : coun t r y ? s .}

Gusmita et al QUANT September 10, 2019 12 / 33

Page 13: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

1. SPARQL suggestionMissing prefix

The original SPARQL query

SELECT ? sWHERE {

r e s : New_Delhi dbo : coun t r y ? s .}

The suggested SPARQL query

PREFIX dbo : <ht tp : // dbped ia . org / on to l ogy/>PREFIX r e s : <ht tp : // dbped ia . org / r e s o u r c e/>SELECT ? sWHERE {

r e s : New_Delhi dbo : coun t r y ? s .}

Gusmita et al QUANT September 10, 2019 13 / 33

Page 14: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

1. SPARQL suggestionPredicate change

The original SPARQL query

SELECT ? dateWHERE {

? web s i t e r d f : t ype onto : So f tware .? web s i t e onto : r e l e a s eDa t e ? date .? web s i t e r d f s : l a b e l "DBpedia" .

}

Gusmita et al QUANT September 10, 2019 14 / 33

Page 15: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

1. SPARQL suggestionPredicate change

The suggested SPARQL query

SELECT ? dateWHERE {

? web s i t e r d f : t ype onto : So f tware .? web s i t e r d f s : l a b e l "DBpedia" .? web s i t e dbp : l a t e s t R e l e a s eDa t e ? date .

}

Gusmita et al QUANT September 10, 2019 15 / 33

Page 16: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

1. SPARQL suggestionPredicate missing

The original SPARQL query

SELECT ? u r iWHERE {

? s u b j e c t r d f s : l a b e l "Tom␣Hanks" .? s u b j e c t f o a f : homepage ? u r i

}

Gusmita et al QUANT September 10, 2019 16 / 33

Page 17: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

1. SPARQL suggestionPredicate missing

The original SPARQL query

SELECT ? u r iWHERE {

? s u b j e c t r d f s : l a b e l "Tom␣Hanks" .? s u b j e c t f o a f : homepage ? u r i

}

The suggested SPARQL query The predicate foaf:homepage is missing in ?subjectfoaf:homepage ?uri

Gusmita et al QUANT September 10, 2019 17 / 33

Page 18: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

1. SPARQL suggestionEntity change

The original SPARQL query

SELECT ? u r i WHERE{ ? u r i r d f : t ype yago : C ap i t a l s I n Eu r o p e }

Gusmita et al QUANT September 10, 2019 18 / 33

Page 19: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

1. SPARQL suggestionEntity change

The original SPARQL query

SELECT ? u r i WHERE{ ? u r i r d f : t ype yago : C ap i t a l s I n Eu r o p e }

The suggested SPARQL query

SELECT ? u r i WHERE{ ? u r i r d f : t ype yago : W i k i c a tC ap i t a l s I nEu r o p e }

Gusmita et al QUANT September 10, 2019 19 / 33

Page 20: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

2. Metadata suggestion

Gusmita et al QUANT September 10, 2019 20 / 33

Page 21: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

3. Multilingual questions and keywords suggestion

Question with missing keywords and translations

Gusmita et al QUANT September 10, 2019 21 / 33

Page 22: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

3. Multilingual questions and keywords suggestion

Generated keywords: state, united, states, america, highest, densityUtilizing Trans Shell tool→Generated keywords translations suggestion

Gusmita et al QUANT September 10, 2019 22 / 33

Page 23: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

3. Multilingual questions and keywords suggestion

Suggested Question Translations

Gusmita et al QUANT September 10, 2019 23 / 33

Page 24: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

Evaluation

Three goals of the evaluation:1 QUANT vs manual curation

Graduate students curated 50questions using QUANT and another50-question manually23 minutes vs 278 minutes

2 Effectiveness of smart suggestions10 expert users got involved increating a new joint benchmark, calledQALD-9, with 653 questions

3 QUANT’s capability to provide ahigh-quality benchmark dataset

The inter-rater agreement betweeneach two users amounts up to 0.83 onaverage

Group Inter-rater Agreement

1st Two-Users 0.972nd Two-Users 0.723rd Two-Users 0.884th Two-Users 0.775th Two-Users 0.96

Average 0.83

Gusmita et al QUANT September 10, 2019 24 / 33

Page 25: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

EvaluationUsers acceptance rate in %

Use

r 1

Use

r 2

Use

r 3

Use

r 4

Use

r 5

Use

r 6

Use

r 7

Use

r 8

Use

r 9

Use

r 10

List of users

0

10

20

30

40

50

60

70

80

90

100

Acc

epta

nce

rat

e in

%

acceptance rate per user

QUANT provided 2380 suggestions and user acceptance rate on average is 81%The top 4 acceptance-rate are for QALD-7 and QALD-8

Gusmita et al QUANT September 10, 2019 25 / 33

Page 26: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

EvaluationNumber of accepted suggestions from all users

User 1

User 2

User 3

User 4

User 5

User 6

User 7

User 8

User 9

User 10

List of users

0

100

200

300

400

500

Num

ber

of

accepte

d s

uggest

ion

SPARQL Query

Question Translations

Out of Scope

Onlydbo

Keywords Translations

Hybrid

Answer Type

Aggregation

Most users accepted suggestion for out-of-scope metadataKeyword and question translation suggestions yielded the second and third highestacceptance rates.

Gusmita et al QUANT September 10, 2019 26 / 33

Page 27: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

EvaluationNumber of users who accepted QUANT’s suggestions for each question’s attribute.

Aggregation

Answer TypeHybrid

Keywords TranslationsOnly Dbo

Out of Scope

Question Translations

SPARQL Query

Name of attributes

0

10

20

30

40

50

60

70

80

90

100

110

Num

ber

of u

sers

acc

epte

d su

gges

tion

in

%Percentage

83.75% of the users accepted QUANT’s smart suggestions on averageHybrid and SPARQL suggestions were only accepted by 2 and 5 users respectively.

Gusmita et al QUANT September 10, 2019 27 / 33

Page 28: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

EvaluationNumber of suggestions provided by users

User 1

User 2

User 3

User 4

User 5

User 6

User 7

User 8

User 9

User 10

List of users

0

10

20

30

40

Num

ber

of

pro

vid

ed s

uggest

ions

SPARQL Query Question Translations Out of Scope Onlydbo

Keywords Translations Hybrid Answer Type Aggregation

Answer type, onlydbo, out-of-scope, and SPARQL query metadata were attributes whosevalue redefined by users

Gusmita et al QUANT September 10, 2019 28 / 33

Page 29: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

QALD-specific Analysis

There are 1924 questions where 1442 questions are training data and 482 questions are testdata

Gusmita et al QUANT September 10, 2019 29 / 33

Page 30: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

QALD-specific Analysis

Duplication removal resulted 655 uniquequestionsRemoving 2 semantically similar questionsproduced 653 questionsUsing QUANT with 10 expert users, wegot 558 total benchmark questions →increase QALD-8 size by 110.6%The new benchmark formed QALD-9dataset

Distribution of unique questions in all QALDversions

Gusmita et al QUANT September 10, 2019 30 / 33

Page 31: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

Conclusion

QUANT’s evaluation highlights the need for betterdatasets and their maintenanceQUANT speeds up the curation process by up to91%.Smart suggestions motivate users to engage in moreattribute corrections than if there were no hints

Gusmita et al QUANT September 10, 2019 31 / 33

Page 32: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

Future Work

There is a need to invest more time into SPARQLsuggestions as only 5 users accepted themWe plan to support more file formats based on ourinternal library

Gusmita et al QUANT September 10, 2019 32 / 33

Page 33: QUANT-Question Answering Benchmark Curator · Outline 1 Motivation 2 Approach 3 Evaluation 4 QALD-specificAnalysis 5 Conclusion&FutureWork Gusmita et al QUANT September 10, 2019

Thank you for your attention!

Ria Hari [email protected]

https://github.com/dice-group/QUANT

DICE Group at Paderborn Universityhttps:

//dice-research.org/team/profiles/gusmita/

Gusmita et al QUANT September 10, 2019 33 / 33