methodology and campaign design for the evaluation of ...staff · evaluation –this infrastructure...
TRANSCRIPT
![Page 1: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/1.jpg)
Methodology and Campaign Design for the
Evaluation of Semantic Search Tools
Stuart N. Wrigley1, Dorothee Reinhard2, Khadija Elbedweihy1, Abraham Bernstein2,
Fabio Ciravegna1
23.04.2010
1
1University of Sheffield, UK2University of Zurich, Switzerland
![Page 2: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/2.jpg)
Outline
• SEALS initiative
• Evaluation design
– Criteria
– Two phase approach
– API
– Workflow
• Data
• Results and Analyses
• Conclusions
23.04.2010
2
![Page 3: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/3.jpg)
SEALS INITIATIVE
23.04.2010
3
![Page 4: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/4.jpg)
SEALS goals
• Develop and diffuse best practices in evaluation of semantic technologies
• Create a lasting reference infrastructure for semantic technology evaluation
– This infrastructure will be the SEALS Platform
• Facilitate the continuous evaluation of semantic technologies
• Organise two worldwide Evaluation Campaigns
– One this summer
– Next in late 2011 / early 2012
• Allow easy access to both:
– evaluation results (for developers and researchers)
– technology roadmaps (for non-technical adopters)
• Transfer all infrastructure to the community
23.04.2010
4
![Page 5: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/5.jpg)
Targeted technologies
Five different types of semantic technologies:
• Ontology Engineering tools
• Ontology Storage and Reasoning Systems
• Ontology Matching tools
• Semantic Search tools
• Semantic Web Service tools
23.04.2010
5
![Page 6: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/6.jpg)
What’s our general approach?
• Low overhead to the participant
– Automate as far as possible
– We provide the compute
– We initiate the actual evaluation run
– We perform the analysis
• Encourage participation in evaluation campaign definitions and design
• Provide infrastructure for more than simply running high profile evaluation campaigns
– reuse existing evaluations for your personal testing
– create new ones evaluations
– store / publish / download test data sets
• Open Source (Apache 2.0)
23.04.2010
6
![Page 7: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/7.jpg)
SEALS Platform
23.04.2010
7
SEALS Service Manager
Result
Repository
Service
Tool
Repository
Service
Test Data
Repository
Service
Evaluation
Repository
Service
Runtime
Evaluation
Service
Technology
Developers
Evaluation Organisers
Technology
Users
![Page 8: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/8.jpg)
SEARCH EVALUATION DESIGN
23.04.2010
8
![Page 9: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/9.jpg)
What do we want to do?
• Evaluate / benchmark semantic search tools– with respect to their semantic peers.
• Allow as wide a range of interface styles as possible
• Assess tools on basis of a number of criteria including usability
• Automate (part) of it
23.04.2010
9
![Page 10: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/10.jpg)
Evaluation criteria
User-centred search methodologies will be evaluated according to the following criteria:
• Query expressiveness
• Usability (effectiveness, efficiency, satisfaction)
• Scalability
• Quality of documentation
• Performance
23.04.2010
10
• Is the style of interface suited to the type of query?
• How complex can the queries be?
• How easy is the tool to use?
• How easy is it to formulate the queries?
• How easy is it to work with the answers?• Ability to cope with a large ontology
• Ability to query a large repository in a reasonable time
• Ability to cope with a large amount of results returned• Is it easy to understand?
• Is it well structured?
Resource consumption:
• execution time (speed)
• CPU load
• memory required
![Page 11: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/11.jpg)
Two phase approach
• Semantic search tools evaluation demands a user-in-the-loop phase
– usability criterion
• Two phases:
– User-in-the-loop
– Automated
23.04.2010
11
![Page 12: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/12.jpg)
Evaluation criteria
Each phase will address a different subset of criteria.
• Automated evaluation: query expressiveness, scalability, performance, quality of documentation
• User-in-the-loop: usability, query expressiveness
23.04.2010
12
![Page 13: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/13.jpg)
RUNNING THE EVALUATION
23.04.2010
13
![Page 14: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/14.jpg)
Automated evaluation
23.04.2010
14
• Tools uploaded to platform. Includes:
– wrapper implementing API
– supporting libraries
• Test data and questions stored on platform
• Workflow specifies details of evaluation sequence
• Evaluation executed offline in batch mode
• Results stored on platform
• Analyses performed and stored on platform
Search tool
API Runtime
Evaluation
Service
SEALS Platform
![Page 15: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/15.jpg)
User in the loop evaluation
23.04.2010
15
• Performed at tool provider site
• All materials provided
– Controller software
– Instructions (leader and subjects)
– Questionnaires
• Data downloaded from platform
• Results uploaded to platform
Search tool
API
Controller
SEALS
Platform
Over the web
Tool provider machine
![Page 16: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/16.jpg)
API
• A range of information needs to be acquired from the tool in both phases
• In automated phase, the tool has to be executed and interrogated with no human assistance.
• Interface between the SEALS platform and the tool must be formalised
23.04.2010
16
![Page 17: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/17.jpg)
API – common
• Load ontology
– success / failure informs the interoperability
• Determine result type
– ranked list or set?
• Results ready?
– used to determine execution time
• Get results
– list of URIs
– number of results to be determined by developer
23.04.2010
17
![Page 18: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/18.jpg)
API – user in the loop
• User query input complete?
– used to determine input time• Get user query
– String representation of user’s query
– if NL interface, same as text inputted• Get internal query
– String representation of the internal query
– for use with…
23.04.2010
18
![Page 19: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/19.jpg)
API – automated
• Execute query
– mustn’t constrain tool type to particular format
– tool provider given questions shortly before evaluation is executed
– tool provider converts those questions into some form of ‘internal representation’ which can be serialised as a String
– serialised internal representation passed to this method
23.04.2010
19
![Page 20: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/20.jpg)
DATA
23.04.2010
21
![Page 21: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/21.jpg)
Data set – user in the loop
• Mooney Natural Language Learning Data– used by previous semantic search evaluation
– simple and well-known domain
– using geography subset• 9 classes
• 11 datatype properties
• 17 object properties and
• 697 instances
– 877 questions already available
23.04.2010
22
![Page 22: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/22.jpg)
Data set – automated
• EvoOnt
– set of object-oriented software source code ontologies
– easy to create different ABox sizes given a TBox
– 5 data set sizes: 1k, 10k, 100k, 1M, 10M triples
– questions generated by software engineers
23.04.2010
23
![Page 23: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/23.jpg)
RESULTS AND ANALYSES
23.04.2010
24
![Page 24: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/24.jpg)
Questionnaires
3 questionnaires:
• SUS questionnaire
• Extended questionnaire
– similar to SUS in terms of type of question but more detailed
• Demographics questionnaire
23.04.2010
25
![Page 25: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/25.jpg)
System Usability Scale (SUS) score
• SUS is a Likert scale
• 10-item questionnaire
• Each question has 5 levels (strongly disagree to strongly agree)
• SUS scores have a range of 0 to 100.
• A score of around 60 and above is generally considered as an indicator of good usability.
23.04.2010
26
![Page 26: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/26.jpg)
Demographics
• Age• Gender• Profession• Number of years in education• Highest qualification• Number of years in employment• knowledge of informatics• knowledge of linguistics• knowledge of formal query languages• knowledge of English• …
23.04.2010
27
![Page 27: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/27.jpg)
Automated
Results
• Execution success (OK / FAIL / PLATFORM ERROR)
• Triples returned
• Time to execute each query
• CPU load, memory usage
Analyses
• Ability to load ontology and query (interoperability)
• Precision and Recall (search accuracy and query expressiveness)
• Tool robustness: ratio of all benchmarks executed to number of failed executions
23.04.2010
28
![Page 28: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/28.jpg)
User in the loop
Results (other than core results similar to automated phase)
• Query captured by the tool
• Underlying query (e.g., SPARQL)
• Is answer in result set? (user may try a number of queries before being successful)
• time required to obtain answer
• number of queries required to answer question
Analyses
• Precision and Recall
• Correlations between results and SUS scores, demographics, etc
23.04.2010
29
![Page 29: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/29.jpg)
Dissemination
• Results browsable on the SEALS portal
• Split into three areas:
– performance
– usability
– comparison between tools
23.04.2010
30
![Page 30: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/30.jpg)
CONCLUSIONS
23.04.2010
33
![Page 31: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/31.jpg)
Conclusions
• Methodology and design of a semantic search tool evaluation campaign
• Exists within the wider context of the SEALS initiative
• First version
• feedback from participants and community will drive the design of the second campaign
• Emphasis on the user experience (for search)
– Two phase approach
23.04.2010
34
![Page 32: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/32.jpg)
Get involved!
• First Evaluation Campaign in all SEALS technology areas this Summer
• Get involved – your input and participation is crucial
• Workshop planned for ISWC 2010 after campaign
• Find out more (and take part!) at:
http://www.seals-project.eu
or talk to me, or email me ([email protected])
23.04.2010
35
![Page 33: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/33.jpg)
Timeline
• May 2010: Registration opens
• May-June 2010: Evaluation materials and documentation are provided to participants
• July 2010: Participants upload their tools
• August 2010: Evaluation scenarios are executed
• September 2010: Evaluation results are analysed
• November 2010: Evaluation results are discussed at ISWC 2010 workshop (tbc)
23.04.2010
36
![Page 34: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic](https://reader034.vdocuments.us/reader034/viewer/2022051908/5ffc61ede7002134645a5a56/html5/thumbnails/34.jpg)
Best paper award
SEALS is proud to be sponsoring the best paper award here at SemSearch2010
Congratulations to the winning authors!
23.04.2010
37