introduction to codalab competitions · introduction in a shared task / challenge / competition /...

Post on 07-Oct-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction to CodaLab Competitions

Tristan Miller

Presented at:School of Data Analysis and Artificial IntelligenceNational Research University – Higher School of Economics25 May 2017

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 1

Overview

What are shared tasks?

What is CodaLab Competitions?

Organizing CodaLab Competitions

Student competitions

Caveats

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 2

What are shared tasks?

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 3

Introduction

In a shared task / challenge / competition / evaluation campaign / evaluationexercise, the organizers

I define a data processing task (classification, segmentation, ranking, etc.)

I define evaluation metrics to measure performanceI produce test data and the gold standard answers

I produce trial data for demonstration purposesI produce or provide ancillary resources (knowledge bases, etc.)I produce training data for use by supervised algorithms

I solicit participants to write algorithms to process the test data

I implement various baseline algorithms

I applying the evaluation metrics, score the algorithms’ output against the goldstandard

I compare and analyze the participants’ and baseline algorithms

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 4

Introduction

In a shared task / challenge / competition / evaluation campaign / evaluationexercise, the organizers

I define a data processing task (classification, segmentation, ranking, etc.)I define evaluation metrics to measure performance

I produce test data and the gold standard answers

I produce trial data for demonstration purposesI produce or provide ancillary resources (knowledge bases, etc.)I produce training data for use by supervised algorithms

I solicit participants to write algorithms to process the test data

I implement various baseline algorithms

I applying the evaluation metrics, score the algorithms’ output against the goldstandard

I compare and analyze the participants’ and baseline algorithms

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 4

Introduction

In a shared task / challenge / competition / evaluation campaign / evaluationexercise, the organizers

I define a data processing task (classification, segmentation, ranking, etc.)I define evaluation metrics to measure performanceI produce test data and the gold standard answers

I produce trial data for demonstration purposesI produce or provide ancillary resources (knowledge bases, etc.)I produce training data for use by supervised algorithms

I solicit participants to write algorithms to process the test data

I implement various baseline algorithms

I applying the evaluation metrics, score the algorithms’ output against the goldstandard

I compare and analyze the participants’ and baseline algorithms

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 4

Introduction

In a shared task / challenge / competition / evaluation campaign / evaluationexercise, the organizers

I define a data processing task (classification, segmentation, ranking, etc.)I define evaluation metrics to measure performanceI produce test data and the gold standard answers

I produce trial data for demonstration purposesI produce or provide ancillary resources (knowledge bases, etc.)I produce training data for use by supervised algorithms

I solicit participants to write algorithms to process the test data

I implement various baseline algorithmsI applying the evaluation metrics, score the algorithms’ output against the gold

standard

I compare and analyze the participants’ and baseline algorithms

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 4

Introduction

In a shared task / challenge / competition / evaluation campaign / evaluationexercise, the organizers

I define a data processing task (classification, segmentation, ranking, etc.)I define evaluation metrics to measure performanceI produce test data and the gold standard answers

I produce trial data for demonstration purposesI produce or provide ancillary resources (knowledge bases, etc.)I produce training data for use by supervised algorithms

I solicit participants to write algorithms to process the test data

I implement various baseline algorithms

I applying the evaluation metrics, score the algorithms’ output against the goldstandard

I compare and analyze the participants’ and baseline algorithms

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 4

Introduction

In a shared task / challenge / competition / evaluation campaign / evaluationexercise, the organizers

I define a data processing task (classification, segmentation, ranking, etc.)I define evaluation metrics to measure performanceI produce test data and the gold standard answers

I produce trial data for demonstration purposesI produce or provide ancillary resources (knowledge bases, etc.)I produce training data for use by supervised algorithms

I solicit participants to write algorithms to process the test data

I implement various baseline algorithms

I applying the evaluation metrics, score the algorithms’ output against the goldstandard

I compare and analyze the participants’ and baseline algorithms

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 4

Introduction

In a shared task / challenge / competition / evaluation campaign / evaluationexercise, the organizers

I define a data processing task (classification, segmentation, ranking, etc.)I define evaluation metrics to measure performanceI produce test data and the gold standard answers

I produce trial data for demonstration purposesI produce or provide ancillary resources (knowledge bases, etc.)I produce training data for use by supervised algorithms

I solicit participants to write algorithms to process the test dataI implement various baseline algorithms

I applying the evaluation metrics, score the algorithms’ output against the goldstandard

I compare and analyze the participants’ and baseline algorithms

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 4

Introduction

In a shared task / challenge / competition / evaluation campaign / evaluationexercise, the organizers

I define a data processing task (classification, segmentation, ranking, etc.)I define evaluation metrics to measure performanceI produce test data and the gold standard answers

I produce trial data for demonstration purposesI produce or provide ancillary resources (knowledge bases, etc.)I produce training data for use by supervised algorithms

I solicit participants to write algorithms to process the test dataI implement various baseline algorithms

I applying the evaluation metrics, score the algorithms’ output against the goldstandard

I compare and analyze the participants’ and baseline algorithms

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 4

Shared tasks: pros and cons

Pros:I stimulate methodological research on unsolved problemsI provide standardized data sets, resources, and evaluation metricsI facilitate reproducibility of research resultsI centralize publication and discussion of research results

Cons:I everything must be planned in advanceI large organizational overhead (data distribution, publicity, communication with

participants, etc.)I encourages “teaching to the test”

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 5

Shared tasks: pros and cons

Pros:I stimulate methodological research on unsolved problemsI provide standardized data sets, resources, and evaluation metricsI facilitate reproducibility of research resultsI centralize publication and discussion of research results

Cons:I everything must be planned in advanceI large organizational overhead (data distribution, publicity, communication with

participants, etc.)I encourages “teaching to the test”

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 5

What is CodaLab Competitions?

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 6

CodaLab Competitions

I Web-based platform for running online data-based competitionsI Developed by Microsoft, Stanford University, and othersI Free hosted implementation: https://competitions.codalab.org/I Free software (Apache License 2.0):

https://github.com/codalab/codalab-competitions/

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 7

Microsoft COCO Image Captioning Challenge

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 8

ChaLearn Looking at People Challenges

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 9

SemEval-2017 Multilingual and Cross-lingualSemantic Word Similarity

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 10

CodaLab features

I Hosts public task websiteI Hosts private gold-standard data and scoring softwareI Manages participant registrationI Enforces submission deadlinesI Runs scoring softwareI Tabulates, stores, and publishes resultsI Handles communication between organizers/participants (e-mail, forums)I Provides some publicity

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 11

Organizing CodaLab Competitions

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 20

Competition bundle file structure

Competitions are defined by a zipped file archive (a bundle) containing:I a logo imageI HTML files for the competition websiteI scoring software (ZIP archive)I gold-standard (“reference”) data (ZIP archive)I competition.yaml

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 25

competition.yaml settings

I Competition title, description, and logoI Whether registration is requiredI Filenames for standard and option web pagesI Configuration of competition phasesI Format of the leaderboard

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 26

competition.yaml: basic settings

title: SemEval -2017 Task 7, Subtask 3description: Interpretation of English Punsimage: semeval2017 -task7 -logo.pnghas_registration: Trueallow_teams: Trueend_date: 2017 -01 -30html:

overview: overview.htmlevaluation: evaluation.htmlterms: terms_and_conditions.htmldata: data.html...

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 27

HTML files

<p>Data sets for this subtask are described in detail on the<a href="http :// alt.qcri.org/semeval2017/task7/index.php

?id=data -and -resources">SemEval -2017 Task 7 website </a>.</p>

<h4>Download </h4><ul><li><a href="data/uploads/semeval2017_pun_task.tar.xz">

Trial data</a></li><li>Test data will not be released until the evaluation

begins </li></ul>

<p>Note that , due to the difficulty in amassing a largenumber of pun examples per word or per sense , there is <strong >no training data</strong > for this task.</p>

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 29

Competition phases

I Phases break your competition into optional subtasksI Each phase has its own:

I titleI start dateI scoring programI dataI submission limit

I Phases can be run in parallel, staggered, or (with some difficulty) in sequenceI It’s common to have a “trial” phase for sandbox testing:

I earlier start dateI toy dataI unlimited submissions

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 30

competition.yaml: defining phases

phases:1:

phasenumber: 1label: "Trial"start_date: 2016 -12 -05max_submissions: 999scoring_program: scorer.zipreference_data: data_trial.zipleaderboard_management_mode: hide_resultscolor: white

2:phasenumber: 2label: "Test␣(Homographic)"start_date: 2017 -01 -23...

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 31

Leaderboards

I A leaderboard is a dynamically updated results tableI Each phase has 0 or more leaderboards, public or hiddenI The columns are the metrics output by your scoring softwareI The rows are the participants’ submissionsI You define the column labels and numeric formatI You can also rank the metrics by priority

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 33

competition.yaml: defining the leaderboards

leaderboard:leaderboards:

RESULTS: &RESULTSlabel: Resultsrank: 1

columns:coverage:

leaderboard: *RESULTSlabel: coveragerank: 4numeric_format: 4

precision:leaderboard: *RESULTSlabel: precision...

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 34

Scoring programs and reference data

I You provide a ZIP file containing arbitrary gold-standard reference dataI You provide another ZIP file containing

I an executable scorerI a metadata file that describes how to run the scorer, using the following variables:

$program scorer directory$input/ref reference data directory$input/res submission data directory

$output output directory

description: Scoring program for Subtask 3 of SemEval-2017 Task 7 (pun interpretation)

command: java -classpath $program de.tudarmstadt.ukp.semeval2017.task7.scorer.PunScorer -i $input/ref/truth.txt $input/res/answer.txt $output/scores.txt

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 36

Scoring programs and reference data

I Scorers can be written in any language (Python, Java, Perl, etc.)I The scorer must produce a key–value file $output/scores.txtI Each key corresponds to a leaderboard column key:

coverage: 1.000precision: 0.825recall: 0.775f1: 0.799

I stderr is captured and reported to the submitterI stderr and stdout are captured and stored for the competition organizer

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 37

Upon uploading a competition

I You will (usually) be warned if there is a problemI You optionally “publish” your competition:

Unpublished: accessible only if you know the URL (default)Published: listed on CodaLab Competitions home page

I Competitions can be edited via the web interface to:I Publish/unpublishI Change competition settingsI Edit web pages (via rich text editor)I Add or remove reference data and scoring programs

I Editing the original YAML is no longer possible!

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 41

Viewing and downloading submissions

The “Submissions” page allows you to:I see a list of all successful and failed submissionsI download the original system outputI download a CSV of all system scoresI re-run the scorer on one or all submissionsI delete a submissionI (un)hide a submission from the leaderboard

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 45

Student competitions

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 47

Student competitions

I skill development for student participants/organizersI experimentation, problem formulation/solvingI writing/fulfilling technical specificationsI collaboration with peersI presentation of research results

I benefits for student participantsI unrestricted choice of tools/methodsI instant feedback to studentsI freedom to experiment

I benefits for teachersI automatic enforcement of submission deadlinesI instant tabulation of scoresI consistent packaging of submissions

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 49

Student competitions

I skill development for student participants/organizersI experimentation, problem formulation/solvingI writing/fulfilling technical specificationsI collaboration with peersI presentation of research results

I benefits for student participantsI unrestricted choice of tools/methodsI instant feedback to studentsI freedom to experiment

I benefits for teachersI automatic enforcement of submission deadlinesI instant tabulation of scoresI consistent packaging of submissions

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 49

Student competitions

I skill development for student participants/organizersI experimentation, problem formulation/solvingI writing/fulfilling technical specificationsI collaboration with peersI presentation of research results

I benefits for student participantsI unrestricted choice of tools/methodsI instant feedback to studentsI freedom to experiment

I benefits for teachersI automatic enforcement of submission deadlinesI instant tabulation of scoresI consistent packaging of submissions

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 49

Caveats

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 50

Caveats

CodaLab Competitions is . . .I underdocumentedI unpredictableI unintuitiveI unstable

But it’s . . .I free (as in beer)I free (as in speech)I popularI under active developmentI probably a time-saver, in the long run

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 51

Caveats

CodaLab Competitions is . . .I underdocumentedI unpredictableI unintuitiveI unstable

But it’s . . .I free (as in beer)I free (as in speech)I popularI under active developmentI probably a time-saver, in the long run

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 51

Thank you!

Questions?

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 52

top related