introduction to codalab competitions · introduction in a shared task / challenge / competition /...

Introduction to CodaLab Competitions

Tristan Miller

Presented at:School of Data Analysis and Artificial IntelligenceNational Research University – Higher School of Economics25 May 2017

25 May 2017 | Ubiquitous Knowledge Processing Lab | Department of Computer Science | Tristan Miller | 1

Overview

What are shared tasks?

What is CodaLab Competitions?

Organizing CodaLab Competitions

Student competitions

Caveats

What are shared tasks?

Introduction

In a shared task / challenge / competition / evaluation campaign / evaluationexercise, the organizers

I define a data processing task (classification, segmentation, ranking, etc.)

I define evaluation metrics to measure performanceI produce test data and the gold standard answers

I produce trial data for demonstration purposesI produce or provide ancillary resources (knowledge bases, etc.)I produce training data for use by supervised algorithms

I solicit participants to write algorithms to process the test data

I implement various baseline algorithms

I applying the evaluation metrics, score the algorithms’ output against the goldstandard

I compare and analyze the participants’ and baseline algorithms

Introduction

I define a data processing task (classification, segmentation, ranking, etc.)I define evaluation metrics to measure performance

I produce test data and the gold standard answers

Introduction

I define a data processing task (classification, segmentation, ranking, etc.)I define evaluation metrics to measure performanceI produce test data and the gold standard answers

Introduction

I implement various baseline algorithmsI applying the evaluation metrics, score the algorithms’ output against the gold

standard

Introduction

I solicit participants to write algorithms to process the test dataI implement various baseline algorithms

Introduction

I solicit participants to write algorithms to process the test dataI implement various baseline algorithms

Shared tasks: pros and cons

Pros:I stimulate methodological research on unsolved problemsI provide standardized data sets, resources, and evaluation metricsI facilitate reproducibility of research resultsI centralize publication and discussion of research results

Cons:I everything must be planned in advanceI large organizational overhead (data distribution, publicity, communication with

participants, etc.)I encourages “teaching to the test”

Shared tasks: pros and cons

Pros:I stimulate methodological research on unsolved problemsI provide standardized data sets, resources, and evaluation metricsI facilitate reproducibility of research resultsI centralize publication and discussion of research results

Cons:I everything must be planned in advanceI large organizational overhead (data distribution, publicity, communication with

participants, etc.)I encourages “teaching to the test”

What is CodaLab Competitions?

CodaLab Competitions

I Web-based platform for running online data-based competitionsI Developed by Microsoft, Stanford University, and othersI Free hosted implementation: https://competitions.codalab.org/I Free software (Apache License 2.0):

https://github.com/codalab/codalab-competitions/

Microsoft COCO Image Captioning Challenge

ChaLearn Looking at People Challenges

SemEval-2017 Multilingual and Cross-lingualSemantic Word Similarity

CodaLab features

I Hosts public task websiteI Hosts private gold-standard data and scoring softwareI Manages participant registrationI Enforces submission deadlinesI Runs scoring softwareI Tabulates, stores, and publishes resultsI Handles communication between organizers/participants (e-mail, forums)I Provides some publicity

Organizing CodaLab Competitions

Competition bundle file structure

Competitions are defined by a zipped file archive (a bundle) containing:I a logo imageI HTML files for the competition websiteI scoring software (ZIP archive)I gold-standard (“reference”) data (ZIP archive)I competition.yaml

competition.yaml settings

I Competition title, description, and logoI Whether registration is requiredI Filenames for standard and option web pagesI Configuration of competition phasesI Format of the leaderboard

competition.yaml: basic settings

title: SemEval -2017 Task 7, Subtask 3description: Interpretation of English Punsimage: semeval2017 -task7 -logo.pnghas_registration: Trueallow_teams: Trueend_date: 2017 -01 -30html:

overview: overview.htmlevaluation: evaluation.htmlterms: terms_and_conditions.htmldata: data.html...

HTML files

<p>Data sets for this subtask are described in detail on the<a href="http :// alt.qcri.org/semeval2017/task7/index.php

?id=data -and -resources">SemEval -2017 Task 7 website </a>.</p>

<h4>Download </h4><ul><li><a href="data/uploads/semeval2017_pun_task.tar.xz">

Trial data</a></li><li>Test data will not be released until the evaluation

begins </li></ul>

<p>Note that , due to the difficulty in amassing a largenumber of pun examples per word or per sense , there is <strong >no training data</strong > for this task.</p>

Competition phases

I Phases break your competition into optional subtasksI Each phase has its own:

I titleI start dateI scoring programI dataI submission limit

I Phases can be run in parallel, staggered, or (with some difficulty) in sequenceI It’s common to have a “trial” phase for sandbox testing:

I earlier start dateI toy dataI unlimited submissions

competition.yaml: defining phases

phases:1:

phasenumber: 1label: "Trial"start_date: 2016 -12 -05max_submissions: 999scoring_program: scorer.zipreference_data: data_trial.zipleaderboard_management_mode: hide_resultscolor: white

2:phasenumber: 2label: "Test␣(Homographic)"start_date: 2017 -01 -23...

Leaderboards

I A leaderboard is a dynamically updated results tableI Each phase has 0 or more leaderboards, public or hiddenI The columns are the metrics output by your scoring softwareI The rows are the participants’ submissionsI You define the column labels and numeric formatI You can also rank the metrics by priority

competition.yaml: defining the leaderboards

leaderboard:leaderboards:

RESULTS: &RESULTSlabel: Resultsrank: 1

columns:coverage:

leaderboard: *RESULTSlabel: coveragerank: 4numeric_format: 4

precision:leaderboard: *RESULTSlabel: precision...

Scoring programs and reference data

I You provide a ZIP file containing arbitrary gold-standard reference dataI You provide another ZIP file containing

I an executable scorerI a metadata file that describes how to run the scorer, using the following variables:

$program scorer directory$input/ref reference data directory$input/res submission data directory

$output output directory

description: Scoring program for Subtask 3 of SemEval-2017 Task 7 (pun interpretation)

command: java -classpath $program de.tudarmstadt.ukp.semeval2017.task7.scorer.PunScorer -i $input/ref/truth.txt $input/res/answer.txt $output/scores.txt

Scoring programs and reference data

I Scorers can be written in any language (Python, Java, Perl, etc.)I The scorer must produce a key–value file $output/scores.txtI Each key corresponds to a leaderboard column key:

coverage: 1.000precision: 0.825recall: 0.775f1: 0.799

I stderr is captured and reported to the submitterI stderr and stdout are captured and stored for the competition organizer

Upon uploading a competition

I You will (usually) be warned if there is a problemI You optionally “publish” your competition:

Unpublished: accessible only if you know the URL (default)Published: listed on CodaLab Competitions home page

I Competitions can be edited via the web interface to:I Publish/unpublishI Change competition settingsI Edit web pages (via rich text editor)I Add or remove reference data and scoring programs

I Editing the original YAML is no longer possible!

Viewing and downloading submissions

The “Submissions” page allows you to:I see a list of all successful and failed submissionsI download the original system outputI download a CSV of all system scoresI re-run the scorer on one or all submissionsI delete a submissionI (un)hide a submission from the leaderboard

I skill development for student participants/organizersI experimentation, problem formulation/solvingI writing/fulfilling technical specificationsI collaboration with peersI presentation of research results

I benefits for student participantsI unrestricted choice of tools/methodsI instant feedback to studentsI freedom to experiment

I benefits for teachersI automatic enforcement of submission deadlinesI instant tabulation of scoresI consistent packaging of submissions

Caveats

CodaLab Competitions is . . .I underdocumentedI unpredictableI unintuitiveI unstable

But it’s . . .I free (as in beer)I free (as in speech)I popularI under active developmentI probably a time-saver, in the long run

Caveats

CodaLab Competitions is . . .I underdocumentedI unpredictableI unintuitiveI unstable

But it’s . . .I free (as in beer)I free (as in speech)I popularI under active developmentI probably a time-saver, in the long run

Thank you!

Questions?

introduction to codalab competitions · introduction in a shared task / challenge / competition /...

Documents

zowe → cics → deﬁne...zowe → cics interact with ibm...

designing for extensibility and planning for conflict...

model driven architecture meta modeling -...

the yin yang of design and anthropology · learning...

writting you first linux kernel module - uni-hamburg.de ·...

semantic e-science: from microformats to models · {invoke...

organizational principles to guide and deﬁne the child...

communication evaluation cognitive · 2020-05-26 ·...

deﬁne database(ddl) deﬁne database(ddl) - firebird ·...

identify properties of deﬁnite integrals deﬁne odd and...

successor features for transfer in reinforcement...

messaging strategy how to deﬁne a customer masterclass...

deﬁne the major features of cisco intrusion protection...

original article tsc1/2 mutations deﬁne a molecular subset

10.1 the ole oncept - weebly chapter 10. the mole 10.1...

optimizing access to application data for analysis …syntax...

how to deﬁne your customer health score

trackmlthroughput challenge on codalab...•non-uniform...

seismic performance evaluation of staggered wall system...

query shredding: efﬁcient relational evaluation of queries...