summary & results, task 2 · summary & results, task 2 t2 audio tagging with noisy labels...

1
Summary & Results, Task T Audio tagging with noisy labels and minimal supervision Coordinators Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis Results https://tinyurl.com/yja Motivation I General-purpose sound event recognizers I follow-up of DCASE Task, but: double number of categories much more data multi-class multi-label Dataset: FSDKaggle I classes, over hours I labels/clip . . I variable length . s I weak labels of varying reliability I FSD: human-labeled I YFCC: machine-labeled (noisy) Class distribution Results, Top- System Features Classier lω lrap Akiyama log-mel energies, waveform CNN, ensemble . Zhang log-mel energies, CQT CNN, RNN, ensemble . Ebbers log-mel energies CRNN, ensemble . Bouteillon log-mel energies CNN . Kuaiyu log-mel energies CNN, ensemble . Koutini log-mel energies CNN, RFR, ensemble . Zhang log-mel energies, PCEN CNN, ensemble . Boqing log-mel energies CNN, ensemble . Task Description I Goal: predict labels among diverse categories I multi-label audio tagging I many noisy labels & limited supervision I domain mismatch Freesound - Flickr Task Setup & Some Numbers I Kaggle kernels-only competition: all systems run in Kaggle’s servers with scores computed on a hidden test set I teams ( of them submitting systems to DCASE) & entries I maximum of daily submissions I Judges Award to foster novel, problem-specic, ecient approaches I top- winners & Judges Award winner are required to publish the code Metric: label-weighted label-ranking average precision (lwlrap) For sample s & class label c: I C(s): set of true class labels I Lab(s, r): class label at rank r I Rank(s, c): rank of class c I [·]: if argument true, else I Prec(s, c): label-ranking precision I lω lrap = average Prec over all labels Baseline System Log mel spectrogram patches 1s window, 0.5s hop, 64 mel bands MobileNet v1 Conv/13 Depth-Separable Conv/ MaxPool/80-way Logistic 3.3M weights, ~430M multiplies 8x smaller/4x faster vs ResNet-50 Average all predictions to compute clip-level prediction Train on noisy YFCC for ~10 epochs dropout, label smoothing, max pool Train on curated FSD for ~100 epochs dropout, label smoothing, max pool Audio representation Tuned audio representation Lwlraps: 0.537 public, 0.538 private Adopted techniques I Log-mel energies, waveform, CQT, ... I Mainly CNN/CRNN: VGG, DenseNet, ResNe(X)t, Shake-Shake, Frequency-Aware CNNs, Squeeze-and-Excitation, MobileNet I Heavy usage of ensembles ( ): several nets or snapshot learning. Aggregation of predictions or shallow meta-learning I Exploiting curated train set: mixup, SpecAugment, SpecMix, TTA, ... I Label noise: semi-supervised learning, instance selection, multi-task learning, loss functions, per-class loss weighting, stochastic weight averaging, adaptive-weighting of noisy samples, MixMatch, ... Open Knowledge: discussion forum & sharing kernels

Upload: others

Post on 10-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Summary & Results, Task 2 · Summary & Results, Task 2 T2 Audio tagging with noisy labels and minimal supervision Coordinators Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel

Summary & Results, Task 2T2 Audio tagging with noisy labels and minimal supervision

CoordinatorsEduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. EllisResultshttps://tinyurl.com/y4�j2a8

Motivation

I General-purpose sound eventrecognizersI follow-up of DCASE2018 Task2,but:

. double number of categories

. much more data

. multi-class ⇒ multi-labelDataset: FSDKaggle2019

I 80 classes, over 100 hoursI labels/clip 1.2 ⇒ 1.4I variable length 0.3 ⇒ 30s

I weak labels of varying reliabilityI FSD: human-labeledI YFCC: machine-labeled (noisy)

Class distribution

Results, Top-8

0.70

0.71

0.72

0.73

0.74

0.75

0.76

0.77

lwlr

ap @

pri

vate

leaderb

oard

Aki

yam

a

Zhang

Ebbers

Boute

illon

Kuaiy

u

Kouti

ni

Zhang

Boqin

g

System Features Classifier lωlrapAkiyama log-mel energies, waveform CNN, ensemble 0.758Zhang log-mel energies, CQT CNN, RNN, ensemble 0.758Ebbers log-mel energies CRNN, ensemble 0.755Bouteillon log-mel energies CNN 0.752Kuaiyu log-mel energies CNN, ensemble 0.741Koutini log-mel energies CNN, RFR, ensemble 0.737Zhang log-mel energies, PCEN CNN, ensemble 0.734Boqing log-mel energies CNN, ensemble 0.724

Task Description

I Goal: predict labels among 80diverse categoriesI multi-label audio taggingI many noisy labels& limited supervisionI domain mismatchFreesound - Flickr

Task Setup & Some Numbers

I Kaggle kernels-only competition: all systems run in Kaggle’s serverswith scores computed on a hidden test setI 880 teams (14 of them submitting 28 systems to DCASE) & 8618 entriesI maximum of 2 daily submissionsI Judges Award to foster novel, problem-specific, efficient approachesI top-3 winners & Judges Award winner are required to publish the codeMetric: label-weighted label-ranking average precision (lwlrap)

For sample s & class label c:I C(s): set of true class labelsI Lab(s, r): class label at rank rI Rank(s, c): rank of class cI 1[·]: 1 if argument true, else 0 I Prec(s, c): label-ranking precision

I lωlrap = average Prec over all labelsBaseline System

Log mel spectrogram patches1s window, 0.5s hop, 64 mel bands

MobileNet v1Conv/13 Depth-Separable Conv/MaxPool/80-way Logistic3.3M weights, ~430M multiplies8x smaller/4x faster vs ResNet-50

Average all predictions to computeclip-level prediction

Train on noisy YFCC for ~10 epochsdropout, label smoothing, max pool

Train on curated FSD for ~100 epochsdropout, label smoothing, max pool

Audio representation

Tuned audio representation

Lwlraps: 0.537 public, 0.538 private

Adopted techniques

I Log-mel energies, waveform, CQT, ...I Mainly CNN/CRNN: VGG, DenseNet, ResNe(X)t, Shake-Shake,Frequency-Aware CNNs, Squeeze-and-Excitation, MobileNetI Heavy usage of ensembles (2 ⇒ 170): several nets or snapshotlearning. Aggregation of predictions or shallow meta-learningI Exploiting curated train set: mixup, SpecAugment, SpecMix, TTA, ...I Label noise: semi-supervised learning, instance selection, multi-tasklearning, loss functions, per-class loss weighting, stochastic weightaveraging, adaptive-weighting of noisy samples, MixMatch, ...Open Knowledge: discussion forum & sharing kernels