how does disagreement help generalization against label ... · introduction related...

30
Introduction Related works Co-teaching Co-teaching+ Experiments Summary References How does Disagreement Help Generalization against Label Corruption? Center for Advanced Intelligence Project, RIKEN, Japan Centre for Artificial Intelligence, University of Technology Sydney, Australia Jun 12th, 2019 (RIKEN & UTS) Co-teaching+ Jun 12th, 2019 1 / 30

Upload: others

Post on 31-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

How does Disagreement Help Generalization against Label Corruption?

Center for Advanced Intelligence Project, RIKEN, JapanCentre for Artificial Intelligence, University of Technology Sydney, Australia

Jun 12th, 2019

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 1 / 30

Page 2: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

Outline

1 Introduction to Learning with Label Corruption/Noisy Labels.

2 Related worksLearning with small-loss instancesDecoupling

3 Co-teaching: From Small-loss to Cross-update

4 Co-teaching+: Divergence Matters

5 Experiments

6 Summary

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 2 / 30

Page 3: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

Big and high quality data drives the success of deep models.

Figure: There is a steady reduction of error every year in object classification on large scale dataset(1000 object categories, 1.2 million training images) [Russakovsky et al., 2015].

However, what we usually have in practice is big data with noisy labels.

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 3 / 30

Page 4: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

Noisy labels from crowdsourcing platforms.

Credit: Torbjørn Marø

Unreliable labels may occur when the workers have limited domain knowledge.

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 4 / 30

Page 5: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

Noisy labels from web search/crawler.

Screenshot of Google.com

The keywords may not be relevant to the image contents.

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 5 / 30

Page 6: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

How to model noisy labels?

Class-conditional noise (CCN):Each label y in the training set (with c classes) is flipped into y with probability p(y |y).Denote by T ∈ [0, 1](c×c) the noise transition matrix specifying the probability of flippingone label to another, so that ∀i ,jTij = p(y = j |y = i).

Positive Negative

Decision Boundary

Figure: Illustration of noisy labels.

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 6 / 30

Page 7: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

What happens when learning with noisy labels?

Figure: Accuracy of neural networks on noisy MNIST with different noise rate (0., 0.2, 0.4, 0.6, 0.8).(Solid is train, dotted is validation.) [Arpit et al., 2017]

Memorization: Learning easy patterns first, then (totally) over-fit noisy training data.Effect: Training deep neural networks directly on noisy labels results in accuracy degradation.

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 7 / 30

Page 8: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

How can wen robustly learn from noisy labels?

Current progress in three orthogonal directions:

Learning with noise transition:Forward Correction (Australian National University, CVPR’17)S-adaptation (Bar Ilan University, ICLR’17)Masking (RIKEN-AIP/UTS, NeurIPS’18)

Learning with selected samples:MentorNet (Google AI, ICML’18)Learning to Reweight Examples (University of Toronto, ICML’18)Co-teaching (RIKEN-AIP/UTS, NeurIPS’18)

Learning with implicit regularization:Virtual Adversarial Training (Preferred Networks, ICLR’16)Mean Teachers (Curious AI, NIPS’17)Temporal Ensembling (NVIDIA, ICLR’17)

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 8 / 30

Page 9: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

Learning with small-loss instances

A promising research line: Learning with small-loss instances

Main idea: regard small-loss instances as “correct” instances.

Figure: Self-training MentorNet[Jiang et al., 2018].

Benefit: easy to implement & free of assumptions.

Drawback: accumulated error caused by sample-selection bias.

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 9 / 30

Page 10: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

Learning with small-loss instances

A promising research line: Learning with small-loss instances

Consider the standard class-conditional noise (CCN) model.

We can learn a reliable classifier if a set of clean data is available.

Then, we can use the reliable classifier to filter out the noisy data, where “small loss”serves as a gold standard.

However, we usually only have access to noisy training data. The selected small-lossinstances are only likely to be correct, instead of totally correct.

(Problem) There exists accumulated error caused by sample-selection bias.

(Solution 1) In order to select more correct samples, can we design a “small-loss” rule byutilizing the memorization of deep neural networks?

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 10 / 30

Page 11: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

Decoupling

Related work: Decoupling

Figure: Decoupling[Malach and Shalev-Shwartz, 2017].

Easy samples can be quickly learnt and classified (memorization effect).

Decoupling focus on hard samples, which can be more informative.

Decoupling use samples in each mini-batch that two classifiers have disagreement inpredictions to update networks.

(Solution 2) Can we further attenuate the error from noisy data by utilizing two networks?

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 11 / 30

Page 12: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

Co-teaching: Cross-update meets small-loss

Figure: Co-teaching[Han et al., 2018].

Co-teaching maintains two networks (A & B) simultaneously.

Each network samples its small-loss instances based on memorization of neural networks.

Each network teaches such useful instances to its peer network. (Cross-update)

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 12 / 30

Page 13: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

Divergence

0 25 50 75 100 125 150 175 200

Epoch

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Tot

alV

aria

tion

Disagreement Co-teaching Co-teaching+

Two networks in Co-teaching will converge to a consensus gradually.However, two networks in Disagreement will keep diverged.We bridge the “Disagreement” strategy with Co-teaching to achieve Co-teaching+.

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 13 / 30

Page 14: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

How does Disagreement Benefit Co-teaching?

Disagreement-update step: Two networks feed forward and predict all data first, and onlykeep prediction disagreement data.

Cross-update step: Based on disagreement data, each network selects its small-loss data,but back propagates such data from its peer network.

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 14 / 30

Page 15: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

Co-teaching+ Paradigm1: Input w (1) and w (2), training set D, batch size B, learning rate η, estimated noise rate τ ,

epoch Ek and Emax;for e = 1, 2, . . . ,Emax do

2: Shuffle D into |D|B mini-batches; //noisy dataset

for n = 1, . . . , |D|B do3: Fetch n-th mini-batch D from D;4: Select prediction disagreement D′ = {(xi , yi ) : y

(1)i 6= y

(2)i };

5: Get D′(1) = arg minD′:|D′|≥λ(e)|D′| `(D′;w (1)); //sample λ(e)% small-loss instances

6: Get D′(2) = arg minD′:|D′|≥λ(e)|D′| `(D′;w (2)); //sample λ(e)% small-loss instances

7: Update w (1) = w (1) − η∇`(D′(2);w (1));//update w (1) by D′(2);8: Update w (2) = w (2) − η∇`(D′(1);w (2));//update w (2) by D′(1);

end

9: Update λ(e) = 1−min{ eEkτ, τ} or 1−min{ e

Ekτ, (1 + e−Ek

Emax−Ek)τ}; (memorization helps)

end

10: Output w (1) and w (2).

Co-teaching+: Step 4: disagreement-update; Step 5-8: cross-update.

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 15 / 30

Page 16: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

Relations to other approaches

Table: Comparison of state-of-the-art and related techniques with our Co-teaching+ approach.“small loss”: regarding small-loss samples as “clean” samples;“double classifiers”: training two classifiers simultaneously;“cross update”: updating parameters in a cross manner;“divergence”: keeping two classifiers diverged during training.

MentorNet Co-training Co-teaching Decoupling Co-teaching+

small loss X × X × Xdouble classifiers × X X X X

cross update × X X × Xdivergence × X × X X

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 16 / 30

Page 17: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

Datasets for CCN model

Table: Summary of data sets used in the experiments.

# of train # of test # of class size

MNIST 60,000 10,000 10 28×28

CIFAR-10 50,000 10,000 10 32×32

CIFAR-100 50,000 10,000 100 32×32

NEWS 11,314 7,532 7 1000-D

T-ImageNet 100,000 10,000 200 64×64

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 17 / 30

Page 18: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

Noise Transitions for CCN model

We manually generate class-conditional noisy labels using two types of noise transitions:

(a) Pair (ε = 45%). (b) Symmetry (ε = 50%).

Figure: Different noise transitions (using 5 classes as an example) [Han et al., 2018].

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 18 / 30

Page 19: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

Baselines

MentorNet: small-loss trick;

Co-teaching: small-loss and cross-update trick.

Decoupling: instances that have different predictions;

F-correction: loss correction on transition matrix;

Standard: directly training on noisy datasets.

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 19 / 30

Page 20: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

Network structures

Table: MLP and CNN models used in our experiments on MNIST, CIFAR-10, CIFAR-100/Open-sets,and NEWS.

MLP on MNIST CNN on CIFAR-10 CNN on CIFAR-100/Open-sets MLP on NEWS

28×28 Gray Image 32×32 RGB Image 32×32 RGB Image 1000-D Text

3×3 Conv, 64 BN, ReLU 300-D Embedding5×5 Conv, 6 ReLU 3×3 Conv, 64 BN, ReLU Flatten → 1000×300

2×2 Max-pool 2×2 Max-pool Adaptive avg-pool → 16×3003×3 Conv, 128 BN, ReLU

Dense 28×28 → 256, ReLU 5×5 Conv, 16 ReLU 3×3 Conv, 128 BN, ReLU Dense 16×300 → 4×3002×2 Max-pool 2×2 Max-pool BN, Softsign

3×3 Conv, 196 BN, ReLUDense 16×5×5 → 120, ReLU 3×3 Conv, 196 BN, ReLU Dense 4×300 → 300

Dense 120 → 84, ReLU 2×2 Max-pool BN, Softsign

Dense 256 → 10 Dense 84 → 10 Dense 256 → 100/10 Dense 300 → 7

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 20 / 30

Page 21: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

MNIST

Standard Decoupling F-correction MentorNet Co-teaching Co-teaching+

0 25 50 75 100 125 150 175 200

Epoch

0.00

0.20

0.40

0.60

0.80

1.00

Tes

tA

ccu

racy

(MNIST, Pair-45%)

(a) Pair-45%.

0 25 50 75 100 125 150 175 200

Epoch

0.50

0.60

0.70

0.80

0.90

1.00

Tes

tA

ccu

racy

(MNIST, Symmetry-50%)

(b) Symmetry-50%.

0 25 50 75 100 125 150 175 200

Epoch

0.75

0.80

0.85

0.90

0.95

1.00

Tes

tA

ccu

racy

(MNIST, Symmetry-20%)

(c) Symmetry-20%.

Figure: Test accuracy vs number of epochs on MNIST dataset.

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 21 / 30

Page 22: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

CIFAR-10

Standard Decoupling F-correction MentorNet Co-teaching Co-teaching+

0 25 50 75 100 125 150 175 200

Epoch

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Tes

tA

ccu

racy

(CIFAR-10, Pair-45%)

(a) Pair-45%.

0 25 50 75 100 125 150 175 200

Epoch

0.25

0.30

0.35

0.40

0.45

0.50

0.55

Tes

tA

ccu

racy

(CIFAR-10, Symmetry-50%)

(b) Symmetry-50%.

0 25 50 75 100 125 150 175 200

Epoch

0.43

0.45

0.48

0.50

0.53

0.55

0.58

0.60

Tes

tA

ccu

racy

(CIFAR-10, Symmetry-20%)

(c) Symmetry-20%.

Figure: Test accuracy vs number of epochs on CIFAR-10 dataset.

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 22 / 30

Page 23: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

CIFAR-100

Standard Decoupling F-correction MentorNet Co-teaching Co-teaching+

0 25 50 75 100 125 150 175 200

Epoch

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Tes

tA

ccu

racy

(CIFAR-100, Pair-45%)

(a) Pair-45%.

0 25 50 75 100 125 150 175 200

Epoch

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Tes

tA

ccu

racy

(CIFAR-100, Symmetry-50%)

(b) Symmetry-50%.

0 25 50 75 100 125 150 175 200

Epoch

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Tes

tA

ccu

racy

(CIFAR-100, Symmetry-20%)

(c) Symmetry-20%.

Figure: Test accuracy vs number of epochs on CIFAR-100 dataset.

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 23 / 30

Page 24: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

NEWS

Standard Decoupling F-correction MentorNet Co-teaching Co-teaching+

0 25 50 75 100 125 150 175 200

Epoch

0.15

0.17

0.20

0.23

0.25

0.28

0.30

0.33

0.35

Tes

tA

ccu

racy

(NEWS, Pair-45%)

(a) Pair-45%.

0 25 50 75 100 125 150 175 200

Epoch

0.24

0.26

0.28

0.30

0.32

0.34

0.36

0.38

0.40

0.42

Tes

tA

ccu

racy

(NEWS, Symmetry-50%)

(b) Symmetry-50%.

0 25 50 75 100 125 150 175 200

Epoch

0.34

0.36

0.38

0.40

0.42

0.44

0.46

Tes

tA

ccu

racy

(NEWS, Symmetry-20%)

(c) Symmetry-20%.

Figure: Test accuracy vs number of epochs on NEWS dataset.

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 24 / 30

Page 25: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

T-ImageNet

Table: Averaged/maximal test accuracy (%) of different approaches on T-ImageNet over last 10epochs. The best results are in blue.

Flipping-Rate(%) Standard Decoupling F-correction MentorNet Co-teaching Co-teaching+

Pair-45% 26.14/26.32 26.10/26.61 0.63/0.67 26.22/26.61 27.41/27.82 26.54/26.87

Symmetry-50% 19.58/19.77 22.61/22.81 32.84/33.12 35.47/35.76 37.09/37.60 41.19/41.77

Symmetry-20% 35.56/35.80 36.28/36.97 44.37/44.50 45.49/45.74 45.60/46.36 47.73/48.20

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 25 / 30

Page 26: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

Open-sets

Open-set noise:An open-set noisy label occurs when a noisy sample possesses a true class that is notcontained within the set of known classes in the training data.

Open-sets: CIFAR-10 noisy dataset with 40% open-set noise from CIFAR-100, ImageNet32,and SVHN.

CIFAR-100 ImageNet32 SVHN

Figure: Examples of open-set noise for “airplane” in CIFAR-10 [Wang et al., 2018].

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 26 / 30

Page 27: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

Open-sets

Table: Averaged/maximal test accuracy (%) of different approaches on Open-sets over last 10 epochs.The best results are in blue.

Open-set noise Standard MentorNet Iterative[Wang et al., 2018] Co-teaching Co-teaching+

CIFAR-10+CIFAR-100 62.92 79.27/79.33 79.28 79.43/79.58 79.28/79.74

CIFAR-10+ImageNet-32 58.63 79.27/79.40 79.38 79.42/79.60 79.89/80.52

CIFAR-10+SVHN 56.44 79.72/79.81 77.73 80.12/80.33 80.62/80.95

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 27 / 30

Page 28: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

Summary

Conclusion:

This paper presents Co-teaching+, a robust model for learning on noisy labels.

Three key points towards robust training on noisy labels:

1) use small-loss trick based on memorization effects of deep networks;2) cross-update parameters of two networks;3) keep two networks diverged during training.

Future work:

Investigate the theory of Co-teaching+ from the view of disagreement-based algorithms[Wang and Zhou, 2017].

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 28 / 30

Page 29: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

Link to our paper:

Our poster will be:Wed Jun 12th 06:30 – 09:00 PM@Pacific Ballroom #21

Thank you very much for your attention!

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 29 / 30

Page 30: How does Disagreement Help Generalization against Label ... · Introduction Related worksCo-teachingCo-teaching+ExperimentsSummaryReferences Outline 1 Introduction to Learning with

Introduction Related works Co-teaching Co-teaching+ Experiments Summary References

References

Arpit, D., Jastrzebski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville, A.,Bengio, Y., et al. (2017). A closer look at memorization in deep networks. In International Conference on MachineLearnin, pages 233–242.

Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. (2018). Co-teaching: Robust training ofdeep neural networks with extremely noisy labels. In Advances in Neural Information Processing Systems, pages8527–8537.

Jiang, L., Zhou, Z., Leung, T., Li, L.-J., and Fei-Fei, L. (2018). Mentornet: Learning data-driven curriculum for very deepneural networks on corrupted labels. In International Conference on Machine Learning.

Malach, E. and Shalev-Shwartz, S. (2017). Decoupling” when to update” from” how to update”. In Advances in NeuralInformation Processing Systems, pages 960–970.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.,et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision,115(3):211–252.

Wang, W. and Zhou, Z.-H. (2017). Theoretical foundation of co-training and disagreement-based algorithms. arXivpreprint arXiv:1708.04403.

Wang, Y., Liu, W., Ma, X., Bailey, J., Zha, H., Song, L., and Xia, S.-T. (2018). Iterative learning with open-set noisylabels. In IEEE Conference on Computer Vision and Pattern Recognition, pages 8688–8696.

(RIKEN & UTS) Co-teaching+ Jun 12th, 2019 30 / 30