multimodal self-supervised learning for medical image analysis

Multimodal Self-Supervised Learningfor Medical Image Analysis

Aiham Taleb∗1, Christoph Lippert 1, Moin Nabi 2, and Tassilo Klein 2

1Digital Health & Machine Learning, Hasso-Plattner-Institute, Potsdam University, Berlin, Germany{aiham.taleb, christoph.lippert}@hpi.de2SAP Machine Learning Research, Berlin, Germany

{m.nabi, tassilo.klein}@sap.com

Abstract

In this paper, we propose a self-supervised learning approach leveraging multipleimaging modalities to increase data efficiency in machine learning for medicalimaging. In particular, we introduce multimodal puzzle-solving as a proxy taskto facilitate neural network feature learning from multiple image modalities withsubsequent finetuning for different target tasks. To achieve that, we employ theSinkhorn operator to predict permutations of puzzle pieces in conjunction witha modality agnostic feature embedding. Together, they allow for a lean networkarchitecture and increased computational efficiency. Under this framework, wepropose different strategies to permute medical imaging modalities, creating puz-zles with varying levels of complexity. We benchmark these strategies in a rangeof experiments and for different target tasks. Our experiments show that solv-ing puzzles interleaved with multimodal content yields more powerful semanticrepresentations. This allows us to solve downstream tasks more accurately andefficiently, compared to treating each modality independently. Our approach’seffectiveness is demonstrated on the semantic segmentation of brain tumor as wellas survival regression task in the BraTS challenge, where we achieve results thatare competitive with state-of-the-art at a fraction of the computational expense.

1 Introduction

Generating expert annotations of medical imaging data at scale is an expensive and time-consumingtask, especially for 3D scans. In fact, with growing sizes of imaging datasets, expert annotationbecomes nearly impossible without computerized assistance [4]. Even current semi-automaticsoftware tools fail to sufficiently reduce the time and effort required for annotation and measurementof these large data sets. Consequently, the scarcity of data and annotations are some of the mainconstraints for machine learning applications in medical imaging. Self-supervised learning providesa viable alternative when labeled data is scarce. In these approaches, the supervisory signals arederived from the data, typically by unsupervised learning of a proxy task. Subsequently, these modelsfacilitate data-efficient supervised fine-tuning on downstream tasks, significantly reducing the burdenof manual annotation. The most related self-supervised method was proposed by Noroozi et al. [10],who solved jigsaw puzzles on natural images as a proxy task. The intuition behind this idea is that inorder to solve the puzzle at sufficient complexity, the model should understand the objects that appearin the images as well as those objects’ parts. In contrast to our approach, their method only relies onsingle imaging modality inputs. However, in a medical context, the inclusion of other modalities,e.g., by mixing T1 and T2 weighted scans, should yield more informative data representations that

∗Work done during an internship at SAP SE

33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

Flair T1ce

T1 T2

(a) (b)

P

P* Prec

Con

v La

yers

FC

Soft Permutation Matrix (S)

S0. P

MSE

(c) (d)

Figure 1: Schematic illustration of the proposed approach. Assuming four imaging modalities: (a)we draw patches from random modalities (b) yielding ground truth P ∗ and random puzzles P (c) thetrained puzzle-solving model (d) the soft permutation matrix and the reconstruction.

leverage the downstream tasks. In addition, their method requires large memory and computationresources, as it integrates 9 replicas of AlexNet [6]. To overcome this limitation, our approach buildsupon Sinkhorn networks proposed by E. Mena et al. [8], where the Sinkhorn operator [12, 1] isutilized as an analog of the Softmax operator, for permutation-related tasks. We extend the method ofE. Mena et al. [8] to work efficiently with modern architectures [11]. Consequently, as opposed toNoroozi and Favaro’s method [10], our approach can solve puzzles with more levels of complexity.

2 Method

Solving a jigsaw puzzle entails reassembling shuffled image pieces such that when aligned correctly,the original image is restored. If C is the number of puzzle pieces, then there exist C! possiblepermutations. However, when the puzzle complexity increases, the association of individual puzzletiles might be ambiguous. Nevertheless, the placement of different puzzle tiles is mutually exclusive.Therefore, when all tiles are observed at the same time, the positional ambiguities are attenuated.In a conventional jigsaw puzzle, the puzzle pieces originate from only one image at a time, i.e., thecomputational complexity for solving such a puzzle is O(C!). On the other hand, we propose amultimodal jigsaw puzzle extension where tiles can be from M different modalities. As a result, thecomplexity of solving multimodal puzzles is O(C!M ). This quickly becomes prohibitively expensivedue to two growth factors in the solution space: i) factorial growth in the number of permutationsC!, ii) exponential growth in the number of modalities M . To reduce the computational burden, weuse two solutions. First, we employ the Sinkhorn operator, which allows for an efficient solvingof the factorial factor, largely following [8]. Second, we employ a feed-forward network G thatlearns a cross-modal representation, which allows for canceling out the exponential factor M , whilesimultaneously learning a semantically rich representation for downstream tasks.

To efficiently solve the self-supervised jigsaw puzzle task, we train a network that can learn apermutation. A permutation matrix of size N ×N corresponds to some permutation of the numbers1 to N . Every row and column, therefore, contains precisely a single 1 with 0s everywhere else,and every permutation corresponds to a unique permutation matrix. This permutation matrix is non-differentiable. However, as shown in [8], the non-differentiable parameterization of a permutationcan be approximated in terms of a differentiable relaxation, the so-called Sinkhorn operator. Ourapproach is illustrated in more detail in figure 1/

3 Experiments

We evaluate the quality of the learned representations from our auxiliary task of multimodal puzzle-solving by transferring them into other downstream tasks. Then, we assess their impact on downstreamperformance. We do not use any synthetic data in this section.

3.1 Brain Tumor Segmentation

CNN-based methods have been on top of the rankings [5, 13, 7, 2] of recent editions of the BraTSchallenge [9, 3] segmentation challenge, on which we showcase the effectiveness of our method.

2

Table 1: BraTS segmentation

Model WT TC ETBaseline (from scratch) 80.76 77.07 67.77

Li [7] 88.30 78.80 72.00Albiol et al. [2] 87.20 76.00 75.10

Chang et al. [13] 89.00 82.41 76.60Isensee et al. [5] (3D U-Net) 90.80 84.32 79.59

Our Proposed Method 89.67 83.73 78.54

Table 2: BraTS survival prediction

Model MSEBaseline (from scratch) 112,841

CNN + age 137,912Random Forest Reg 152,130

FeatNet + all features 103,878Lin. Reg. + top 16 features 99,370

Our Proposed Method 97,291

1 10 50 1000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

60%

74%

80%

90%

15%

42%

72%

88%

Our MethodFrom Scratch

Figure 2: Results on low-shot data regime.

However, these methods rely on the existence of large amounts of manual annotations for canceroustissues. Some of these works [5, 7] use additional training sets and perform multiple techniques ofaugmentation. Others such as [13, 2] rely on ensemble models, which require significant computingtime and resources. We achieve comparable results by just fine-tuning the learned representations fromthe self-supervised task, thus requiring much less data, augmentation and computational resources.In addition to comparing to these methods, we compare to the baseline of training the segmentationmodel from scratch. Table 1 summarizes these results. We use the U-Net [11] architecture forsegmentation, which is a fully convolutional network (FCN) that significantly improved the state-of-the-art in semantic segmentation on bio-medical imaging. Thereafter, it has found use-cases multiplecomputer vision applications. The evaluation metrics we report, following the BraTS challenge, arethe Dice scores for the Whole Tumor (WT), the Tumor Core (TC), and the Enhanced Tumor (ET).

3.2 Survival Prediction (Regression)

As another example downstream task, we transfer the learned weights of our puzzle-solving modelinto the regression task of survival days prediction. We perform this task to show the generalapplicability of the learned representations by our self-supervised method. We reuse the convolutionalfeatures, and we add a fully connected layer with only five features in it, and then a single outputlayer on top. We also include the patient age as a sixth feature right before the output layer. Largelyfollowing the baselines set by [14]. In table 2, we compare to the baseline model trained from scratch.This baseline provides an insight into the benefits of self-supervised pretraining. Additionally, wecompare to the baselines of Suter et al. [14], who benchmarked multiple models on this task. OurProposed Method outperformed all of the baselines.

3.3 Low-Shot Data Regime

To assess how our self-supervised task benefits the performance on the segmentation task at differenttraining set sizes. We randomly select subsets of patients at 1%, 10%, 50%, and 100% of the totalsegmentation set size. Finally, we compare the performance of our model when fine-tuned usingthese subsets against the baseline trained from scratch. As shown in figure 2, our method outperformsthe baseline with a large margin when using only few labelled training samples. In a low-data regimeof as few samples as 1% of the overall dataset size, this margin to the baseline appears larger. Thiscase, in particular, suggests the potential for generic unsupervised features applicable to relevantmedical imaging tasks.

3

4 Conclusion & Future Work

We demonstrated that self-supervised puzzle-solving in a multimodal context allows for learningpowerful semantic representations that facilitate downstream tasks in the medical imaging context.What is more, our method achieves this by utilizing a rather inexpensive training procedure. Ourapproach leverages unlabelled multimodal medical scans, and further reduces the cost of manualannotation required for downstream tasks. The preliminary results in our experiments support thisidea, especially those of operating on low-data regimes. Our self-supervised approach providesperformance gains on the evaluated downstream tasks. However, to further reduce the performancegap between 2D and 3D models, we plan to extend the work towards 3D multimodal puzzles, makingfull use of the spatial context.

References[1] R. P. Adams and R. S. Zemel. Ranking via sinkhorn propagation. arXiv preprint arXiv:1106.1925, 2011.

[2] F. A. Alberto Albiol, Antonio Albiol. Extending 2d deep learning architectures to 3d image segmentationproblems. In Pre-Conference Proceedings of the 7th MICCAI BraTS Challenge, 2018.

[3] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. S. Kirby, J. B. Freymann, K. Farahani, andC. Davatzikos. Advancing the cancer genome atlas glioma mri collections with expert segmentation labelsand radiomic features. Scientific Data, 4:170117 EP –, 09 2017.

[4] K. Grünberg, O. Jimenez-del Toro, A. Jakab, G. Langs, T. Salas Fernandez, M. Winterstein, M.-A. Weber,and M. Krenn. Annotating Medical Image Data, pages 45–67. Springer International Publishing, Cham,2017.

[5] F. Isensee, P. Kickingereder, W. Wick, M. Bendszus, and K. H. Maier-Hein. No new-net. In InternationalMICCAI Brainlesion Workshop, pages 234–244. Springer, 2018.

[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neuralnetworks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[7] X. Li. Fused u-net for brain tumor segmentation based on multimodal mr images. In Pre-ConferenceProceedings of the 7th MICCAI BraTS Challenge, 2018.

[8] G. Mena, D. Belanger, S. Linderman, and J. Snoek. Learning latent permutations with gumbel-sinkhornnetworks. arXiv preprint arXiv:1802.08665, 2018.

[9] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, and et al. The multimodal brain tumorimage segmentation benchmark (brats). IEEE Transactions on Medical Imaging, 34(10):1993–2024, 2015.

[10] M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles.CoRR, abs/1603.09246, 2016.

[11] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation.In MICCAI, 2015.

[12] R. Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. The annalsof mathematical statistics, 35(2):876–879, 1964.

[13] T. Y. T. H. Y. Chang, Z. Lin. Automatic segmentation of brain tumor from 3d mr images using a 2dconvolutional neural networks. In Pre-Conference Proceedings of the 7th MICCAI BraTS Challenge, 2018.

[14] M. R. Yannick Suter, Alain Jungo and M. Reyes. End-to-end deep learning versus classical regressionfor brain tumor patient survival prediction. In Pre-Conference Proceedings of the 7th MICCAI BraTSChallenge, 2018.

4

multimodal self-supervised learning for medical image analysis

Documents