adaptive feature selection guided deep forest for covid-19

8
1 Adaptive Feature Selection Guided Deep Forest for COVID-19 Classification with Chest CT Liang Sun , Zhanhao Mo , Fuhua Yan , Liming Xia , Fei Shan , Zhongxiang Ding , Wei Shao , Feng Shi, Huan Yuan, Huiting Jiang, Dijia Wu, Ying Wei, Yaozong Gao, Wanchun Gao, He Sui, Daoqiang Zhang * , Dinggang Shen * , IEEE Fellow Abstract—Chest computed tomography (CT) becomes an ef- fective tool to assist the diagnosis of coronavirus disease-19 (COVID-19). Due to the outbreak of COVID-19 worldwide, using the computed-aided diagnosis technique for COVID-19 classification based on CT images could largely alleviate the burden of clinicians. In this paper, we propose an Adaptive Feature Selection guided Deep Forest (AFS-DF) for COVID-19 classification based on chest CT images. Specifically, we first extract location-specific features from CT images. Then, in order to capture the high-level representation of these features with the relatively small-scale data, we leverage a deep forest model to learn high-level representation of the features. Moreover, we propose a feature selection method based on the trained deep forest model to reduce the redundancy of features, where the feature selection could be adaptively incorporated with the COVID-19 classification model. We evaluated our proposed AFS- DF on COVID-19 dataset with 1495 patients of COVID-19 and 1027 patients of community acquired pneumonia (CAP). The accuracy (ACC), sensitivity (SEN), specificity (SPE) and AUC achieved by our method are 91.79%, 93.05%, 89.95% and 96.35%, respectively. Experimental results on the COVID- 19 dataset suggest that the proposed AFS-DF achieves superior performance in COVID-19 vs. CAP classification, compared with L. Sun, W. Shao and D. Zhang are with the College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing 211106, China. Z. Mo and H. Sui are with the Department of Radiology, The Third Hospital of Jilin University, Changchun, China. F. Yan is with Department of Radiology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China. L. Xia is with Department of Radiology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China. F. Shan is with Department of Radiology, Shanghai Public Health Clinical Center, Fudan University, Shanghai, China. Z. Ding is with the Department of Radiology, Hangzhou First Peoples Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China. F. Shi, H. Yuan, H. Jiang, D. Wu, Y. Wei, Y. Gao and D. Shen are with the Department of Research and Development, Shanghai United Imaging Intelligence Co., Ltd., Shanghai, China W. Gao is with the Department of Radiology, Qianjiang Central Hospital, Jishou University School of Medicine, Chongqing, China. : Equal contribution * Corresponding authors: D. Zhang ([email protected]) and D.Shen ([email protected]). This work was supported by the National Key Research and Development Program of China (Nos. 2018YFC2001600, 2018YFC2001602) and the Na- tional Natural Science Foundation of China (Nos. 61876082, 61861130366, 61732006, 61902183 and 81871337), the Royal Society-Academy of Medical Sciences Newton Advanced Fellowship (No. NAF\R1\180371), and China Postdoctoral Science Foundation funded project (No. 2019M661831), Hubei Provincial Novel Pneumonia Emergency Science and Technology Project (No. 2020FCA021), the Novel Coronavirus Special Research Foundation of the Shanghai Municipal Science and Technology Commission (No. 20441900600). 4 widely used machine learning methods. Index Terms—COVID-19 Classification, Deep Forest, Feature Selection, Chest CT I. I NTRODUCTION S INCE Decemeber 2019, the outbreak of coronavirus disease-19 (COVID-19) [1], [2] has infected more than three million people worldwide, and causing more than 230, 000 deaths. World Health Organization (WHO) has declared the COVID-19 as a global health emergency on January 30, 2020 [3]. The chest computed tomography (CT) has shown to be useful to assist clinical diagnosis of COVID-19 [4]–[9]. However, the rapid growth of COVID-19 patients results in the shortage of the clinicians and radiologists. It is highly desired to develop automatic methods for computer-aided COVID-19 classification tools with chest CT images. A few machine learning methods have been proposed for COVID-19 classification using chest CT images. For exam- ple, [10] employs a logistic regression method for COVID-19 classification by using clinical and laboratory features. [11], [12] use random forest model with the handcrafted features for COVID-19 classification. Moreover, some deep learning based methods are proposed for the diagnosis of COVID- 19. For instance, [13] leverages a deep learning method to learn the feature representation of the chest CT images, and then uses the learned features for COVID-19 classification by combining decision tree and Adaboost algorithm. In addition, [14] employs an end-to-end network to map the CT images to label space for COVID-19 disease identification. In summary, the existing machine learning methods for COVID-19 diagnosis are mainly based on the handcrafted features or the learned image representation by deep neu- ral networks. However, simply adopting handcrafted features cannot fully utilize the high-level information for COVID-19 classification, while the features learned from neural networks require great effort for parameter tuning with a small amount of medical image data. To this end, in this paper, we propose a novel adaptive fea- ture selection guided deep forest method that takes advantage of the high-level deep features with small number of medical image data for the classification between COVID-19 and CAP. Specifically, as shown in Fig. 1, we first extract the location- specific features from the chest CT image. Then, a deep forest model [15] is introduced to learn the latent high-level repre- sentations of these features, which can effectively describe the arXiv:2005.03264v1 [eess.IV] 7 May 2020

Upload: others

Post on 10-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adaptive Feature Selection Guided Deep Forest for COVID-19

1

Adaptive Feature Selection Guided Deep Forest forCOVID-19 Classification with Chest CT

Liang Sun†, Zhanhao Mo†, Fuhua Yan†, Liming Xia†, Fei Shan†, Zhongxiang Ding†, Wei Shao†, Feng Shi, HuanYuan, Huiting Jiang, Dijia Wu, Ying Wei, Yaozong Gao, Wanchun Gao, He Sui, Daoqiang Zhang∗, Dinggang

Shen∗, IEEE Fellow

Abstract—Chest computed tomography (CT) becomes an ef-fective tool to assist the diagnosis of coronavirus disease-19(COVID-19). Due to the outbreak of COVID-19 worldwide,using the computed-aided diagnosis technique for COVID-19classification based on CT images could largely alleviate theburden of clinicians. In this paper, we propose an AdaptiveFeature Selection guided Deep Forest (AFS-DF) for COVID-19classification based on chest CT images. Specifically, we firstextract location-specific features from CT images. Then, in orderto capture the high-level representation of these features withthe relatively small-scale data, we leverage a deep forest modelto learn high-level representation of the features. Moreover,we propose a feature selection method based on the traineddeep forest model to reduce the redundancy of features, wherethe feature selection could be adaptively incorporated with theCOVID-19 classification model. We evaluated our proposed AFS-DF on COVID-19 dataset with 1495 patients of COVID-19and 1027 patients of community acquired pneumonia (CAP).The accuracy (ACC), sensitivity (SEN), specificity (SPE) andAUC achieved by our method are 91.79%, 93.05%, 89.95%and 96.35%, respectively. Experimental results on the COVID-19 dataset suggest that the proposed AFS-DF achieves superiorperformance in COVID-19 vs. CAP classification, compared with

L. Sun, W. Shao and D. Zhang are with the College of ComputerScience and Technology, Nanjing University of Aeronautics and Astronautics,MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing211106, China.

Z. Mo and H. Sui are with the Department of Radiology, The Third Hospitalof Jilin University, Changchun, China.

F. Yan is with Department of Radiology, Ruijin Hospital, Shanghai JiaoTong University School of Medicine, Shanghai, China.

L. Xia is with Department of Radiology, Tongji Hospital, Tongji MedicalCollege, Huazhong University of Science and Technology, Wuhan, Hubei,China.

F. Shan is with Department of Radiology, Shanghai Public Health ClinicalCenter, Fudan University, Shanghai, China.

Z. Ding is with the Department of Radiology, Hangzhou First PeoplesHospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang,China.

F. Shi, H. Yuan, H. Jiang, D. Wu, Y. Wei, Y. Gao and D. Shen are withthe Department of Research and Development, Shanghai United ImagingIntelligence Co., Ltd., Shanghai, China

W. Gao is with the Department of Radiology, Qianjiang Central Hospital,Jishou University School of Medicine, Chongqing, China.†: Equal contribution∗Corresponding authors: D. Zhang ([email protected]) and D.Shen

([email protected]).This work was supported by the National Key Research and Development

Program of China (Nos. 2018YFC2001600, 2018YFC2001602) and the Na-tional Natural Science Foundation of China (Nos. 61876082, 61861130366,61732006, 61902183 and 81871337), the Royal Society-Academy of MedicalSciences Newton Advanced Fellowship (No. NAF\R1\180371), and ChinaPostdoctoral Science Foundation funded project (No. 2019M661831), HubeiProvincial Novel Pneumonia Emergency Science and Technology Project(No. 2020FCA021), the Novel Coronavirus Special Research Foundationof the Shanghai Municipal Science and Technology Commission (No.20441900600).

4 widely used machine learning methods.

Index Terms—COVID-19 Classification, Deep Forest, FeatureSelection, Chest CT

I. INTRODUCTION

S INCE Decemeber 2019, the outbreak of coronavirusdisease-19 (COVID-19) [1], [2] has infected more than

three million people worldwide, and causing more than 230,000 deaths. World Health Organization (WHO) has declaredthe COVID-19 as a global health emergency on January 30,2020 [3]. The chest computed tomography (CT) has shownto be useful to assist clinical diagnosis of COVID-19 [4]–[9].However, the rapid growth of COVID-19 patients results in theshortage of the clinicians and radiologists. It is highly desiredto develop automatic methods for computer-aided COVID-19classification tools with chest CT images.

A few machine learning methods have been proposed forCOVID-19 classification using chest CT images. For exam-ple, [10] employs a logistic regression method for COVID-19classification by using clinical and laboratory features. [11],[12] use random forest model with the handcrafted featuresfor COVID-19 classification. Moreover, some deep learningbased methods are proposed for the diagnosis of COVID-19. For instance, [13] leverages a deep learning method tolearn the feature representation of the chest CT images, andthen uses the learned features for COVID-19 classification bycombining decision tree and Adaboost algorithm. In addition,[14] employs an end-to-end network to map the CT images tolabel space for COVID-19 disease identification.

In summary, the existing machine learning methods forCOVID-19 diagnosis are mainly based on the handcraftedfeatures or the learned image representation by deep neu-ral networks. However, simply adopting handcrafted featurescannot fully utilize the high-level information for COVID-19classification, while the features learned from neural networksrequire great effort for parameter tuning with a small amountof medical image data.

To this end, in this paper, we propose a novel adaptive fea-ture selection guided deep forest method that takes advantageof the high-level deep features with small number of medicalimage data for the classification between COVID-19 and CAP.Specifically, as shown in Fig. 1, we first extract the location-specific features from the chest CT image. Then, a deep forestmodel [15] is introduced to learn the latent high-level repre-sentations of these features, which can effectively describe the

arX

iv:2

005.

0326

4v1

[ee

ss.I

V]

7 M

ay 2

020

Page 2: Adaptive Feature Selection Guided Deep Forest for COVID-19

2

high-level information within the extracted location-specificfeatures by using a small-scale training data. Intuitively, theuse of the feature selection could promote the performance ofthe classification task. Hence, in our study, we also introducea task-driven feature selection method to adaptively reducethe redundancy of features. In particular, the feature selectionoperation will discard a portion of unimportant features basedon the feature importance which is calculated from trainedforests in each deep forest layer. Hence, the feature selectionand classifier training are adaptively incorporated into a uni-fied framework. Finally, the trained adaptive feature selectionguided deep forest is used for COVID-19 prediction.

The major contributions of this paper are three-fold. First,a deep forest is leveraged to learn the high-level featurerepresentation of the location-specific features from chest CTfor the diagnosis of COVID-19. Second, we proposed anadaptive feature selection method to adaptively select thediscriminative features for the diagnosis of COVID-19. Third,the proposed method is evaluated on the collected COVID-19 dataset, which consists of 1495 COVID-19 patients and1027 CAP patients. Experimental results demonstrate that ourmethod achieves superior classification performance than thecomparison methods.

The rest of the paper is organized as follows. We firstintroduce the materials used in this study and the proposedadaptive feature selection guided deep forest in Section II.Then, in Section III, we present experimental settings andexperimental results. In Section IV, we study the influence ofparameters in the proposed methods and present the limitationsof the current study as well as possible future directions. Wefinally conclude this paper in Section V.

II. MATERIALS AND METHOD

In this section, we first introduce the dataset used in thisstudy. Then, we present the feature extraction procedure forthe location-specific features. Next, we describe the proposedadaptive feature selection guided deep forest (AFS-DF). Fi-nally, we provide the implementation details for our proposedAFS-DF.

A. Materials

A total of 2522 chest CT images are used in our study,provided by The Third Hospital of Jilin University, RuijinHospital of Shanghai Jiao Tong University, Tongji Hospitalof Huazhong University of Science and Technology, Shang-hai Public Health Clinical Center of Fudan University, andHangzhou First Peoples Hospital of Zhejiang University. Inthis dataset, 1495 cases are from the confirmed COVID-19cases diagnosed by positive nucleic acid testing. The other1027 case are from CAP patients. COVID-19 images are ac-quired from Jan. 9, 2020 to Feb. 14, 2020, and CAP images areobtained from Jul. 30, 2018 to Feb. 22, 2020. The demographicinformation of these 2522 subjects is summarized in Table I.

All patients underwent chest CT scans with thin section.Specifically, CT scanners include uCT 780 from UIH, OptimaCT520, Discovery CT750, LightSpeed 16 from GE, AquilionONE from Toshiba, SOMATOM Force from Siemens, and

TABLE IDEMOGRAPHIC INFORMATION OF THE STUDIED 2685 CHEST CT SCANS

FROM COVID-19 DATASET. M/F: MALE/FEMALE

Category Scan # Gender (M/F)

COVID-19 1495 770/725

CAP 1027 488/539

SCENARIA from Hitachi. CT protocol includes: 120 kV ,reconstructed slice thickness ranging from 0.625 to 2 mm,with breath hold at full inspiration. All images are de-identifiedbefore sending for analysis. The study are approved by theInstitutional Review Board of participating institutes. Writteninformed consent is waived due to the retrospective nature ofthe study.

B. Feature Extraction

Similar to [12], we extract the location-specific features(i.e., infection locations and spreading patterns) to representthe chest CT images for diagnosis of COVID-19. Specifically,the chest CT images are first automatically segmented intoinfected lung regions and lung fields bilaterally by using VB-Net [16], [17]. The infected lung regions are mainly related tomosaic sign, ground glass opacity (GGO), lesion-related signs(air bronchogram) and interlobular septal thickening. The lungfields include left lung, right lung, five lung lobes, and eighteenpulmonary segments. Then, we extract four kinds of location-specific handcrafted features, including volume, infected lesionnumber, histogram distribution and surface area from chest CTimages. Meanwhile, we also extract the radiomics features fordescribing the CT images. More details are as follows.

1) Volume features: Based on the segmented infectedlung regions, we extract the total volume of infectedregion, and then calculate the percentage of the infectedregion of the whole lung. Meanwhile, according to thelung field segmentation results, we further extract thevolume and percentage in each lobe and each pulmonarysegment, respectively. Since there are evidence thatpneumonia caused by COVID-19 more likely occurs inboth right and left lungs, we also calculate the infectedlesion difference as well as the percentage differencebetween left and right lungs.

2) Infected lesion number: In comparison to CAP, mostof the COVID-19 infections encompass bilateral lungswith multifocal involvement [6], [7], and COVID-19generally has concentrated infection lesions while CAPshows small in volume and patchy in distribution [18].Therefore, we calculate the features of the total numberof infected regions in the bilateral lungs, lung lobes, andpulmonary segments, respectively.

3) Histogram distribution: The predominant chest CTfindings show that bilateral and peripheral GGO andconsolidation are a radiologic hallmark of COVID-19 [19], [20]. GGO is a pattern of hazy increasedlung opacity with preservation of bronchial and vascularmargins, whereas consolidation is characterized by ahomogeneous increase in lung parenchymal attenuation

Page 3: Adaptive Feature Selection Guided Deep Forest for COVID-19

SUN et al.: ADAPTIVE FEATURE SELECTION GUIDED DEEP FOREST 3

AFS-DF

COVID-19

or

CAP ?

⚫ Volume features⚫ Infected lesion number⚫ Histogram distribution⚫ Surface area⚫ Radiomics features⚫ Age⚫ Gender⚫ ……

Feature Extraction

LR

Classification

Fig. 1. Pipeline of the proposed adaptive feature selection guided deep forest for COVID-19 vs. CAP classification. We first extract the location-specific featuresfrom chest CT. Then, the proposed AFS-DF is leveraged to train the classifier based on the location-specific features. Finally, based on the location-specificfeatures, the trained FAS-DF is adopted for COVID-19 identification task.

COVID-19

or

CAP ?

Loca

tio

n-s

pec

ific

fea

ture

s

...

RF 1

RF N

...

......

FeatureSelection ...

RF 1

RF N

...

......

... RF 1

RF N

...

...

FeatureSelection...

...

...

Layer 1 Layer 2 Layer K

...

Fig. 2. Overview of the proposed adaptive feature selection guided deep forest. Each layer of the proposed adaptive feature selection guided deep forestconsists of N random forests and an adaptive feature selection unit.

that obscures the margins of vessels and airway walls onCT images [18]. To extract the intensity distribution ofthe infected regions in chest CT images, we calculatedthe histogram features of the infected regions.

4) Surface area: In previous study [21], it has been foundthat COVID-19 had a predominate distribution in theposterior and peripheral lung, and the abnormalities oflung parenchyma eventually spread to the central areaand bilateral upper lobes [6]. Therefore, we constructedthe infection surface as well as the lung boundary sur-face. We further calculated the distance of each infectionsurface vertex to the nearest lung boundary surface, andcategorized them into 5 ranges, i.e., 3, 6, 9, 12 and 15voxels. For each feature, the number of infection surfacevertices within each range of distances to the lung wallis calculated. Furthermore, the percentage of infectionvertex number against the number of whole infectionsurface vertices in each range is also considered.

5) Radiomics features: Radiomics features extracted frominfected lesions, including intensity features (e.g., aver-age gray level intensity, range of gray values) and texturefeatures (e.g., gray level co-occurrence matrix, gray-level run-length matrix, gray-level size-zone matrix, and

neighborhood gray-tone difference matrix) are used inour study.

Beside, we also adopt the age and gender into the location-specific features for the diagnosis of COVID-19. In summary,a total of 239 dimensions features are used in our study.

C. Adaptive Feature Selection Guided Deep ForestAs mentioned in Section II-B, we extract location-specific

features from the chest CT images. But, simply using thesefeatures cannot adequately describe the high-level informationfor COVID-19 classification. In this work, we propose anadaptive feature selection guided deep forest to learn the la-tent high-level representation of the extracted location-specificfeatures with adaptive task-driven feature selection process fordiagnosis of COVID-19. The architecture of proposed adaptivefeature selection guided deep forest is shown in Fig. 2.

As shown in Fig. 2, each layer of proposed adaptive featureselection guided deep forest consists of N independent randomforests and a feature selection unit. Here, each random forestproduces a probability distribution of the COVID-19 and CAP(the yellow rectangle in Fig. 2). Then, the N probability dis-tribution vector of the COVID-19 and CAP are concatenatedwith the input feature vector. To reduce the redundancy of

Page 4: Adaptive Feature Selection Guided Deep Forest for COVID-19

4

the features, we further perform an adaptive feature selectionoperation. In particular, for each trained random forest, wecan calculate the feature importance for each feature withinthe input feature vector. Thus, we calculate the overall featureimportance ci for i-th feature as follows,

ci =1

N

N∑n=1

ci,n (1)

where ci,n is the feature importance for i-th feature in n-thrandom forest. Herein, we discard the features with low featureimportance by a specific ratio based on the calculated featureimportance. Hence, the feature selection and classifier trainingare adaptively incorporated into a unified framework. Thus, theselected feature vector as the input of the next layer. Finally,we cascade multiply layers to learn the deep discriminativefeature representation for COVID-19 classification task.

D. Implementation

As shown in Fig. 2, in the proposed adaptive featureselection guided deep forest, we employ a Xgboost [22] with20 trees, a random forest [23] with 20 trees, and two extremelyrandomized trees [24] with 20 and 50 trees, respectively. Weempirically set the feature discard ratio as 0.2 in our study.In the training stage, we feed the extracted location-specificfeatures to the adaptive feature selection guided deep forest.The training set further is divided to 5 subsect for 5-fold cross-validation. In particular, each subset is sequentially selected asthe validation set, while the remaining four subsets are treatedas training set for model construction. Thus, the numbersof cascade layers and selected features are automaticallydetermined by using the cross-validation strategy. Hence, theadaptive feature selection guided deep forest can adaptivelytrain the feature selection and classification model in a task-driven manner.

In the testing stage, we also feed the location-specificfeatures of test subject to the trained adaptive feature selectionguided deep forest model. In the last layer, each forest willproduce a probability distribution p for the identification ofCOVID-19. For each subject, we use the following equationto ensemble the predicted value for diagnose of COVID-19,

p(l = c) =1

N

N∑n=1

p(l = c|n) (2)

where p(l = c|n) is the probability of subject belongs tocategory c (i.e., COVID-19 or CAP) that is provided by then-th forest in last layer. Finally, we use the MAP criterion toobtain the label for each subject, i.e., argmaxcp(l = c)

III. EXPERIMENT

In this section, we first illustrate the competing methodsand experimental settings in our study. Then, we presentexperimental results achieved by different methods on thechest CT images with 1495 patients of COVID-19 and 1027patients of CAP.

A. Competing Methods

In our experiments, we compare our proposed AFS-DF withthe following four widely adopted machine learning methods.

1) Logistic Regression (LR): A Logistic regressionmethod is employed for COVID-19 classification byusing the extracted location-specific features.

2) Support Vector Machine (SVM): The extractedlocation-specific features are fed into the SVM classi-fier by using radial basis function kernel with defaultparameters.

3) Random Forests (RF): The random forest classifier isapplied on the location-specific features for COVID-19classification, and the number of trees in random forestis set as 500 via cross-validation.

4) Neural Networks (NN): In this method, a fully con-nection neural network is employed for COVID-19classification. Specifically, we empirically set the mini-batch size as 64, the number of epochs as 100, and thelearning rate as 0.001.

B. Experimental Settings

For all extracted location-specific features from chest CTimages, we first perform the normalization with center 0 anddeviation 1 for each feature. Then, we employ a 5-fold cross-validation strategy to evaluate the performance of the proposedAFS-DF as well as its competitors. Specifically, all subjects arerandomly partitioned into 5 subsets. Each subset is sequentiallyselected as the testing set, while the remaining four subsetsare treated as training set for model construction, and the finalclassification results are the average over the five-fold cross-validations.

In order to measure the classification performance of dif-ferent methods, four evaluation metrics are adopted, includingclassification accuracy (ACC), sensitivity (SEN), specificity(SPE) and the Area Under the receiver operating characteristicCurve (AUC) [25]. Here, the ACC, SEN and SPE are definedas,

ACC =TP + TN

TP + TN + FP + FN(3)

SEN =TP

TP + FN(4)

SPE =TN

TN + FP(5)

where TP, TN, FP and FN in Eqs.3–5 represent True Positive,True Negative, False Positive, and False Negative, respectively.

C. Classification Performance

We evaluate the COVID-19 vs. CAP classificationin 5-foldcross-validation on the collected chest CT images dataset.Table II shows the quantitative results (i.e., ACC, SEN, SPEand AUC) achieved by different methods.

From Table II, we can observe that our AFS-DF archivesthe best results in the terms of ACC, SEN, SPE and AUC.

Page 5: Adaptive Feature Selection Guided Deep Forest for COVID-19

SUN et al.: ADAPTIVE FEATURE SELECTION GUIDED DEEP FOREST 5

TABLE IIPERFORMANCE OF COVID-19 VS. CAP CLASSIFICATION ACHIEVED BY LR, SVM, RF, NN AND AFS-DF. THE TERMS a AND b IN “a± b” DENOTE THE

MEAN AND STANDARD DEVIATION, RESPECTIVELY.

Method ACC (%) SEN (%) SPE (%) AUC (%)

LR 89.81± 1.63 91.64± 1.69 87.10± 2.14 95.52± 1.31

SVM 89.97± 1.37 91.51± 2.04 87.70± 0.89 95.62± 0.91

RF 89.41± 1.67 90.51± 0.87 87.85± 1.01 95.40± 1.21

NN 89.96± 1.58 92.66± 1.57 86.04± 1.97 95.73± 1.00

AFS-DF 91.79± 1.04 93.05± 1.67 89.95± 1.28 96.35± 1.13

In particular, our AFS-DF method achieves the highest clas-sification accuracy (i.e., 91.79%), which is better than theLR (i.e., 89.81%), SVM (i.e., 89.97%), RF (i.e., 89.41%)and NN (i.e., 89.53%). In general, the proposed AFS-DFachieves 1.98%, 1.82%, 2.38% and 1.83% improvements interms of ACC over the LR, SVM, RF and NN, respectively.The possible reason for improvements is that our AFS-DFnot only leverages the high-level feature representation byusing the deep forest to improve the performance of COVID-19 vs. CAP, but also employs the discriminative features byusing the adaptive feature selection process. The deep methods(i.e., NN and AFS-DF) show better classification accuracyin task of diagnosis of COVID-19 when compared with theconventional classifiers (i.e., LR, SVM and RF). The possiblereason is that, the deep models can learn the high-level featurerepresentative, which can boost the classification performance.It is worth noting that COVID-19 is the highly contagiousdisease, the higher SEN should have practically meaningfuladvantage for timely COVID-19 diagnosis to prevent thespread of the COVID-19. The SEN achieved by our AFS-DFfor COVID-19 vs. CAP is 93.05%, which is better than otherbaseline methods. These results imply that using the adaptivefeature selection guided deep forest model can improve theidentification ability of COVID-19.

As shown in Fig 3, our proposed AFS-DF method producesthe best classification performance when compared with thebaseline methods. These results further validate that usingthe high-level representation and adaptive feature selectionstrategy could improve the performance for COVID-19 vs.CAP classification. The statistics of feature importance ofextracting location-specific features is shown in Fig. 4. Wecalculate the sum of feature importance of location-specificfeatures over the last layer in 5 trained AFS-DF model. Fig. 4shows that the surface area features have importance influencefor COVID-19 classification.

IV. DISCUSSION

In this section, we first compare our proposed adaptivefeature selection guided deep forest with several state-of-the-art methods for COVID-19 classification. Then, we study theinfluence of the adaptive feature selection strategy and theselected deep features that are learned by adaptive featureselection guided deep forest. Finally, we present the limitationsof this work as well as possible future research directions.

Fig. 3. ROC curves achieved by LR, SVM, RF, NN and AFS-DF in COVID-19 vs. CAP classification.

TABLE IIICOMPARISON WITH STATE-OF-THE-ART METHODS FOR COVID-19

CLASSIFICATION.

Method ACC (%) SEN (%) SPE (%) AUC (%)

Wang et al. [13] 73.1 67.0 74.0 78.0

Xu et al. [14] 86.7 − − −Shi et al. [10] 89.0 82.2 82.8 89.0

Tang et al. [11] 87.5 93.3 74.5 91.0

Shi et al. [12] 87.9 90.7 83.3 94.2

AFS-DF 91.79 93.05 89.95 96.35

A. Comparison with State-of-the-art Methods

Since several attempts have been made for COVID-19classification, we now compare our proposed AFS-DF withstate-of-the-art methods. [13] employs a convolutional neuralnetworks (CNN) to extract the features of chest CT images,and then combines the decision tree and Adaboost to producethe classification result of COVID-19 vs. typical viral pneu-monia (a total of 670 CT scans). [14] leverages a ResNet [26]to predict the COVID-19, Influenza-A viral pneumonia, andhealthy cases (a total of 618 CT scans). [10] extracts clinical

Page 6: Adaptive Feature Selection Guided Deep Forest for COVID-19

6

Fig. 4. The top 30 important features of location-specific features in AFS-DF.

and laboratory features and uses a logistic regression modelfor non-severe and severe patient classification (a total of 196CT scans). [11] extracts quantitative features, and introduced arandom forest method to assess the severity of the COVID-19patient (a total of 176 CT scans). [12] extracts location-specifichandcrafted features, and proposed an infection size-adaptiverandom forest for COVID-19 classification (a total of 2685CT scans).

The results are reported in Table III. One can observe fromTable III that the proposed AFS-DF shows competitive clas-sification performance for COVID-19 patient identification.The underlying reason could be that our AFS-DF can utilizethe high-level discriminative representation of the extractedfeatures.

B. Influence of Adaptive Feature Selection

To study the effectiveness of the proposed adaptive featureselection, we compare it to two feature selection methods(i.e., Lasso and ElasticNet [27]) and its variant (i.e., DeepForest (DF)). Lasso is used to select a discriminative sub-set of features from the feature vector by using l1-normsparsity constraint. The parameter for the sparsity constraintin this method is set as 0.001 by using cross-validation.In ElasticNet, a l1-norm is leveraged to reduce dimensionof extracted features. Also, a l2-norm is further introducedinto the classification model to ensure the smoothness of thelinear model. The parameters for the sparsity constraint andsmoothness constraint in ElasticNet are set as 0.001 and 0.1by using cross-validation, respectively. DF is a variant of theproposed AFS-DF, which employs the same architecture as

TABLE IVPERFORMANCE OF COVID-19 VS. CAP CLASSIFICATION BY USING

LASSO, ELASTICNET, DF AND AFS-DF. THE TERMS a AND b IN “a± b”DENOTE THE MEAN AND STANDARD DEVIATION, RESPECTIVELY.

Method ACC (%) SEN (%) SPE (%)

Lasso 90.33± 0.87 92.51± 1.25 87.14± 0.53

ElasticNet 90.25± 1.26 92.71± 1.41 86.68± 1.35

DF 91.16± 1.34 92.85± 1.54 88.65± 1.59

AFS-DF 91.79± 1.04 93.05± 1.67 89.95± 1.28

our AFS-DF model without feature selection block. Note thatthis variants method has same parameters setting with AFS-DF for training and test in our study. The experimental resultsare reported in Table IV.

As shown in Table IV, the proposed AFS-DF achievesthe best classification performance, when compared withLasso, ElasticNet and DF. In particular, compared with thefeature selection methods (i.e., Lasso and ElasticNet), theproposed AFS-DF achieves better performance by using thehigh-level feature representation. Meanwhile, by using theadaptive feature selection operation, the AFS-DF achieve abetter performance when compared with DF. These resultsimply the effectiveness of the proposed AFS-DF.

C. Influence of Features

To evaluate the effectiveness of the selected deep features(e.g., the features used in the last layer of AFS-DF) forCOVID-19 vs. CAP classification, we further develop threemethods based on LR, SVM and RF by using the selecteddeep features (i.e., AFSDF-LR, AFSDF-SVM and AFSDF-RF). We evaluate these six methods for COVID-19 vs. CAPclassification, with the results reported in Table V.

As can be seen from Table V, the proposed AFSDF-LR,AFSDF-SVM and AFSDF-RF outperform their counterparts(i.e., LR, SVM and RF) in most of evaluation metrics. Ofnote, the proposed methods consistently achieve better resultsin terms of ACC and SEN. For example, AFSDF-LR, AFSDF-SVM and AFSDF-RF achieve 1.38%, 1.15% and 1.11%improvement over LC, SVM and RF in terms of ACC forCOVID-19 vs. CAP classification, respectively. Comparedwith LC, SVM and RF, the AFSDF-LR, AFSDF-SVM andAFSDF-RF also show improvement in terms of SEN forCOVID-19 vs. CAP classification, respectively. The possiblereason is that, with the selected deep features by using AFS-DF, the features include the high-level and discriminative in-formation. Hence, the conventional machine learning methods(i.e., LR, SVM and RF) can use these selected deep featuresto improve the performance of COVID-19 classification task.

Beside, In Fig 5, we plot the original location-specificfeatures, the features of the last layer in DF, and the featuresof the last layer in AFS-DF, which perform dimensionalityreduction by using t-SNE [28]. As shown in Fig 5, our AFS-DF produces more discriminative features for COVID-19 vs.CAP classification.

Page 7: Adaptive Feature Selection Guided Deep Forest for COVID-19

SUN et al.: ADAPTIVE FEATURE SELECTION GUIDED DEEP FOREST 7

TABLE VPERFORMANCE OF COVID-19 VS. CAP CLASSIFICATION BY USING LR,AFSDF-LR, SVM, AFSDF-SVM, RF AND AFSDF-RF. THE TERMS a

AND b IN “a± b” DENOTE THE MEAN AND STANDARD DEVIATION,RESPECTIVELY.

Method ACC (%) SEN (%) SPE (%)

LR 89.81± 1.63 91.64± 1.69 87.10± 2.14

AFSDF-LR 91.19± 1.11 92.52± 1.51 89.08± 2.00

SVM 89.97± 1.37 91.51± 2.04 87.70± 0.89

AFSDF-SVM 91.12± 1.50 92.31± 1.82 89.38± 1.52

RF 89.41± 1.67 90.51± 0.87 87.85± 1.01

AFSDF-RF 90.52± 1.21 92.51± 1.27 87.57± 2.50

(a)

(b)

(c)

Fig. 5. Visual illustration of original location-specific features (a), the featuresof the last layer in DF (b), and the features of the last layer in AFS-DF (c).

D. Limitations and Future Work

There are still several limitations in the current study. First,the adaptive feature selection guided deep forest is only vali-dated on COVID-19 vs. CAP classification task. In the future,we plan to perform our proposed method on other COVID-19classification tasks (e.g., COVID-19 vs. normal, severe patientsvs. non-severe patients, etc.). Second, we extract the handcraftfeatures by using prior knowledge in current work; in future,the features learned by deep learning method are expectedto leverage our proposed method for further performanceimprovement.

V. CONCLUSION

In this paper, we propose an adaptive feature selectionguided deep forest for COVID-19 vs. CAP classification byusing the chest CT images. Specifically, the AFS-DF uses thedeep forest to learn the high-level representation based onthe location-specific features. Meanwhile, an adaptive featureselection operation is employed to reduce the redundancyof features based on the trained forest. Experimental resultson the collected COVID-19 dataset with 1495 COVID-19cases and 1027 CAP cases show that our proposed AFS-DF approach can achieve superior performance on COVID-19classification with chest CT images in comparison with severalexisting methods.

REFERENCES

[1] Z. Y. Zu, M. D. Jiang, P. P. Xu, W. Chen, Q. Q. Ni, G. M. Lu, andL. J. Zhang, “Coronavirus disease 2019 (COVID-19): A perspective fromChina,” Radiology, p. 200490, 2020.

[2] Y. J. Choe et al., “Coronavirus disease-19: The first 7,755 cases in theRepublic of Korea,” medRxiv, 2020.

[3] C. Sohrabi, Z. Alsafi, N. ONeill, M. Khan, A. Kerwan, A. Al-Jabir,C. Iosifidis, and R. Agha, “World Health Organization declares globalemergency: A review of the 2019 novel coronavirus (covid-19),” Inter-national Journal of Surgery, 2020.

[4] J. Wu, X. Wu, W. Zeng, D. Guo, Z. Fang, L. Chen, H. Huang, andC. Li, “Chest CT findings in patients with corona virus disease 2019and its relationship with clinical features,” Invest Radiol, vol. 10, 2020.

[5] Y. Fang, H. Zhang, J. Xie, M. Lin, L. Ying, P. Pang, and W. Ji,“Sensitivity of chest CT for COVID-19: comparison to RT-PCR,”Radiology, p. 200432, 2020.

[6] Y. Li and L. Xia, “Coronavirus disease 2019 (COVID-19): Role of chestCT in diagnosis and management,” American Journal of Roentgenology,pp. 1–7, 2020.

[7] M. Chung, A. Bernheim, X. Mei, N. Zhang, M. Huang, X. Zeng, J. Cui,W. Xu, Y. Yang, Z. A. Fayad et al., “CT imaging features of 2019 novelcoronavirus (2019-nCoV),” Radiology, p. 200230, 2020.

[8] M. Li, P. Lei, B. Zeng, Z. Li, P. Yu, B. Fan, C. Wang, Z. Li, J. Zhou,S. Hu et al., “Coronavirus disease (covid-19): Spectrum of CT findingsand temporal progression of the disease,” Academic Radiology, 2020.

[9] C. Long, H. Xu, Q. Shen, X. Zhang, B. Fan, C. Wang, B. Zeng, Z. Li,X. Li, and H. Li, “Diagnosis of the coronavirus disease (covid-19): rRT-PCR or CT?” European Journal of Radiology, p. 108961, 2020.

[10] W. Shi, X. Peng, T. Liu, Z. Cheng, H. Lu, S. Yang, J. Zhang,F. Li, M. Wang, X. Zhang et al., “Deep learning-based quantitativecomputed tomography model in predicting the severity of COVID-19:A retrospective study in 196 patients,” 2020.

[11] Z. Tang, W. Zhao, X. Xie, Z. Zhong, F. Shi, J. Liu, and D. Shen, “Sever-ity assessment of coronavirus disease 2019 (COVID-19) using quantita-tive features from chest CT images,” arXiv preprint arXiv:2003.11988,2020.

[12] F. Shi, L. Xia, F. Shan, D. Wu, Y. Wei, H. Yuan, H. Jiang, Y. Gao, H. Sui,and D. Shen, “Large-scale screening of COVID-19 from communityacquired pneumonia using infection size-aware classification,” arXivpreprint arXiv:2003.09860, 2020.

Page 8: Adaptive Feature Selection Guided Deep Forest for COVID-19

8

[13] S. Wang, B. Kang, J. Ma, X. Zeng, M. Xiao, J. Guo, M. Cai, J. Yang,Y. Li, X. Meng et al., “A deep learning algorithm using CT images toscreen for Corona Virus Disease (covid-19),” medRxiv, 2020.

[14] X. Xu, X. Jiang, C. Ma, P. Du, X. Li, S. Lv, L. Yu, Y. Chen, J. Su,G. Lang et al., “Deep learning system to screen coronavirus disease2019 pneumonia,” arXiv preprint arXiv:2002.09334, 2020.

[15] Z.-H. Zhou and J. Feng, “Deep forest: Towards an alternative to deepneural networks,” in Proceedings of the Twenty-Sixth International JointConference on Artificial Intelligence, IJCAI-17, 2017, pp. 3553–3559.

[16] F. Shan, Y. Gao, J. Wang, W. Shi, N. Shi, M. Han, Z. Xue, D. Shen,and Y. Shi, “Lung infection quantification of COVID-19 in CT imageswith deep learning,” arXiv preprint arXiv:2003.04655, 2020.

[17] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutionalneural networks for volumetric medical image segmentation,” in 2016Fourth International Conference on 3D Vision (3DV). IEEE, 2016, pp.565–571.

[18] D. M. Hansell, A. A. Bankier, H. MacMahon, T. C. McLoud, N. L.Muller, and J. Remy, “Fleischner Society: glossary of terms for thoracicimaging,” Radiology, vol. 246, no. 3, pp. 697–722, 2008.

[19] X. Li, X. Zeng, B. Liu, and Y. Yu, “Covid-19 infection presenting withCT Halo Sign,” Radiology: Cardiothoracic imaging, vol. 2, no. 1, p.e200026, 2020.

[20] A. Bernheim, X. Mei, M. Huang, Y. Yang, Z. A. Fayad, N. Zhang,K. Diao, B. Lin, X. Zhu, K. Li et al., “Chest CT findings in coro-navirus disease-19 (COVID-19): relationship to duration of infection,”Radiology, p. 200463, 2020.

[21] F. Song, N. Shi, F. Shan, Z. Zhang, J. Shen, H. Lu, Y. Ling, Y. Jiang, andY. Shi, “Emerging 2019 novel coronavirus (2019-nCoV) pneumonia,”Radiology, p. 200274, 2020.

[22] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,”in Proceedings of the 22nd acm sigkdd international conference onknowledge discovery and data mining, 2016, pp. 785–794.

[23] A. Liaw, M. Wiener et al., “Classification and regression by random-Forest,” R news, vol. 2, no. 3, pp. 18–22, 2002.

[24] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,”Machine learning, vol. 63, no. 1, pp. 3–42, 2006.

[25] G. S. Fletcher, Clinical epidemiology: the essentials. LippincottWilliams & Wilkins, 2019.

[26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proceedings of the IEEE conference on computer visionand pattern recognition, 2016, pp. 770–778.

[27] H. Zou and T. Hastie, “Regularization and variable selection via theelastic net,” Journal of the royal statistical society: series B (statisticalmethodology), vol. 67, no. 2, pp. 301–320, 2005.

[28] L. v. d. Maaten and G. Hinton, “Visualizing data using t-SNE,” Journalof machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.