detecting and explaining self-admitted technical debts with attention...

Detecting and Explaining Self-Admitted Technical Debts withAttention-based Neural Networks

Xin WangSchool of Computer Science, Wuhan

UniversityWuhan, China

[email protected]

Jin Liu∗School of Computer Science, Wuhan

UniversityWuhan, China

[email protected]

Li LiFaculty of Information Technology,

Monash UniversityMelbourne, [email protected]

Xiao ChenFaculty of Information Technology,

Monash UniversityMelbourne, Australia

[email protected]

Xiao LiuSchool of Information Technology,

Deakin UniversityGeelong, Australia

[email protected]

Hao WuSchool of Information Science andEngineering, Yunnan University

Kunming, [email protected]

ABSTRACTSelf-Admitted Technical Debt (SATD) is a sub-type of technicaldebt. It is introduced to represent such technical debts that areintentionally introduced by developers in the process of softwaredevelopment. While being able to gain short-term benefits, theintroduction of SATDs often requires to be paid back later with ahigher cost, e.g., introducing bugs to the software or increasing thecomplexity of the software.

To cope with these issues, our community has proposed vari-ous machine learning-based approaches to detect SATDs. Theseapproaches, however, are either not generic that usually requiremanual feature engineering efforts or do not provide promisingmeans to explain the predicted outcomes. To that end, we pro-pose to the community a novel approach, namely HATD (HybridAttention-based method for self-admitted Technical Debt detec-tion), to detect and explain SATDs using attention-based neuralnetworks. Through extensive experiments on 445,365 comments in20 projects, we show that HATD is effective in detecting SATDs onboth in-the-lab and in-the-wild datasets under both within-projectand cross-project settings. HATD also outperforms the state-of-the-art approaches in detecting and explaining SATDs.

CCS CONCEPTS• Software and its engineering → Software notations andtools; Software maintenance tools.

KEYWORDSSelf-Admitted Technical Debt, Word Embedding, Attention-basedNeural Networks, SATD∗Jin Liu is corresponding author.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’20, September 21–25, 2020, Virtual Event, Australia© 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-6768-4/20/09. . . $15.00https://doi.org/10.1145/3324884.3416583

ACM Reference Format:Xin Wang, Jin Liu, Li Li, Xiao Chen, Xiao Liu, and Hao Wu. 2020. Detectingand Explaining Self-Admitted Technical Debts with Attention-based NeuralNetworks. In 35th IEEE/ACM International Conference on Automated SoftwareEngineering (ASE ’20), September 21–25, 2020, Virtual Event, Australia. ACM,New York, NY, USA, 12 pages. https://doi.org/10.1145/3324884.3416583

1 INTRODUCTIONTechnical debt is a metaphor, proposed by Cunningham [6], usedto denote a suboptimal solution that developers take shortcuts toachieve rapid delivery during development. Some typical examplesof technical debts include introducing hard-coded values into thecode, making code changes while ignoring failing unit tests, copy-pasting code from other modules, etc. Those technical debts areusually introduced to a software system unintentionally by develop-ers, or intentionally but without adequately documented. As timegoes by, those undocumented technical debts, if not resolved intime, might be faded away from the developers’ minds and becomeequivalent to unintentionally introduced technical debts [2, 45].To mitigate this issue, developers may decide to document (viacomments) their intentionally introduced technical debts. The doc-umented debt is often known as self-admitted technical debt (SATDin short).

No matter intentionally or unintentionally, the compromisesmade to introduce technical debts will likely lead to negative effectson the maintenance of the software in the long run [18, 21, 23, 27, 35,53]. Indeed, as argued by Megan Horn1, technical debts may causesoftware systems to behave unexpectedly, and those surprisingbehaviors could be notoriously difficult to test and fix. Often, fixingone issue could introduce several new issues, resulting in highcosts for resolving technical debts. As revealed in a recent study byCAST Software2, a provider of software analysis and measurementtools, the amount of technical debts that have to be addressed afterthe application is deployed in production, on average, is over onemillion for each business application.

1https://www.farreachinc.com/blog/far-reach/2017/10/05/the-true-cost-of-technical-debt2https://www.itbusinessedge.com/slideshows/show.aspx?c=93589

https://doi.org/10.1145/3324884.3416583

https://doi.org/10.1145/3324884.3416583

ASE ’20, September 21–25, 2020, Virtual Event, Australia Xin Wang and Jin Liu, et al.

Table 1: Examples of SATD comments.Project SATD Comments

Apache Ant// TODO: allow user to request the system or no parent// What is the property supposed to be?// cannot remove underscores due to protected visibility >:(

JFreeChart// do we need to update the crosshair values?// defer argument checking...// do we need to update the crosshair values?

JMeter// Don’t try restoring the URL TODO: wy not?// Can be null (not sure why)// Maybe move to vector if MT problems occur

SQuirrel// Do we need this one.// is this right???// verify this

ArgoUML// Why does the next part not work?// Shouldn’t we throw an exception here?!?!// The following can be removed if selectAll gets fixed

Despite the fact that technical debts will introduce negativeimpacts to the maintenance of software, technical debt is still wide-spread and seems to be unavoidable in software systems[3, 12, 47,54]. Hence, there is a strong need to invent automated approachesto detect technical debts and fix them as earlier as possible. Nev-ertheless, it is a challenging endeavor to handle all the types oftechnical debts at once [25, 38]. As the initial step towards achiev-ing such a purpose, many state-of-the-art works start by focusingon the detection of SATDs [10, 14, 17, 58]. Unlike other technicaldebts, SATDs are often highlighted as comments in the code ofthe software [30, 41, 59]. Table 1 enumerates some SATD examplesidentified in popular open-source projects. For example, the com-ment "May be replaced later" in the JMeter project indicates thatthe current code is a temporary solution and needs to be replacedin the future.

Although explicitly highlighted, as shown in Table 1, SATDsare still largely presented in open-source software projects. Ourcommunity has hence spent various efforts to experimentally char-acterize SATDs. For example, Potdar et al. [38] attempt to manuallyexplore SATDs in code comments of Java projects and eventuallysummarize 62 patterns that can be used to identify SATD. Huang etal. [16] have proposed a text-mining method to identify SATD andhave achieved substantial improvements compared with pattern-based methods. Yan et al. [56] propose a change-level method fordetecting SATDs utilizing 25 software change features.

Unfortunately, existing state-of-the-art approaches are eitherinvolving heavy manual pre-processes (such as identifying theSATD-relevant patterns or extracting features to fulfill ML-basedclassifiers) or lacking reliable explanations to elucidate why a SATDis flagged as such[4, 43, 50, 51, 60]. To fill this gap, we propose tothe community a novel deep learning-based approach to detectand explain self-admitted technical debts in open-source softwareprojects. We design and implement a prototype tool called HATD.First, it leverages positional encoder and Bi-directional Long Short-Term Memory(Bi-LSTM)[64] network to capture the sequentialcharacteristics of SATDs from code comments. Then, it leveragesdiverse attention mechanisms to highlight the importance of theautomatically prepared features that have contributed to the detec-tion of SATDs. The most important features will subsequently beleveraged to explain the classification result, e.g., why a comment is(not) flagged as SATD? Through extensive experiments on 445,365

SATD non_SATD

0

2

4

6

8

10

Num

ber o

f com

mits

Figure 1: Distribution of changes made to SATD-involvedand non-SATD-involved source code files.

comments in 20 projects, HATD demonstrates its superior perfor-mance and explainability in both within-project and cross-projectSATD detections over the state-of-the-art methods.

The main contributions of this paper can be summarized asfollows:• We provide an overview of code comment characteristics thatcould be challenging for creating automated tools to detect SATDsin software projects.• We design and implement a prototype tool called HATD, whichleverages a hybrid attention-basedmethod combining both single-head attention and multi-head attention mechanisms for pin-pointing SATDs from code comments.• We demonstrate the effectiveness of HATD by comparing withstate-of-the-art methods on open-source benchmark projects,detecting SATDs in real-world and popular software projects,and providing explanations to the classification results.

2 MOTIVATION AND BACKGROUNDTo help readers better understand our work and motivate the neces-sity of developing automated approaches for pinpointing SATDs, wefirst present in this section a simple empirical investigation to checkwhether SATDs impact the maintainability of software projects.Then, we enumerate some characteristics of code comments thatmake it challenging to automatically interpret the semantics of thecomments (i.e., pinpointing self-admitted technical debts).

2.1 SATD VS. Code ChangesIntuitively, SATDs will trigger more future code changes, i.e., in-creasing the maintainability of the software project [39, 55]. Toquantify the relationship between SATDs and code changes, weconduct a simple empirical study (count the number of commitsrelated to files that have been changed, with and without technicaldebts) to check if SATDs will trigger more future code changes. Tothis end, we select a popular open-source project called Apache Antto fulfill the experiments. The reason why this project is chosenis that the SATDs introduced over the development of this projectare known. Indeed, as of Jan 14, 2016, the Apache Ant project hasintroduced in total 131 SATDs, accounting for roughly 3.2% of totalcomments (cf. 4,137).

Detecting and Explaining Self-Admitted Technical Debtswith Attention-based Neural Networks ASE ’20, September 21–25, 2020, Virtual Event, Australia

In order to explore the long-term impact of SATDs on the main-tenance of the project, we set a cutting point on Jan 1, 2014 andrecount the SATDs introduced at that point. Among 131 SATDs, 111of them have already been introduced into the project and have notbeen resolved, involving 82 Java files. In other words, at this point,1,116 Java files are not involved with SATDs. Starting from thiscutting point, we revisit all upcoming commits and count the timesof changes related to each Java file. The final results are plottedin Figure 1, which illustrates the distribution of changes made toSATD-involved and non-SATD-involved Java files. It is quite clearthat SATD-involved files require more changes compared to thefiles without involving SATDs. This difference is further confirmedby a Mann-Whitney-Wilcoxon (MWW) test, i.e., the difference issignificant at a significance level3 of 0.001. This simple empiricalinvestigation shows that SATDs indeed negatively impact the main-tenance of software projects, and hence there is a strong need toinvent promising approaches to detect and resolve them [13].

2.2 Characteristics of Code CommentsCode comments are often provided by developers to remind them-selves to record unsatisfactory design decisions, question designdecisions of other developers, and make future changes. Since codecomments are usually written in natural language by differentdevelopers with different writing styles, they may come with char-acteristics that are difficult to interpret [31]. In this section, wesummarize some of the characteristics of code comments that makeautomated detection of SATDs challenging:(1) Domain Semantics. Code comments often contain jargons,

method names, and abstract concepts that only programmerscan understand. For example, the comment "FIXME: I use a forblock to implement END node because we need a proc which cap-tures" in JRuby project contains a term for, an abstract conceptEND node.

(2) Level of details (Comment Length). The length of phrasesin code comments can vary widely, ranging from one wordlike “unused”, “load”, to short phrases like “call workaround”,“should probably error” and to sentences like “this is such a badway of doing it, but ...”. This varying length text feature in codecomments makes it challenging as well to detect SATDs.

(3) Project Uniqueness. SATD comments on different projectswill show different styles due to different developers, as empiri-cally confirmed by Potdar et al. [38] in their manual pattern sum-marization work. In addition, the coding habits and vocabularygaps among different developers aggravate this phenomenon.

(4) Polysemy. Polysemy is common in sentence expressions. It isoften accompanied by domain knowledge in code comments.For example, code comments may contain method names. It ischallenging to distinguish method names and common words,which often requires contextual information.

(5) Abbreviation.To facilitatememorization andwriting efficiency,developers often use abbreviations in code comments. Projectdifferences often exacerbate this phenomenon. For example,the comment “no default in DB. If nullable, use null.” in projectSQuirrel, in which DB is the abbreviation of Database.

3Given a significance level 𝛼 = 0.001, if p-value < 𝛼 , there is one chance in a thousandthat the difference between the datasets is due to a coincidence.

Code comment

Word Embedding

Label of code comment

HATD

Feature Learning SATD Detection

Figure 2: The overall architecture of our approach.

These characteristics exhibited frequently in SATD commentsmake traditional pattern-based and text-mining-based SATD detec-tion approaches difficult to achieve high performance. Therefore,there is a strong need to mine the in-depth implicit and seman-tic features from the code comments so as to satisfy the differentchallenges in effectively detecting SATDs.

3 OUR APPROACHWe propose a Hybrid Attention-based method for self-admittedTechnical Debt detection named HATD. HATD analyzes code com-ments of different projects or application domains (Challenge 1 and3) aiming to learn comprehensive implicit features of SATDs. It canflexibly extract and aggregate variable-length text features of SATDdetections (Challenge 2). To overcome polysemy and abbreviationissues (Challenge 4 and 5), we use ELMo to obtain dynamic wordembedding across contexts for SATD detections.

As depicted in Figure 2, HATD has three modules. The wordembedding module maps the words in the comments to vectors.Different word embedding techniques can be incorporated into ourapproach (by default, the ELMo algorithm is implemented). In thefeature learning module, we use positional encoder and Bi-LSTMto learn sequential features, then we leverage diverse attentionmechanisms to further obtain potential representation of words anddistinguish the importance of different words. Finally, the learnedfeatures are fed into the detection module to predict SATDs.

3.1 Word EmbeddingWord embedding is a collective term for language models and repre-sentation learning techniques in natural language processing (NLP).Conceptually, it refers to embedding a high-dimensional vectorwhose dimension is the number of all words into a vector of a muchlower dimension. The goal of word embedding module is to over-come the aforementioned challenges (cf. Section 2.2) in handlingcode comments. Let us consider the example of keyword proc fromcode comment "FIXME: I use a for block to implement END nodebecause we need a proc which captures" in JRuby project. The prochere is the abbreviation of process, which may be seen as OOV (outof vocabulary). Because the labeled code comment is limited, andwe cannot guarantee that any test word exists in training corpus.According to morphological knowledge, FastText[5, 19] solves theOOV problem to some extent by learning character-based embed-ding representation of subwords. Static word embedding techniquescannot deal with the phenomenon of polysemy, which is commonin comments. Dynamic word representation technique such asELMo[37] can deal with this problem to some extent by adjust-ing the word vector according to the context. ELMo (Embeddingfrom Language Models)[37] is a new type of deep contextualizedword representation language model, which can model the complexfeatures of words (such as syntax and semantics) and the changeof words across linguistic contexts (i.e., to polysemy). Since the


Comment Matrix Feature Learning

Single-head Attention Encoder

Multi-head

Attention Encoder

Dimension

FIXMEformattersarenotthreadsafe

S1

S2

Figure 3: The working process of the feature learning mod-ule.

meaning of words is context-sensitive, ELMo’s principle is to inputsentences into a pre-trained language model in real time to getdynamic word vectors, which can effectively deal with polysemy.Considering that the main application scenario of ELMo is to pro-vide dynamic word vectors, acting on downstream tasks, our modelintegrates ELMo to identify the SATDs. Detailed comparison ofexisting word embedding schemes can be found in Section 5.3.

3.2 Feature LearningThe feature learning module consists of two sub-modules, a single-head attention encoder (SAE) and a multi-head attention encoder(MAE), adopting different attention mechanisms and sequence fea-ture extractors. The single-head attention encoder first inputs theembedding representation of the comment into a Bi-LSTM net-work, then we can use an attention mechanism to assign differentweights to different words according to the importance of wordsfor the purpose of interpreting detection results. In order to fur-ther encode the current word, the multi-head attention encoderallows the encoding of words considering other words informationvia the so-called self-attention mechanism. By combining the SAEand MAE modules, we hope to exploit the strengths of the RNN-based model[64] and the Transformer-based model[48] to improveSATD detection performance and interpret the detection resultswith attention mechanism. Figure 3 shows the internal structureof the feature learning module. The details of these two internalsub-modules will be given later.

3.2.1 Single-head Attention Encoder (SAE). Given a code comment,SAE module firstly learns the features of the comment from thepreceding as well as the following tokens with a two-layer Bi-LSTMnetwork. In this process, we can exploit an attention mechanism toassign different attentions to words for the purpose of interpretingdetection results. Figure 4 shows the working process of SAE, whichis carried out in two steps: (1) Extracting sequential features of thecomment, and (2) Assigning attention weight to each word. 𝑆1denotes the obtained implicit feature vector of the comment.

Word Encoder. Given a code comment, we assume that thecomment contains 𝑙 words. 𝑥𝑡 represents the 𝑡 th word’s vector. Weuse a Bi-LSTM to get implicit representation of words by summa-rizing information from both directions for words. The Bi-LSTMcontains the forward LSTM which reads the comment 𝑐 from 𝑥1 to

Bi-LS

TM

Bi-LS

TM

tanh

2 Layers

Atte

ntio

n La

yer

Comment Matrix S1

SAE

Figure 4: Workflow of the SAE module.

Self-A

tten

tion

No

rmalizatio

n

Po

sition

al

En

cod

ing

S2Comment Matrix

MAE

Figure 5: Workflow of the MAE module.

𝑥𝑙 and a backward LSTM which reads from 𝑥𝑙 to 𝑥1:

ℎ𝑡 =−−−−→𝐿𝑆𝑇𝑀 (𝑥𝑡 ), 𝑡 ∈ [1, 𝑙], (1)

ℎ′𝑡 =←−−−−𝐿𝑆𝑇𝑀 (𝑥𝑡 ), 𝑡 ∈ [1, 𝑙] . (2)

We obtain the implicit representation of a given word 𝑥𝑡 by con-catenating the forward hidded state ℎ𝑡 and back hidden state ℎ′𝑡 ,i.e., [ℎ𝑡 , ℎ′𝑡 ].

Assigning AttentionWeights. In a code comment, each wordcontributes differently on SATD detection. For example, Potdar etal.[38] manually summarized 62 patterns for detecting SATDs. Con-sidering the evolution of the project and gaps between developers,the manually summarized patterns lack generalizability and adapt-ability to newly generated code comments. Based on these aboveconsiderations, we exploit an attention mechanism to distinguishcontributions of different words and aggregate the representationof those informative words to form a sentence vector.

𝑢𝑡 = 𝑡𝑎𝑛ℎ(𝑊𝑤 [ℎ𝑡 , ℎ′𝑡 ] + 𝑏𝑤) (3)

𝑎𝑡 = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 (𝑢𝑇𝑡 ∗ 𝑢𝑤) (4)

𝑆1 =∑𝑡

𝑎𝑡 [ℎ𝑡 , ℎ′𝑡 ] (5)

We firstly feed word representation [ℎ𝑡 , ℎ′𝑡 ] through one-layerperception to get 𝑢𝑡 as a hidden representation of [ℎ𝑡 , ℎ′𝑡 ], then wemeasure the importance of the word 𝑢𝑡 with a word level contextvector 𝑢𝑤 and get a normalized importance weight 𝑎𝑡 through asoftmax function. Finally,We add all theword vectors with attentionweight as the final feature representation (𝑆1) of the code comment.

3.2.2 Multi-head Attention Encoder(MAE). The purpose of MAEis to focus on all the words in the entire input sequence to helpthe model better encode the current word. Figure 5 illustrates theworking process of MAE in two steps: (1) Encoding positionalinformation of words, and (2) Encoding words considering otherwords information. 𝑆2 denotes the obtained implicit feature vectorof the comment.


Capturing Positional Information. To learn the order of thewords in the input sequence, we firstly add a positional vectorfollowing the one proposed by Vaswani et al. [48], which followsthe specific pattern that the model learns. Then we perform a lineartransformation on the result. The propagation rule is defined bythe following equation:

𝑝(𝑖)𝑡 =

{𝑠𝑖𝑛(𝑡/100002𝑖/𝑑 ) 𝑖 𝑓 𝑖 = 2𝑘,

𝑐𝑜𝑠 (𝑡/100002𝑖/𝑑 ) 𝑖 𝑓 𝑖 = 2𝑘 + 1,(6)

𝑥𝑡 = (𝑥𝑡 + 𝑝𝑡 )𝑊𝑝 + 𝑏𝑝 (7)

where 𝑡 is the position, 𝑖 is the dimension and 𝑑 is the dimensionof the produce output.

Assigning Self-Attention. Given a code comment, the repre-sentation of a word is closely related to the other words in thecomment. To obtain further word vector representation, we intro-duce self-attention mechanism[48] to better encode the words. Thedetails are given by the following equations:

𝑄 𝑗 = 𝑋𝑊𝑞 (8)𝐾𝑗 = 𝑋𝑊𝑘 (9)𝑉𝑗 = 𝑋𝑊𝑣 (10)

ℎ𝑒𝑎𝑑 𝑗 = 𝑆𝑜 𝑓 𝑡𝑚𝑎𝑥 (𝑄 𝑗𝐾

𝑇𝑗√

𝑑𝑘

)𝑉𝑗 (11)

𝑆2 = 𝑁𝑜𝑟𝑚(𝐶𝑎𝑡 (ℎ𝑒𝑎𝑑1, ..., ℎ𝑒𝑎𝑑ℎ)𝑊 𝑜 ) (12)

Here, 𝑋 ∈ R𝑙×𝑑 is the word vector matrix. We compute theattention function on a set of queries simultaneously, and packthem together into a matrix Q. The keys and values are also packedtogether into matrices K and V [48].𝑊𝑞 ∈ R𝑑×𝑑𝑞 ,𝑊𝑘 ∈ R𝑑×𝑑𝑘

and𝑊𝑣 ∈ R𝑑×𝑑𝑣 are three learnable weight matrices. We use eightattention heads (ℎ = 8 by default) to expand the model’s ability tofocus on different words. For each headwe have𝑑𝑞 = 𝑑𝑘 = 𝑑𝑣 = 𝑑/ℎ,𝑊 𝑜 ∈ Rℎ𝑑𝑣×𝑑 . After concatenating the embedding vectors from allthe heads, we introduce layer normalization for fixing the meanand the variance of the accumulated inputs.

3.3 SATD DetectionFinally, as is shown in Figure 6, 𝑆1 and 𝑆2 are concatenated to forman abstract representation of the code comment, and then inputinto a fully connected layer 𝑓𝑓 𝑐 for SATD detection. Predictionresults are calculated by a 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 function. The classification ruleis defined by the following equations:

𝑌 = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑 (𝑓𝑓 𝑐 (𝑆1⊕

𝑆2)) (13)

Here,⊕

denotes the concatenate operation, and 𝑌 denotes theprediction result of SATD detection.

As shown in Table 2, SATD and non-SATD comments have avery imbalanced distribution in the open-source projects [20]. Toaddress the data imbalance problem, instead of downsampling orapplying feature selection that may sacrifice the feature learningcapability, we introduce a weighted cross-entropy loss as the lossfunction. Suppose there are 𝑎 SATD comments and 𝑏 non-SATD

S1

S2

non-SATD

SATD

SATD Detection

S

Fully connected layer

W S +b

Figure 6: Workflow of module combination and SATD de-tection.

comments in the training data, then the loss function is defined as:

𝑤𝑒𝑖𝑔ℎ𝑡𝑠 =

𝑏

𝑎 + 𝑏 if 𝑠th class is SATD𝑎

𝑎 + 𝑏 otherwise(14)

𝐿𝑜𝑠𝑠 (𝑌,𝑇 ) = −(𝑤𝑒𝑖𝑔ℎ𝑡𝑠 )∑𝑠

𝑇𝑠𝑙𝑜𝑔(𝑌𝑠 ) (15)

Here, 𝑇 and 𝑌 are the groundtruth label and predicted label of thecomment, respectively. All the parameters in the model are trainedvia gradient descent.

4 EXPERIMENTAL SETUPIn this section, we have conducted comprehensive experiments onSATD detections aiming to answer the following research questions:• RQ1: Is HATD effective in detecting self-admitted technical debtsin software projects?• RQ2: How does HATD compare with the state-of-the-art ap-proaches for recognizing self-admitted technical debts?• RQ3: How effective is the word embedding technique integratedinto HATD compared with other baseline techniques?• RQ4: To what extent are self-admitted technical debts existing inreal-world software projects?

4.1 Dataset DescriptionIn our experiments, we use the dataset provided by [7]. This datasetcontains labeled code comments for ten open-source projects, in-cluding ApacheAnt, ArgoUML, Columba, EMF, Hibernate, JEdit,JFreeChart, JMeter, JRuby, and SQuirrel. The detailed descriptionsof the dataset are shown in Table 2. These ten projects belong todifferent application areas and have different numbers of contrib-utors, comments as well as scale and complexity. We can see thatonly a small ratio of SATD comments in each project.

4.2 Evaluation Methods and MetricsIn this work, we adopt three commonly used metrics, namely Pre-cision, Recall, and F1-score (the balanced view between Precisionand Recall), to evaluate the performance of our approach.


Table 2: Statistics of SATD comments in the benchmark software projects.Project Description Release Contributions Comments SATD %of SATDApache Ant Java library and Command-line Tool 1.7.0 74 4,137 131 3.17ArgoUML UML Modeling Tool 0.34 87 9,548 1,413 14.80Columba E-mail Client 1.4 9 6,478 204 3.15EMF Eclipse Model Driven Architecture 2.4.1 30 4,401 104 2.36Hibernate ORM Framework 3.3.2 226 2,968 472 15.90JEdit Text Editor 4.2 57 10,322 256 2.48JFreeChart Char library 1.0.19 19 4,423 209 4.73JMeter Performance Tester 2.10 33 8,162 374 4.58JRuby Ruby Interpreter 1.4.0 328 4,897 662 13.52SQuirrel SQL Client 3.0.3 46 7,230 285 3.94Average 91 6,257 411 6.57

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇𝑃

𝑇𝑃 + 𝐹𝑃 (16)

𝑅𝑒𝑐𝑎𝑙𝑙 =𝑇𝑃

𝑇𝑃 + 𝐹𝑁 (17)

𝐹1 =2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 (18)

Here, TP (true positive) means the number of SATD commentsthat are detected as such; FP (false positive) means the numberof non-SATD comments that are detected as SATD; and FN(falsenegative) means the number of SATD comments that are detectedas non-SATD.

4.3 Parameters and Experimental EnvironmentIn the Bi-LSTM component of our model, the number of layersis two, the number of hidden units is set to be 128, two branchesproduce outputs with 256 units, respectively. The adam optimizerwith default setting (except that learning rate is configured to be 1e-4) is used for training. In order to avoid overfitting, we add a dropoutlayer between every two layers, with a drop probability of 0.5. Weset 𝐿2 loss weight as 5e-4. Due to the unbalanced distribution ofSATD and non-SATD comments, we introduce a weighted cross-entropy loss as the loss function.We compute themean values of thePrecision, Recall and F1-score across ten experiments to estimate theperformance of models. The experiments are run on a Linux system(Ubuntu 16.04 LTS) with 64GB memory and a RTX2080Ti GPU,implemented by leveraging the deep learning library PyTorch4.

5 RESULTWe now present our experimental results towards answering theaforementioned research questions.

5.1 RQ1: Effectiveness of HATDTowards evaluating the effectiveness of our approach, we applyHATD to study the ten projects maintained by [7], since these tenprojects have been provided with ground truth. Based on these tenprojects, we evaluate the performance of HATD from two aspects:• Within-project. For each project, we perform 10-fold cross-validation to assess the classification capability of HATD. Inparticular, we first equally divide the project’s comments

4https://pytorch.org/

into ten subsets. Then, we train our model with nine subsetsof comments and leverage it to classify the remaining subsetof comments. We repeat this experiment 10 times to ensurethat each subset has been considered as a testing set once.The final performance is reported as the average score of theten rounds of experiments.• Cross-projects. In this aspect, we also conduct our experi-ments in ten rounds, with one project tested in each round.Specifically, for each project, we train our model based onthe other nine projects and leverage it to predict all the com-ments of the project under testing. The classification resultswill be directly attached to the testing project, for the sake ofsimplicity, despite that the classification models are trainedbased on the other nine projects.

Table 3: The performance of our approach over both within-project and cross-projects SATD detections.

Project Within-Project Cross-ProjectsPrecision Recall F1 Precision Recall F1

Apache Ant 0.7050 0.6108 0.6545 0.6569 0.7801 0.7132ArgoUML 0.8731 0.9469 0.9085 0.8615 0.9487 0.9030Columba 0.9434 0.9616 0.9524 0.8906 0.9598 0.9239

EMF 0.7475 0.8586 0.7992 0.6769 0.8787 0.7647Hibernate 0.8866 0.9536 0.9189 0.8699 0.9305 0.8992

JEdit 0.5468 0.6554 0.5962 0.7783 0.8797 0.8259JFreeChart 0.9268 0.8747 0.9000 0.6897 0.7135 0.7014

JMeter 0.9249 0.8938 0.9091 0.8178 0.8715 0.8438JRuby 0.9568 0.9388 0.9477 0.8977 0.9397 0.9182

SQuirrel 0.7978 0.8769 0.8355 0.7814 0.8318 0.8058Average 0.8309 0.8571 0.8422 0.7921 0.8734 0.8299

Table 3 presents the experimental results of our approach w.r.t.Precision, Recall, and F1-score metrics. The best and worst resultsare respectively highlighted in bold and underline, and the differ-ence is quite significant. This finding shows that the our approachmight be sensitive to the training dataset. Indeed, our approachyields relatively poor performance for the JEdit project becausemany of its comments are short and sometimes meaningless (e.g.,“Inner classes” or “unsplit() method” ), resulting in noises or a lack ofeffective semantic information to be learned by our approach. Inaddition, the fact that some projects achieved better performance inthe within-project setting (e.g., JFreeChart) while others achievedbetter performance in the cross-projects setting (e.g., JEdit) furtherconfirms that the quality of the training set is important to ourapproach.


Nevertheless, the experimental results show that our approachgenerally yields good performance for most of the projects con-cerning both within-project and cross-projects settings. Indeed, onaverage, our approach achieves a Precision, Recall, and F1-score of83.09%, 85.71%, and 84.22% respectively for within-project SATDdetection, and a Precision, Recall, and F1-score of 79.21%, 87.34%,F1-score of 82.99% respectively for cross-projects SATD detection.It is worth mentioning that our approach does not perform signifi-cant differences between within-project and cross-projects settings.This evidence suggests that our approach could be effective andpractical for pinpointing SATDs in real-world software projects.

Answer to RQ1: Our approach can achieve high-performance for both within-project and cross-projectssettings and hence is effective and practical to be used topinpoint SATDs for real-world software projects.

5.2 RQ2: Performance Comparison with thestate-of-the-art Approach

Table 4: The performance of our appraoch compared to thestate-of-the-art.

Projects With-Project Cross-ProjectRen’s Ours Gains Ren’s Ours Gains

Apache Ant 0.5977 0.6545 +9.50% 0.6260 0.7132 +13.93%ArgoUML 0.8895 0.9085 +2.14% 0.8409 0.9030 +7.38%Columba 0.8545 0.9524 +11.46% 0.7782 0.9239 +18.72%EMF 0.7010 0.7992 +14.01% 0.5797 0.7647 +31.91%

Hibernate 0.8465 0.9189 +8.55% 0.8035 0.8992 +11.91%JEdit 0.5564 0.5962 +7.15% 0.5760 0.8259 +43.39%

JFreeChart 0.7759 0.9000 +15.99% 0.6504 0.7014 +7.84%JMeter 0.8858 0.9091 +2.63% 0.7726 0.8438 +9.22%JRuby 0.8626 0.9477 +9.87% 0.8445 0.9182 +8.73%SQuirrel 0.7467 0.8355 +11.89% 0.7069 0.8058 +13.99%Average 0.7717 0.8422 +9.14% 0.7179 0.8299 +15.61%

We now compare our approach with the state-of-the-art ap-proach proposed by Ren et al.[39]. To the best of our knowledge,this is the closest work to ours since both approaches are imple-mented based on deep learning to detect SATDs. Their methodleverages basic neural networks to accomplish its purposes whileour approach brings attention mechanism into neural networksto achieve our objectives. We compare our approach with theirmethod on the same dataset, i.e., the ten projects with ground truth.

Table 4 summarizes the comparison results. In within-projectSATD detection, compared with Ren’s method, our method achievesperformance improvements on all projects ranging from 2.14% to15.99%. Especially, in the EMF project, which only contains 104comments, our method gets a 14.01% improvement. This evidencesuggests that our approach is more resilient to small training cor-pora. On average, our method obtains an F1-score of 84.22%, whichis 9.14% higher than that of Ren’s method.

In cross-projects SATD detection, our method achieves evenbetter performance improvements ranging from 7.38% to 43.39%,compared to the within-project setting. On overage, for the tenprojects, our method achieves an F1-score of 82.99%, which is 15.61%higher than that of Ren’s method. It is worth highlighting that for

the JEdit project, our method achieves 43.39% improvement. Theperformance improvement for the EMF project also exceeds 30%.These experimental results demonstrate that our approach can learnmore useful information from other projects than Ren’s method.Hence, our approach is more practical to be applied to detect SATDsin real-world software projects.

Finally, despite that, we have only compared our approach withthe state-of-the-art Ren’s approach, our approach should also out-perform other similar ones targeting SATD extractions or classi-fications. Indeed, as experimentally demonstrated by Ren et al.,over the same dataset, their method can outperform three baselineapproaches, including a pattern-based SATD extraction approach,a traditional text-mining based SATD classification approach, andKim’s simple CNN-based sentence classification approach. There-fore, since our approach can outperform Ren’s method in the sameregard, we believe that our approach should also be able to outper-form the aforementioned state-of-the-art approaches.

Answer to RQ2: HATD outperforms the state-of-the-artapproaches, including neural-network-based ones such asRen’s approach and pattern-based ones, as well as tradi-tional text-mining based approaches.

5.3 RQ3: Effectiveness of the selected wordembedding technique

Word embedding, allowing to embed a high-dimensional spacewith the number of all words into a continuous vector space witha much lower dimension, has become one of the most importanttechniques in the natural language processing (NLP) community.Apart from the ELMo technique adopted in this work, the NLPcommunity has invented various other methods to realize wordembedding. Towards answering the third research question, weconduct a comparative study between ELMo and some of the othermethods to justify our selection of ELMo and its effectiveness.

To the best of our knowledge, there are at least four additionalpopular word embedding techniques available in the communitythat we can compare with.

One-hot is the simplest encoding method. Each word has itsown unique word vector representation. The dimension of the wordvector is the length of vocabulary. Only one position in each wordvector has a value of 1, and the others are 0. One-hot encoding is toassume that all words are independent of each other. It ignores thesemantic similarity between words and causes severe data sparsityproblems when the corpus is very large.

Word2Vec[34] is an open-source toolkit for generating wordvectors launched by Google in 2013, containing two traning modes:skip-gram and continuous bag of words (CBOW). The skip-grammodel is concerned with using one word to predict the surroundingwords, while CBOWmodel is concernedwith using the surroundingwords to predict the central word. In order to improve model accu-racy and training speed, Word2Vec usually uses negative samplingand hierarchical softmax. Word vectors generated by Word2Veccan better express the similarity and analogy relationship betweenwords.


Table 5: The comparison results of ELMo and other state-of-the-art word embedding techniques.

Projects Within-Project Cross-ProjectsELMo One-Hot Word2Vec GloVe FastText ELMo One-Hot Word2Vec GloVe FastText

Apache Ant 0.6545 0.6263 0.6242 0.5576 0.5825 0.7132 0.6308 0.6783 0.6226 0.6316ArgoUML 0.9085 0.9132 0.9059 0.8967 0.8918 0.9030 0.8487 0.8440 0.8290 0.8400Columba 0.9524 0.9467 0.9024 0.9424 0.9434 0.9239 0.8337 0.8685 0.8469 0.8718

EMF 0.7992 0.7826 0.7273 0.7556 0.7506 0.7647 0.6043 0.6435 0.6376 0.6818Hibernate 0.9189 0.9126 0.9293 0.8871 0.9592 0.8992 0.8257 0.8331 0.8261 0.8242

JEdit 0.5962 0.5862 0.5714 0.5818 0.5622 0.8259 0.6475 0.6013 0.6118 0.6225JFreeChart 0.9000 0.8837 0.8947 0.8421 0.8660 0.7014 0.6533 0.6645 0.6095 0.7214

JMeter 0.9091 0.8608 0.8718 0.8741 0.8721 0.8438 0.7451 0.7601 0.7537 0.7659JRuby 0.9477 0.9000 0.9091 0.8841 0.9524 0.9182 0.8726 0.8548 0.8834 0.8626

SQuirrel 0.8355 0.8214 0.7719 0.7970 0.8000 0.8058 0.7070 0.7559 0.7299 0.6872Average 0.8422 0.8234 0.8108 0.8019 0.8190 0.8299 0.7369 0.7504 0.7351 0.7509

GloVe (Global Vectors for Word Representation) [36] is essen-tially a log-bilinear model with a weighted least-squares objective.The model is inspired by the word co-occurrence probability thatmay encode global information for words. This just makes up forthe weakness of Word2Vec just using local word co-occurrenceinformation. Each word of GloVe involves two word vectors, one isthe vector of the word itself, and the other is the context vector ofthe word. The final representation of the word vector is obtainedby adding the two vectors.

FastText[5, 19] can be regarded as a derivative of Word2Vec,which performs word vector training based on the knowledge oflanguage morphology. For each word entered, a word-based n-gramrepresentation is performed, and then all n-grams are added to theoriginal word to represent morphological information. The advan-tage of this method is that in english words, the morphologicalsimilarity of prefixes or suffixes can be used to establish relation-ships between words.

To conduct a fair comparison, we respectively reimplement theword embedding module with one of the aforementioned four tech-niques and keep the remaining implementation ofHATD unchanged.We then launch the new version of HATD on the same dataset, withthe same parameters and experimental settings (i.e., within-projectand cross-projects). The word embedding size determines the abilityof the model to capture the feature information of a code comment.In order to avoid the influence of different word embedding sizeon the experimental results, we uniformly set the embedding sizeas 1024 except OneHot embedding method, whose dimension isalways equal to the size of the corpus.

Table 5 presents the experimental results for both within-projectand cross-projects settings. In the within-project setting, due tothe small size of corpus, even OneHot can perform well and evenslightly better than other word embedding techniques (Word2Vec,GloVe and FastText). ELMo achieves the best performance for eightprojects. This result shows that ELMo is a promising word em-bedding technique to be used for supporting the automated clas-sification of SATDs. This evidence can be easily observed in thecross-projects experiments as well. Indeed, as also shown in Ta-ble 5, for nine out of the ten selected projects, ELMo achieves thebest performance in supporting HATD pinpoint SATDs. Moreover,as suggested by the average performance for both within-project

and cross-projects experiments, at least for supporting SATD clas-sifications, ELMo performs much better than that of other wordembedding techniques (i.e., One-Hot, Word2Vec, GloVe, FastText).This can be explained by the fact that ELMo will generate differ-ent vector representations for the same word depending on thecontext. For example, some teams use the term ‘rework’ to specifyroutine refactoring, which is not always TD. ELMO can alleviatethis situation to some extent.

In the within-project SATD detection, the used training set andtesting set come from the same project with similar comment char-acteristics, due to the same batch of developers. While in the cross-projects SATD detection, the comments of the testing set comefrom a new project, which is different from the training set. Overall,the performance of cross-project SATD detection is generally lowerthan the within-project SATD detection to varying degrees in allword embedding techniques.

Answer to RQ3: ELMo outperforms other word embeddingtechniques by achievingmostly the best or second-best per-formance when supporting neural network-based SATDdetections. Moreover, under the same context, dynamicword embedding technique seem to be more reliable thanstatic ones.

5.4 RQ4: SATDs in real-world software projectsTo evaluate the practicality of HATD, in this last research question,we assess to what extent can HATD be leveraged to detect SATDsin real-world software projects, which have no ground truth con-structed beforehand. In addition, we also evaluate whether HATDis applicable to other programming languages, such as Python,JavaScript, etc. To this end, we resort to Github to search for hotprojects, and eventually, we select ten software projects, takingdifferent technical fields into consideration. Table 6 summarizes theten projects, along with their metadatas, such as short descriptions,number of total comments, etc.

We then train our approach based on the ten benchmark apps(as enumerated in Table 2) and apply it to classify the comments ofeach of the aforementioned ten real-world software projects. Thelast second column of Table 6 illustrates the experimental results.In total, among 382,799 comments, HATD flags 5,438 of them as


Table 6: The list of our selected real-world popular software projects.Project Description Release Stars Contributions Comments SATD % of SATDSpring Boot A framework for creating Spring based applications. 2.3.0 47.0k 675 67,425 44 0.07Dubbo A high-performance, java based, open source RPC framework. 2.7.6 31.9k 297 37,516 129 0.34OkHttp Square’s meticulous HTTP client for Java and Kotlin. 4.5.0 36.7k 199 2,728 25 0.92Guava Google core libraries for Java. 29.0 36.8k 229 102,705 1,398 1.36Vue a progressive JavaScript framework for building UI on the web. 2.6.11 162.0k 292 11,721 152 1.30Werkzeug The comprehensive WSGI Web Application Library. 1.0.1 5.3k 332 1,575 17 1.08PyTorch Tensors and Dynamic neural networks in Python. 1.4.0 37.9k 1,361 21,977 742 3.38Django The Web Framework for Perfectionists with Deadlines. 3.0.5 48.7k 1,877 24,074 186 0.77scikit-learn A Python Module for Machine Learning. 0.22.2 40.3k 1,634 19,625 288 1.47TensorFlow An Open Source Machine Learning Framework for Everyone. 2.2.0 143.0k 2,460 93,453 2,452 2.62Average 59.0k 936 38,280 543 1.33

SATDs. The ratios of SATDs in all the project’s comments arequite low, ranging from 0.7% to 3.38%, giving an average of 1.33%.Compared with the ground truth projects, we observe that there arefewer SATDs in these ten real-world software projects we selected,which may be because they are very popular on Github, and morecontributors participate in code commits and maintenances. Theaverage number of code contributors in these software projects isten times that of the ground truth projects.

Manual verification of the results reported by HATD. Werandomly select 50 samples each (100 comments in total) from thecode comments marked as having technical debts and no technicaldebt for further analysis. We then invite five programmers to manu-ally judge these 100 selected comments and compare their decisionswith the results yielded by our attention-based deep learning model.The five programmers manually flag the 100 comments as SATDor non-SATD independently and then discuss with each other toreach consensus if inconsistent decisions are made in the first place.Among the 50 code comments marked as SATD, 12 of them arenot flagged by five programmers as such. Among the 50 code com-ments marked as non-SATD, only five comments differ from themanual annotations. Overall, our approach yields an accuracy of83% in detecting SATDs in popular real-world software projects.Considering that the actual projects we selected come from differ-ent technical fields and have significant domain differences, somedomain characteristics will not be fully covered by our training set(based on the ten benchmark apps). With the increase of trainingdata, we believe that the performance of our model can be furtherimproved.

Answer to RQ4: In a practical scenario, our approachachieves an accuracy of 83% in detecting SATDs in real-world software projects. The fact that all the selected soft-ware projects have more or less involved with SATDsshows that SATD is commonly spread in software projects.Therefore, we argue that software developers and main-tainers should pay more attention to SATDs.

6 DISCUSSIONWe now discuss the explainability of our approach to SATD detec-tion and introduce potential threats to validity of this study.

This is a

hack fo

rth

em

ultil

ine

colu

mn

Comment 01

0.0

0.2

0.4

0.6

0.8

TODO

Ther

e ispr

obab

ly abe

tter

way to do this

hack

Comment 02

0.00

0.04

0.08

0.12

0.16

0.20

0.24

Figure 7: Visualization of weights for two code comments(Comment 01: This is a hack for themultiline column; Com-ment 02: TODO: There’s probably a better way to do thishack).

6.1 Interpretability of HATDIn order to interpret the result of SATD classification, Potdar etal. [38] have manually summarized 62 frequent SATD patternsfor detecting SATD comments. However, this manual process istime-consuming and labor-intensive, and difficult to summarize allSATD patterns. Ren et al. [39] exploit the backtracking mechanismto map CNN-extracted features to key h-grams and summarize keyh-grams into SATD patterns. Their method, however, requires abacktracking mechanism to propagate the extracted features backinto the same structure of the model.

In this work, we propose an attention-based neural network todetect and explain SATD classifications. Through the visualizationof weights, we find our method can immediately provide word-leveland phrase-level interpretability for each comment, regardless ofproject differences and time-consuming pattern summary issues.

Figure 7 shows two examples of the comments we have exam-ined. The color shade indicates the attention weight of words. Thehigher the attention weight, the darker the color. For visualizationpurposes, we remove padding words to ensure that only meaningfulwords are emphasized. From Figure 7, we can see that our methodhighlights the word hack in the first comment. While in the secondcomment, our method highlights the word TODO and the phrase abetter way. We find an interesting phenomenon that the keywordhack carries a significant SATD tendency in the first comment,which is different from that in the second comment. This showsthat the same keywords will play different roles for SATD detection


in different comments, which further exacerbates the difficulty ofmanually summarizing SATD patterns to identify SATDs.

Furthermore, we summarize in Table 7 the Top-10 SATD pat-terns for each of the ten open-source projects. Interestingly, we canobserve that many projects share some common one-gram SATDpatterns (underlined), such as "todo", "xxx", "fix", "fixme", "hack","work", "perhaps", "workaround", "better" and "bug". At the sametime, some SATD patterns appear in only one or two projects. Forinstance, “a public setter" is only used in Apache Ant project, “modelelement", "model/extent" and "behind addTrigger" are only usedin ArgoUML project, "programming error" and "tweak" are onlyused in JMeter and JFreeChart projects, respectively. We observethat some patterns (bold) are not discovered by Potdar et al. andRen et al. [38, 39], suggesting that manual summary of SATD pat-terns will likely miss some less evidential SATD patterns and willalso encounter the problem of project differences. Our method cansupplement those works by automatically identifying SATDs andhighlight keywords or phrases contributing to such identifications,regardless of project differences.

6.2 Threats to ValidityThreats of internal validity are related to errors in our implementa-tion and personal bias in data labeling. To avoid implementationerrors, we have carefully checked our experimental steps and pa-rameter settings. To avoid the personal biases of manually markingthe code comments, the dataset used in our experiments has a highinter-rater agreement (Cohen’s Kappa coefficient of 0.81), whichhas reported by Maldonado et al. [31]. In addition, we do not use thefine-grained SATD categories proposed by [7] and only considerthe binary classification of code comments. In other words, designdebt, requirement debt, defect debt, documentation debt, and testdebt are all simply treated as SATDs.

Threats of external validity are related to both the quantity andquality of our experimental dataset and the generalizability of ourexperiment results and findings. To guarantee the quantity andquality of our dataset, we have taken into account ten open-sourceprojects with different functions and fields, and a different num-ber of comments as well as comment characteristics. In order toguarantee the generalizability of our experiment results and find-ings, we have crawled another ten popular software projects fromGithub to test our trained model. By manually verifying the resultsreported by our method, we conclude that our method has achievedan accuracy of 83% on 100 randomly selected comments.

7 RELATEDWORKSelf-Admitted Technical Debt (SATD) is a variant of technical debtthat is used to identify debt that is intentionally introduced duringthe software development process. The detection of SATD can bettersupport software maintenance and ensure high software quality[9, 17, 22, 26, 28, 29, 33]. This paper aims to propose a deep learningbased method to detect SATDs. To the best of our knowledge, thereis very limited research on the use of deep learning technologyfor SATD detection. Therefore, we divide the related work intotwo main parts: code comment analysis, existing studies in Self-Admitted Technical Debt.

7.1 Code Comment AnalysisCode comments play an important role in software development.The significance of comments is to enable developers to easily un-derstand and manage codes[15, 42, 49]. Code comments are usuallywritten in natural language texts containing functional descriptionsand task annotations, aiming to assist project development. Codecomments will have different characteristics due to differences indevelopers, projects, and programming languages.

Many studies investigated the characteristics of code comments.Fluri et al. [11] investigated the level of developers adding com-ments or adapting comments when evolving code in three open-source systems. As a result, they found that when the relevant codechanged, the comment changes were usually made in the same ver-sion. Steidl et al. [44] and Sun et al. [46] conducted a code commentquality analysis to improve the quality of code comments. Morespecifically, Steidl et al. provide a semi-automated approach to con-duct quantitative and qualitative evaluation of comment quality.Sun et al. extended their work and provide a more accurate andcomprehensive comment assessments and recommendations.

7.2 Studies in Self-Admitted Technical DebtCurrently, many researchers have focused on proposing approachesto detect and manage technical debts. Indeed, many empirical stud-ies have applied on technical debts [1, 24, 40, 61–63]. The concept ofself-admitted technical debt (SATD) is proposed by Potdar et al[38].SATD means that the debt is intentionally introduced and reportedin code comments by developers. They manually summarized 62specific patterns from different Java projects for the detection ofSATD comments. Wehaibi et al. [52] examined the relationshipbetween SATDs and software defects and found that SATDs causedsoftware systems to consume more resources in the future, ratherthan equating to software defects. According to different character-istics of SATD comments, Maldonado et al. [31] further classifiedSATD into five types, namely design debt, defect debt, documentdebt, requirement debt, and test debt.

In recent years, researchers in the field of software engineeringhave made significant efforts to address the issue of SATD detec-tions [7, 8, 32, 57]. Huang et al.[16] proposed a text-mining-basedmethod for SATD detections. In their work, they leverage featureselectors to select useful features for classifier training and detectSATD comments in the target project based on the results of theclassifier votes from different source projects. Maldonado et al. [7]built a maximum entropy classifier for automatically identifying de-sign and requirement SATD comments based on Natural LanguageProcessing (NLP) technologies. They conducted extensive exper-iments in 10 open-source projects and outperformed the currentstate-of-the-art based on fixed keywords and phrases. Yan et al. [55]proposed a change-level method for SATD detections utilizing 25change features. In addition, they investigated the most importantfeatures (“diffusion”) that impact SATD detections.

With the rapid development of deep learning, Ren et al. [39]exploited a deep learning based method as support of SATD detec-tions. In their work, they utilized a convolutional neural network(CNN) to automatically learn key features in SATD comments. Theyintroduced a weighted loss function to deal with the issue of data


Table 7: Top-10 SATD patterns in each project extracted by our approachApache Ant ArgoUML Columba EMF Hibernatexxx todo todo todo todotodo not needed fixme better betterwhat is should be function need remove worka public setter model/extent work why not this supportfixme model element implement factor up into bugperhaps more work replace with SubProgressMonitor workaroundhack fixme hack revisit fixbetter behind addTrigger better GenBaseImpl perhapsshould we not in uml workaround instead of fixmethis comment hack used currently fix renderJEdit JFreeChart JMeter JRuby SQuirrelhack todo todo todo todowork check hack fixme worktodo defer argument checking... perhaps require pop data typefix fixme not used hack renderworkaround tweak work better bugbug happen appear bother hack to deal withxxx implement this fix fix followbroken assert a value programming error perhaps workaroundstupid crosshair values bail out NOT_ALLOCATABLE_ALLOCATOR is this rightlook bad NOT trigger bug not very efficient better way

imbalance. In addition, for the explainability of detection, they de-signed a backtracking method to extract and highlight key phrasesand patterns of SATD comments to explain the prediction results.

8 CONCLUSIONThe objective of this work is to automatically detect and explainself-admitted technical debts in software projects, allowing soft-ware developers to fix those issues in a timely manner so as to avoidlong-time maintenance efforts. To fulfill this objective, in this work,we have provided an overview of code comment characteristicsthat make it challenging to automatically detect SATDs. Then, wepropose to the community a hybrid attention-based method forSATD detection named HATD, which has been equipped with theflexibility to switch word embedding techniques based on projectuniqueness and comment characteristics. After that, we experi-mentally demonstrate the efficiency of HATD by (1) outperform-ing state-of-the-art methods on benchmark datasets, (2) detectingSATDs in real-world software projects, and (3) demonstrating theexplainability of HATD in highlighting the core words or phrasesjustifying why a given SATD is flagged as such.

ACKNOWLEDGMENTThis work was supported by the grands of the National NaturalScience Foundation of China (Nos.61972290, U163620068, 61962061,61562090).

REFERENCES[1] Nicolli S.R. Alves, Leilane F. Ribeiro, Vivyane Caires, Thiago S. Mendes, and

Rodrigo O. Spinola. 2014. Towards an Ontology of Terms on Technical Debt. In2014 Sixth International Workshop on Managing Technical Debt. 1–7.

[2] Nicolli S R Alves, Thiago Souto Mendes, Manoel Mendonca, Rodrigo O Spinola,Forrest Shull, and Carolyn Seaman. 2016. Identification and management oftechnical debt. Information and Software Technology 70, 70 (2016), 100–121.

[3] Areti Ampatzoglou, Apostolos Ampatzoglou, Alexander Chatzigeorgiou, andParis Avgeriou. 2015. The financial aspect of managing technical debt. Informationand Software Technology 64, 64 (2015), 52–73.

[4] G. Bavota and B. Russo. 2016. A Large-Scale Empirical Study on Self-AdmittedTechnical Debt. In 2016 IEEE/ACM 13th Working Conference on Mining SoftwareRepositories (MSR). 315–326.

[5] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017.Enriching word vectors with subword information. Transactions of the Associationfor Computational Linguistics 5 (2017), 135–146.

[6] Ward Cunningham. 1993. The WyCash portfolio management system. ACMSIGPLAN OOPS Messenger 4, 2 (1993), 29–30.

[7] Everton da Silva Maldonado, Emad Shihab, and Nikolaos Tsantalis. 2017. Usingnatural language processing to automatically detect self-admitted technical debt.IEEE Transactions on Software Engineering 43, 11 (2017), 1044–1062.

[8] Mário André de Freitas Farias, Manoel Gomes de Mendonça Neto, Marcos Kali-nowski, and Rodrigo Oliveira Spínola. 2020. Identifying self-admitted technicaldebt through code comment analysis with a contextualized vocabulary. Inf. Softw.Technol. 121 (2020), 106270.

[9] Jernej Flisar and Vili Podgorelec. 2018. Enhanced Feature Selection Using WordEmbeddings for Self-Admitted Technical Debt Identification. In Euromicro Con-ference on Software Engineering and Advanced Applications.

[10] Jernej Flisar and Vili Podgorelec. 2019. Identification of Self-Admitted TechnicalDebt Using Enhanced Feature Selection Based on Word Embedding. IEEE Access7 (2019), 106475–106494.

[11] B. Fluri, M. Wursch, and H.C. Gall. 2007. Do Code and Comments Co-Evolve?On the Relation between Source Code and Comment Changes. In 14th WorkingConference on Reverse Engineering (WCRE 2007). 70–79.

[12] Matthieu Foucault, Xavier Blanc, Margaretanne Storey, Jeanremy Falleri, andCedric Teyton. 2018. Gamification: a Game Changer for Managing TechnicalDebt? A Design Study. arXiv: Software Engineering (2018).

[13] Sávio Freire, Nicolli Rios, Boris Gutierrez, Darío Torres, Manoel G. Mendonça,Clemente Izurieta, Carolyn B. Seaman, and Rodrigo O. Spínola. 2020. SurveyingSoftware Practitioners on Technical Debt Payment Practices and Reasons for notPaying off Debt Items. In EASE. 210–219.

[14] Zhaoqiang Guo, Shiran Liu, Jinping Liu, Yanhui Li, Lin Chen, Hongmin Lu,Yuming Zhou, and Baowen Xu. 2019. MAT: A simple yet strong baseline foridentifying self-admitted technical debt. arXiv: Software Engineering (2019).

[15] Matthew J. Howard, Samir Gupta, Lori Pollock, and K. Vijay-Shanker. 2013.Automatically mining software-based, semantically-similar words from comment-code mappings. In MSR. 377–386.

[16] Qiao Huang, Emad Shihab, Xin Xia, David Lo, and Shanping Li. 2018. Identifyingself-admitted technical debt in open source projects using text mining. EmpiricalSoftware Engineering 23, 1 (2018), 418–451.

[17] Martina Iammarino, Fiorella Zampetti, Lerina Aversano, and Massimiliano DiPenta. 2019. Self-Admitted Technical Debt Removal and Refactoring Actions:


Co-Occurrence or More?. In 2019 IEEE International Conference on Software Main-tenance and Evolution,ICSME. 186–190.

[18] Clemente Izurieta, Ipek Ozkaya, Carolyn B. Seaman, Philippe Kruchten, Robert L.Nord, Will Snipes, and Paris Avgeriou. 2016. Perspectives on Managing TechnicalDebt: A Transition Point and Roadmap from Dagstuhl. In Joint Proceedings ofthe 4th International Workshop on Quantitative Approaches to Software Quality(QuASoQ 2016), Vol. 1771. 84–87.

[19] Armand Joulin, Edouard Grave, and Piotr Bojanowski Tomas Mikolov. 2017. Bagof Tricks for Efficient Text Classification. EACL 2017 (2017), 427.

[20] Michael Kampffmeyer, Arnt-Borre Salberg, and Robert Jenssen. 2016. Semanticsegmentation of small objects and modeling of uncertainty in urban remotesensing images using deep convolutional neural networks. In Proceedings of theIEEE conference on computer vision and pattern recognition workshops. 1–9.

[21] Kisub Kim, Dongsun Kim, Tegawendé F Bissyandé, Eunjong Choi, Li Li, JacquesKlein, and Yves Le Traon. 2018. FaCoY - A Code-to-Code Search Engine. In ICSE2018.

[22] Pingfan Kong, Li Li, Jun Gao, Kui Liu, Tegawendé F Bissyandé, and Jacques Klein.2018. Automated Testing of Android Apps: A Systematic Literature Review. IEEETransactions on Reliability (2018).

[23] Philippe Kruchten, Robert L Nord, and Ipek Ozkaya. 2012. Technical Debt: FromMetaphor to Theory and Practice. IEEE Software 29, 6 (2012), 18–21.

[24] Philippe Kruchten, Robert L Nord, Ipek Ozkaya, and Davide Falessi. 2013. Tech-nical debt: towards a crisper definition report on the 4th international workshopon managing technical debt. ACM SIGSOFT Software Engineering Notes 38, 5(2013), 51–54.

[25] Valentina Lenarduzzi, Nyyti Saarimäki, and Davide Taibi. 2019. The TechnicalDebt Dataset. In Proceedings of the Fifteenth International Conference on PredictiveModels and Data Analytics in Software Engineering, PROMISE 2019, Recife, Brazil,September 18, 2019. ACM, 2–11.

[26] Li Li, Tegawendé F Bissyandé, Mike Papadakis, Siegfried Rasthofer, AlexandreBartel, Damien Octeau, Jacques Klein, and Yves Le Traon. 2017. Static Analysisof Android Apps: A Systematic Literature Review. Information and SoftwareTechnology (2017).

[27] Zengyang Li, Paris Avgeriou, and Peng Liang. 2015. A systematic mapping studyon technical debt and its management. Journal of Systems and Software 101 (2015),193–220.

[28] Zhongxin Liu, Qiao Huang, Xin Xia, Emad Shihab, David Lo, and Shanping Li.2018. SATD detector: a text-mining-based self-admitted technical debt detectiontool. In Proceedings of the 40th International Conference on Software Engineering:Companion Proceeedings, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018.9–12.

[29] Rungroj Maipradit, Christoph Treude, Hideaki Hata, and Kenichi Matsumoto.2019. Wait For It: Identifying "On-Hold" Self-Admitted Technical Debt. arXiv:Software Engineering (2019).

[30] E. D. S. Maldonado, R. Abdalkareem, E. Shihab, and A. Serebrenik. 2017. AnEmpirical Study on the Removal of Self-Admitted Technical Debt. In 2017 IEEEInternational Conference on Software Maintenance and Evolution (ICSME). 238–248.

[31] Everton da S Maldonado and Emad Shihab. 2015. Detecting and quantifyingdifferent types of self-admitted technical debt. In 2015 IEEE 7th InternationalWorkshop on Managing Technical Debt (MTD). IEEE, 9–15.

[32] Solomon Mensah, Jacky Keung, Jeffery Svajlenko, Kwabena Ebo Bennin, andQing Mi. 2018. On the value of a prioritization scheme for resolving Self-admittedtechnical debt. Journal of Systems and Software 135 (2018), 37–54.

[33] Solomon Mensah, Jacky W. Keung, Michael Franklin Bosu, and Kwabena EboBennin. 2016. Rework Effort Estimation of Self-admitted Technical Debt. InJoint Proceedings of the 4th International Workshop on Quantitative Approachesto Software Quality (QuASoQ 2016) Hamilton, New Zealand, December 6, 2016,Vol. 1771. 72–75.

[34] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficientestimation of word representations in vector space. arXiv preprint arXiv:1301.3781(2013).

[35] Robert L. Nord, Ipek Ozkaya, Edward J. Schwartz, Forrest Shull, and Rick Kazman.2016. Can Knowledge of Technical Debt Help Identify Software Vulnerabilities?.In 9th Workshop on Cyber Security Experimentation and Test.

[36] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove:Global vectors for word representation. In EMNLP. 1532–1543.

[37] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, ChristopherClark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized wordrepresentations. In Proceedings of NAACL-HLT. 2227–2237.

[38] Aniket Potdar and Emad Shihab. 2014. An exploratory study on self-admittedtechnical debt. In ICSME. IEEE, 91–100.

[39] XIAOXUE REN, ZHENCHANG XING, XIN XIA, DAVID LO, XINYU WANG, andJOHN GRUNDY. 2019. Neural Network Based Detection of Self-admitted Tech-nical Debt: From Performance to Explainability. ACM Transactions on SoftwareEngineering and Methodology 28, 3 (2019).

[40] Carolyn Seaman, Yuepu Guo, Nico Zazworka, Forrest Shull, Clemente Izurieta,Yuanfang Cai, and Antonio Vetrò. 2012. Using technical debt data in decision

making: Potential decision approaches. In 2012 Third International Workshop onManaging Technical Debt (MTD). IEEE, 45–48.

[41] Giancarlo Sierra, Emad Shihab, and Yasutaka Kamei. 2019. A survey of self-admitted technical debt. Journal of Systems and Software 152 (2019), 70–82.

[42] Giancarlo Sierra, Ahmad Tahmid, Emad Shihab, and Nikolaos Tsantalis. 2019. IsSelf-Admitted Technical Debt a Good Indicator of Architectural Divergences?. InSANER, Xinyu Wang, David Lo, and Emad Shihab (Eds.). 534–543.

[43] Rodrigo O. Spínola, Nico Zazworka, Antonio Vetro, Forrest Shull, and Carolyn B.Seaman. 2019. Understanding automated and human-based technical debt identi-fication approaches-a two-phase study. J. Braz. Comp. Soc. 25, 1 (2019), 5:1–5:21.

[44] Daniela Steidl, Benjamin Hummel, and Elmar Juergens. 2013. Quality analy-sis of source code comments. In 2013 21st International Conference on ProgramComprehension (ICPC). 83–92.

[45] Ben Stopford, Ken Wallace, and John Allspaw. 2017. Technical Debt: Challengesand Perspectives. IEEE Software 34, 4 (2017), 79–81.

[46] Xiaobing Sun, Qiang Geng, David Lo, Yucong Duan, Xiangyue Liu, and Bin Li.2016. Code Comment Quality Analysis and Improvement Recommendation:An Automated Approach. International Journal of Software Engineering andKnowledge Engineering 26, 6 (2016), 981–1000.

[47] Edith Tom, Aybuke Aurum, and Richard Vidgen. 2013. An exploration of technicaldebt. Journal of Systems and Software 86, 6 (2013), 1498–1516.

[48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information processing systems. 5998–6008.

[49] Bradley L. Vinz and Letha H. Etzkorn. 2008. Improving program comprehensionby combining code understanding with comment understanding. KnowledgeBased Systems 21, 8 (2008), 813–825.

[50] Supatsara Wattanakriengkrai, Rungroj Maipradit, Hideki Hata, MorakotChoetkiertikul, and Kenichi Matsumoto. 2018. Identifying Design and Require-ment Self-Admitted Technical Debt Using N-gram IDF. In IWESEP.

[51] Supatsara Wattanakriengkrai, Napat Srisermphoak, Sahawat Sintoplertchaikul,Morakot Choetkiertikul, Chaiyong Ragkhitwetsagul, Thanwadee Sunetnanta,Hideaki Hata, and KenichiMatsumoto. 2019. Automatic Classifying Self-AdmittedTechnical Debt Using N-Gram IDF. (2019), 316–322.

[52] Sultan Wehaibi, Emad Shihab, and Latifa Guerrouj. 2016. Examining the impactof self-admitted technical debt on software quality. In 2016 IEEE 23rd InternationalConference on Software Analysis, Evolution, and Reengineering (SANER), Vol. 1.IEEE, 179–188.

[53] Will, Snipes, Carolyn, Seaman, Ipek, Ozkaya, Clemente, and Izurieta. 2017. Tech-nical Debt: A Research Roadmap Report on the Eighth Workshop on ManagingTechnical Debt (MTD 2016). Software Engineering Notes Acm Sigsoft (2017).

[54] Jifeng Xuan, Yan Hu, and Jiang He. 2012. Debt-Prone Bugs: Technical Debt inSoftware Maintenance. International Journal of Advancements in ComputingTechnology 4 (10 2012), 453–461.

[55] Meng Yan, Xin Xia, Emad Shihab, David Lo, Jianwei Yin, and Xiaohu Yang.2018. Automating change-level self-admitted technical debt determination. IEEETransactions on Software Engineering (2018).

[56] Meng Yan, Xin Xia, Emad Shihab, David Lo, Jianwei Yin, and Xiaohu Yang. 2019.Automating Change-Level Self-Admitted Technical Debt Determination. IEEETransactions on Software Engineering 45, 12 (2019), 1211–1229.

[57] Zhe Yu, Fahmid Morshed Fahid, Huy Tu, and Tim Menzies. 2020. IdentifyingSelf-Admitted Technical Debts with Jitterbug: A Two-step Approach. CoRRabs/2002.11049 (2020).

[58] Fiorella Zampetti, Cedric Noiseux, Giuliano Antoniol, Foutse Khomh, and Mas-similiano Di Penta. 2017. Recommending when Design Technical Debt Shouldbe Self-Admitted. In 2017 IEEE International Conference on Software Maintenanceand Evolution, ICSME. 216–226.

[59] Fiorella Zampetti, Alexander Serebrenik, and Massimiliano Di Penta. 2018. WasSelf-Admitted Technical Debt Removal a Real Removal? An In-Depth Perspective.In MSR.

[60] Fiorella Zampetti, Alexander Serebrenik, and Massimiliano Di Penta. 2020. Au-tomatically Learning Patterns for Self-Admitted Technical Debt Removal. InSANER.

[61] Nico Zazworka, Michele A. Shaw, Forrest Shull, and Carolyn Seaman. 2011.Investigating the impact of design debt on software quality. In Proceedings of the2nd Workshop on Managing Technical Debt. 17–23.

[62] Nico Zazworka, Rodrigo O. Spínola, Antonio Vetro, Forrest Shull, and CarolynSeaman. 2013. A case study on effectively identifying technical debt. In Proceed-ings of the 17th International Conference on Evaluation and Assessment in SoftwareEngineering. 42–47.

[63] Nico Zazworka, Antonio Vetro, Clemente Izurieta, Sunny Wong, Yuanfang Cai,Carolyn Seaman, and Forrest Shull. 2014. Comparing four approaches for techni-cal debt identification. Software Quality Journal 22, 3 (2014), 403–426.

[64] Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and BoXu. 2016. Attention-based bidirectional long short-term memory networks forrelation classification. In Proceedings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Papers). 207–212.

detecting and explaining self-admitted technical debts with attention...

Documents