understanding uses and misuses of similarity hashing

Understanding Uses and Misuses of Similarity Hashing Functions for

Malware Detection and Family Clustering in Actual Scenarios

Marcus Botacin1 Vitor Hugo Galhardo Moia2,3 Fabricio Ceschin1

Marco A. Amaral Henriques3 Andre Gregio1

1Federal University of Parana (UFPR){mfbotacin,fjoceschin,gregio}@inf.ufpr.br

2 Samsung R&D Institute [email protected]

3 University of Campinas (UNICAMP)[email protected]

Abstract

An everyday growing number of malware variants targetend-users and organizations. To reduce the amount of indi-vidual malware handling, security analysts apply techniquesfor finding similarities to cluster samples. A popular clus-tering method relies on similarity hashing functions, whichcreate short representations of files and compare them toproduce a score related to the similarity level between them.Despite the popularity of those functions, the limits of theirapplication to malware samples have not been extensivelystudied so-far. To help in bridging this gap, we performeda set of experiments to characterize the application of thesefunctions on long-term, realistic malware analysis scenarios.To do so, we introduce SHAVE, an ideal model of similarityhashing-based antivirus engine. The evaluation of SHAVEconsisted of applying two distinct hash functions (ssdeep andsdhash) to a dataset of 21 thousand actual malware sam-ples collected over four years. We characterized this datasetbased on the performed clustering, and discovered that: (i)smaller groups are prevalent than large ones; (ii) the thresh-old value chosen may significantly change the conclusionsabout the prevalence of similar samples in a given dataset;(iii) establishing a ground-truth for similarity hashing func-tions comparison has its issues, since the clusters originatedfrom traditional AV labeling routines may result from a com-pletely distinct approach; (iv) the application of similarityhashing functions improves traditional AVs’ detection ratesby up to 40%; and finally, (v) taking specific binary regionsinto account (e.g., instructions) leads to better classificationresults than hashing the entire binary file.

1 Introduction

Antiviruses (AVs) have been the main line of defense againstmalware infections in the last years, but even the AV in-dustry struggles to provide countermeasures for the ex-ponentially growing number of malware variants releaseddaily [22]. AV companies usually handle the huge amount ofnewly discovered malware by grouping samples into clusters,

then applying the same detection processes to each of them.Thus, companies save time and prevent their analysts fromanalyzing plenty of similar individual samples. This pro-cedure of complexity reduction is backed by the fact thatmany samples are created by the same attackers with thesame “development model”, which incur in resulting mal-ware that exhibit similar constructs [4].

Identifying malware similarity is as challenging as it isimportant for defense purposes. Many different approacheshave been proposed for similarity identification over time.A popular similarity identification technique is similarityhashing (a.k.a., Approximate Matching [18]), which pro-duces small and compact object representations called di-gests. The comparison of two objects’ digests using thesetypes of functions produces a value (score) that indicatesthe level of similarity between those objects; a given thresh-old determines whether the objects are considered similar toeach other or not. Similarity hashing functions became pop-ular due to their greater speed in comparison to the afore-mentioned related work, thus being implemented by manysecurity companies in their products, servers, and endpoints(e.g., TrendMicro relies on similarity hashing for IoT mal-ware clustering [43], whereas Google applies it for findingmalicious apps in its Play Store [28]).

Despite their popularity, the academic literature aboutsimilarity hashing functions is still limited, either in numberor in the broadness of their use in malware detection. Forinstance, Pagani et al. [56] presented an extensive evalua-tion on the application of hash functions for many binaryanalysis tasks, but the specific case of malware variants waslittle explored. No available work focuses on the drawbacksof malware analysis, such as threshold definition issues. Theliterature also does not discuss the limitations, developmentopportunities, and challenges of similarity hashing applica-tion to actual malware samples in real scenarios. Therefore,we decided to investigate these possibilities and fill thesegaps with a methodological approach for the application ofsimilarity hashing functions for malware classification andclustering.

We investigate the application of similarity hashing func-

1

tions for three distinct, independent tasks: (i) characteriza-tion of real datasets; (ii) malware detection; and (iii) mal-ware clustering in families. For each one of them, we high-light similarity hashing function’s strong and weak points.For all tasks, since our goal is to evaluate the functions inreal scenarios, we used a dataset of 21 thousand distinct real-world malware samples collected during seven years fromreal infected machines.

For the characterization task, we leverage distinct setupsof hashing functions (ssdeep and sdhash) and thresholds(from 0 up to 100%), such that we can explore the best pa-rameters for performing the task at hand. We report a set ofexploratory results on the application of hashing functionsto malware samples, such as the number of clustered sam-ples, their sizes, and the impact of distinct threshold valueson clusters definition. We successfully identified multipleclusters in the real-world dataset, some of which contain-ing up to a hundred samples. We discovered that distinctthreshold values might completely change the dataset char-acterization outcomes by indicating that it presents a lowor a high ratio of similar samples. Even worse, there is cur-rently no guideline for such threshold definition, such thatdistinct researchers choose values in an ad-hoc manner.

For the detection task, we introduce SHAVE (SimilarityHashing AntiVirus Engine), an ideal model of a cloud-basedAV with similarity hashing capabilities to evaluate its ap-plication to malware samples. By leveraging SHAVE, wediscovered that traditional AVs’ detection rates can be in-creased by up to 40% if the similarity between detected andundetected samples is taken into account. We also discov-ered that AVs cloud-based architectures might allow thatmore complex similarity identification routines to be putinto practice by applying them only to the cluster’s cen-troids, instead of to all samples.

For the malware classification task, we compared simi-larity hashing function’s and traditional AntiVirus’ (AVs)clustering capabilities, since AVs are the most popular so-lutions for labeling malware. We identified a difficulty toestablish a ground-truth for the number and size of clustersreported by the similarity hashing functions. We highlightthat it is difficult to compare SHAVE and traditional AVs —the closest-related malware detection solution to this work— since the latter generate clusters based on completelydistinct metrics (e.g., behavioral heuristics). We concludethat malware classification is the hardest task to be per-formed either by using AVs or similarity hashing functions,which is often reflected in the lower metrics scores (accu-racy, precision, so on) of these methods, as reported in thepaper. In an attempt to increase the classification capabil-ities, we investigate distinct strategies, such as consideringonly specific binary excerpts and/or sections. For instance,we investigate the impact of common blocks [47], i.e., com-mon structures that repeat in many objects of the same anddifferent file types, to the similarity scores. We discoveredthat considering only the 70% most prevalent binary’s in-structions slightly increases the overall metrics scores (e.g.,accuracy, precision, recall).

Therefore, our results suggest that (i) the adoption of sim-ilarity hashing functions is a promising candidate technique

to support the development of the next generation of AVsintended to operate along with large-scale demands; but (ii)their application should care about the drawbacks we de-scribe in this study.

Our major goal with this case study is to discuss and sug-gest mechanisms to increase the confidence and reliability ofthe malware classification procedures using similarity hash-ing functions used by malware analysts and forensic experts.In this sense, our contributions in this paper are:

• We propose SHAVE, which presents the concept ofa similarity hashing-based AV engine operating solelyfrom a cloud server and with a centralized malware def-inition repository.

• We evaluate the limits and benefits of applying similar-ity hashing functions for malware detection and classi-fication in real-world scenarios, investigate the impactof removing common blocks from the similarity scoring,and show the issues of using AV labels as ground-truthfrom malware clustering.

• We show that using AV labels as ground-truth for mal-ware families clustering is challenging, since labels comefrom other techniques than similarity hashing functions,producing distinct results.

• We investigate the impact of adapting a forensic com-mon blocks removal approach to the malware similarityidentification context so as to allow the most discrimi-nant binary instructions to influence more on the finalsimilarity score rather than common instructions.

• We pinpoint and discuss challenges and opportunitiesfor future work on malware detection and similarityclustering.

The remainder of this paper is organized as follows: InSection 2, we present background information on how simi-larity hashing functions work; In Section 3, we present ourexperimental methodology and the concept of a similarityhashing-based AntiVirus; In Section 4, we evaluate the ap-plication of similarity hashing functions for malware detec-tion and classification in real-world scenarios; In Section 5,we revisit our findings to propose guidelines for the applica-tion of similarity hashing function; In Section 6, we discussthe implications of our findings; In Section 7, we present re-lated work to better position our contributions; Finally, wedraw our conclusions in Section 8.

2 Background

This work’s goal is to evaluate the application of similarityhashing to speed up malware clustering procedures. There-fore, we here present background information on similarityhashing that supported the development of this work anddescribe the clustering tasks we are tackling using them.

2

2.1 Similarity Hashing

Traditional hash functions (e.g., MD5, SHA-1, SHA-2) aimto create small and compact object representations (a.k.a.,digests) for their unique identification. The main charac-teristic of hashes is that flipping a single bit in the inputdrastically changes the output, making possible only the de-tection of identical objects. To tackle traditional hash limi-tation, similarity hashing functions were developed to createshort object representations and allow the identification ofsimilarity between objects. A small change in the input willreflect in a minor and located variation in the digest, al-lowing similar objects to be correlated. Another differencebetween hashes and similarity hashing is that, while hashesproduce binary answers for the comparison of two objects(they are identical or not), the other one uses special com-parison functions that provide a confidence measure (score)about the similarity of two objects. This score can be in afixed interval (from 0 to 100) or express dissimilarity (from0 up to a given limit). A score of 100 (or another value rep-resenting the perfect value - maximum similarity) producedby similarity hashes does not mean that the two comparedobjects are identical in every single byte, but that they havea high degree of similarity.

The digest generation process [45] is composed by a fea-ture extraction (in this case, feature is defined as a sequenceof bytes extracted from the object) and a filtering step toreduce the number of produced features. This happens be-cause some tools sometimes extract overlapping featuresand/or some features are weak and should not be consid-ered due to an increase in false positives (e.g., a sequence ofzeros). The interpretation of the final result varies accord-ing to the tool; some use fixed threshold values to declaresimilarity; others produce percentage values.

Similarity hashing is widely applied in the digital forensicfield, where practitioners use them to find similar data inmultiple contexts [61, 31], including the location of malwarevariants, the topic of this work. Although these functionsare formally known as Approximate Matching [18], we optedfor using the term similarity hashing for the rest of this worksince it is most known in the malware community.

The Multiple Similarity Hashing Functions. Manytools implement the concepts of similarity hashes and it ishard to cover all of them in a single research effort. There-fore, we limited our analysis to the most popular solutions.Table 1 shows a sampling of the most recent papers availablein popular publisher’s (ACM, IEEE, Elsevier) repositories.Our findings suggest ssdeep and sdhash as the targets ofmost research works in the literature. Besides, these twofunctions were also scrutinized in many research papers (seenext paragraphs) and are well accepted by the community.

In the following paragraphs, we summarize the use and theoperation of these two well known tools (ssdeep and sdhash)which are the target of constant research and the focus ofthis paper. A detailed analysis of their internal aspects ispresented in [42].

ssdeep is a popular similarity hashing function used in mul-tiple contexts. It implements the Content Triggered Piece-wise Hashing (CTPH) concept to detect content similarity at

Table 1: Recently Published Works. ssdeep and sdhashare the most popular functions.

Work ssd

eep

sdh

ash

TL

SH

myh

ash

Lem

pel-

ziv

Shiel et al. [66] 3Sarantinos et al. [64] 3 3

Pagani et al. [56] 3 3Azab et al. [4] 3 3 3Naik et al. [51] 3 3 3Raff et al. [59] 3 3 3

the bytewise level. ssdeep creates variable size blocks withthe aid of a rolling hash algorithm to determine when blocksstart and stop (set boundaries) based on a (random) valueobtained from a fixed-size window that moves through theinput byte-by-byte. Next, all blocks produced are hashedusing an FNV hash function [52], and the six least signif-icant bits of each hash are encoded in a Base64 character;the digest is the concatenation of all characters. The sim-ilarity is assessed with ssdeep by using an Edit Distancefunction [68] that counts the minimum number of operationsrequired to transform one digest into the other, producinga value (score) indicating similarity; it ranges from 0 (dis-similarity) to 100 (perfect match). No threshold is used byssdeep; all comparisons with score > 0 are considered as sim-ilar. More information about the whole process of creatingand comparing digests can be found in [36].

sdhash is another similarity hashing function that oper-ates in the bytewise level to detect similarity between ob-jects [60]. In short, sdhash aims to identify statisticallyimprobable features in objects (i.e., unique ones) and map-ping them into similarity digests. A feature in this contextis a β-bytes sequence (64 by default) extracted from the ob-ject, and those which have the lowest empirical probabilityof being encountered by chance (based on their Shannon en-tropy) are selected to represent the object. Next, all selectedfeatures are hashed using SHA-1 and stored into bloom fil-ters [10]. Each filter can store at most 160 features; in casea filter reaches its capacity, a new one is created. The fi-nal object digest is the sequence of all bloom filters. Thesimilarity of two digests is measured by comparing the setof bloom filters against each other; a normalized score isreturned in the range of 0 (no similarity) to 100 (very sim-ilar or identical). Roussev, V. and Quates, C. [63] suggeststhe use of a threshold t value of 21 to define when objectsare similar since comparisons with lower scores tend to pro-duce many false positives, although this threshold was notevaluated in the malware detection context. Many studieshave been developed to improve sdhash efficiency [34] andprecision [47, 45]. In this work, we adopt a new version ofsdhash called J-sdhash [45], since it adopts a new com-parison function that is more precise and whose results areeasier to interpret, as they are expressed in the terms of thereal similarity shared between objects.

3

2.2 Malware Clustering & Similarity

The presented functions can be applied to many classifica-tion tasks and over any type of binary. Malware analysis isone of the most popular tasks for similarity hashing func-tions and therefore our focus in this work. Malware classifi-cation tasks can be divided into three types:

1. Malware Family Clustering: In this task, a set ofbinaries reported as malicious (with no malware familylabel) are classified into clusters of similar binaries (mal-ware families). The goal is to label them into N clus-ters that maximize their similarities, generating familylabels for future classification and remediation tasks.Note that this step is similar to unsupervised learningin machine learning, given that there are no labels in-volved in the process of generating labels for unknownsamples [8].

2. Malware Detection: In this task, unknown bina-ries are classified into a cluster of either malware orgoodware samples. The goal is to identify whether theunknown binary is malicious or not based on a set ofknown malware and goodware samples, similar to a su-pervised learning process [8].

3. Family Classification: In this task, a binary re-ported as malicious is classified into clusters of simi-lar known labeled malware binaries (malware families).This task’s goal is to separate malware samples be-longing to one family from another, thus allowing thedevelopment of targeted remediation procedures. Thisapproach should not be confused with malware familyclustering. In the hereby described task, family labelsare available for the known malware binaries (e.g., as-signed by previous antivirus checks), as in any trainingset of supervised learning approaches [8].

Similarity hash functions present distinct drawbacks andoutcomes when applied for clustering, detection, and clas-sification. In this work, we explore the three scenarios toinvestigate their advantages and constraints.

3 Methodology

In this section, we present the adopted scientific methodol-ogy on the malware clustering experiments and the datasetconsidered in them. Whereas we do not claim our scientificmethodology to be immediately applied by practitioners, weexpect it to be adapted to be applied in conjunction withother existing forensic methodologies (e.g., [40, 3]).

3.1 Dataset Selection

Our goal in this work is to evaluate the application of sim-ilarity hashing for clustering malware samples in real-worldscenarios. To this end, we searched for a dataset of realmalware samples. We considered the dataset characterizedin [11] as representative of a real scenario, because it covers a

time frame of seven years of in-the-wild malware samples col-lected from infected user’s machines by a security company.This dataset has already been proven to be challenging toother classification tasks [7, 21] such that we believe it mighthighlight challenges to malware classification using similar-ity hashing functions as well. This dataset is composed ofmultiple file types, either executable ones (e.g., EXEs andDLL written in C, CPLs written in Delphi, and code com-piled from .Net), or script-based ones (e.g., VBS, Javascript,so on). We selected for our evaluation only the executablebinary files present in the dataset, which resulted in a totalof 20986 files considered in this study. These binaries arespread over more than 100 distinct family clusters (accord-ing to AV’s labels), with 53% of them being “Download-ers” or “Password Stealers” variations. Around 40% of allsamples are packed (50% of them with UPX and other 50%with multiple, distinct packers). We did not filter obfuscatedsamples out to demonstrate their impact over analyses in areal inspection case.

To assess the False Positive (FP) rates of the classifi-cation procedures, we also applied the tools to a set oflegitimate binaries (goodware) that are divided into twoclasses: (i) 2870 Executable binaries and DLL files retrievedfrom a fresh Windows 8/x64 installation; and (ii) the 2,935most downloaded binaries from popular software reposito-ries (e.g., CNET) in 2019. The applications were described ina previous study [12]. All goodware files were checked usingthe Virustotal service to ensure they were downloaded freeof bundled malware.

3.2 Clustering Criteria

The considered similarity hashing tools show the similaritybetween the files A and B in a pairwise manner, but they donot cluster these files automatically. Thus, we parsed theiroutputs to create the clusters by ourselves. When clusteringthe files, it is natural to think about transitivity: if A and B

are similar and B and C are similar, then A and C should besimilar too. Unfortunately, this might not hold for all files,since they might share distinct code portions. Therefore,we need to choose a criteria to include or not these filesinto a given cluster. In this work, we decided to includethe three files of the aforementioned example in the samecluster whenever the similarity between A and B and B andC are greater or equal to the considered threshold for a givenexperiment.

3.3 AV ground-truth

AntiViruses (AVs) are the most popular defense against mal-ware. Therefore, in this work, we consider AV detection re-sults for ground-truth evaluation experiments. All sampleshereby reported as malware were labeled as such by at leastone AV engine after the sample’s submission to the Virusto-tal service [70]. In addition to detection rates, we also con-sidered the labels attributed to the samples by the AVs asground-truth for family clustering experiments. Moreover,we normalized all labels to avoid inconsistency issues [65]whenever the experiments required it. For instance, samples

4

reported as belonging to the W32/Delf.A and W32/Delf.B

families were assigned to the same cluster. In the case of“Generic” labels, we assigned the samples to a new clusterto reflect the fact that AV analysts will have to manuallyinspect the “unknown” samples. Depending on the experi-ment goal, singleton clusters are discarded. All labels wereretrieved in Dec/2019 and attempts to reproduce our resultsshould refer to these scans, as AV labels may change overtime [13].

3.4 The Ideal AV Model

All experiments are described in this paper as if they wereapplied in the light of an ideal, Similarity Hashing-based AVEngine (SHAVE). The reliance on similarity hashing func-tions allows us not only to match identical binaries (100%similarity), as in typical signature-based detection schemes,but also extend AV detection capabilities to detect samplesbased on partial matches. We restricted our threat model tosignature-based AVs to not bias our results with influencesof other detection mechanisms, such as heuristic detectors.The ideal AV engine is structured in a client-server archi-tecture. The client running on the user’s endpoints triggersinspection procedures according to user demands and up-loads the digest of the suspicious file to the server runningon the cloud. The server is responsible for effectively imple-menting the detection logic, which, in the case of this work,relies on similarity metrics. The server compares the suspi-cious digest against a previously built database of similaritydigests having multiple known threats and reports the simi-larity score. The suspicious file is considered as malicious ornot according to the established threshold. This can be con-figured either on the client or server-side. The major advan-tage of the considered approach is that the AV only uploadsa digest to the cloud instead of an entire file, unlike somecloud-based AV solutions. This allows preserving user’s pri-vacy and saving network traffic. The centralized databasealso allows instant malware definition updates, as these donot need to be distributed to endpoints via the Internet.Therefore, each scan request is always performed againstthe latest version of the malicious digest database. Whereasthe experiment presented in Section 4 justify SHAVE’s via-bility, we do not expect it to be adopted isolated as in ourexperiments, but in conjunction with traditional AV solu-tions.

4 Experiments & Results

In this section, we present and discuss the experimentsperformed to evaluate the feasibility of leveraging similar-ity hashing function for multiple, distinct malware-relatedtasks. More precisely, we first describe the characteristicspresented by similarity hashing functions when characteriz-ing a real dataset of malware samples. We further evaluatewhether this type of function allows one to cluster unknownbinaries to separate them into malicious and goodware sam-ples. We finally evaluate whether one can leverage this ap-proach to identify distinct malware families.

4.1 Characterizing a Real Dataset: Mal-ware Family Clustering

An immediate application for similarity hashing functions isto characterize an unknown dataset regarding the numberof malware families and their sizes, which allows analyststo make inferences about the given scenario, such as howactive and diverse it is. Dataset characterization, however, isnot a straightforward task, but depends on multiple factors.In this work, we are interested in shedding light on someof them. In particular, we considered two main affectingfactors: (i) the used similarity hash function; and (ii) thesimilarity threshold value.

The used similarity hash function directly affects the fi-nal result as each function is developed using different tech-niques (see Section 2) and thus output a distinct similar-ity score. Pagani et al. [56] identified that the sdhash

function outperforms the popular ssdeep when applied totheir dataset of collected samples. We here applied thesesame functions classes1 to a dataset of malware samples col-lected in-the-wild to evaluate whether this conclusion holdsin broader scopes.

The threshold value affects the result because it conceptu-ally defines when a sample is considered similar to another.Failures in adopting a proper threshold value might lead toFalse Positives (FPs) or False Negatives (FNs) in the caseof malware detectors and/or classifiers. In the following, wediscuss the extent of the influence of the threshold value inthe overall dataset characterization.The similarity hash functions. We evaluated the im-pact of distinct similarity hash functions by applying themto the dataset described in Section 3. Figure 1 and Figure 2show, respectively, the relative number of clustered samplesand the total number of identified clusters by the two sim-ilarity hash functions according to the threshold variation(from 0 to 100). Figure 1 shows that the tighter the con-sidered threshold, the fewer samples are identified as sim-ilar. J-sdhash slightly outperforms ssdeep for all thresh-olds, which confirms Pagani’s results. Figure 2 clarifies that,when using laxer thresholds, the similarity hashing func-tions not only add more samples to already-existing clusters(which could be due to False Positives), but they indeedgenerate more clusters to host samples whose similarity wasnot identified using distinct thresholds. From an incident re-sponse perspective, it brings two significant advantages: (i)the more samples are identified as similar, the faster theycan be handled by a uniform procedure; and (ii) having morespecific clusters allows more fine-grained incident response.

Finding #1: The J-SDHASH tool covers more sam-ples and generates more clusters than the SSDEEPtool, thus easing malware triaging.

The threshold. The threshold value is key for a good clas-sification analysis, but it seems there is no consensus in theliterature about which is the ideal value. One can find re-lated work suggesting values of 5 [62], 20 [43], and 90 [27].In practice, researchers end up adopting ad-hoc threshold

1We used J-sdhash instead of the original sdhash version

5

0+ 10 20 30 40 50 60 70 80 90 100Similarity Score

20

25

30

35

40

45

50

55

60

65

70Sa

mpl

es (%

)Clustered Samples vs Similarity Score

J-SDHASH SSDEEP

Figure 1: Relative Number of Clustered Samples per Simi-larity Threshold for the two similarity hashing functions.

0+ 10 20 30 40 50 60 70 80 90 100Similarity Score

1600

1900

2200

2500

2800

3100

3400

3700

4000

Clus

ters

(#)

Clusters vs. Similarity Score

J-SDHASH SSDEEP

Figure 2: Absolute Number of Clusters per SimilarityThreshold for the two similarity hashing functions.

values among the ones considered in our experiments, sincethere is currently no guideline for the threshold definition.Understanding and defining threshold criteria is an impor-tant open research question because the variation on thenumber of clusters and clustered samples in some scenariosmight be expressive enough to lead to contradictory conclu-sions. For instance, in the considered dataset, on the onehand, a lax threshold of 50% or less (Figure 1) would sup-port the conclusion that more than 50% of the dataset iscomposed of malware variants (a scenario with few but veryactive attackers, as discussed below). On the other hand, astrict threshold value of 100% (Figure 1) would support theconclusion that less than a third of the dataset is composedof similar samples (a scenario with more but less active at-tackers, as discussed below). Therefore, researchers should(i) define a clear threshold for their experiments instead ofadopting default settings; and (ii) avoid comparing datasetsusing distinct thresholds. In the following, we discuss addi-tional criteria for threshold definition.

Finding #2: Distinct threshold values might lead tocompletely opposite conclusions about the characteris-tics of the same dataset.

To better understand the impact of multiple thresholdvalues, we took a look at the cluster size distribution inaddition to the number of clustered samples. The clustersize distribution allows understanding and inferring multi-ple characteristics of the observed dataset. One can, forinstance, leverage the number of distinct identified malwarefamilies as a proxy for the number of attackers distribut-ing malware samples and/or attack sources: the greater thenumber of distinct clusters, the greater the chance that theassociated malware samples were generated by distinct at-tackers. Alternatively, one could use the sizes of the clustersas a proxy for characterizing how active the attackers are:the greater the number of samples in a cluster, the greaterthe chance that the attacker remained active for a long timegenerating new malware variants.

We evaluated how clusters are distributed for the distinctthreshold values. Figure 3 illustrates the cluster size dis-

2 3 4 5 6 7 8 9 10 11 12 13 14 19 20 96 301Cluster Size (# Samples)

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Clus

ters

(#)

Cluster Size Distribution (Score=50)J-SDHASH SSDEEP

Figure 3: Cluster Distribution for the 50% threshold.Smaller clusters are more prevalent than larger ones, thusincidating sample’s diversity.

tribution when the 50% threshold is applied2. We noticethat small clusters (e.g., sized 2, with most samples beingassociated with at most one other sample) are more preva-lent than large clusters, thus indicating a variety of malwarethreats. This characteristic also holds for all other thresh-old values we investigated. It was also reported in previousstudies [27], thus indicating it is a typical characteristic ofreal malware datasets and that it can be observed regardlessof the considered threshold value.

Our experiments revealed that the threshold change af-fects both the maximum number of samples attributed toa single cluster as well as the size of the largest identifiedcluster. These factors are associated, since the thresholdchange “moves” samples back and forth from new clustersto already-existing ones. For instance, our experiments showthat the clusters sized 2 are the prevalent ones for all thresh-old values, as shown in Figure 4, even though their number isreduced for tighter threshold values. This happens becausethe use of tighter thresholds implies that some samples will

2chosen randomly as a single example due to space constraints

6

0 10 20 30 40 50 60 70 80 90 100Similarity Score

0

300

600

900

1200

1500

1800

2100

2400

2700

3000Cl

uste

rs (#

)Clusters Distribution (Size=2)

J-SDHASH SSDEEP

Figure 4: Clusters Sized 2 are prevalent for all thresholdvalues, although reduced in number for increased thresholdvalues.


050

100150200250300350400450500550

Clus

ters

(#)

Clusters Distribution (Max Cluster Size)J-SDHASH SSDEEP

Figure 5: Maximum Cluster Size. The tighter thethreshold value, less large clusters and more specialized onesare generated.

not be considered as enough similar anymore, thus beingremoved from any cluster. This suggests that laxer thresh-olds are more suitable for tasks that require maximizing thesample’s coverage by the similarity hash function.

Our experiments also show that the size of the largest clus-ters also change when tighter threshold are considered, asshown in Figure 5. This happens because tighter thresholdstend to divide single, large clusters into multiple, smallerones, according to the samples that best match themselves.This suggests that tighter thresholds are more suitable fortasks that require sample’s clusters to be highly specialized.

More specifically, two classes of analysis tasks require dis-tinct cluster characteristics to operate with. When triagingmalware, it is desired to have the greatest coverage pos-sible to minimize the processing costs; When applying re-mediation procedures, it is desired to have the most sim-ilar clusters possible to apply the correct, specialized vac-cines to them. Therefore, we conclude that there is nota single threshold value that applies to all tasks, but thatsmaller threshold values should be used for malware triag-ing and higher thresholds for remediation procedures. Forboth cases, we observed that the J-sdhash function outper-formed the ssdeep function for all threshold values. In thefollowing subsections, we dig deeper into these tasks.

Finding #3: Smaller thresholds are more interestingfor malware triaging and larger thresholds are moreinteresting for remediation.

4.2 Detecting New Threats: Malware De-tection

A common way to detect malware is to check whether anunknown binary clusters to a set of known malware binaries,according to the considered clustering criteria. We here dis-cuss how similarity hashing functions can assist in this task.For all the following presented experiments, the modifiedsdhash function (J-sdhash) was considered, as it was theone that produced the greatest variety of clusters in the ex-periments presented in the previous section.

AV Detection Rates Increase using Similarity Infor-mation. Antiviruses (AVs) often perform their clusteringtasks using deterministic signatures, which is reasonably ef-fective (95% of all samples in the real dataset were detectedby at least one AV). Our hypothesis is that similarity hash-ing functions could also be used for this task and performbetter. In the following, we present an evaluation of howAV’s detection rates could be individually increased by con-sidering binary’s structural similarity information. We eval-uated the detection rate when looking at the similarity be-tween a known malware sample and a suspicious sample.More specifically, we searched for a tuple of samples (A,B)that either A or B was detected by a given AV solution (butnot both) and the similarity between A and B is greater thana predefined threshold. All experiments were performed us-ing the same aforementioned dataset and all AV solutionsavailable in the VirusTotal service were considered.

0-10 10-40 50-90 100Similarity Score (%)

05

101520253035404550556065

Dete

ctio

n In

crea

se (%

)

AV Detection Increase vs. Similarity ScoreBaiduComodo

DrWebF-Secure

AhnLabeScan

Figure 6: AV Detection Increase when consideringbinary similarity. Similarity effectively increases AV de-tection.

Figure 6 shows the detection rate increase according to themultiple thresholds for the most affected AVs–i.e., the AVswhich would most increase their detection rates if similarity

7

hashing capabilities were added to them. As expected, thelaxer the threshold, the more samples are considered sim-ilar to known malware and thus the greater the detectionrate increase. We notice that even when a tight threshold of100% is considered (which requires the known malware andthe unknown payload to be 100% compatible–even thoughnot 100% equal), the detection rate keeps increasing, thusshowing that there is room for improvements in AV detec-tion strategies and capabilities (such as by using similarityhashing functions).

Finding #4: The use of similarity hashing functionsby AV increases their detection capabilities in practice.

Whereas increasing the detection rate via the applicationof similarity hashing functions is possible, and even verysignificant in the presented cases, it is important to high-light that distinct AVs will be affected distinctly, accordingto their engine’s weaknesses and capabilities. Figure 7presents a landscape of the detection rate increase for themultiple AVs present in the Virustotal service. We noticethat all AVs were positively affected, even though by asmall factor

The case for False Positives. We are not aware of AVsolutions using similarity hashing functions in the same wayas proposed in this work, although our results suggest thatadopting this approach is a promising way of increasing mal-ware detection. This fact raised the concern if any limita-tions were preventing AVs from adopting this type of oper-ation mode. We hypothesized that a plausible explanationcould be an increase in the False Positive (FP) rates whensimilarity scores were considered–i.e. if the laxer thresholdswere resulting in goodware samples being considered sim-ilar to malware samples. To evaluate that possibility, werepeated the same experiments as described above but nowconsidering the two datasets of goodware samples previouslydescribed in Section 3.

Table 2 shows the FP rate (regarding the detection byany Virustotal’s AV) for distinct threshold values in the twodatasets. Around one out of four samples result on a FPwhen using the minimum threshold value of similarity sup-ported (threshold > 0, 1%). Increasing the threshold consid-erably decreases the FP rate. A threshold of 50% decreasesthe FP rate to 1%. If we consider that a perfect compati-bility score (100% threshold) between known malware andsuspicious payloads is required to classify a payload as ma-licious, we observe an almost negligible FP rate while stillallowing AV detection rates to increase, as previously shown.Therefore, our findings show that FP is not a limitation forthe application of similarity scores to increase AV’s malwaredetection.

Finding #5: A threshold of 50% is the best trade-offbetween TP and FNs when using a similarity hashingfunction to assist AV detections.

Similarity Hashing Functions Performance. The ma-jor advantage of the signature-based approaches adopted bymany AVs is that they are very fast. Similarity hashing

functions are more complex than traditional hash functions–J-sdhash even requires the application of a traditional hashfunction as part of its operation (see Section 2), thus im-posing a greater processing load. We, therefore, evaluatedwhether the greater performance impact could be a limita-tion for their application in real-world scenarios. For such,we measured the throughput of distinct similarity and tra-ditional hash functions when inputted with real malwaresamples. All experiments were performed on a 64GB, IntelXeon E5-2620 2.00GHz CPU running Ubuntu 18.04 LTS.

Table 3 and Table 4 show the average throughput (hashesper second) while hashing the entire malware dataset de-scribed in Section 3 using distinct hash functions, consid-ering, respectively, the CPU Time (actual processing time)and the Wall Time (total time spent). We notice that for allcases the throughput reported using CPU Time is an orderof magnitude greater than Wall Time, thus showing thatthe time taken to load the hash database to memory is notnegligible (which is key for an AV).

Finding #6: Similarity hashing-based AVs shouldnot neglect the cost of loading the signature database inmemory to operate even though they might reduce thedatabase size by condensing similar file’s signatures.

We observed that all the traditional hash functions pre-sented greater throughput than the similarity hash func-tions, which shows that the distinct performance rates aredue to the distinct nature of the functions rather than par-ticular implementation decisions. We also observed thatthe previously shown increase in J-sdhash’s similarity re-sults in comparison to ssdeep’s is obtained via a decreasein its performance. This result suggests that whereas theJ-sdhash function should be selected in unconstrained sce-narios (e.g., offline analyses) due to its higher classificationcapabilities (shown in the previous subsection), the ssdeep

function is a better option for performance-limited scenar-ios (e.g., real-time operation), since it provides a ≈ 4 timesgreater throughput without losing ≈ 4 times classificationpower.

Finding #7: Whereas J-sdhash is a better choice forunconstrained scenarios, ssdeep is a better choice forperformance-limited scenarios.

Similarity hashing performance and AV databases.In addition to performance penalties when hashing individ-ual files, similarity hashing functions also do not scale wellnaturally (i.e., without a proper approach/protocol), whichmight be prohibitive for an AV company trying to estab-lish a database of known malware hashes. Figure 8 showsthe spent time to naively match a file against the totalityof a growing database, up to the limit of the entire malwaredataset. This simulates the addition of newly-discoveredsamples to an AV company database, as AV companies of-ten perform in traditional hash-based procedures. This kindof growth might seems prohibitive at a first glance.

The viability of leveraging similarity hashing functions inreal-world cases is only highlighted when we consider that

8

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67AV (id)

05

10152025303540455055

Incr

ease

(%)

Detection Increase vs Similariry Score0+ 10 20 30 40 50 60 70 80 90+

Figure 7: Overall AV Comparison. Distinct AVs increase their detection rates differently.

Table 2: False Positive Analysis. Few goodware samples are mistakenly flagged as malicious.Win/Sys32 Repositories

Threshold 0+% 50+% 100% 0+% 50+% 100%FPs 27.30% 1.00% 0.05% 25.00% 1.00% 0.05%

Table 3: Hash Functions Throughput Comparison.Hashes per second considering CPU processing time.

Function Throughput Function ThroughputJ-sdhash 948 SHA256 5651ssdeep 2475 MD5 9777

SHA512 5146 SHA1 11110

Table 4: Hash Functions Throughput Comparison.Hashes per second considering Wall time.

Function Throughput Function ThroughputJ-sdhash 82 SHA256 613ssdeep 369 MD5 727

SHA512 413 SHA1 787

1 2 4 8 16 21Database Size (K entries)

0

5

10

15

20

25

30

35

40

Mat

chin

g Ti

me

(s)

Matching Time vs. Database Size

TIME

Figure 8: Matching Time. The matching time of similar-ity hashing function grows exponentially.

they do not need to perform the same number of compar-isons as traditional hash functions to produce the same re-sult. Whereas typical hash functions would require the com-parison of all thousand files present in the dataset to checkwhether any of them is malicious, a similarity hash func-tion would require checking only the centroids of the cor-responding clusters, thus reducing the problem space by amagnitude order. This is possible because any sample in thedataset is similar at least to its cluster centroid by construc-tion. Therefore, to a similarity hash-based AV to be practi-cal in terms of performance, a good clustering approach atthe server-side is required.

There are distinct approaches to accelerate the digestsearch for large databases. The strategies might coverdatabases in up to O(n · log(n)) or O(n) comparisons,depending on the chosen strategy and the problem con-straints (more information about these strategies can befound in [48]). However, a common drawback of all theseapproaches is that they consider an already-built, fixed-sizedatabase, which is not the case of AV companies, that buildtheir databases as soon as new samples are being discovered.

To better understand the impact of centroids selection onAV’s performance, we evaluated how fast a reference dataset(the AV database) can be covered in practice, again simu-lating an AV company creating a digest database, addingmore samples to it, and querying it for detection. For such,we considered a DBSCAN-like algorithm that takes a newsample (randomly chosen for the sake of the experiment)and clusters it to an existing cluster in the AV company’sdatabase if the score is above the predefined threshold, orcreates a new cluster if not. This type of approach hasbeen previously used along with other similarity hash func-tions [32, 71] and we are here extending it to operate alongwith J-sdhash. In a real operation, this procedure is in-definitely repeated for every new incoming sample. In ourexperiment, the procedure repeats until all clusters (with re-gards to the non-probabilistic ground-truth) are formed. Weconsidered 100 repetitions of this algorithm while initializingit in random states.

9

0 10 20 30 40 50 60 70 80 90 100Similarity Threshold

20

30

40

50

60

70

80Sa

mpl

es R

atio

(%)

Coverage: Number of Required Clusters

Figure 9: Clustering: Number of Samples.

The average results presented in Figure 9 show thatthe performance gains are greater as the laxer the thresh-olds. When a 50% threshold value is considered, the matchagainst ≈ a third of the dataset is enough to cover allsamples. In other words, if an AV company had built itsdatabase using ≈ a third of the samples we considered inthe experiment, it would be able to detect the remainingtwo-thirds due to their similarity with the ≈ one-third ofcentroids. When the threshold value is increased to its limit(100%), around 70% of the samples are required to achievefull coverage. In other words, the AV company should haveaccess to 70% of the samples to detect the remaining 30% ofthe samples. This result is explained by the fact that whena strict match is considered, the result is bounded by thegreat number of clusters sized two, as shown in the previoussections, thus requiring more checks to cover all clusters.Therefore, the larger the reference clusters, the better theperformance of the AV scanning schema.

Finding #8: A good threshold selection allows anAV database that covers the full reference dataset to bedeveloped with only a third of the reference samples.

4.3 Response & Remediation: MalwareFamily Classification

In addition to detecting malware, AVs also try to classifythem in families. This is essential to label the threats andallows incident response procedures and remediation (in theideal scenario). As for malware detection, previously pre-sented, we also hypothesized that similarity hashing func-tions could assist AVs in this task. In the following, wepresent our findings.AV Ground-truth is hard. The first step to evaluatewhether similarity hashing functions are good for malwarefamily clustering is to adopt a ground-truth, i.e., to identifywhich families malware samples actually belong to. In ourexperiments, we used the most adopted strategy in the lit-erature: the reliance on AVs labels. We consider this choicereasonable since although there are alternatives for malwarelabeling (see Section 7), AVs are often the only immediately

available tool for most analysts. Notice that having to relyon a third party to label similar samples is the case for mostreal-world datasets (including ours) where there is no previ-ous information about the sample’s similarity.

To evaluate AV’s agreement with the similarity hash func-tions results, we considered two cases: individual AV’s clas-sification capabilities, and the classification capabilities byan AV committee. For such, we modeled two hypotheticalAVs from the labels assigned by VirusTotal’s AV engines.The first AV we modeled is an “ideal” AV that confirms thattwo samples are really similar whenever two labels assignedby any VirusTotal AVs agree (even if the labels are providedby distinct AVs); The second AV we modeled considers thecase of an “average” AV that confirms that the samples aresimilar whenever two labels assigned by the same VirusTotalAV agree (for the same AVs).

0 10 20 30 40 50 60 70 80 90 100Similarity Score (%)

05

101520253035404550556065707580859095

100

AV S

imila

rity

(%)

AV's Similarity Detection vs. Similarity Score

Ideal AV Average AV

Figure 10: AV ground-truth. Samples reported similarby an “ideal” and an “average” AV.

Our first finding was that relying on individual AVs forestablishing malware families ground-truth is a challengingtask since the labels hardly ever agree. Figure 10 showsthe agreement between the samples identified as similar byJ-sdhash and the two hypothetical AVs that we modeled.The fact that most of the samples considered similar bythe similarity hash functions are also reported similar bythe “ideal” AV confirms that the similarity hash functionis able to effectively detect similar samples. However, thelow “average” score indicates that few AVs are able to in-dividually detect similar samples, which makes their use asground-truth labelers hard. In other words, the differencebetween the “ideal” and the “average” AVs shows that thedistinct approaches adopted by each AV solution are effec-tive to detect similar malware samples when combined butnot when applied individually.

Finding #9: An AV committee is able to confirmthe similarity results presented by the functions, butindividual AVs are not.

A significant factor limiting the performance of the aver-age AV is that AVs often generate “generic” labels. When ithappens, one has to either try to cluster the generic sample

10

anyway or generate a new cluster for the unknown sam-ple. In both cases, the correctness of the clustering pro-cess is affected, either due to False Positives or False Neg-atives. Using a committee mitigates this problem by dilut-ing the responsibility on the decision procedure, since it isvery unlikely that all AVs will present generic labels at thesame time. Thus, relying on the AVs diversity mitigates thegeneric labeling problem.

The average result is highlighted when we observe the dif-ferences between individual AVs when clustering the sam-ples. Figure 11 shows the AVs that clustered most malwaresamples into families and the number of clusters identifiedby each one of these solutions. We notice that each solutionidentified a distinct number of clusters, thus showing thatthey rely on distinct criteria for the sample’s classification.

Tencent Antiy Avast MW Bitdefender Emsis Arcabit AdAware0

20406080

100120140160180200

Clus

ters

(#)

Cluster Size vs. AV SolutionClusters

Figure 11: Number of clusters generated by differ-ent AV solutions. Each AV solution generates a distinctnumber of clusters.

Figure 12 shows the cluster size distribution for the AVthat most clustered samples. The observed distributionis clearly different from the distributions obtained usingthe similarity hash functions (shown in Section 4.1, with ahuge prevalence of clusters sized 2 and without a thousand-sized cluster), even when considering the multiple evaluatedthreshold values. This result also suggests that the AV relieson a distinct method for malware clustering than the use ofa similarity hash function.

Finding #10: The criteria used by AVs to clas-sify samples as similar is completely different from theused by the similarity hashing functions.

The Root of AV labels. Given the identified heteroge-neous results, we started investigating which types of clus-tering mechanisms were used by the AV solutions. Table 5shows the most popular labels assigned to the largest clus-ters. We discovered that AVs have been labeling samplesaccording to heuristics, which flag samples as similar whenthey are all matched using the same rule, regardless of theirstructural similarity. We can observe, for instance, thatmany samples are classified in the same “AutoIt” family,which means that all malware samples were created us-ing this same framework. This label, however, does not

2 3 4 5 6 7 8 9 10 11 12 13 14 1262Size (# Samples)

0

5

10

15

20

25

30

35

Clus

ters

(#)

Cluster Size Distribution (Tencent AV)Number

Figure 12: Cluster histogram of Tencent AV solution.The distribution is clearly different from the ones generatedfrom similarity hashing functions.

provide any information about the structural similarity be-tween these samples. Other samples were grouped in the“Themida” cluster, which means that they were all packedusing this same solution. Unfortunately, this label cannotsay much about the similarity or behavior of the embeddedpayloads.

Table 5: Tencent Cluster’s Labels. The AV clusters sam-ples based on heuristics and not on binary similarity.

Cluster Samples (#) Cluster Samples (#)Delf 437 AutoIt 385

Proxy 12 Heur 10Themida 4 Rogue 2

Finding #11: AVs flag samples as similar whenthey present the same file packaging and/or when arepacked with the same solution, regardless of their con-tent.

The Impact of Packing. The previous results show thatAV labels are based on binary packing. This might signifi-cantly affect detection and labeling since malware often usespackers to hide their payloads from analysts and analysistools. Given the potential impact of packing over malwareclassification, we decided to investigate what happens whensimilarity hashing functions are applied to packed samples.Ideally, we would like to check all packed samples, but thereis no automatic unpacking solution for all type of packersidentified in the samples of our dataset. Thus, we limited ouranalysis to UPX [53], a popular open-source packer, since itpacked 50% of all packed samples of our dataset.

Figure 13 shows the comparison on the relative numberof unpacked and UPX-packed malware samples clustered bythe sdhash function. We first notice that the unpacked ver-sions are more clustered than the packed versions, whichshows that the packing tool can hide some sample’s com-mon characteristics. The tighter the selected threshold, thegreater the relative difference in the number of packed and

11


0

10

20

30

40

50

60

70

80

90

100Sa

mpl

es (%

)The Impact of Packing on Sample's Similarity

Packed Unpacked Identical

Figure 13: The impact of UPX packing. Packing re-duces sample’s similarity scores.

unpacked samples, as the impact of the few common charac-teristics identified by the similarity hashing functions is lessweighted. Nevertheless, packing similar samples does notcompletely eliminate the capabilities of hashing functions tocluster them. This happens because similar samples packedwith the same solution result in similarly packed binaries,which are then clustered. The “identical” curve shows thematching rate of the same samples in their packed and un-packed versions (discarding non-original samples similar tothem). We notice that the packer significantly changes thebinary structure, such that most packed samples are verydistinct from their original versions. This allows us to con-clude that the samples that were clustered by the similarityhashing functions in their packed versions are not similarwith their own unpacked versions, but to multiple packedvariants of themselves distributed by the attackers.

Finding #12: Malware packing with UPX signifi-cantly reduces clustering capabilities but does not com-pletely eliminate its application to same-packer vari-ants.

UPX Packing

UPX Packing

Similar Not Similar

Not Similar

Not Similar

Similar

Unpacked 1 Packed 1

Packed 2Unpacked 2

Figure 14: Average Packed Sample’s SimilarityScheme. Cross-comparisons should be avoided.

Figure 14 summarizes the similarity scheme we identifiedfor the majority of the inspected samples. Notice that thisresult does not imply that all packers will present this same

characteristic when generating malware variants. Packersthat perform distinct transformations for each run (e.g.,crypters that use distinct keys) might generate distinct vari-ants from the same original code.

AVs Agreement. A drawback resulting from the differ-ences between our approach for similarity hashing clusteringfor family identification and the current methods leveragedby AVs to label samples is that these two outcomes cannotbe directly compared. If they are, a low agreement betweenthe samples considered similar by the hash function andthe AV is observed. In this work, we understand the labelagreement of a given cluster as the ratio of the samples inthe cluster that present the same label. For instance, if allsamples in a given cluster defined according to the similar-ity hash function scores present the same AV labels, theiragreement is 100%. If 2 out of 3 samples in the same clusterpresent the same AV label, their agreement is 66%. If allsamples present distinct AV labels, their agreement is zero.Ideally, we would like that the similarity hashing functionsand the AVs presented perfect cluster and label agreementrates (100% of agreement for the labels of 100% of all clus-ters).

50 60 70 80 90 100Similarity (%)

05

10152025303540455055

Agre

emen

t (%

)

Cluster Label Agreement3+ Files 1+ Files

Figure 15: Agreement between the clusters generatedby the AV and by J-sdhash. We notice that most sam-ples clustered by J-sdhash present divergent AV labels.

Figure 15 shows the fraction of clusters that presented aperfect label agreement rate for distinct thresholds. For astrict threshold of 100%, if we consider all samples in thedataset, the average agreement ratio is 50%, i.e., 50% ofall clusters agree 100% in their sample’s labels. Whereasthis is not a very high value, it is still surprising, since itseems to contradict the previously presented difference onthe cluster distribution for distinct thresholds (Figure 3).We then realized that this result was biased by the largenumber of small clusters, with two samples only (Figure 4),which tend to agree more due to their low intra-cluster di-versity. The agreement ratio of clusters with three or moresamples is around only 20%, thus showing that using AVlabels as ground-truth for cluster labeling is a challengingtask. This same phenomenon was observed when we variedthe threshold.

12

Finding #13: The direct application of AV labelsover similarity hashing functions-generated clustersleads to low intra-cluster label agreement rates.

On the positive side, the tighter the threshold, the greaterthe label agreement, thus showing that the clusters tend tobe more specific, thus more similar, in these cases. Thisconfirms our finding 2 about the best suitability of tighterthresholds for remediation procedures rather than triaging.Selecting individual AVs. Despite the presented limita-tions, we cannot completely discard individual AVs as la-beling tools for two reasons: (i) they are still the only solu-tion available to label unknown payloads in many scenarios;and (ii) many experiments can only be validated by using aconsistent evaluation tool. For instance, similarity hashingfunction’s classification improvements can only be properlyvalidated by using the same AV as ground-truth, otherwise,one cannot ensure that a better classification result is dueto the proposed function improvement rather due to the re-ported matching of the file against a distinct AV part of acommittee.

Face to this scenario, it is essential to develop criteria forselecting an AV for the experiment development. Since ourultimate goal is to evaluate the similarity hashing function,we proposing that the AV that most resembles the similar-ity hashing function results should be selected. Thus, wepropose that we should consider the usual metrics of ac-curacy, precision, recall, and F1 score for comparing theiroutputs. In our context, a TP occurs when the AV assignsthe same labels to two samples that the similarity hashingfunction considered similar. A TN occurs when the AV as-signs distinct labels to two samples that the similarity hash-ing function considered non-similar. A FP occurs when theAV assigns the same labels to two samples that the similar-ity hashing function considered non-similar. A FN occurswhen the AV assigns distinct labels to two samples that thesimilarity hashing function considered similar.

The first hypothesis that comes up in everybody’s mindsfor selecting an AV is that the AV which detects most sam-ples is also the one that best clusters them. However, thismight not hold for all cases. In our experiments, AVG wasthe AV that most detected samples, but the use of its labelsfor clustering purposes resulted in poor accuracy rates, asshown in Figure 16.

Finding #14: The AV that detects most samples isnot necessarily the one that best clusters them.

To identify the AV that actually best fits the similarityhashing function, we repeated the experiment to comparethe metrics for all VirusTotal’s AVs. Figures 17, 18, 19,and 20 show, respectively, accuracy, precision, recall, andF1-scores for the AVs that best performed in our evaluations.The NANO AV was the one that achieved the best scoresfor all metrics.

We notice that the accuracy values grow in the 0-30%range, which might be due to TP or TN increase. How-ever, AV labels seem to impose an upper bound after the30% value, which prevents the metric from growing signif-icantly. Precision values confirm that TP also increases in


5

10

15

20

25

30

35

Accu

racy

AVG's Clustering Accuracy vs. Similarity Score

Figure 16: AVG Clustering Accuracy. Despite present-ing the greatest detection rates among all AVs, AVG wasnot the AV that most/best clustered samples.

this range, which is explained by the similarity hashing func-tion generating clusters of more similar samples as tighterthe threshold, as previously presented. However, this samecharacteristic make recall values to lower as the threshold in-creases, because the similarity hashing function stops clus-tering together samples that the AV labels state as simi-lar. As previously shown, this behavior is due to the broadheuristic and generic labels assigned by the AVs to coversamples packed in the same format despite its content. TheF1-score metric weights these findings and shows that thebest results are achieved for intermediate threshold values.We notice that the F1-score for the 100% threshold is lowbecause whereas it correctly labels very similar samples, itleaves all the generic, heuristic-labeled samples out of theclusters.

In other words, we observe that, despite all challenges,some AVs can present a reasonable trade-off between TPsand FPs, depending on the considered threshold values. Thebalance is completely tied to the way that the AV labelsthe samples. For instance, more conservative AVs will clus-ter fewer samples as similar than similarity hash functions,thus raising the number of false negatives and limiting theprecision rate. Thus, in practice, the AV drawbacks whenlabeling samples turns into an upper bound for the perfor-mance rates.

Finding #15: Choosing the AV that best fit similar-ity hashing function results is the best choice for in-dividual ground-truth, but this might impose an upperbound limitation to the evaluation metrics.

The Root of Similarity. Once we identified the nature ofthe labels assigned by the AVs and the AV solution that bestfits the similarity hashing function’s behavior, we started in-vestigating the nature of the similarity matches at the byte-wise level. This task requires us to split the files into parts toidentify which one of them contributes most to the similarityscore. A natural choice for such splitting is to consider thenatural organization of executable binary files in sections. Inthis study, we considered Windows malware samples, that

13


0

10

20

30

40

50

60

70

80

90

100

Accu

racy

(%)

AV Clustering Accuracy vs Similarity ScoreAdAwareAvira

CyrenNANO

Zillya

Figure 17: Whole Binary Accuracy


0

10

20

30

40

50

60

70

80

90

100

Prec

ision

(%)

AV Clustering Precision vs Similarity ScoreAdAwareAvira

CyrenNano

Zillya

Figure 18: Whole Binary Precision


0

10

20

30

40

50

60

70

80

90

100

Reca

ll (%

)

AV Clustering Recall vs Similarity ScoreAdAwareAvira

CyrenNano

Zillya

Figure 19: Whole Binary Recall


0

10

20

30

40

50

60

70

80

90

100

F1 S

core

(%)

AV Clustering F1 Score vs Similarity ScoreAdAwareAvira

CyrenNano

Zillya

Figure 20: Whole Binary F1 Score

14

are distributed via the Portable Executable (PE) [58] fileformat. The PE format is very flexible and does not de-fine fixed section names and permissions, such each samplemight present its own combination. To triage all possibil-ities, we started by investigating all sections presented bythe binaries, as summarized in Table 6.

We discovered that the diversity in binary sections is dueto the attacker’s implementation choices. Thus, each sectionrepresents a distinct implementation decision and has a dis-tinct goal. For instance, whereas “text” sections often im-plement malicious actions by themselves (instructions), datasections are often used to carry payloads injected by thirdparties. Similarly, sections such as “UPX” are used mainlyto store the packing decompression stub. As each section’sgoals are distinct, the sample’s matching rates when consid-ering each one of them is different as well, even if using thesame J-sdhash tool, as shown in Figure 21. As expected,each implementation choice led to a distinct matching rate.Even distinct packers sections, such as UPX and Themida,resulted in distinct matching rates.

Finding #16: Each binary section implements a dis-tinct part of the malicious behavior and thus presenta distinct similarity score.

Considering this finding, it is straightforward to hypoth-esize that separating the distinct sections according to theirgoals (e.g., the instructions that implement a malicious be-havior and the malicious payloads stored as data) mightlead to greater similarity metrics. However, there is still achallenge to be considered: Whereas the differences in thematching rates in the distinctly named sections are sugges-tive, they can only be confirmed by understanding the at-tributes of each section. As the PE format does not specifyany strict rule for section naming, a section named “data”by a malware sample could store in fact instructions. There-fore, to perform a more precise check, we chose to parse thesection headers of all PE files and dump their executable andnon-executable section bytes in separated files. We then re-peated the previously presented clustering experiments withthese new files.

Figures 22, 23, 24 and 25 present, respectively, accuracy,precision, recall, and F1-score values obtained in the exper-iments. All tests were performed considering the NANO AVbecause it was the AV that presented the best results in theaforementioned AV tests. We discovered that this strategyresulted in an overall increase of all considered metrics. Wehighlight that such improvement is observed in the compar-ison with AV-assigned labels, which naturally complicatesthe task, as shown in previous experiments. Therefore, weconclude that such change not only clustered more samples,but actually helped AVs to recognize them as similar.

More specifically, we observe that the behavior of the ac-curacy metric is very similar to what is observed for thefull binary experiment. It starts growing, with some pointsachieving a 10% higher score than was achieved for the fullbinary. However, it is still limited by the AVs labels pre-venting the metric to achieve values higher than 70%. Theprecision metric results confirm that TP increased when sep-arating data and instructions from the main binary. This

also caused the recall metric to be less affected, thus achiev-ing more than 10% higher values at some specific points,especially for higher thresholds, when the samples are ex-pected to be really more similar. These facts resulted inoverall slightly better F1 scores.

Finding #17: Separating instructions from data sec-tions leads to better similarity metrics scores.

We notice that both the instruction-only as well as thedata-only sections outperformed the original baseline, thusshowing that mixing instruction and data make similaritydetection harder. We believe that in the future when similar-ity hashing function-based approaches become mainstream,this fact might be purposely exploited by attackers to lowerthe similarity score of their released malware variants. Weconsider this is a plausible hypothesis since many attackersalready mix code and data to fool disassemblers [15].Increasing Matching Rates. The previous results haveshown that instruction sections are the most discriminantbinary components for family clustering procedures usingsimilarity hashing functions. This is reasonable as the in-struction sequences present in the binaries are responsiblefor causing harm and thus characterizing a binary as ma-licious. Therefore, instructions are a natural focus of anystrategy to enhance the classification performance of thefunctions.

In the forensics context, researchers discovered that com-mon blocks might influence the similarity assessment processof two objects. Many pieces of common data are found to bepresent in many files of the same and different types. Thesepieces are related to application-generated content, such asheader/footer information, color pallets, font specifications,etc [47]. For instance, Moia et al. [46] show that remov-ing such blocks from the similarity assessment may increasethe similarity detection rate in some cases and even filterout many irrelevant matches. Thus, it is plausible to hy-pothesize that this same reasoning might apply to binarysimilarity.

We adapted the concept of common blocks to commoninstructions–i.e., instead of considering a set of bytes rep-resentative of a data pattern, we considered them represen-tative of an instruction and its argument. The rationalebehind this choice is that by removing common instructionsthat are present in all (or most) binaries, and thus that donot differentiate them, we allow the similarity hashing func-tion to focus on the file contents that actually make themsimilar or distinct. In other words, we aim to make similarityscores more representative, since a similarity score consider-ing instructions present in all binaries is naturally lifted dueto this similarity. A drawback of the common blocks methodis that the common block pattern might be removed fromany file region, thus potentially removing more than it wasdesigned to remove (e.g., removing actual data in additionto file headers). This problem is exacerbated when mov-ing to instructions, since the same instruction at distinctfile regions might imply distinct high-level behaviors. How-ever, as in the common block’s case, we expect the mostcommon instructions to be statistically more significant in

15

Table 6: Binary Sections of the malware samples in our dataset. Each sample presents its own set of executablebinary sections.

Section Prevalence Section Prevalence Section Prevalence Section Prevalencedata 97% .bss 97% .idata 92% .reloc 90%bss 85% ..tls 73% itext 72% .rdata 58%.rsrc 51% UPX1 34% UPX0 34% .data 34%.code 29% .edata 20% .text 18% Others 17%

0 10 20 30 40 50 60 70 80 90 100Similarity Score (%)

0102030405060708090

100

Sam

ples

(%)

Sections vs. Similarity Scoresvmp0 vmp1 rsrc text data rdata

Figure 21: Similarity Matches per Section. Each section presents a distinct rate.


0

10

20

30

40

50

60

70

80

90

100

Accu

racy

(%)

AV Clustering Accuracy vs Similarity Score

All Text Data

Figure 22: Binary Sections Accuracy


0

10

20

30

40

50

60

70

80

90

100

Prec

ision

(%)

AV Clustering Precision vs Similarity Score

All Text Data

Figure 23: Binary Sections Precision


0

10

20

30

40

50

60

70

80

90

100

Reca

ll (%

)

AV Clustering Recall vs Similarity Score

All Text Data

Figure 24: Binary Sections Recall


0

10

20

30

40

50

60

70

80

90

100

F1 S

core

(%)

AV Clustering F1 Score vs Similarity Score

All Text Data

Figure 25: Binary Sections F1 Score

16

non-essential code regions, such that we end up observingmetrics improvements.

Considering the proposed method adaptation, we re-peated the presented experiments now considering each bi-nary instruction as a possible common block. In total, weconsidered 377 million unique combinations of instructionopcodes and arguments present in the disassembly of theexecutable sections of the malware binaries. Table 7 exem-plifies the top 10 instructions most found in our datasetsand that might be removed to improve malware detectionrates.

Table 7: Common Blocks. Top 10 most removed instruc-tions.

Instruction Prevalence Instruction Prevalencepush <reg> 95.34% nop 89.84%pop <reg> 94.77% int3 89.24%inc <reg> 94.36% cltd 88.43%dec <reg> 94.19% leave 83.49%

ret 94.18% in 83.38%

We notice that the most prevalent instructions, individ-ually or in combination, really reflect common binary con-structions. More specifically, we notice their relation withfunction prologues (push) and epilogues (pop+ret). By re-moving these binary function’s surrounding instructions, weallow the similarity hashing function to focus on the simi-larity of the actual binary function’s content.

To evaluate our hypothesis in practice, we varied the ra-tio of the removed instructions (in % of the most commoninstructions) to identify the binary portion that most dis-criminates the binaries. Figures 26, 27, 28, and 29 presents,respectively, accuracy, precision,recall, and F1-score valuesobtained in the experiment while keeping 70%, 80%, 90%,and 100% of the binary instructions (thus removing from0% to 30% of the most common instructions).

This experiment sheds light on two important aspects ofsimilarity hashing functions. First, we notice that, whereasremoving a distinct number of common instructions did notimprove the overall metrics, it also did not decrease them.This shows that approximate matching might also be suc-cessful in identifying similarity of corrupted data or codeexcerpts capture from partial memory dumps, which is keyfor analyzing malware. Further, we notice that the 70%threshold (removing the 30% most common instructions andkeeping the remaining 70%) leads to an overall performanceincrease. We again highlight that this result is observedeven considering the limits imposed by the AV labeling pro-cedures, which shows that this strategy not only clusteredmore samples but also helped similarity hashing functionsand the AV to agree on their clusters.

The behavior of all metrics in this experiment was verysimilar to in the previous one, but now achieving onlyslightly better results, which is expected, since the resulthas already been significantly improved by the adoption ofthe previous techniques (e.g., separating instructions fromother sections). Even though, precision was the metric thatmost increased, which demonstrates that this technique hasthe potential to still increase the detection of similar samples

in agreement with AV labels.The obtained result is explained by the fact that by feed-

ing the similarity hash with less data (70% of the total), thebytes are distributed distinctly within the similarity hash-ing function’s clustering buckets (see Section 2). This newbucket organization leads to distinct digest values, that leadto distinct similarity scores, and thus to distinct clusters.

Finding #18: Removing common instructions forcessimilarity hashing function’s internal buckets reorgani-zation, which leads to similarity metrics increase.

5 Guidelines for the Application ofSimilarity Hashing Functions

With this work’s development, we discovered that althoughsimilarity hashing functions are already popular solutionsin many contexts, there are still no guidelines for theirapplication, with distinct researchers and solutions takingad-hoc decisions on parameter selection. Our experimentspresented results provide interesting insights about the sce-narios in which distinct parameters and approaches lead tobetter usage of such functions. Therefore, aiming to helpto bridge this guidelines gap, we propose some rules to befollowed when designing and performing experiments withsimilarity hashing functions.

1. Be aware of the existing trade-offs when choosing asimilarity hashing function.

(a) J-sdhash clusters more samples than ssdeep,thus being more suitable for tasks that requiregreater sample’s coverage.

(b) Ssdeep is faster than J-sdhash while present-ing slightly lower similarity scores, thus beingmore suitable for performance-constrained envi-ronments.

i. The cost of loading a database formatching is non-negligible for any of thetwo functions.

2. Always explicitly choose a threshold value and donot simply rely on tool’s default parameters and hetero-geneous meanings. A threshold must reflect what andhow much the researchers envision as similar or not.

(a) Be aware that distinct threshold values mightlead to completely distinct dataset charac-terizations, thus making you to take oppositeconclusions with its variations.

(b) Distinct threshold values perform better for dis-tinct tasks, with smaller threshold values per-forming better for malware triaging andgreater threshold values performing betterfor malware remediation and response.

3. Do not mix malware detection and malware fam-ily clustering, since they are distinct tasks, with dis-tinct drawbacks.

17


64656667686970717273747576

Accu

racy

(%)

AV Clustering Accuracy vs Similarity Score

All Inst90

8070

6050

Figure 26: Accuracy Removing Blocks


39404142434445464748495051525354555657

Prec

ision

(%)

AV Clustering Precision vs Similarity Score

All Inst90

8070

6050

Figure 27: Precision Removing Blocks


222630343842465054586266

Reca

ll (%

)

AV Clustering Recall vs Similarity Score

All Inst90

8070

6050

Figure 28: Recall Removing Blocks


2830323436384042444648505254565860

F1 S

core

(%)

AV Clustering F1 Score vs Similarity Score

All Inst90

8070

6050

Figure 29: F1 Removing Blocks

18

4. When detecting malware samples:

(a) Considering sample’s similarity increasesAV’s detection, since samples not detected bythe AV can now be detected by their similaritywith previously detected samples.

(b) Intermediate threshold values (e.g., 50%)present the best trade-off between malware de-tection and false positives.

(c) A good threshold choice (e.g., 50%) allowsa similarity-based AV to detect a wholedataset without adding all of them to adatabase.

5. When classifying malware samples in families:

(a) Notice that A committee of AVs is a betterground-truth than individual AVs, since itreduces the heterogeneity of the AV’s labels.

(b) Individual AVs should be avoided because dis-tinct AVs lead to distinct results, since theyrely on distinct classification criteria.

i. Many AVs classify samples as similarbased on their file type and/or packer,regardless their actual content.

ii. Avoid directly comparing packed andunpacked samples and prefer unpacking thesamples whenever possible to maximize simi-larity scores.

(c) The application of individual AVs lead to lowintra-cluster agreement, since the clusters cre-ated by the similarity hashing functions will bedistinct from the clusters generated by the AVs.

(d) If using individual AVs are still the only pos-sibility:

i. Be aware that the AV that most detectssamples is not necessarily the one thatmost clusters them, thus additional selec-tion criteria are required.

ii. Choose the AVs that best-fits the similar-ity hashing function results, so it allowsone to check function’s improvements withoutthe uncertainty of AV results.

(e) Remind that you can optimize the use of hashfunctions to achieve better results.

i. Distinct binary sections present distinctsimilarity rates, as they implement distinctmalicious behaviors and/or components.

ii. Separating code and data increases sim-ilarity scores, as it avoid the distinct pat-terns exhibited in these two kinds of sectionsto be mixed.

A. Be careful when identifying executablesections as packed code might turnsections executable only in runtime.

iii. Removing common instruction in-creases similarity metrics, as it suppliesthe function with less data and it causeinternal buckets rearrangements.

6 Discussion

In this section, we discuss the limits of our evaluations andthe implications of our findings.

The Importance of the Threshold Value. Our ex-periments have shown that the threshold value affects thenumber of clustered samples and thus the conclusions whencharacterizing a dataset. Therefore, we highlight the im-portance of researchers reporting the used threshold in theirexperiments to allow reproducibility and fair comparisons.There is currently no guideline for the threshold definition.We consider that this definition is a key open question to-wards the adoption of similarity metrics as standard formalware experiments and solutions. We expect our find-ings might help in bridging this gap and incentivizing fu-ture guidelines development. Meanwhile, we suggest re-searchers clearly stating their considered threshold valuesso as to allow reproducibility. Moreover, we suggest themopen-sourcing their similarity solutions since the thresholdvalue meaning changes from solution to solution.

Defining Ground-Truth is Hard. Our results haveshown that the clusters originated from AVs and from sim-ilarity hashing functions differ significantly. We discoveredthat the main reason is that AVs generate labels based ondetection heuristics and not on binary similarity. There-fore, there is a significant opportunity for the developmentof new methods and strategies for ground-truth definition.We consider it as an important step for allowing malwaresimilarity experiments to be validated and new cluster-ing/identification approaches to be compared to each otheron the same basis.

The Advantages of Alternative Malware Represen-tations. Due to the nature of the similarity hashing func-tions, our proposed AV only needs to upload the digest ofa suspicious file to be scanned instead of the entire file con-tent, which results in privacy and performance gains. Thisis made possible because the uploaded digest is, in fact, ashort representation of the scanned file, which saves networkbandwidth and presents itself as an effective and scalable al-ternative representation for malware detection.

Similarity hashing for triaging malware. A straight-forward application of similarity hashing in the context ofmalware analysis and detection is to triage samples, i.e.,to verify the sample is known or not. This task is oftenperformed using cryptographic hashes (e.g., MD5 or SHA),which limits its application mostly to static databases. Ifsimilarity hashes were used, the triage procedure could beturned into the verification of whether there is a known wayto handle that sample, despite its being known or unknown.Similarity hashes could then be used, for instance, to clusternew samples to the existing ones and thus the same proce-dures applied to that cluster would be applied to the newsample. This would allow the scaling of multiple triaging ap-

19

proaches, such as Google SafeBrowsing [29], as users wouldnot need to upload entire files (but only their similarity di-gests) to the server to have initial feedback about the filesecurity.

Scaling Malware Variants Identification. Multiple ap-proaches have been proposed over time to identify malwarevariants. Some of them, such as those relying on dynamicanalysis [14], are computationally expensive, despite effec-tive, thus being not practical to handle large-scale databases.The reliance on similarity hashing functions proposed in thiswork allows an efficient and yet effective solution to handlelarge-scale databases. In our approach, suspicious files arematched only against representative samples of each cluster.Therefore, costly approaches can be employed at the server-side to define the cluster’s “leader” whereas our approachcan be used to fast calculate the similarity between theseleaders and the suspicious file.

identifying executable sections might be challeng-ing. The major drawback of relying on individual sectionsis that originally read-only sections can be turned into anexecutable one in runtime by malware samples. This typeof information is only available for dynamic analysis proce-dures and not for static approaches like the ones proposed inthis paper. Therefore, the best way of benefiting from thistype of approach is to not perform this task statically at theclient-side, but to outsource it to a more powerful agent,such as a cloud-based AV server, which might execute andtrace the sample before matching its executable sections.

Limitations Whereas we believe our work covered a sig-nificant number of challenges in the application of similar-ity hashing functions to malware decision processes, we areaware that it has some limitations. For instance, ssdeep andJ-sdhash are not the only similarity hashing functions thatcould be selected. We considered them in our evaluationdue to their popularity, but other functions, such as TLSHvariations, should also be evaluated to draw a broader land-scape of the malware classification subject. Moreover, ourdataset is only a limited view of the larger problem of mal-ware handling. We do not claim that our findings generalizefrom this dataset to any other. Instead, our claim is thatthis dataset particularly highlights the challenges of malwareclassification using similarity hashing functions. We demon-strated that the challenges we pointed really occurred in areal dataset and thus they should be considered by profes-sionals and researchers as they might appear during theirinvestigative activities. However, other datasets might re-veal distinct challenges, so more research is warranted.

Future Work. This work has shown alternative approachesfor the application of similarity hashing functions. We be-lieve that our findings might help to broaden the usage sce-narios of this type of function. However, this work is notexhaustive and more research is warranted to investigate ad-ditional issues. For instance, whereas our work encompassedonly MS Windows binaries, the same experiments should berepeated to assess whether the same conclusions hold forLinux and Android applications. In future work, we planto investigate the impact of considering common blocks ofmultiple instructions in addition to single instructions andnew sources of labels ground-truths.

7 Related Work

In this section, we present related work to better position ourcontributions. Whereas the main usage of similarity hash-ing functions is for sample triaging and clustering [25, 30],we below focus more specialized developments of similarityhashing.

Approximate Matching. Many functions were developedover the years trying to provide a lightweight solution, hav-ing as short as possible digests with generation and compar-ison times just as fast as traditional hash functions, such asSHA-1, SHA-2, etc. The more prominent functions includessdeep [36], sdhash [60], mrsh-v2 [17], TLSH [54], and themost recent one, LZJD [59]. However, the most popular andthe chosen ones for this work are ssdeep and sdhash, for be-ing the first functions of their kind and target of constantresearch, regarding improvements [16, 46, 45] and discussingtheir detection capabilities [5, 20, 19].

Approximate Matching and Malware. The applicationof similarity hashing functions to binary files was well stud-ied by Pagani et al [56], which investigates use cases such aslibrary identification and binary recompilation. Malware areparticular types of binaries whose analysis tasks can bene-fit from the application of similarity hash functions. Thesehave already been reported to be applied in forensic proce-dures [64] and for hunting ransomware [50]. Their majorapplication field, however, is the identification of malwarevariants [4].

Drawbacks of Approximate Matching. The applica-tion of similarity hashing functions presents some draw-backs, such as the ones mentioned by Pagani et al [56], re-lated to a change in the binary structure of files. Since thefunctions employed in this work are meant to find similarityin the bytewise level, changes in the compiler during binarycreation or due to packing and compressing techniques willnegatively impact the use of the similarity hashing func-tion in the assessment of similarity. A drawback related totheir application in practice is the resulting cluster diversity.The application of similarity hash functions to IoT binariesproduced the same trade-offs between cluster sizes and per-formance presented in this work [6].

Design of new similarity hash functions. A way toadvance malware similarity identification solutions is to de-velop new similarity hash functions. Previous research workshave identified the requirements for a good similarity hashfunction [55] and how it can inspire new applications [35],but there are still few works targeting the similarity of ex-ecutable binaries] [39], and more specifically proposing newfunctions that are malware-targeted [38]. We expect thatthis study might foster new research and that our findingsmight support their developments.

Malware Variants Identification. The identification ofmalware variants is a challenging task and multiple ap-proaches have been proposed to address it: the use of n-grams [26], visual analytics [57], call graphs (CGs) [9], graphembedding [72], and dynamic analysis [2]. All of theseapproaches, however, are more computationally expensivethan the here proposed application of hash functions. Ad-ditionally, nothing prevents these approaches to be used as

20

a complement to ours if applied over the clusters identifiedby the similarity hash functions.

Other Matching Approaches. Malware variants identifi-cation is not the only method that works by matching similarpatterns. The same task is performed in the context of ba-sic blocks comparison [1] and code reuse identification [69].Similarity hashing is an important step part of these ap-proaches and our findings are also valid to be applied alongwith them.

Matching Strategies. A key contribution of this work is todiscuss matching criteria and strategies to effectively detect-ing common binaries. Whereas our focus is on malware de-tection, previous work presented related discussion in otherfields, such as in Windows modules identification [41]. Theapplication of this kind of strategy to malware samples is anopen problem.

Other Labeling Strategies. In addition to AVs, othersolutions might provide ground-truth information aboutmalware families. For instance, many machine-based ap-proaches have been proposed recently to address the prob-lem of weak malware labels [73]. These approaches are var-ied, and might rely from static strings extracted from thesamples [67], to artifacts extracted during dynamic analysisprocedures [44]. Other approaches try to overcome malwareclassification challenges via the observation of their behav-iors [49]. A major drawback of these approaches is thatthey are more costly (e.g., requiring runtime analysis) thanrelying on AV labels. Also, the implementation of theseapproaches is not often available during the practice of ananalyst and/or forensic expert. Thus, we focused this workon the use of the widely-available AV labels.

Cloud AVs. A key insight of our evaluation is that the sim-ilarity hash-based matching will be outsourced to a cloud-based AV. The idea of a cloud AV is not new, but most so-lutions only move the same technology to the cloud and donot change how detection is performed [24]. In some cases,the detection is even outsourced to third part AVs [33]. Inour proposal, in addition to the processing tasks, we alsochange the file scanning representation from signatures tosimilarity hashes and scores. This allows us to benefit fromother cloud-based strategies, such as distributed data col-lection [23].

The bias towards the easy cases. In this paper, we tack-led the problem of identifying similar malware samples in areal scenario, which brings many difficulties and limits theobtained results. Our goal with this evaluation is to high-light how challenging a theoretically-feasible task might be-come in a real scenario. In this sense, our evaluation followsthe philosophy adopted in previous work to demonstratedthat evaluating solution only using the “easy-to-classify”samples is not enough for a complete understanding of theproblem. Previous works have already demonstrated thatthe high accuracy rates reported for both traditional [37]as well as emerging [7] (textures) malware clustering ap-proaches might not be reproducible depending on how chal-lenging the evaluated dataset is.

8 Conclusion

In this work, we investigated the limits and benefits of apply-ing similarity hashing to clustering routines to enhance mal-ware detection procedures applied to real-world datasets.We proposed an ideal model of a similarity hashing-basedAntiVirus (AV) that applied two distinct similarity hashfunctions to a dataset of real malware samples collected overfour years. We discovered that considering similarity scoresmight improve standard AV’s detection rates by up to 40%without raising False Positives. The centralized databasemodel adopted by our AV along with our findings suggeststhat approaches so-far consider impractical, such as com-plex clustering procedures, might now be successfully im-plemented at the AV server level. On the other hand, wediscovered that leveraging AV’s labels as ground-truth formalware similarity identification is challenging and requiresadditional steps to achieve reasonable results. For instance,we showed that considering only the most-representative bi-nary instructions, i.e., the unique instructions of each binaryor the ones with fewer occurrences in different files), leads tobetter similarity metrics than considering the entire binary,thus a better fit between the similarity hashing function andthe considered AV.Acknowledgements. This study was financed in part bythe Brazilian National Counsel of Technological and Scien-tific Development (CNPq - process n. 164745/2017-3), theCoordenacao de Aperfeicoamento de Pessoal de Nıvel Supe-rior - Brasil (CAPES – Finance Code 001) and the SantanderInternational Mobility Program for Graduate Students.Reproducibility. All developed source codes forthis research are available at: https://github.com/

marcusbotacin/Binary.Similarity The list of consid-ered malware samples is available at https://github.

com/marcusbotacin/malware-data Goodware informa-tion is available at https://github.com/marcusbotacin/

Application.Installers.Overview

References

[1] F. Adkins, L. Jones, M. Carlisle, and J. Upchurch.Heuristic malware detection via basic block compari-son. In 2013 8th International Conference on Mali-cious and Unwanted Software: ”The Americas” (MAL-WARE), pages 11–18, 2013.

[2] E. M. S. Alkhateeb. Dynamic malware detection usingapi similarity. In 2017 IEEE International Conferenceon Computer and Information Technology (CIT), pages297–301, 2017.

[3] Flora Amato, Aniello Castiglione, Giovanni Cozzolino,and Fabio Narducci. A semantic-based methodologyfor digital forensics analysis. Journal of Parallel andDistributed Computing, 138:172–177, 2020.

[4] A. Azab, R. Layton, M. Alazab, and J. Oliver. Miningmalware to detect variants. In 2014 Fifth Cybercrimeand Trustworthy Computing Conference, pages 44–53,2014.

21

[5] Harald Baier and Frank Breitinger. Security aspects ofpiecewise hashing in computer forensics. In IT SecurityIncident Management and IT Forensics (IMF), 2011Sixth International Conference on, pages 21–36. IEEE,2011.

[6] Marton Bak, Dorottya Papp1, Csongor Tamas,and Levente Buttyan. Clustering iot mal-ware based on binary similarity. https:

//www.crysys.hu/publications/files/setit/

cpaper_bme_BakPTB20dissect.pdf, 2020.

[7] Tamy Beppler, Marcus Botacin, Fabrıcio J. O. Ceschin,Luiz E. S. Oliveira, and Andre Gregio. L(a)ying in(test)bed. In Zhiqiang Lin, Charalampos Papaman-thou, and Michalis Polychronakis, editors, InformationSecurity, pages 381–401, Cham, 2019. Springer Inter-national Publishing.

[8] Christopher M. Bishop. Pattern Recognition and Ma-chine Learning (Information Science and Statistics).Springer-Verlag, Berlin, Heidelberg, 2006.

[9] K. Blokhin, J. Saxe, and D. Mentis. Malware similarityidentification using call graph based system call subse-quence features. In 2013 IEEE 33rd International Con-ference on Distributed Computing Systems Workshops,pages 6–10, 2013.

[10] Burton H. Bloom. Space/time trade-offs in hash codingwith allowable errors. Commun. ACM, 13(7):422–426,July 1970.

[11] Marcus Botacin, Hojjat Aghakhani, Stefano Ortolani,Christopher Kruegel, Giovanni Vigna, Daniela Oliveira,Paulo Lıcio De Geus, and Andre Gregio. One size doesnot fit all: A longitudinal analysis of brazilian finan-cial malware. ACM Trans. Priv. Secur., 24(2), January2021.

[12] Marcus Botacin, Giovanni Bertao, Paulo de Geus,Andre Gregio, Christopher Kruegel, and Giovanni Vi-gna. On the security of application installers and onlinesoftware repositories. In Clementine Maurice, LeylaBilge, Gianluca Stringhini, and Nuno Neves, editors,Detection of Intrusions and Malware, and Vulnerabil-ity Assessment, pages 192–214, Cham, 2020. SpringerInternational Publishing.

[13] Marcus Botacin, Fabricio Ceschin, Paulo [de Geus], andAndre Gregio. We need to talk about antiviruses: chal-lenges & pitfalls of av evaluations. Computers & Secu-rity, 95:101859, 2020.

[14] Marcus Botacin, Paulo de Geus, and Andre Gregio.Malware variants identification in practice. https:

//sbseg2019.ime.usp.br/anais/195666.pdf, 2019.

[15] Marcus Botacin, Lucas Galante, Paulo de Geus,and Andre Gregio. Revenge is a dish served cold:Debug-oriented malware decompilation and reassem-bly. In Proceedings of the 3rd Reversing and Offensive-Oriented Trends Symposium, ROOTS’19, New York,NY, USA, 2019. Association for Computing Machinery.

[16] Frank Breitinger and Harald Baier. Performance IssuesAbout Context-Triggered Piecewise Hashing, pages 141–155. Springer Berlin Heidelberg, Berlin, Heidelberg,2012.

[17] Frank Breitinger and Harald Baier. Similarity Preserv-ing Hashing: Eligible Properties and a New AlgorithmMRSH-v2, pages 167–182. Springer Berlin Heidelberg,Berlin, Heidelberg, 2013.

[18] Frank Breitinger, Barbara Guttman, Michael McCar-rin, Vassil Roussev, and Douglas White. Approximatematching: definition and terminology. NIST SpecialPublication, 800:168, 2014.

[19] Frank Breitinger and Vassil Roussev. Automated evalu-ation of approximate matching algorithms on real data.Digital Investigation, 11:S10–S17, 2014.

[20] Frank Breitinger, Georgios Stivaktakis, and VassilRoussev. Evaluating detection error trade-offs for byte-wise approximate matching algorithms. Digital Inves-tigation, 11(2):81–89, 2014.

[21] F. Ceschin, F. Pinage, M. Castilho, D. Menotti, L. S.Oliveira, and A. Gregio. The need for speed: An anal-ysis of brazilian malware classifers. IEEE Security &Privacy, 16(6):31–41, Nov.-Dec. 2018.

[22] CNN. Nearly 1 million new malware threats re-leased every day. https://money.cnn.com/2015/

04/14/technology/security/cyber-attack-hacks-

security/, 2015.

[23] M. Dev, H. Gupta, S. Mehta, and B. Balamurugan.Cache implementation using collective intelligence oncloud based antivirus architecture. In 2016 Interna-tional Conference on Advanced Communication Con-trol and Computing Technologies (ICACCCT), pages593–595, 2016.

[24] Dimitris Deyannis, Eva Papadogiannaki, Giorgos Kali-vianakis, Giorgos Vasiliadis, and Sotiris Ioannidis.Trustav: Practical and privacy preserving malwareanalysis in the cloud. In Proceedings of the Tenth ACMConference on Data and Application Security and Pri-vacy, CODASPY ’20, page 39–48, New York, NY, USA,2020. Association for Computing Machinery.

[25] Sebastian Eschweiler, Khaled Yakdan, and El-mar Gerhards-Padilla†. discovre: Efficient cross-architecture identification of bugs in binary code - ndss17, 2017.

[26] Z. Fuyong and Z. Tiezhu. Malware detection and classi-fication based on n-grams attribute similarity. In 2017IEEE International Conference on Computational Sci-ence and Engineering (CSE) and IEEE InternationalConference on Embedded and Ubiquitous Computing(EUC), volume 1, pages 793–796, 2017.

22

[27] Lucas Galante, Marcus Botacin, Andre Gregio, andPaulo de Geus. Malicious linux binaries: A land-scape. https://sol.sbc.org.br/index.php/sbseg_

estendido/article/view/4160, 2018.

[28] Google. How google play protect kept users safe in 2019.https://security.googleblog.com/2020/03/how-

google-play-protect-kept-users-safe.html,2019.

[29] Google. Improved malware protection for usersin the advanced protection program. https:

//security.googleblog.com/2020/09/improved-

malware-protection-for-users.html, 2020.

[30] Mariano Graziano, Davide Canali, Leyla Bilge, An-drea Lanzi, and Davide Balzarotti. Needles in ahaystack: Mining information from public dynamicanalysis sandboxes for malware intelligence. In Proceed-ings of the 24th USENIX Conference on Security Sym-posium, SEC’15, page 1057–1072, USA, 2015. USENIXAssociation.

[31] Vikram S Harichandran, Frank Breitinger, and IbrahimBaggili. Bytewise approximate matching: The good,the bad, and the unknown. The Journal of DigitalForensics, Security and Law: JDFSL, 11(2):59, 2016.

[32] Qing He, Hai Xia Gu, Qin Wei, and Xu Wang. Anovel dbscan based on binary local sensitive hashingand binary-knn representation. Advances in Multime-dia, 2017:3695323, Dec 2017.

[33] Chris Jarabek, David Barrera, and John Aycock.Thinav: Truly lightweight mobile cloud-based anti-malware. In Proceedings of the 28th Annual ComputerSecurity Applications Conference, ACSAC ’12, page209–218, New York, NY, USA, 2012. Association forComputing Machinery.

[34] Andre Kameyama, Vitor Moia, and MarcoAurelio Amaral Henriques. Aperfeicoamento da ferra-menta sdhash para identificacao de artefatos similaresem investigacoes forenses. https://sol.sbc.org.br/

index.php/sbseg_estendido/article/view/4161,2018.

[35] ElMouatez Billah Karbab, Mourad Debbabi, and Djed-jiga Mouheb. Fingerprinting android packaging: Gener-ating dnas for malware detection. Digital Investigation,18:S33–S45, 2016.

[36] Jesse Kornblum. Identifying almost identical files usingcontext triggered piecewise hashing. Digital investiga-tion, 3:91–97, 2006.

[37] Peng Li, Limin Liu, Debin Gao, and Michael K. Re-iter. On challenges in evaluating malware clustering.In Somesh Jha, Robin Sommer, and Christian Kreibich,editors, Recent Advances in Intrusion Detection, pages238–255, Berlin, Heidelberg, 2010. Springer Berlin Hei-delberg.

[38] Yuping Li, Sathya Chandran Sundaramurthy, Alexan-dru G. Bardas, Xinming Ou, Doina Caragea, Xin Hu,and Jiyong Jang. Experimental study of fuzzy hashingin malware clustering analysis. In Proceedings of the8th USENIX Conference on Cyber Security Experimen-tation and Test, CSET’15, page 8, USA, 2015. USENIXAssociation.

[39] Lorenz Liebler and Harald Baier. Towards exact andinexact approximate matching of executable binaries.Digital Investigation, 28:S12–S21, 2019.

[40] Jacques Linden, Raymond Marquis, Silvia Bozza, andFranco Taroni. Dynamic signatures: A review ofdynamic feature variation and forensic methodology.Forensic Science International, 291:216–229, 2018.

[41] Miguel Martın-Perez, Ricardo J. Rodrıguez, and Da-vide Balzarotti. Pre-processing memory dumps to im-prove similarity score of windows modules. Computers& Security, 101:102119, 2021.

[42] Miguel Martın-Perez, Ricardo J. Rodrıguez, and FrankBreitinger. Bringing order to approximate matching:Classification and attacks on similarity digest algo-rithms. Forensic Science International: Digital Inves-tigation, 36:301120, 2021. DFRWS 2021 EU - SelectedPapers and Extended Abstracts of the Eighth AnnualDFRWS Europe Conference.

[43] Fernando Merces. Grouping linux iot mal-ware samples with trend micro elf hash.https://blog.trendmicro.com/trendlabs-

security-intelligence/grouping-linux-iot-

malware-samples-with-trend-micro-elf-hash/,2020.

[44] Aziz Mohaisen and Omar Alrawi. Amal: High-fidelity,behavior-based automated malware analysis and clas-sification. In Kyung-Hyune Rhee and Jeong Hyun Yi,editors, Information Security Applications, pages 107–121, Cham, 2015. Springer International Publishing.

[45] Vitor Hugo Galhardo Moia. A study on approximatematching for similarity search: techniques, limitationsand improvements for digital forensic investigations.PhD dissertation, School of Electrical and ComputerEngineering - University of Campinas, Campinas, SP,Brazil, 2020.

[46] Vitor Hugo Galhardo Moia, Frank Breitinger, andMarco Aurelio Amaral Henriques. Understanding theeffects of removing common blocks on approximatematching scores under different scenarios for digitalforensic investigations. In XIX Brazilian Symposium oninformation and computational systems security, pages1–14. Brazilian Computer Society (SBC) SA£ o Paulo-SP, Brazil, 2019.

[47] Vitor Hugo Galhardo Moia, Frank Breitinger, andMarco Aurelio Amaral Henriques. The impact ofexcluding common blocks for approximate matching.Computers & Security, 89:101676, 2020.

23

[48] Vitor Hugo Galhardo Moia and Marco Aurelio AmaralHenriques. Similarity digest search: A survey and com-parative analysis of strategies to perform known file fil-tering using approximate matching. Security and Com-munication Networks, 2017, 2017.

[49] Azqa Nadeem, Christian Hammerschmidt, Carlos H.Ganan, and Sicco Verwer. Beyond Labeling: UsingClustering to Build Network Behavioral Profiles of Mal-ware Families, pages 381–409. Springer InternationalPublishing, Cham, 2021.

[50] N. Naik, P. Jenkins, N. Savage, and L. Yang. Cy-berthreat hunting - part 1: Triaging ransomware us-ing fuzzy hashing, import hashing and yara rules. In2019 IEEE International Conference on Fuzzy Systems(FUZZ-IEEE), pages 1–6, 2019.

[51] N. Naik, P. Jenkins, N. Savage, L. Yang, K. Naik, andJ. Song. Augmented yara rules fused with fuzzy hashingin ransomware triaging. In 2019 IEEE Symposium Se-ries on Computational Intelligence (SSCI), pages 625–632, 2019.

[52] Landon Curt Noll. Fowler/noll/vo (fnv) hash.http://www.isthe.com/chongo/tech/comp/fnv/

index.html, 2012. Accessed 2020 Apr 14.

[53] Markus Oberhumer, Laszlo Molnar, and John Reiser.the ultimate packer for executables. https://upx.

github.io/, 2017.

[54] Jonathan Oliver, Chun Cheng, and Yanggui Chen.TLSH–a locality sensitive hash. In Cybercrimeand Trustworthy Computing Workshop (CTC), 2013Fourth, pages 7–13. IEEE, 2013.

[55] Jonathan Oliver and Josiah Hagen. On designingthe elements of a fuzzy hashing scheme. https://

www.malwareconference.org/index.php/en/2019-

malware-conference-proceedings/2019-malware-

conference/session-2-case-studies-analysis-

and-industry-view/05-4720-pdf/detail, 2019.

[56] Fabio Pagani, Matteo Dell’Amico, and DavideBalzarotti. Beyond precision and recall: Understandinguses (and misuses) of similarity hashes in binary anal-ysis. In Proceedings of the Eighth ACM Conference onData and Application Security and Privacy, CODASPY’18, page 354–365, New York, NY, USA, 2018. Associ-ation for Computing Machinery.

[57] A. Paturi, M. Cherukuri, J. Donahue, and S. Mukka-mala. Mobile malware visual analytics and similaritiesof attack toolkits (malware gene analysis). In 2013 In-ternational Conference on Collaboration Technologiesand Systems (CTS), pages 149–154, 2013.

[58] Matt Pietrek. An in-depth look into thewin32 portable executable file format. https:

//docs.microsoft.com/en-us/archive/msdn-

magazine/2002/february/inside-windows-win32-

portable-executable-file-format-in-detail,2002.

[59] Edward Raff and Charles Nicholas. Lempel-ziv jaccarddistance, an effective alternative to ssdeep and sdhash.Digital Investigation, 24:34–49, 2018.

[60] Vassil Roussev. Data fingerprinting with similarity di-gests. In IFIP International Conf. on Digital Forensics,pages 207–226. Springer, 2010.

[61] Vassil Roussev. An evaluation of forensic similarityhashes. Digital investigation, 8:34–41, 2011.

[62] Vassil Roussev. An evaluation of forensic similarityhashes. Digital Investigation, 8:S34 – S41, 2011. TheProceedings of the Eleventh Annual DFRWS Confer-ence.

[63] Vassil Roussev and Candice Quates. sdhash tutorial:Release 0.8. http://roussev.net/sdhash/tutorial/sdhash-tutorial.pdf, 2013. Accessed 2016 Set 13.

[64] N. Sarantinos, C. Benzaıd, O. Arabiat, and A. Al-Nemrat. Forensic malware analysis: The value of fuzzyhashing algorithms in identifying similarities. In 2016IEEE Trustcom/BigDataSE/ISPA, pages 1782–1787,2016.

[65] Marcos Sebastian, Richard Rivera, Platon Kotzias, andJuan Caballero. Avclass: A tool for massive malwarelabeling. In Fabian Monrose, Marc Dacier, GregoryBlanc, and Joaquin Garcia-Alfaro, editors, Researchin Attacks, Intrusions, and Defenses, pages 230–253,Cham, 2016. Springer International Publishing.

[66] Ian Shiel and Stephen O’Shaughnessy. Improving file-level fuzzy hashes for malware variant classification.Digital Investigation, 28:S88–S94, 2019.

[67] Prasha Shrestha, Suraj Maharjan, Gabriela Ramırezde la Rosa, Alan Sprague, Thamar Solorio, and GaryWarner. Using string information for malware fam-ily identification. In Ana L.C. Bazzan and KarimPichara, editors, Advances in Artificial Intelligence– IBERAMIA 2014, pages 686–697, Cham, 2014.Springer International Publishing.

[68] Esko Ukkonen. On approximate string matching. InInternational Conference on Fundamentals of Compu-tation Theory, pages 487–495. Springer, 1983.

[69] J. Upchurch and X. Zhou. Malware provenance: codereuse detection in malicious software at scale. In 201611th International Conference on Malicious and Un-wanted Software (MALWARE), pages 1–9, 2016.

[70] VirusTotal. Virustotal. https://www.virustotal.

com, 2018.

[71] Y. Wu, J. Guo, and X. Zhang. A linear dbscan algo-rithm based on lsh. In 2007 International Conferenceon Machine Learning and Cybernetics, volume 5, pages2608–2614, 2007.

24

[72] Xiaochuan Zhang, Wenjie Sun, Jianmin Pang, FudongLiu, and Zhen Ma. Similarity metric methodfor binary basic blocks of cross-instruction set ar-chitecture. https://archive.bar/pdfs/bar2020-

preprint2.pdf, 2020.

[73] Y. Zhang, Y. Sui, S. Pan, Z. Zheng, B. Ning, I. Tsang,and W. Zhou. Familial clustering for weakly-labeledandroid malware using hybrid representation learning.IEEE Transactions on Information Forensics and Se-curity, 15:3401–3414, 2020.

25

understanding uses and misuses of similarity hashing

Documents