spam: signal processing for analyzing malware - … · spam: signal processing for analyzing...

1

SPAM: Signal Processing for Analyzing MalwareLakshmanan Nataraj, Student Member, IEEE, B.S. Manjunath, Fellow, IEEE

CYBER attacks have risen in recent times. The attackon Sony Pictures by hackers, allegedly from North

Korea, has caught worldwide attention. The President of theUnited States of America issued a statement and “vowed aUS response after North Korea’s alleged cyber-attack”.Thisdangerous malware termed “wiper” could overwrite data andstop important execution processes. An analysis by the FBIshowed distinct similarities between this attack and the codeused to attack South Korea in 2013, thus confirming thathackers re-use code from already existing malware to createnew variants. This attack along with other recently discoveredattacks such as Regin, Opcleaver give one clear message:current cyber security defense mechanisms are not sufficientenough to thwart these sophisticated attacks.

Today’s defense mechanisms are based on scanning systemsfor suspicious or malicious activity. If such an activity is found,the files under suspect are either quarantined or the vulnerablesystem is patched with an update. These scanning methodsare based on a variety of techniques such as static analysis,dynamic analysis and other heuristics based techniques, whichare often slow to react to new attacks and threats. Staticanalysis is based on analyzing an executable without executingit, while dynamic analysis executes the binary and studiesits behavioral characteristics. Hackers are familiar with thesestandard methods and come up with ways to evade the currentdefense mechanisms. They produce new malware variants thateasily evade the detection methods. These variants are createdfrom existing malware using inexpensive easily available “fac-tory toolkits” in a “virtual factory” like setting, which thenspread over and infect more systems. Once a system is com-promised, it either quickly looses control and/or the infectionspreads to other networked systems. While security techniquesconstantly evolve to keep up with new attacks, hackers toochange their ways and continue to evade defense mechanisms.As this never-ending billion dollar “cat and mouse game”continues, it may be useful to look at avenues that can bringin novel alternative and/or orthogonal defense approaches tocounter the ongoing threats. The hope is to catch these newattacks using orthogonal and complementary methods whichmay not be well known to hackers, thus making it moredifficult and/or expensive for them to evade all detectionschemes. This paper focuses on such orthogonal approachesfrom Signal and Image Processing that complement standardapproaches.

MALWARE LANDSCAPEMalware - malicious software, is any software that is

designed to cause damage to a computer, server, network,mobile phones and more such devices. Based on their func-tion, malware are classified into different Types such as Tro-jans, Backdoors, Virus, Worm, Spyware, Adware and more.

1

MALWARE

Type Type Type

Family Family Family Family Family Family

. . . . . .

Variants Variants

. . . . . . . . .

Variants Variants Variants Variants

Fig. 1: Malware LandscapeMalware are also identified by which Platform they belongto, such as Windows, Linux, AndroidOS and others. Apartfrom Types and Platforms, malware are further classified intoFamilies depending on their specific function. These Familiesin turn have many Variants which perform almost the samefunction. The entire malware landscape is shown in Fig. 1.According to the Computer Antivirus Research Organiza-tion (CARO) convention for naming malware, a malware isrepresented by: Type:Platform/Family.Variant. For example,PWS:Win32/Zbot.gen!AF denotes a password stealer malwareof the generic Zbot family that attacks 32-bit Windows plat-forms.

Malware variants are created either by making changesto the malware code or by using executable packers. In theformer case a simple mutation occurs by changing smallparts of the code. These are referred as unpacked malwarevariants. In the latter case a more complex mutation occurseither by compressing or encrypting (usually with differentkeys) the main body of the code and appending a decom-pression/decryption routine, which during runtime decom-presses/decrypts the encrypted payload. The new variants arecalled packed malware variants and they perform the samefunction as the original malware but their attributes wouldbe so different that Antivirus software, which use traditionalsignature based detection, would not be able to detect them.The tools used for obfuscation are called Executable Packers,available both as freeware and commercial tools. There arehundreds of packers that exist today which make it very easyfor malware writers to create new variants.

MALWARE ANALYSIS

Malware classification deals with identifying the family ofan unknown malware variant from a malware dataset that isdivided into many families. The level of risk of a particularmalware is determined by what function it does, which is inturn reflected in its family. Hence, identifying the malware

arX

iv:1

605.

0528

0v1

[cs

.CR

] 1

7 M

ay 2

016

2

8 Bit Vector

Malware Binary Image

SignalHex Viewer

Fig. 2: Malware represented as a Signal and an Image

family of an unknown malware is crucial in understandingand stopping new malware. It is usually assumed that anunknown malware variant belongs to a known set of malwarefamilies (supervised classification). Having a high classifica-tion accuracy (the number of correctly classified families)is desirable. A closely related problem is malware retrievalwhere the objective is to retrieve similar malware matchesfor a given query from a large database of malware. Inmalware detection the problem is to determine if an unknownexecutable is malicious, benign or unknown. This problemis more challenging than malware classification where allsamples are known to be malicious. In this tutorial we willfocus on malware classification and malware retrieval.

While most malware are geared towards Windows Operat-ing System, they are also quickly expanding to other avenuessuch as Android, Linux and OS X. Antivirus vendor G-DATAreported that they discovered more than 1.5 Million maliciousAndroid apps in 2014 and more than 400,000 apps in just thefirst quarter of 2015. Similarly, there has also been a stark risein Linux malware and OS X malware. An important questionin this context is: Can we have a single method that can detectmalware irrespective of which Operating System it comes fromwithout having to know the nuances of each system?

A common way to defeat static analysis is by using packerson a executable which compress and/or encrypt the executablecode and create a new packed executable that mimics theprevious executable in function but reveals the actual codeonly upon execution runtime. Dynamic analysis is agnosticto packing but is slow and time consuming. Further, today’smalware are designed to be Virtual Machine (VM) aware,which either do not do any malicious activity in the presenceof VM or attempts a “suicide” when a VM is detected. Thechallenges here are: Can we design techniques that are fast,do not need disassembly, unpacking or execution?

A key emphasis in all the above-mentioned challengesis development of complementary methods that address thelimitations of existing approaches. Alternative representationsof malware data such as signals or images have patterns thatare not captured by standard methods. We explore these typesof representations in this paper.

.text

.rdata

.data

.rsrc

Fig. 3: Example of a Malware Image

MALWARE IMAGES

A common method of viewing and editing malware binariesis by using Hex Editors, which display the bytes of the binariesin hexadecimal representation from ‘00’ to ‘FF’. Effectively,these are 8-bit numbers in the range of 0-255. Groupingthese 8-bit numbers results in a 8-bit vector, from whichwe construct a signal or an image as shown in Fig. 2. Foran image, the width is fixed and the height is allowed tovary depending on the file size. Fig. 3 shows an example

3

(c) Fakerean (d) Yuner.A(b) Dialplatform.B(a) Adialer.C

Fig. 4: Visual similarity among malware variants of 4 different families

image of a common Windows Trojan downloader, Dontovo.A,which downloads and executes arbitrary files. We can see thatdifferent sections of this malware exhibit distinctive imagepatterns. The .text section which contains the executable codehas a fine grained texture. It is followed by a black block(zeros), indicating zero padding at the end of this section. The.data section contains both uninitialized code (black patch) andinitialized data (fine grained texture). The final .rsrc sectioncontains all the resources of the module, including the icon ofthe executable.

Dataset Size No. of Families Accuracy

Malimg (Win) 9,339 25 97.4

Malheur (Win) 3,131 24 98.37

VxShare (Linux) 568 8 83.27

Malgenome (Android) 1,200 13 97.4

Fig. 5: Classification Accuracy on Datasets from differentOperating Systems

When we visualized a large number of malware variants,an empirical observation we could make is that there wasvisual similarity among malware variants of the same family(Fig. 4). At the same time, the variants were also distinct fromthose belonging to other families. This is because the variantsare created using either simple code mutations or packing. Itis easy to identify the variants for unpacked malware sincethe structure of the variants are very similar. In the case ofpacked malware, the executable code is compressed and/orencrypted. During runtime, this code is then unpacked andexecuted. When two unpacked variants belonging to a specificmalware family are using a packer to obtain packed variantsof the same family, their structure no longer remains thesame as that of the unpacked variants. However, the structurewithin the packed variants are still similar though the actualbytes may vary due to compression and/or encryption. Thevisual similarity of malware images motivated us to look atmalware classification using techniques from computer vision,where image based classification has been well studied. Weuse global image similarity descriptors and obtain compactsignatures for these malware, which are then used to identify

their families.

CLASSIFICATION

Once the malware binary is converted to an image, animage similarity descriptor is computed on the image tocharacterize the malware. The descriptor that we use isthe GIST feature [1], which is commonly used in imagerecognition systems such as scene classification [1], objectrecognition [2] and large scale image search [3]. Every imagelocation is represented by the output of filters tuned to differentorientations and scales. A steerable pyramid with 4 scales and8 orientations is used. The local representation of an image isthen given by: V L(x) = Vk(x)k=1..N where N = 20 is thenumber of sub-bands. To capture the global image propertieswhile retaining some local information, the mean value of themagnitude of the local features is computed and averaged overlarge spatial regions: m(x) =

∑x′ |V (x′)|W (x′ − x) where

W (x) is the averaging window. The resulting representationis downsampled to have a spatial resolution of M×M pixels(here we use M=4). Thus the feature vector obtained is of sizeM×M×N = 320. For faster processing, the images are usuallyresized to a smaller size (we use 64× 64).

To identify malware families, we perform supervised classi-fication with 10-fold cross validation and compute the averageclassification accuracy. We use Nearest Neighbor (NN) clas-sifier which assigns the family of the nearest malware to anunknown malware. We obtained four datasets: Malimg dataset(Windows) [6], Malheur dataset (Windows) [7], MalGenomedataset (Android) [5] and VxShare ELF dataset (Linux) [8].On all four datasets, we obtained a high classification accuracy(Fig. 5). Further, on comparing our approach with dynmaicanalysis, our method was comparable in terms of classificationaccuracy but 4,000 times faster than dynamic analysis [9].In [10], we extend our approach to separate malware frombenign software. In order to get a richer discrimination be-tween benign and malicious samples, we adopt a section-aware approach and compute GIST descriptors on the entirebinary as well as the top two sections of the binary whichcould contain the code. With more than 99% precision, ourapproach outperformed other static similarity features.

4

1Fig. 6: Block Schematic of SARVAM

SEARCH AND RETRIEVAL

We developed SARVAM: Search And RetrieVAl of Mal-ware [11] (accessible at http://sarvam.ece.ucsb.edu), an onlinesystem for large scale malware search and retrieval. It isone of the few systems available to public where researcherscan upload or search for a sample and retrieve similar mal-ware matches from a large database. Leveraging on our pastwork [4], we use GIST descriptors for content-based searchand retrieval of malware. These effectively capture the visual(structural) similarity between similar malware variants. Forfast search and retrieval, we use a scalable Balltree-basedNearest Neighbor searching technique. On a database of morethan 7 Million samples comprising mostly malware and afew benign samples, SARVAM could find a match in about 6seconds. SARVAM has been operational since May 2012 andduring this period, we received more than 440,000 samples.Nearly 60% were possible variants of already existing malwarefrom our database.

There are two phases in the system design as illustratedin Fig. 6. During the initial phase, we first obtain a largecorpus of malware samples from various sources . The imagefingerprints for all the samples in the corpus are then computedand stored in a database. Simultaneously, we obtain theAntivirus (AV) labels for all the samples from Virustotal [12],an online system that maintains a database of AV labels. Theselabels act as a ground truth and are later used to describe thenature of a sample, i.e., how malicious or benign a sampleis. During the query phase, the fingerprint for the new sampleis computed and matched with the existing fingerprints in thedatabase to retrieve the top matches.

Of the 440,000 uploaded samples we received, not all thesamples have a good match with our corpus database. In Fig. 7,we see the distribution of the confidence levels of the top

match. Close to 37% fall under Very High Confidence, 8%under High Confidence, 49.5% under Low confidence and5.5% under Very Low Confidence.

Fig. 7: Confidence of the Top Match

SPARSITY BASED MALWARE ANALYSIS

In this part we explore Sparse Representation based Classi-fication (SRC) methods to classify malware variants into fam-ilies. Such methods have been previously applied to problemswhere samples belonging to a class have small variations inthem, such as face recognition [14] and iris recognition [16].We developed SATTVA: SparsiTy inspired classificaTion ofmalware VAriants [13], where we model a malware variant be-longing to a particular malware family as a linear combinationof variants from that family. Since variants of a family havesmall changes in the overall structure and differ from variantsof other families, projections of malware in lower dimensionspreserve this “similarity”.

Given a dataset of N labeled malware belonging to Ldifferent malware families with P malware per family, thetask is to identify the family of an unknown malware u. We

5

Random Projections

Signal Representation

Sparse Modeling

Malware Data

𝐴1𝐴2 𝐴𝑁

...

𝑀 ×𝑁𝑀 × 1

=

u A 𝛼

u2

u𝑀

u1

.

.

. ...

α1

α2

α𝑁𝑁 × 1

.

.

.

α1

α2

D × 𝑁

α𝑁

𝐷 × 1

=

w RA 𝛼

... 𝑅𝐴𝑁𝑅𝐴1...

w𝐷

w1

𝑁 × 1

Fig. 8: Sparse Representation based Classification (SRC) framework for Malware Classification

represent a malware as a digital signal x of range [0, 255],where every entry of x is a byte value of the malware. Sinceeach malware sample can have a different code-length, wenormalize all vectors to a maximum length (M ) by zero-padding.

The entire dataset can now be represented as an M × Nmatrix A, where every column represents a malware. Further,for every family k (k = 1, 2, ..., L), we define an M×P matrixAk = [xk1,xk2, ...xkP ] where xk{.} represents a malwaresample belonging to family k. Now, A can be expressed as aconcatenation of block-matrices Ak:

A = [A1A2..AL] ∈ RM×N (1)

Let u ∈ RM be an unknown malware whose family is tobe determined, with the assumption that u belongs to one ofthe families in the dataset. Then, following [14], we representu as a sparse linear combination of the training samples as:

u =

L∑i=1

P∑j=1

αijxij = Aα (2)

where α = [α1,1, ..., αL,P ]T represents the N × 1 sparse

coefficient vector (N = LP ). α will have non-zero values onlyfor samples that are from the same family as u. The sparsestsolution to (2) can be obtained using Basis Pursuit [16] bysolving the following l1-norm minimization problem:

α̂ = argminα′∈RN

‖α′‖1 subject to u = Aα′ (3)

Estimating the family of u is done by computing residualsfor every family in the training set and then selecting thefamily that has minimum residue.

RANDOM PROJECTIONS

When a malware binary is represented as a numerical vectorby considering every byte, the dimensions of that vector can bevery high. For example, a 1 MB malware has around 1 Millionbytes and this could make the calculations computationallyexpensive. Hence, we project the vectors to lower dimensionsusing Random Projections (RP). This also removes depen-dency on any particular feature extraction method. Previousworks have demonstrated that SRC is effective in lower-dimensional random projections as well, see [14]–[16]. LetR ∈ RD×M be the matrix that projects u from signal spaceM to w of lower dimensional space D (D << M ):

w = Ru = RAα (4)

The entries of R are drawn from a zero mean normal dis-tribution. The above system of equations is underdeterminedand sparse solutions can be obtained by reduced l1-normminimization:

α̂ = argminα′∈RN

‖α′‖1 subject to w = RAα′ (5)

The overall approach is shown in Fig.8.We tested our technique on two public malware datasets:

Malimg Dataset [6] and Malheur Dataset [7]. On both datasets,we selected equal number of samples to reduce any biastowards a particular family.For comparison, we used GISTdescriptors, which we had previously applied for malware clas-sification. We used the SRC framework to identify the malwarefamily of a test sample and compared with Nearest Neighbors(NN) classification that was previously used in [4]. We variedthe dimensions from {48, 96, 192, 256, 384, 512}, which areconsistent for both RP and GIST. In our experiments, we chose80% of a dataset for training and 20% for testing. On both theMalimg dataset (Fig. 9a) and the Malheur dataset (Fig. 9b),

6

0 50 100 150 200 250 300 350 400 450 500 55080

82

84

86

88

90

92

94

96

98

100

Dimensions

Acc

ura

cy

RP+NNGIST+NNGIST+SRCRP+SRC

(a) Malimg Dataset

0 50 100 150 200 250 300 350 400 450 500 55080

82

84

86

88

90

92

94

96

98

100

Dimensions

Acc

ura

cy

RP+NNGIST+NNGIST+SRCRP+SRC

(b) Malheur DatasetFig. 9: Experimental Results on (a) Malimg Dataset and (b) Malheur Dataset with features using Random Projections (RP)and GIST, and classification algorithms using Sparse Representation based Classification (SRC) and Nearest Neighbor (NN).

the best accuracy is obtained for the combination of RandomProjections (RP) and the SRC classification framework. Theaccuracies for GIST for both classifiers were almost the same.In [13], we further showed how this approach can be used toreject potential outliers in a dataset and also evaluated on largescale datasets having 42,480 malware and 2,124 families.

FUTURE DIRECTIONS

While we explored signal and image based analysis ofmalware data, a natural complement is to treat the malwareas audio-like one dimensional signals and leverage automatedaudio descriptors. Another possible approach is computingimage similarity descriptors and/or random projections on allthe sections and represent a malware as bag of descriptors,which can then be used for better characterization of malware.Using the error model in the sparse representation basedmalware classification framework, we can determine the exactpositions in which the malware variant differs from anothervariant. This approach can also be used to find the exact sourcefrom which a malware variant evolves. Patched malware thatattaches to benign software can be identified using this method.

CONCLUSION

In this paper we explored orthogonal yet complementarymethods to analyze malware motivated by Signal and ImageProcessing. Malware samples are represented as images orsignals. Image and signal based features are extracted tocharacterize malware. Our extensive experiments demonstratethe efficacy of our methods on malware classification andretrieval. We believe that our techniques will open the scope ofsignal and image based methods to broader fields in computersecurity.

REFERENCES

[1] A. Oliva and A. Torralba, Modeling the shape of the scene: A holisticrepresentation of the spatial envelope, International Journal of ComputerVision, 42(3), 145-175, 2002.

[2] A. Torralba, K.P. Murphy, W.T. Freeman and M. Rubin, Context-basedvision system for place and object recognition, In Proceedings of theNinth IEEE International Conference on Computer Vision, pp. 273-280,2003.

[3] M. Douze, H. Jgou, H. Sandhawalia, L. Amsaleg and M. Schmid,Evaluation of gist descriptors for web-scale image search, In Proceedingsof the ACM International Conference on Image and Video Retrieval, (p.19). , 2009.

[4] L. Nataraj, S. Karthikeyan, G. Jacob and B.S. Manjunath, MalwareImages: Visualization and Automatic Classification, In Proceedings of the8th International Symposium on Visualization for Cyber Security (VizSec’11), 2011, New York, NY, USA.

[5] Y. Zhou and X. Jiang, Dissecting android malware: Characterizationand evolution, In Proceedings of the IEEE Symposium on Security andPrivacy (SP), 2012.

[6] Malimg Dataset, http://old.vision.ece.ucsb.edu/spam/malimg.shtml[7] K. Rieck, P. Trinius, C. Willems and T. Holz, Automatic analysis of

malware behavior using machine learning, Professoren der Fak. IV., 2009.[8] VirusShare, http://www.virusshare.com[9] L. Nataraj, V. Yegneswaran, P. Porras and J. Zhang, A comparative

assessment of malware classification using binary texture analysis anddynamic analysis, In In Proceedings of the 4th ACM Workshop onSecurity and Artificial Intelligence, (AISec ’11), 2011, Chicago, IL, USA.

[10] D. Kirat, L. Nataraj, G. Vigna and B.S. Manjunath, SigMal: A staticsignal processing based malware triage, In Proceedings of Proceedingsof the 29th Annual Computer Security Applications Conference (ACSAC’13), 2013, New Orleans, LA, USA.

[11] L. Nataraj, D. Kirat, B.S. Manjunath and G. Vigna, SARVAM: SearchAnd RetrieVAl of Malware, In Proceedings of the Annual ComputerSecurity Conference (ACSAC) Worshop on Next Generation MalwareAttacks and Defense (NGMAD ’13), 2013, New Orleans, LA, USA.

[12] Virustotal, https://www.virustotal.com/[13] L. Nataraj, S. Karthikeyan, and B.S. Manjunath, SATTVA: SpArsiTy in-

spired classificaTion of malware VAriants, In Proceedings of the 3rd ACMWorkshop on Information Hiding and Multimedia Security (IH&MMSEC’15), 2015, Portland,OR, USA.

[14] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry and Y. Ma, Robust facerecognition via sparse representation, In IEEE Transactions on IPatternAnalysis and Machine Intelligence, 31(2), 210-227, 2009.

[15] D. Donoho and J. Tanner, Counting faces of randomly projected poly-topes when the projection radically lowers dimension, In Journal of theAmerican Mathematical Society, 22(1), 1-53, 2009.

[16] J.K. Pillai, V.M. Patel, R. Chellapa and N.K. Ratha, Secure and robustiris recognition using random projections and sparse representations, InIEEE Transactions on IPattern Analysis and Machine Intelligence, 33(9),1877-1893, 2011.

spam: signal processing for analyzing malware - … · spam: signal processing for analyzing...

Documents