ontology-driven knowledge graph for android malware

Ontology-driven Knowledge Graph for Android MalwareRyan Christian1, Sharmishtha Dutta1, Youngja Park2, Nidhi Rastogi1

Rensselaer Polytechnic Institute1, IBM TJ Watson Research Center2New York, USA

ABSTRACTWe present MalONT2.0 – an ontology for malware threat intelli-gence [5]. New classes (attack patterns, infrastructural resources toenable attacks, malware analysis to incorporate static analysis, anddynamic analysis of binaries) and relations have been added follow-ing a broadened scope of core competency questions. MalONT2.0allows researchers to extensively capture all requisite classes andrelations that gather semantic and syntactic characteristics of an an-droidmalware attack. This ontology forms the basis for themalwarethreat intelligence knowledge graph, MalKG, which we exemplifyusing three different, non-overlapping demonstrations. Malwarefeatures have been extracted from CTI reports on android threatintelligence shared on the Internet and written in the form of un-structured text. Some of these sources are blogs, threat intelligencereports, tweets, and news articles. The smallest unit of informa-tion that captures malware features is written as triples comprisinghead and tail entities, each connected with a relation. In the posterand demonstration, we discuss MalONT2.0, MalKG, as well as thedynamically growing knowledge graph, TINKER.

CCS CONCEPTS• Computer systems organization→ Embedded systems; Re-dundancy; Robotics; • Networks→ Network reliability.

KEYWORDSKnowledge Graphs, Security Intelligence, Android, MalwareACM Reference Format:Ryan Christian1, Sharmishtha Dutta1, Youngja Park2, Nidhi Rastogi1. 2021.Ontology-driven Knowledge Graph for Android Malware. In Seoul’21: ACMConference on Computer and Communications Security (CCS), November14–19, 2021. ACM, New York, NY, USA, 3 pages.

1 INTRODUCTIONMalware attack intelligence describes the working of the attacks,their tactics, techniques, and procedures (TTPs), and the technol-ogy vulnerabilities exploited by the malware. This intelligence canequip security researchers with information to build better defensesagainst advanced cyber attacks and issue early warnings aboutfuture threats. The Internet can lead to evidence on attack intel-ligence through thousands of diverse and heterogeneous sources,globally known as cyber threat intelligence (CTI). Discerning and

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected]’21, November 14–19, 2021, Seoul, South Korea© 2021 Association for Computing Machinery.

utilizing this knowledge speedily and accurately for longitudinalstudies mandate rigorous development of techniques that are con-stantly evolving and adapting to the complexities of malware at-tacks. Therefore, threat intelligence is best utilized when it is timely,actionable, shared in an universally acceptable format, contextual,and understandable.

CVE1, NVD2 are vulnerability tracking programs where informa-tion is combined through a centralized platform in a semi-structuredway. Industry standards like Structured Threat Information eXpres-sion (STIX)[1] and trusted automated exchange of indicator infor-mation (TAXII)[2] provide a language-agnostic framework for stor-ing, sending, and receiving packages. However, despite concertedefforts by experts towards organizing malware threat information,there is a lack of context and transparency in sharing CTI. Ana-lysts require more than data-driven threat intelligence. They seektrustworthy data sources, relevant threat indicators, sources andmotivations behind attacks, and the likelihood of an attack. Theyalso demand context to get a comprehensive picture of the threats,victims, and the distinct tactics an attacker deploys.

Malware threat ontologies, like MalONT2.0, enable communi-cating contextual CTI feeds by representing them in a structuredformat that encapsulates data and information characterized byproperties that vary according to context. Our main contributionsare as follows:

(1) We present MalONT2.0, an ontology for capturing malwarethreat intelligence through classes and relations that com-bine semantic features (such as malware, attacker, infrastruc-ture) with syntactic features and factual data (extracted fromVirusTotal3).

(2) Annotated CTI reports on android malware attacks, whereeach annotation is instantiated into classes and shares a rela-tionship with other instances. Both classes and relations aredefined and described in MalONT2.0 and share similaritieswith the STIX2.1 framework, where applicable. The instancesare stored as RDF triples [3], ⟨𝑒𝑛𝑡𝑖𝑡𝑦ℎ𝑒𝑎𝑑 , 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛, 𝑒𝑛𝑡𝑖𝑡𝑦𝑡𝑎𝑖𝑙 ⟩.

(3) We provide a dynamically growing knowledge graph gen-erated by automatically feeding new CTI reports into theexisting KG.We demonstrate the use of this knowledge graphthrough three queries (shown in Section 3).

1.1 What’s new in MalONT2.0MalONT2.0 is a significant improvement over prior version [5] as itcomprehensively captures semantic, syntactic, and factual descrip-tion of a malware threat. Prior version emphasised on contextualinformation coming from semantic data and contained partial fac-tual data. The main source of CTI was unstructured threat reportswritten on diverse threats such as malware, and APT. The latest1https://cve.mitre.org/2https://nvd.nist.gov/3virustotal.com

arX

iv:2

109.

0154

4v1

[cs

.CR

] 3

Sep

202

1

Seoul’21, November 14–19, 2021, Seoul, South Korea Anonymous

knowledge graph is generated from CTI reports that focus exclu-sively on android malware threats. Designing an ontology requiresanswering competency questions that can provide a wide coverageof the domain. These questions (or a narrower version of a com-petency question) are validated by confirming answers to querieson the instantiated triples. While building MalONT2.0, we updatedthe competency questions to the following:

a Find all missing intelligence from the KG relating to variousattack vectors - malware, actors (attacker, attacker-group,organizations, country), infrastructure (software, applica-tions, platform, infrastructure used, TTPs). This informationmay be spread across triples generated across multiple CTIreports describing an attack vector.

b Triples collected from CTI reports from a wide range ofsources can be aggregated with syntactic intelligence from asource like VirusTotal to provide a richer description of theattack vector.

c Identifying similar properties and grouping attack vectorscan reveal latent behaviors and can be used in predictivemodels to forecast future events, both short- and long-term.

Figure 1: Main Classes (left), Relations (right) from Mal-ONT2.0

1.2 Example Knowledge GraphSee Figure 4 (left) for a sub-graph of MalKG. A CTI report containsinformation about a malware, Pegasus mapped toMalware class.This malware is also known as Chrysaor, and it logs user keystrokesand leaks the data of popular apps. The CTI report contains thehash for a single sample of the malware. According to VirusTotal,this sample was first seen in April 2017.

2 SYSTEM ARCHITECTUREThe proposed framework dynamically gathers unstructured CTIreports from the Internet. It extracts threat intelligence informationin the form of RDF triples, assigns them classes and relations fromMalONT2.0 (see Figure 1) forming a new knowledge graph, whichit appends to the existing MalKG. In this section, we describe themain components of this framework .

(1) CTI reports corpus– MalONT2.0 is used to instantiate 25CTI reports written between 2011 – 2021, and downloadedfrom the Internet. We followed the process of natural lan-guage annotation for machine learning [4], created mutuallyagreed upon annotation guidelines, including a tie-breaking

process managed by a security expert. These reports wereauthored by analysts from security organizations such asMcAfee Labs. The annotated text has approximately 3,400tags extracted by annotators using BRAT4 resulting in 1,100entities and 2,300 relations. Triples generated from these arethe structural components of MalKG that capture large-scalefacts related to android malware threat intelligence.

(2) MalONT2.0– Semantic text patterns map to classes (calledentity in KG) and object properties (called relations in KG)defined by MalONT2.0. An ideal open-source ontology cansystematically capture cyber threat and attack information(facts and analysis) to model the contents of CTI reports.For instance, in MalONT2.0, three classes largely describemalware behavior –Malware, Vulnerability, and Indicator.Instances of these classes can equip the analysts with infor-mation on malware behavior and TTPs. See Figure 3 for asnapshot on an instance of MalONT2.0. See GitHub5 code forannotated text and corresponding triples for all CTI reports.

(3) Knowledge Graph Generation and Querying– We con-struct a knowledge graph corresponding to the malwareontology by populating it with triple instances derived fromactual CTI reports. However, one may argue about the ne-cessity concerning knowledge graphs, given that the instan-tiated ontology is previously obtaining facts and knowledgeregarding the domain. The two key features of knowledgegraphs are the ability to reason on deduced information andinfer latent information. MalKG captures properties connect-ing nodes (also called entities) and employs a reasoner todraw associations among entities that would otherwise notbe recognized.

(4) Dynamically generated KG– In addition to the MalKGgenerated from annotated triples, we also present a dynami-cally growing knowledge graph called TINKER. CTI reportsare regularly published by security and technology com-panies. These reports, especially those freely available forpublic access, are shared on the Twitter platform by companypersonnel. We have built an interface using python that usesacademic Twitter API to extract unique occurrences of an-droid malware CTI reports. Triple extraction models trainedon annotated instances are used for dynamic extraction fromfrequent batches of CTI reports.

3 OUTLINE OF POSTER ANDDEMONSTRATION

Our demonstration describes the ontology MalONT2.0 and brieflycompares it with other ontologies, namely UCO [7], MalONT (ourprior work)[5], and Swimmer ontology[6]– arguably the first mal-ware ontology. These ontologies have been chosen for comparisonbased on the competency questions defined in Section 1.1. We alsohave a live demonstration of MalONT2.0 ontology prepared, as wellas queries for the knowledge graph. A few queries of the knowledgegraph are visualized using Neo4j Bloom6 in figures 2, 3, and 4.

4https://brat.nlplab.org/5https://github.com/aiforsec/DemoCCS2021/6https://neo4j.com/developer/neo4j-bloom/

Ontology-driven Knowledge Graph for Android Malware Seoul’21, November 14–19, 2021, Seoul, South Korea

Figure 2: MalKG

Figure 3: Complete sub-graph from a single CTI report

Figure 4: Subgraph of a single malware (left), path fromZitmo malware to other related malware (right).

3.1 MalONT2.0 based annotationsFirst we will demonstrate (a) MalONT2.0 in protege showing allthe classes, sub-classes, and their descriptions, (b) configurationof BRAT7 using the OWL file generated by MalONT2.0 prior toannotating CTI reports, (c) semantic features in MalONT2.0 basedon STIX2.1 framework and the syntactic features and factual databased on VirusTotal, (d) triple generation in the form of RDF fromthe annotated text, and assignment of classes and relations. SeeFigure 5 for a snippet of a McAfee threat report on a spyware,"Golden Cup".

7https://brat.nlplab.org/

Figure 5:McAfee Blog snippet (left), BRAT’s interface (right)

3.2 Demonstrate MalKGWe will demonstrate (a) MalKG generated using only annotatedtriples, (b) MalKG generated using all CTI reports collected so farusing the dynamically growing platform, (c) an entire sub-graph ofa malware extracted from the larger KG. This will include all thetriples connected to this malware including those generated fromthe CTI reports and VirusTotal (see Figure 4).

3.3 Run queries on MalKGWe will demonstrate live queries on the MalKG through the Neo4jplatform. Queries will include searching for entities such as a mal-ware, an application, or a vulnerability and demonstrate the kindsof sub-graphs that can be drawn from them (see Figure 3 and 4).

4 CONCLUSION AND FUTUREWORKWe present MalONT2.0 ontology for capturing contextual threatintelligence from heterogeneous sources. CTI reports provide se-mantic informationwhich is used to instantiate classes and relationsof MalONT2.0. VirusTotal provides additional syntactic informationfor hashes that occur in the CTI reports, and the graph generatedfrom this is appended to the existing MalKG. A dynamic knowl-edge graph, TINKER, is also proposed. For future work, we plan toperform validations on triples generated for TINKER. We also planto demonstrate the use on TINKER for forecasting threat vectors.

5 ACKNOWLEDGEMENTThis work is supported by the IBM AI Research Collaboration(AIRC). The authors would like to thank RPI researchers Erin Turn-bull and Yueting Liao for their support.

REFERENCES[1] Sean Barnum. 2012. Standardizing cyber threat intelligence information with the

Structured Threat Information eXpression (STIX). Mitre Corporation 11 (2012),1–22.

[2] Julie Connolly, Mark Davidson, and Charles Schmidt. 2014. The trusted automatedexchange of indicator information (taxii). The MITRE Corporation (2014), 1–20.

[3] Jay Pujara, Hui Miao, Lise Getoor, and William Cohen. 2013. Knowledge graphidentification. In International Semantic Web Conference. Springer, 542–557.

[4] James Pustejovsky and Amber Stubbs. 2013. Natural Language Annotation forMachine Learning. OReilly Media.

[5] Nidhi Rastogi, Sharmishtha Dutta, Mohammed J Zaki, Alex Gittens, and CharuAggarwal. 2020. MalONT: An ontology for malware threat intelligence. In Inter-national Workshop on Deployable Machine Learning for Security Defense. Springer,28–44.

[6] Morton Swimmer. 2008. Towards an ontology of malware classes. Online] January27 (2008).

[7] Zareen Syed, Ankur Padia, Tim Finin, LisaMathews, andAnupam Joshi. 2016. UCO:A unified cybersecurity ontology. In Workshops at the Thirtieth AAAI Conferenceon Artificial Intelligence.

ontology-driven knowledge graph for android malware

Documents