vol.106 (2) june 2015 south african institute of … · 2016-08-23 · 46 south african institute...

Vol.106 (2) June 2015 SOUTH AFRICAN INSTITUTE OF ELECTRICAL ENGINEERS 41

June 2015Volume 106 No. 2www.saiee.org.za

Africa Research JournalISSN 1991-1696

Research Journal of the South African Institute of Electrical EngineersIncorporating the SAIEE Transactions

Vol.106 (2) June 2015SOUTH AFRICAN INSTITUTE OF ELECTRICAL ENGINEERS42

(SAIEE FOUNDED JUNE 1909 INCORPORATED DECEMBER 1909)AN OFFICIAL JOURNAL OF THE INSTITUTE

ISSN 1991-1696

Secretary and Head OfficeMrs Gerda GeyerSouth African Institute for Electrical Engineers (SAIEE)PO Box 751253, Gardenview, 2047, South AfricaTel: (27-11) 487 3003Fax: (27-11) 487 3002E-mail: [email protected]

SAIEE AFRICA RESEARCH JOURNAL

Additional reviewers are approached as necessary ARTICLES SUBMITTED TO THE SAIEE AFRICA RESEARCH JOURNAL ARE FULLY PEER REVIEWED

PRIOR TO ACCEPTANCE FOR PUBLICATIONThe following organisations have listed SAIEE Africa Research Journal for abstraction purposes:

INSPEC (The Institution of Electrical Engineers, London); ‘The Engineering Index’ (Engineering Information Inc.)Unless otherwise stated on the first page of a published paper, copyright in all materials appearing in this publication vests in the SAIEE. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, magnetic tape, mechanical photo copying, recording or otherwise without permission in writing from the SAIEE. Notwithstanding the foregoing, permission is not required to make abstracts oncondition that a full reference to the source is shown. Single copies of any material in which the Institute holds copyright may be made for research or private use

purposes without reference to the SAIEE.

EDITORS AND REVIEWERSEDITOR-IN-CHIEFProf. B.M. Lacquet, Faculty of Engineering and the Built Environment, University of the Witwatersrand, Johannesburg, SA, [email protected]

MANAGING EDITORProf. S. Sinha, Faculty of Engineering and the Built Environment, University of Johannesburg, SA, [email protected]

SPECIALIST EDITORSCommunications and Signal Processing:Prof. L.P. Linde, Dept. of Electrical, Electronic & Computer Engineering, University of Pretoria, SA Prof. S. Maharaj, Dept. of Electrical, Electronic & Computer Engineering, University of Pretoria, SADr O. Holland, Centre for Telecommunications Research, London, UKProf. F. Takawira, School of Electrical and Information Engineering, University of the Witwatersrand, Johannesburg, SAProf. A.J. Han Vinck, University of Duisburg-Essen, GermanyDr E. Golovins, DCLF Laboratory, National Metrology Institute of South Africa (NMISA), Pretoria, SAComputer, Information Systems and Software Engineering:Dr M. Weststrate, Newco Holdings, Pretoria, SAProf. A. van der Merwe, Department of Infomatics, University of Pretoria, SA Dr C. van der Walt, Modelling and Digital Science, Council for Scientific and Industrial Research, Pretoria, SA.Prof. B. Dwolatzky, Joburg Centre for Software Engineering, University of the Witwatersrand, Johannesburg, SAControl and Automation:Dr B. Yuksel, Advanced Technology R&D Centre, Mitsubishi Electric Corporation, Japan Prof. T. van Niekerk, Dept. of Mechatronics,Nelson Mandela Metropolitan University, Port Elizabeth, SAElectromagnetics and Antennas:Prof. J.H. Cloete, Dept. of Electrical and Electronic Engineering, Stellenbosch University, SA Prof. T.J.O. Afullo, School of Electrical, Electronic and Computer Engineering, University of KwaZulu-Natal, Durban, SA Prof. R. Geschke, Dept. of Electrical and Electronic Engineering, University of Cape Town, SADr B. Jokanović, Institute of Physics, Belgrade, SerbiaElectron Devices and Circuits:Dr M. Božanić, Azoteq (Pty) Ltd, Pretoria, SAProf. M. du Plessis, Dept. of Electrical, Electronic & Computer Engineering, University of Pretoria, SADr D. Foty, Gilgamesh Associates, LLC, Vermont, USAEnergy and Power Systems:Prof. M. Delimar, Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia Dr A.J. Grobler, School of Electrical, Electronic and Computer Engineering, North-West University, SAEngineering and Technology Management:Prof. J-H. Pretorius, Faculty of Engineering and the Built Environment, University of Johannesburg, SA

Prof. L. Pretorius, Dept. of Engineering and Technology Management, University of Pretoria, SA

Engineering in Medicine and BiologyProf. J.J. Hanekom, Dept. of Electrical, Electronic & Computer Engineering, University of Pretoria, SA Prof. F. Rattay, Vienna University of Technology, AustriaProf. B. Bonham, University of California, San Francisco, USA

General Topics / Editors-at-large: Dr P.J. Cilliers, Hermanus Magnetic Observatory, Hermanus, SA Prof. M.A. van Wyk, School of Electrical and Information Engineering, University of the Witwatersrand, Johannesburg, SA

INTERNATIONAL PANEL OF REVIEWERSW. Boeck, Technical University of Munich, GermanyW.A. Brading, New ZealandProf. G. De Jager, Dept. of Electrical Engineering, University of Cape Town, SAProf. B. Downing, Dept. of Electrical Engineering, University of Cape Town, SADr W. Drury, Control Techniques Ltd, UKP.D. Evans, Dept. of Electrical, Electronic & Computer Engineering, The University of Birmingham, UKProf. J.A. Ferreira, Electrical Power Processing Unit, Delft University of Technology, The NetherlandsO. Flower, University of Warwick, UKProf. H.L. Hartnagel, Dept. of Electrical Engineering and Information Technology, Technical University of Darmstadt, GermanyC.F. Landy, Engineering Systems Inc., USAD.A. Marshall, ALSTOM T&D, FranceDr M.D. McCulloch, Dept. of Engineering Science, Oxford, UKProf. D.A. McNamara, University of Ottawa, CanadaM. Milner, Hugh MacMillan Rehabilitation Centre, CanadaProf. A. Petroianu, Dept. of Electrical Engineering, University of Cape Town, SAProf. K.F. Poole, Holcombe Dept. of Electrical and Computer Engineering, Clemson University, USAProf. J.P. Reynders, Dept. of Electrical & Information Engineering, University of the Witwatersrand, Johannesburg, SAI.S. Shaw, University of Johannesburg, SAH.W. van der Broeck, Phillips Forschungslabor Aachen, GermanyProf. P.W. van der Walt, Stellenbosch University, SAProf. J.D. van Wyk, Dept. of Electrical and Computer Engineering, Virginia Tech, USAR.T. Waters, UKT.J. Williams, Purdue University, USA

Published bySouth African Institute of Electrical Engineers (Pty) Ltd, PO Box 751253, Gardenview, 2047 Tel. (27-11) 487 3003, Fax. (27-11) 487-3002, E-mail: [email protected]

President: Mr André HoffmannDeputy President: Mr TC Madikane

Senior Vice President: Mr Jacob Machinijke

Junior Vice President:Dr Hendri Geldenhuys

Immediate Past President: Dr Pat Naidoo

Honorary Vice President:Mr Max Clarke


VOL 106 No 2June 2015

SAIEE Africa Research Journal


June 2015Volume 106 No. 2www.saiee.org.za

Africa Research JournalISSN 1991-1696

Research Journal of the South African Institute of Electrical EngineersIncorporating the SAIEE Transactions A Sandbox-based Approach to the Deobfuscation and Dissection

of PHP-based MalwareP. Wrench and B. Irwin .............................................................. 46

The Impact of Triggers on Forensic Acquisition and Analysisof DatabasesW. K. Hauger and M. S. Olivier .................................................. 64

An Investigation into Reducing Third Party Privacy Breachesduring the Investigation of CybercrimeW.J. van Staden ......................................................................... 74

A Multi-faceted Model for IP-based Service Authorization inthe Eduroam NetworkL. Tekeni, R. Botha and K. Thomson.......................................... 83

Secure Separation of Shared Caches in Amp-based MixedCriticality SystemsP. Schnarz, C. Fischer, J. Wietzke, I. Stengel............................. 93

SAIEE AFRICA RESEARCH JOURNAL EDITORIAL STAFF .......................... IFC


GUEST EDITORIALINFORMATION SECURITY SOUTH AFRICA (ISSA) 2014

This special issue of the SAIEE Africa Research Journal is devoted to selected papers from the Information Security South Africa (ISSA) 2014 Conference which was held in Johannesburg, South Africa from 13 to 14 August 2014. The aim of the annual ISSA conference is to provide information security practitioners and researchers, from all over the globe, an opportunity to share their knowledge and research results with their peers. The 2014 conference focused on a wide spectrum of aspects in the information security domain. The fact that issues covering the functional, business, managerial, human, theoretical and technological aspects were addressed emphasizes the wide multi-disciplinary nature of modern-day information security.

With the assistance of the original reviewers, twelve conference papers that received good overall reviews were identified. At the conference, I attended the presentation of each of these twelve papers and based on the reviewer reports and the presentations I selected eight of these papers for possible publication in this Special Edition. The authors of these eight selected papers were asked to rework their papers by expanding and/or further formalizing the research conducted. Each of these papers was subsequently reviewed again by a minimum of three international subject specialists. In some cases, where conflicting reviews were received, further reviews were requested. In some cases five reviews were requested to enable objective and quality decisions to be made. In all cases the reviews were conducted by members of the Technical Committee (TC) 11 of the International Federation of Information Processing (IFIP) or some subject experts suggested by them. Thus, in all cases the reviews were conducted by reputable subject experts and enough reviews were received to make a confident decision as well as to improve the relevant papers.

In the end five papers were selected to get published in this Special Edition after the reviewer comments were attended to satisfactorily. These five papers cover various aspects of information security. Topics addressed by these five papers include: malware protection, forensics on databases, secure operating systems, privacy and service authorization. This Special Edition includes five quite diverse papers in the discipline of information security, providing a true reflection of the multi-disciplinary nature of this field of study.

I would like to thank members of the Technical Committee 11 of the International Federation of Information Processing (IFIP) that assisted so diligently with the reviewing of these papers as well as the other subject experts proposed by them. These quality reviews certainly contributed to the quality of this Special Edition.

Lastly, I would like to express my appreciation to IEEE Xplore, who originally published the ISSA conference papers, for granting permission that these reworked papers can be published in this Special Edition.

Prof Rossouw von SolmsGuest Editor


NOTES


A SANDBOX-BASED APPROACH TO THE DEOBFUSCATION ANDDISSECTION OF PHP-BASED MALWARE

P. Wrench∗ and B. Irwin†

∗ Department of Computer Science, Rhodes University, P.O. Box 94, Grahamstown 6140, Email:[email protected]† Department of Computer Science, Rhodes University, P.O. Box 94, Grahamstown 6140, Email:[email protected]

Abstract: The creation and proliferation of PHP-based Remote Access Trojans (or web shells) usedin both the compromise and post exploitation of web platforms has fuelled research into automatedmethods of dissecting and analysing these shells. Current malware tools disguise themselves by makinguse of obfuscation techniques designed to frustrate any efforts to dissect or reverse engineer the code.Advanced code engineering can even cause malware to behave differently if it detects that it is notrunning on the system for which it was originally targeted. To combat these defensive techniques, thispaper presents a sandbox-based environment that aims to accurately mimic a vulnerable host and iscapable of semi-automatic semantic dissection and syntactic deobfuscation of PHP code.

Key words: Code deobfuscation, Sandboxing, Reverse engineering

1. INTRODUCTION

The overwhelming popularity of PHP as a hosting platform[1] has made it the language of choice for developersof Remote Access Trojans (RATs) and other malicioussoftware [2]. Web shells are typically used to compromiseand monetise web platforms by providing the attacker withbasic remote access to the system, including file transfer,command execution, network reconnaissance and databaseconnectivity. Once infected, compromised systems canbe used to defraud users by hosting phishing sites,perform Distributed Denial of Service (DDOS) attacks, orserve as anonymous platforms for sending spam or othermalfeasance [3].

The proliferation of such malware has become increasinglyaggressive in recent years, with some monitoring institutesregistering over 70 000 new threats every day [4]. Thesheer volume of software and the rate at which it isable to spread make traditional, static signature-matchinginfeasible as a method of detection [5,6]. Previous researchhas found that automated and dynamic approaches capableof identifying malware based on its semantic behaviourin a sandbox environment fare much better against themany variations that are constantly being created [5, 7].Furthermore, many malware tools disguise themselves bymaking extensive use of obfuscation techniques designedto frustrate any efforts to dissect or reverse engineer thecode [8]. Advanced code engineering can even causemalware to behave differently if it detects that it is notrunning on the system for which it was originally targeted[9]. To combat these defensive techniques, this projectintended to create a sandbox environment that accuratelymimics a vulnerable host and is capable of semi-automaticsemantic dissection and syntactic deobfuscation of PHPcode. The novel technique of performing deobfuscationbased on the identification and reversal of commonobfuscation idioms proved highly effective in revealing

hidden code. Although not included in the scope ofthis project, this could act as a stepping stone forthe identification of PHP-based malware on productionservers.

This paper expands substantially on a work previouslypublished [10] in the proceedings of the 13th AnnualInformation Security South Africa Conference held in2014. The two most substantial extensions included thereworking of the Design and Implementation section toprovide more detail on how the two major componentswere configured, and the inclusion of additional testingoutcomes in the Results section. The Summary section wasalso updated to reflect these two changes, and to provide amore in-depth analysis of the new results.

2. PAPER STRUCTURE

The remainder of the paper commences with an overviewof the PHP language and its relevant features in SectionIII, along with an outline of a typical web shell and itscommon capabilities. The concept of code obfuscationis also introduced, with particular emphasis on how it istypically achieved in PHP. Several alternate deobfuscationtechniques are briefly discussed, along with dynamicapproaches to code analysis such sandboxing. Section IVdetails how the system was designed and implemented,outlining the structure and functionality of the two maincomponents (namely the decoder and the sandbox). Theresults obtained during system testing are presented inSection V. The work concludes in Sections VI and VII,which provide a summary of the results and present ideasfor future work and improvement.

3. BACKGROUND AND PREVIOUS WORK

The deobfuscation and dissection of PHP-based malwareis a non-trivial task with no well-defined generalised

Based on: “Towards a Sandbox for the Deobfuscation and Dissection of PHP Malware”, by Peter Wrench and Barry Irwin which appeared in the Proceedings of Information Security South African (ISSA) 2014, Johannesburg, 13 & 14 August 2014. © 2014 IEEE


solution. Many different techniques and approaches can befound in the literature, each with their own advantages andlimitations [8,11,12,13,14]. In an attempt to evaluate theseapproaches, this section provides an overview of PHP,a description of the structure and capabilities of typicalweb shells, and an overview of both code obfuscation anddissection techniques.

3.1 PHP Overview

PHP (the recursive acronym for PHP: Hypertext Prepro-cessor) is a general purpose scripting language that isprimarily used for the development and maintenance ofdynamic web pages. First conceived in 1994 by RasmusLerdof [15], the power and ease of use of PHP hasenabled it to become the world’s most popular server-sidescripting language by numbers. Using PHP, it is possibleto transform traditional static web pages with predefinedcontent into pages capable of displaying dynamic contentbased on a set of parameters. Although originallydeveloped as a purely interpreted language, multiplecompilers have since been developed for PHP, allowing itto function as a platform for standalone applications. Since2001, the reference releases of PHP have been issued andmanaged by The PHP Group [16].

Language Features:

Much of the popularity of PHP can be attributed to itsrelatively shallow learning curve. Users familiar withthe syntax of C++, C#, Java or Perl are able to gainan understanding of PHP with ease, as many of thebasic programming constructs have been adapted fromthese C-style languages [15, 17]. As is the case withmore recent derivatives of C, users need not concernthemselves with memory or pointer management, both ofwhich are dealt with by the PHP interpreter [18]. Thedocumentation provided by the PHP group is concise andcomprehensively describes the many built-in functions thatare included in the language’s core distribution [19]. Thesimple syntax, recognisable programming constructs, andthorough documentation combine to allow even noviceprogrammers to become reasonably proficient in a shortspace of time.

PHP is compatible with a vast number of platforms,including all variants of UNIX, Windows, Solaris,OpenBSD and Mac OS X [15]. Although mostcommonly used in conjunction with the Apache webserver, PHP also supports a variety of other servers, suchas the Common Gateway Interface, Microsoft’s InternetInformation Services, Netscape iPlanet and Java servletengines [15]. Its core libraries provide functionality forstring manipulation, database and network connectivity,and file system support [15, 16], giving PHP unparalleledflexibility in terms of deployment and operation.

As an open source language, PHP can be modified tosuit the developer. In an effort to ensure stability anduniformity, however, reference implementations of the

language are periodically released by The PHP Group[16]. This rapid development cycle ensures that bug fixesand additional functionality are readily available and hascontributed directly to PHP’s reputation as one of the mostwidely supported open source languages in circulationtoday [15, 20]. An abundance of code samples andprogramming resources exist on the Internet in additionto the standard documentation [21, 22, 23], and manyextensions have been created and published by third partydevelopers [24].

Performance and Use:

PHP is most commonly deployed as part of the LAMP(Linux, Apache, MySQL and PHP/Perl/Python) stack [25].It is a server-side scripting language in that the PHP codeembedded in a page will be executed by the interpreter onthe server before that page is served to the client [16]. Thismeans that it is not possible for a client to know what PHPcode has been executed – they are only able to see theresult. The purpose of this preprocessing is to allow forthe creation of dynamic pages that can be customised andserved to clients on the fly [15].

When implemented as an interpreted language, studieshave found that PHP is noticeably slower than compiledlanguages such as Java and C [26, 27]. However, sinceversion 4, PHP code has been compiled into bytecode thatcan then be executed by the Zend Engine, dramaticallyincreasing efficiency and allowing PHP to outperformcompetitors written in other languages (such as Axis2and the Java Servlets Package) [28, 29, 30]. Performancecan be further enhanced by deploying commonly-usedPHP scripts as executable files, eliminating the need torecompile them each time they are run [31].

At the time of writing, PHP was being used as theprimary server-side scripting language by over 240 millionwebsites, with its core module, mod php, logging the mostdownloads of any Apache HTTP module [32]. Of thewebsites that disclosed their scripting language (severalchose not to for security reasons), 79.8% were runningsome implementation of PHP, including popular sites suchas Facebook, Baidu, Wikipedia, and Wordpress [33].

Security:

A study of the United States National VulnerabilityDatabase performed in April 2013 found that ap-proximately 30% of all reported vulnerabilities wererelated to PHP [34]. Although this figure might seemalarmingly high, it is important to note that most ofthese vulnerabilities are not vulnerabilities associated withthe language itself, but are rather the result of poorprogramming practices employed by PHP developers. In2008, for example, a mere 19 core PHP vulnerabilitieswere discovered, along with just four in the language’slibraries [34]. These numbers represent a small percentageof the 2218 total vulnerabilities reported in the same year[34].


Apart from a lack of knowledge and caution on the partof PHP developers, the most plausible explanation forthe large number of vulnerabilities involving PHP is thatthe language is specifically being targeted by hackers.Because of its popularity, any exploit targeting PHP canpotentially be used to compromise a multitude of othersystems running the same language implementation [34].PHP bugs are thus highly sought after because of the highpay-off associated with their discovery. This mentalityis clearly demonstrated in the recent spate of exploitstargeting open source PHP-based Content ManagementSystems like phpBB, PostNuke, Mambo, Drupal, andJoomla, the last of which has over 30 million registeredusers [35, 36].

3.2 Web Shells

Remote Access Trojans (or web shells) are small scriptsdesigned to be uploaded onto production servers. Theyare so named because they will often masquerade as alegitimate program or file. Once in place, these shellsact as a backdoor, allowing a remote operator to controlthe server as if they had physical access to it [37]. Anyserver that allows a client to upload files (usually via theHTTP POST method or compromised FTP) is vulnerableto infection by malicious web shells.

In addition to basic remote administration capabilities,most web shells include a host of other features, such asaccess to the local file system, keystroke logging, registryediting, and packet sniffing capabilities [3].

The motive behind the use of web shells to compromiseservers is usually financial gain. Compromised serverscan either be monetised directly (by selling access tocompromised servers to a third party), or indirectly,by using the servers to facilitate fraudulent activities.Common uses include the establishment of phising sites,piracy servers, malware download sites, and spam sites[3, 37].

The structure of a web shell can vary according to itsintended function. Smaller, more limited shells are betterat avoiding detection, and are often used to secure initialaccess to a system. These shells can then be used to uploada more fully-featured RAT when it is less likely to getnoticed.

3.3 Code Obfuscation

Code obfuscation is a program transformation intendedto thwart reverse engineering attempts. The resultingprogram should be functionally identical to the original,but may produce additional side effects in an attempt todisguise its true nature.

In their seminal work detailing the taxonomy ofobfuscation transforms, Collberg et al. [38] define a codeobfuscator as a “potent transformation that preserves theobservable behaviour of programs”. The concept of“observable behaviour” is defined as behaviour that can

be observed by the user, and deliberately excludes thedistracting side effects mentioned above, provided thatthey are not discernible during normal execution. Atransformation can be classified as potent if it producescode that is more complex than the original.

All methods of code obfuscation can be evaluatedaccording to three metrics [38]:

• Potency – the extent to which the obfuscated code isable to confuse a human reader

• Resilience – the level of resistance to automateddeobfuscation techniques

• Cost – the amount of overhead that is added to theprogram as a result of the transformation

Although primarily used by authors of legitimate softwareas a method of protecting technical secrets, codeobfuscation is also employed by malware authors to hidetheir malicious code. Reverse engineering obfuscatedmalware can be tedious, as the obfuscation processcomplicates the instruction sequences, disrupts the controlflow, and makes the algorithms difficult to understand.Manual deobfuscation in particular is so time-consumingand error-prone that it is often not worth the effort.

Although the number of code obfuscation methods islimited only by the creativity of the obfuscator, the oneslisted in the sections below fall neatly into the threecategories of layout, data and control obfuscation [39].Each category boasts methods of varying potency, and apowerful obfuscator should employ methods from eachcategory to achieve a high level of obfuscation.

3.4 Code Obfuscation and PHP

As a interpretedl language with object-oriented features,PHP can be obfuscated using all of the methods detailedabove. In addition to this, the language contains severalfunctions that directly support the protection/hiding ofcode and which are often combined to form the followingobfuscation idiom:

eval(gzinflate(base64 decode($str)))

The string containing the malicious code is encoded inbase64 before being compressed. At runtime, the processis reversed. The resulting code is executed through the useof the eval() function.

The ubiquitous nature of idioms such as these meansthat they can be used as a means of detecting obfuscatedcode. Although seemingly complex, code obfuscatedin this manner can easily be neutralised and analysedfor potential backdoors. Replacing the eval() functionwith an echo command will display the code instead ofexecuting it, allowing the user to determine whether it issafe to run. This process can be automated using PHP’sbuilt-in function overriding mechanisms.


3.5 Deobfuscation Techniques

The obfuscation methods described in the previoussections are all designed to prevent code from beingreverse engineered. Given enough time and resources,however, a determined deobfuscator will always be able torestore the code to its original state. This is because perfectobfuscation is provably impossible, as is demonstratedby Barak et al. [40] in their seminal paper “On the(Im)possibility of Obfuscating Programs”. Collberg etal. [38] concur, postulating that every method of codeobfuscation simply “embeds a bogus program within areal program” and that an obfuscated program essentiallyconsists of “a real program which performs a useful taskand a bogus program that computes useless information”.Bearing this in mind, it is useful to review the techniquesthat are widely employed by existing deobfuscationsystems:

• Pattern matching – the detection and removal ofknown bogus code segments

• Program slicing – the decomposition of a programinto manageable units that can then be evaluatedindividually

• Statistical analysis – the replacement of expressionsthat are discovered to always produce the same valuewith that value

• Partial evaluation – the removal of the static partof the program so as to evaluate just the remainingdynamic expressions

3.6 Code Dissection

The process of analysing the behaviour of a computerprogram by examining its source code is known as codedissection or semantic analysis [41]. The main goal of thedissection process is to extract the primary features of thesource program, and, in the case of malicious software,to neutralise and report on any undesirable actions[42]. Sophisticated anti-malware programs go beyondtraditional signature matching techniques, employingadvanced methods of detection such as sandboxing andbehaviour analysis [43].

3.7 Static Dissection Techniques

Static analysis approaches attempt to examine codewithout running it [44]. Because of this, these approacheshave the benefit of being immune to any potentiallymalicious side effects. The lack of runtime informationsuch as variable values and execution traces does limitthe scope of static approaches, but they are still usefulfor exposing the structure of code and comparing it topreviously analysed samples [45].

Signature Matching:

A software signature is a characteristic byte sequence thatcan be used to uniquely identify a piece of code [45].Anti-malware solutions make use of static signatures todetect malicious programs by comparing the signature ofan unknown program to a large database containing thesignatures of all known malware – if the signatures match,the unknown program is flagged as suspicious. This kindof detection can easily be overcome by making trivialchanges to the source code of a piece of malware, therebymodifying its signature [45].

Pattern Matching:

Pattern matching is a generalised form of signaturematching in which patterns and heuristics are used inplace of signatures to analyse pieces of code [45]. Thisallows pattern matching systems to recognise and flagcode that contains patterns that have been found inpreviously analysed malware samples, which, although animprovement on signature matching, is still insufficient toidentify newly developed malware [45]. Patterns that aretoo general will lead to false positives (benign code that isincorrectly classified as malicious), whereas patterns thatare too specific will suffer from the same restrictions facedby signature matching [45].

3.8 Dynamic Dissection Techniques

Dynamic approaches to analysis extract information abouta program’s functioning by monitoring it during execution[44]. These approaches examine how a program behavesand are best confined to a virtual environment such as asandbox so as to minimise the exposure of the host systemto infection [44].

API Hooking:

Application programming interface (API) hooking is atechnique used to intercept function calls between anapplication and an operating system’s different APIs [46].In the context of code dissection, API hooking is usuallycarried out to monitor the behaviour of a potentiallymalicious program [47]. This is achieved by alteringthe code at the start of the function that the programhas requested access to before it actually accesses it andredirecting the request to the user’s own injected code [47].The request can then be examined to determine the exactbehaviour exhibited by the program before it is directedback to the original function code [46].

The precision and volume of code required for correctAPI hooking mean that behaviour monitoring systemsthat make use of the technique are complex and timeconsuming to implement [47]. They are also virtuallyundetectable and thoroughly customisable (only functionsrelevant to behaviour analysis need be hooked) [47].


Sandboxes and Function Overriding:

A sandbox is a restricted programming environment thatis used to separate running programs [48]. Malicious codecan safely be run in a sandbox without affecting the hostsystem, making it an ideal platform for the observation ofmalware behaviour [49].

PHP’s Runkit extension contains the Runkit Sandboxclass, which is capable of executing PHP code in asandbox environment [50]. This class creates its ownexecution thread upon instantiation, defines a new scopeand constructs a new program stack, effectively isolatingany code that is run within it from other active processes[50]. Other options are also provided to further restrict thesandbox environment [50]:

• The safe mode include dir option can be used tospecify a single directory from which modules can beincluded in the sandbox.

• The open basedir option can be used to specify asingle directory that can be accessed from within thesandbox.

• The allow url fopen option can be set to false toprevent code in the sandbox from accessing contenton the Internet.

• The disable functions and disable classesoptions can be used to disable any functions andclasses from being used inside the sandbox.

Of particular interest to a developer of a code dissectionsystem is the runkit.internal configuration directivethat can be used to enable the ability to modify, removeor rename functions within the sandbox [50]. This canfacilitate the dissection of PHP code by providing thefunctionality to replace functions associated with codeobfuscation (such as eval()) with benign functions thatmerely report an attempt to execute a string of PHP code[50]. Network activity could be monitored in much thesame way – calls to url fopen() could be replaced by anecho statement that prints out the URL that was requestedby the code.

4. DESIGN AND IMPLEMENTATION

The development of a system capable of analysingPHP shells required the design and construction of twomain components: the decoder and the sandbox. Theenvironment in which both of these components weredeveloped and run is detailed in Section 4.2. The designand implementation of the decoder responsible for codenormalisation and deobfuscation is presented in Section4.5 and the next stage of the analytical process, thesandbox capable of dynamic shell analysis, is described inSection 4.6.

4.1 Scope and Limits

The system was originally envisioned as consisting ofthree distinct components (the decoder, the sandbox, andthe reporter) that would communicate via a database.As development progressed, it was found that aseparate reporting component would necessitate complexcommunication between itself, the other components, andthe database. For this reason, the design of the systemwas modified and each component was made responsiblefor reporting on its own activities. The closer couplingbetween the components and the feedback mechanismsallows information relating to each stage in the processof shell analysis to be relayed to the user as it occurs –deobfuscation results are displayed during static analysis,and the results of executing the shell in the sandboxenvironment are displayed during dynamic analysis.

4.2 Architecture, Operating System and Database

While the deobfuscation and dissection of PHP shells is anontrivial task, neither of the stages involved in the processis computationally intensive. It was thus not necessaryto acquire any special hardware – the system was simplydeveloped and run on the lab machines provided by RhodesUniversity.

A core part of the system is the sandbox environment,which is designed to safely execute potentially maliciousPHP code. This component relies heavily on theRunkit Sandbox class that forms part of PHP’s Runkitextension package [50]. Since this extension is notavailable as a dynamic-link library (DLL) or Windowsbinary, a decision was made to develop the system in aLinux environment. Ubuntu (version 12.10) was chosenbecause of its familiarity and status as the most popular(and therefore most widely supported) Linux distribution.Another welcome byproduct of Ubuntu’s popularity isthe abundance of Ubuntu-specific tutorials for proceduressuch as setting up web servers, installing and configuringlibraries, and setting file permissions, all of which wereuseful during the development period.

VMware Player is used to run an Ubuntu host in avirtual machine environment. The primary reason forthis is to protect the host machine from being affectedby any malicious actions performed by the PHP shellsduring execution and to provide greater control over thedevelopment environment. Although the Runkit Sandboxclass can be configured to restrict the activities of suchshells (see Section 4.6), there is still a risk that anincorrectly configured option or unforeseen action on thepart of the shell could corrupt the system in some way.Backups of the virtual machine were therefore made ona regular basis. These backups had the added benefit ofacting as a version control system that permitted rollbackin the event of system failure due to shell activity or errorsthat arose during development. Traditional version controlsystems such as Git would have worked well with justthe source files, but since the project involved extensiverecompilation and configuration of both PHP and Apache,


it proved more expedient to backup snapshots of the entirevirtual machine.

Both the decoder and the sandbox components make useof a MySQL database for the persistent storage of webshells. PHP scripts being analysed are stored by computingthe MD5 hash of the raw code and using the resulting32-bit string as the primary key. MD5 was chosen becauseit is faster than other common hashing algorithms suchas SHA-1 and SHA-256 [51]. Each MD5 hash is thenchecked against the previously analysed code stored inthe database to prevent duplication. Once the shell hasbeen decoded, the resulting deobfuscated and normalisedversion of the code is stored alongside the hash and theraw code in the database. This deobfuscated code is whatis then executed in the sandbox environment. A flowchartdepicting the passage of a shell through the system isshown in Figure 1.

Figure 1: The path of a web shell though the system

4.3 Web Server Sandbox

The PHP shells, which the system was created to dissectand analyse, are all designed to be uploaded onto webservers thereby providing remote access to an attacker[52]. For this reason, many of the shells only functioncorrectly when run in a web server environment –advanced scripts fail to begin executing at all if they donot detect an HTTP server and its associated environmentvariables [53, 9]. The system was thus designed to closelymimic conditions that might be found on a real worldweb platform to facilitate correct shell execution and allowanalysis to take place.

In pursuit of this goal, an Apache HTTP server wasinstalled inside the virtual machine. This server can beaccessed via the loopback network interface by directing aweb browser in the virtual machine to the default localhostaddress of 127.0.0.1. Although the virtual machine itselfhas no access to the broader Internet, shells executinginside the sandbox are barred from making web requestsas an added precaution. This restriction was achieved bymodifying the configuration options of the Runkit Sandbox

class (see Section 4.6 for full details of how the sandboxwas configured).

Choice of Apache:

As the world’s most popular HTTP server, Apache is usedto power over half of all websites on the Internet [54]. Itsrampant popularity made it an ideal choice for this projectfor two reasons: Firstly, as was the case with Ubuntu,many installation and configuration guides are availablefor Apache. Since it was necessary to compile the webserver from its source code (Ubuntu’s Advanced PackagingTool does not allow configuration options relating tonon-standard modules such as Runkit and PHP to be set,it simply performs a default install of commonly usedmodules), these guides and the documentation providedby the Apache Software Foundation proved invaluable.Secondly, Apache’s popularity means that it is also wellsupported by the developers of web shells – a significantnumber of these shells are able to run on the Apache HTTPserver.

Apache was also chosen as the preferred web serverbecause of its modular design and the abundance ofmodules available for use. Its behaviour can be modifiedby enabling and disabling these modules, allowing it to betailored to suit the needs of any system designed to run onit. This modularity also allows it to be compatible witha wide variety of languages used for server-side scripting,including PHP, the language used to develop this system.Furthermore, PHP’s Runkit Sandbox class, a core part ofthe sandbox environment, requires that both the underlyingweb server and PHP itself support thread-safety. This wasachieved by manipulating configuration options during thecompilation process. A detailed description of exactly howthis was performed is provided in Section 4.6.

In a system of this scale, server performance is not animportant factor. Shells are uploaded and processedindividually instead of concurrently. In future, however,if the system were to be extended to automaticallycollect and process web shells, performance wouldbecome more of a concern and other approaches (such asmultithreading or concurrent programming) would have tobe considered. Furthermore, the focus during developmentwas on testing a proof of concept rather than developinga high-performance system able to be deployed in aproduction environment.

Apache Compilation and Configuration:

As has already been stated, it was necessary to compileApache from the source to gain access to the configurationoptions needed to enable the thread safety required byPHP’s Runkit extension. Although this was the primaryreason, compiling the server from its source code hadother key advantages. It provided more flexibility, asit was possible to choose only the functionality requiredby the system and no more – this would not have beenpossible if the server was installed from a binary created


by a third party. Furthermore, the default install directorycould be modified during compilation, which provedhelpful when managing multiple versions of Apache andtesting different configuration settings. Descriptions of theconfiguration options required specifically for the systemand the Runkit Sandbox in particular, but which are notincluded as part of the default install, are shown below:

--enable-so

The --enable-so configuration option was used to enableApache’s mod so module, which allows the server to loaddynamic shared objects (DSOs). Modules in Apachecan either be statically compiled into the httpd binaryor exist as DSOs that are separate from this binary[55]. If a statically compiled module is updated orrecompiled, Apache itself must also be recompiled. Sincerecompilation is a time-consuming process, PHP wascompiled as a shared module so that it was only necessaryto restart Apache when changes were made to the PHPinstallation.

--with-mpm=worker

The --with-mpm=worker configuration option was in-cluded to specify the multi-processing module (MPM)that Apache should use. MPMs implement the basicbehaviour of the Apache server, and every server isrequired to implement at least one of these modules [55].The default MPM is prefork, a non-threaded web serverthat allocates one process to each request. While thisMPM is appropriate for powering sites that make use ofnon-thread-safe libraries, it was not chosen for this systembecause it is not compatible with PHP’s Runkit Sandboxclass. It was therefore necessary to specify the use ofthe worker MPM, a hybrid multi-process multi-threadedserver that is able to serve more requests using fewersystem resources while still maintaining the thread-safetydemanded by the aforementioned class.

4.4 PHP Configuration

As was the case with Apache, PHP was compiledfrom source and installed in the wwwroot directory forflexibility and ease of modification. It was configured bymanipulating configuration options during installation –once again, the focus was on enabling thread safety andcreating a sandbox-friendly environment.

--with-zlib

When developers of malware attempt to hide theirwork, they often employ compression functions such asgzdeflate() as part of the obfuscation process. Sincethe goal of the system is to remove such obfuscation,it is necessary to reverse these functions. The zlibsoftware library facilitates reverse engineering of thiskind by allowing the system to decompress compresseddata using the gzinflate() function. Listing 1depicts an obfuscation idiom that includes a call to theaforementioned function, where the string in bracketsrepresents the obfuscated code.

<?phpeval(gzinflate(base64_decode("4+VKK8n...")));

?>

Listing 1: A common obfuscation idiom

--enable-maintainer-zts and--enable-runkit

PHP is interpreted by the Zend Engine. This engineprovides memory and resource management for thelanguage, and runs with thread safety disabled bydefault so as to support the use of non-thread-safe PHPlibraries. Thread safety was enabled by passing the--enable-maintainer-zts configuration option duringthe compilation process. The purpose of enabling threadsafety was to provide an environment in which the Runkitextension could function - this extension was enabled usingthe last configuration option.

4.5 The Decoder

The first of the major components developed for the systemwas the decoder, which is responsible for performing codenormalisation and deobfuscation prior to execution in thesandbox environment. Code normalisation is the processof altering the format of a script to promote readabilityand understanding, while deobfuscation is the process ofrevealing code that has been deliberately disguised [56].

The decoder is considered a static deobfuscator in thatit manipulates the code without ever executing it. Theadvantage of this approach is that it suffers from noneof the risks associated with malicious software execution,such as the unintentional inclusion of remote files, theoverwriting of system files, and the loss of confidentialinformation. Static analysers are however unable to accessruntime information (such as the value of a variable atany given time or the current program state) and are thuslimited in terms of behavioural analysis.

The purpose of this component is to expose the underlyingprogram logic and source code of an uploaded shell byremoving any layers of obfuscation that may have beenadded by the shell’s developer. This process is controlledby the decode function, which is described in Section4.5. It makes use of two core supporting functions,processEvals() and processPregReplace().

In addition to performing code deobfuscation, the decoderalso attempts to extract information such as whichvariables were used, which URLs were referenced, andwhich email addresses were discovered. Some codenormalisation (or pretty printing) is also performed onthe output of the deobfuscation process in an attempt totransform it into a more readable form.


BEGINFormat the codeWHILE there is still an eval or preg_replace

Increment the obfuscation depthProcess the eval(s)Format the codeProcess the preg_replace(s)Format the code

END WHILE

Perform pretty printingInitiate information harvestingStore the shell in the database

END

Listing 2: Psuedo-code for the decode() function

Decode():

The part of the Decoder class responsible for removinglayers of obfuscation from PHP shells is the decode()function. It scans the code for the two functionsmost associated with obfuscation, namely eval() andpreg replace(), both of which are capable of arbitrarilyexecuting PHP code. The eval() function interprets itsstring argument as PHP code, and preg replace() canbe made to perform an eval() on the result of its searchand replace by including the deprecated ’/e’ modifier.Furthermore, eval() is often used in conjunction withauxiliary string manipulation and compression functionsin an attempt to further obfuscate the actual code.

Once an eval() or preg replace() is foundin the script, either the processEvals() or theprocessPregReplace() helper function is called toextract the offending construct and replace it with thecode that it represents. To deal with nested obfuscationtechniques, this process is repeated until neither of thefunctions is detected in the code. Some pretty printing isthen performed to get the output into a readable format,the functions that carry out the information gathering arecalled, and the decoded shell is stored in the databasealongside the raw script. The full pseudo-code of thisprocess is presented in Listing 2.

After both the processEvals() andprocessPregReplace() functions have been called,the formatLines() pretty printing function is usedto remove unnecessary spaces in the code that couldotherwise thwart the string processing techniques used inthese helper functions.

ProcessEvals():

The eval() function is able to evaluate an arbitrary stringas PHP code, and as such is widely used as a method ofobfuscating code. The function is so commonly exploitedthat the PHP group includes a warning against its use. It isrecommended that it only be used in controlled situations,and that user-supplied data be strictly validated before

BEGINWHILE there is still an eval in the script

Find the starting position of the evalFind the end position of the evalRemove the eval from the scriptExtract the string argumentCount the number of auxiliary functionPopulate the array of functionsReverse the array

FOR every function in the reversed arrayApply the function to the argument

END FOR

Insert the deobfuscated codeEND WHILE

END

Listing 3: Psuedo-code for the processEvals() function

being passed to the function. [57]

Listing 3 shows the full pseudo-code of theprocessEvals() function. This function is taskedwith detecting eval() constructs in a script and replacingthem with the code that they represent. String processingtechniques are used to detect the eval() constructs andany auxiliary string manipulation functions containedwithin them. The eval() is then removed from the scriptand its argument is stored as a string variable. Auxiliaryfunctions are detected and stored in an array, which is thenreversed and each function is applied to the argument. Theresult of this process is then re-inserted into the shell inplace of the original construct.

The processEvals() function was designed to beextensible. At its core is a switch statement that isused to apply auxiliary functions to the string argument.Adding another function to the list already supported bythe system can be achieved by simply adding a case forthat function. In future, the system could be extended totry and apply functions that it has not encountered beforeor been programmed to deal with.

ProcessPregReplace():

The preg replace() function is used to perform a regularexpression search and replace in PHP [17]. The dangerof the function lies in the use of the deprecated ’/e’modifier. If this modifier is included at the end of thesearch pattern, the interpreter will perform the replacementand then evaluate the result as PHP code, but the systemprevents this from happening, as is demonstrated below.

Listing 4 shows the full pseudo-code of theprocessPregReplace() function. It is tasked withdetecting preg replace() calls in a script and replacingthem with the code that they were attempting to obfuscate.In much the same way as the processEvals() function,string processing techniques are used to extract the

54 Vol.106 (2) June 2015SOUTH AFRICAN INSTITUTE OF ELECTRICAL ENGINEERS

BEGINWHILE there is still a preg_replace

Find the starting positionFind the end positionRemove the preg_replace from the scriptExtract the string argumentsRemove ’/e’ from first argument

to prevent evaluationPerform the preg_replaceInsert the deobfuscated code

END WHILEEND

Listing 4: Psuedo-code for the processPregReplace()function

<?phppreg_match_all($pattern , $this ->decoded ,

$matches);?>

Listing 5: Call to the preg replace all() function

preg replace() construct from the script. Its three stringarguments are then stored in separate string variablesand, if detected, the ’/e’ modifier is removed from thefirst argument to prevent the resulting text from beinginterpreted as PHP code. The preg replace() can thenbe safely performed and its result can be inserted back intothe script.

Information Gathering:

The Decoder class contains three functions for extractingvariables, URLs, and email addresses from PHP code.These functions are called after decoding has beencompleted to ensure that no obfuscation constructs areable to frustrate the information gathering process. Threeaccompanying functions for listing these code features arealso contained within the class and are called from theHTML code associated with it to display the results of theinformation gathering to the user.

Each of these functions uses simple pattern matchingand regular expressions to locate the three code features.PHP’s preg match all() function is used to perform thismatching, accepting a pattern to search for, a string tosearch through, and an array in which to store the resultsas its arguments. The call to the function is identical forall three of the feature extraction functions, and is shownin Listing 5.

The only difference between the three feature extractionfunctions is the regular expression (or pattern) that ispassed to the preg match all() function. The regularexpressions for each of the functions are shown in Table1.

Table 1: Regular expressions used for informationgathering

<?php

...//Update server$updateurl = "http://emp3ror.com/N3tshell//update/";//Sources server$sourcesurl = "http://emp3ror.com/N3tshell/";...

?>

Listing 6: Extract from c99.php showing a reference to anupdate server

The information gathered in this way is useful for thepurposes of discovering where a web shell has originatedfrom and where it is reporting server information to. Forexample, some web shells, including many of the variantsderived from the original c99 shell, will attempt to updatethemselves via an update server if given the opportunity(see Listing 6). Large resources are also often stored onremote servers and accessed at runtime to minimise shellsize [43]. A list of these servers could potentially bestored and published as a URL blacklist that could thenbe blocked by ISPs or individual web hosts.

In addition to URLs, creators and modifiers of shells ofteninclude email addresses that can reveal information abouttheir online aliases and any groups with which they maybe associated. This information, in conjunction with theURL and variable analysis, could potentially be used totrack the evolution of common web shells or as inputsto a system that attempts to perform similarity matchingbetween shells (see Section 7.3 for more details).

4.6 The Sandbox

The second major component developed for the systemwas the sandbox, which is responsible for executing thedeobfuscated code produced by the decoder in a controlledenvironment. As such, it forms the dynamic part ofthe shell analysis process – information about the shell’sfunctioning is extracted at runtime [42]. The purpose of thesandbox component is to log calls to functions that have thepotential to be exploited by an attacker and make the useraware of such calls by specifying where they were made inthe code. This was achieved in part through the use of the

55Vol.106 (2) June 2015 SOUTH AFRICAN INSTITUTE OF ELECTRICAL ENGINEERS

Runkit Sandbox, an embeddable sub-interpreter bundledwith PHP’s Runkit extension. A description of the RunkitSandbox class and how it was configured is discussed laterin this section.

The part of the sandbox responsible for identifyingmalicious functions and overriding them with functionsthat perform an identical task (at least as far as the scriptis concerned), but also record where in the code the callwas made is the redefineFunctions() function. Thisredefinition process takes place before the code is executedin the Runkit Sandbox. Finally, shell execution and calllogging is performed to complete the process.

Class Outline:

Unlike the decoder, which involves extensive stringprocessing and the removal of nested obfuscationconstructs, the sandbox is mainly concerned with theconfiguration of the Runkit Sandbox, the redefinition offunctions, and the monitoring of any malicious functioncalls. As such, it requires far less processing logic anddispenses with a controlling function (like the decoder’sdecode() function) altogether.

To begin with, the deobfuscated shell is retrieved from thetemporary file created by the decoder. The outer PHP tagsare then removed, as the eval() function used to initiatecode execution inside the Runkit Sandbox requires that thecode be contained in a string without them. An array ofoptions is then used to instantiate a Runkit Sandbox object,and redefineFunctions() is called to override maliciousfunctions within the sandbox.

The callList class is an auxiliary class created to maintain alist of potentially malicious function calls made by a shellexecuting in the sandbox. A callList object is initialised bythe constructor before the shell is run, and is constantlyupdated as execution progresses. Once the shell scripthas completed, it is displayed in the user interface alongwith its output and a list of exploitable functions that itreferenced.

Runkit Sandbox Class:

The sandbox’s core component is the Runkit Sandboxclass, an embeddable sub-interpreter capable of providinga safe environment in which to execute PHP code.Instantiating an object of this class creates a newthread with its own scope and program stack, effectivelyseparating the Runkit Sandbox from the rest of the shellanalysis system. It is this functionality that necessitatedthe enabling of thread safety in both Apache and the PHPinterpreter.

The behaviour of the Runkit Sandbox is controlled by anassociative array of configuration options. Using theseoptions, it was possible to restrict the environment to asubset of what the primary PHP interpreter can do (i.e.prevent activity such as network and file system access).

BEGINFOR every exploitable function

Copy the function to "name"_newRedefine the original functionModify the function body to echo

function informationModify the function body to call

the copied functionEND FOR

END

Listing 7: Pseudo-code for the redefineFunctions()function

These options were all set proir to the initialisation of thesandbox object and are passed to its constructor, whichthen configures the environment appropriately.

Function Redefinition and Classification:

The redefineFunctions() function is used to overridepotentially exploitable PHP functions with alternatives thatperform identical tasks, but also log the function name,where it was called in the code, and type of vulnerabilitythat the function represents. The pseudo-code for thisprocess is shown in Listing 7.

To begin with, the potentially exploitablefunction is copied using the Runkit extension’srunkit function copy() function to preserve itsfunctionality and prevent it from being overwrittencompletely. The runkit function redefine() functionis then used to override the original function, acceptingthe name of the original function, a list of new parameters,and a new function body as its arguments. The parametersare kept the same as those of the original function to allowit to be called in exactly the same way, but the body ismodified to echo information about the function, which isthen processed for logging purposes. A call is then madeto the function that was copied to ensure that the scriptcontinues to execute.

Functions with the potential for exploitation can begrouped into four main categories: command execution,code execution, information disclosure and filesystemfunctions. Command execution functions can be usedto run external programs and pass commands directly toa client’s browser, while code execution functions (suchas the infamous eval()) allow arbitrary strings to beexecuted as PHP code. Information disclosure functionsare not directly exploitable, but they can be used to leakinformation about the host system, thereby assisting apotential attacker. Filesystem functions can allow anattacker to manipulate local files and even include remotefiles if PHP’s allow url fopen configuration option hasbeen set to true.


//Output handler for the sandbox

//Split the string into separate words$arr = explode(" ", $str);

//For every word in the arrayfor($i = 0; $i < count($arr); $i++){

//If the word has ###PROCESS### attached to it//it is a function call and must be written to//call_list.txtif (strpos($arr[$i],’###PROCESS###’) !== false){

file_put_contents("/wwwroot/htdocs/temp/call_list.txt", str_replace("###PROCESS###", "", $arr[$i])."\n", FILE_APPEND);

}//If it does not, it is sandbox output//and must be written to output.txtelse{

file_put_contents("/.../temp/output.txt",$arr[$i]."\n", FILE_APPEND);

}}return ’’;

Listing 8: Output handler for the Runkit Sandbox object

Shell Execution and the Logging of Function Calls:

During the function redefinition process, the body of theoriginal function is modified to echo information about it.While the shell is executing, this output is then capturedby the output handler, a function designed to process allsandbox output without allowing it to affect the outerscript. Since the output handler deals with both theinformation about the function calls and the actual outputof the script executing in the sandbox, it is necessary todifferentiate between the two. For this reason, processingtags consisting of an unlikely sequence of characters areappended to all information pertaining to the function calls.When the output handler receives information enclosedin such tags, it writes the information to a file, which isthen read by the addCall() method of the callList objectto record the details of the call. Information that is notenclosed in these tags is written to a separate file thatis subsequently output to the browser. A code snippetdemonstrating the output handler’s selection process isshown in Listing 8.

The function names and classifications are hard-coded intoeach of the redefinition operations. As the only dynamicpart of the three pieces of information associated witha function call, the line numbers must be determinedat runtime. This is achieved through the use of PHP’sdebug backtrace() function, which returns a backtraceof the function call that includes the line it was calledon. An example of the use of debug backtrace() in afunction redefinition is shown in Listing 9.

//--------------Command Execution----------------//Exec$this ->sandbox ->runkit_function_copy(’exec’,

’exec_new’);$this ->sandbox ->runkit_function_redefine(’exec’,\

’$str’,’echo " ".array_shift(debug_backtrace())["line"]."### PROCESS### exec###PROCESS### Command_Execution###PROCESS### ";return exec_new($str);’);

Listing 9: Example of a function redefinition

5. RESULTS

Throughout the development of the shell analysis systemthe components were tested to ensure that they functionedas intended. These ranged from the smaller unit testsdesigned to test specific scenarios to comprehensive teststhat involved functional units from all parts of the system.The smaller unit tests were based on sections of real shellcode, but were adapted to clearly demonstrate the specificcapabilities of the system.

In addition to the adapted unit tests, several active andfully-featured web shells were used as inputs to thesystem in order to assess its performance in a liveproduction environment. These shells were sourced froma comprehensive web malware collection maintained byInsecurety Research [58], which contains a variety of bots,backdoors and other malicious scripts. This repository isupdated on a regular basis, and could theoretically be usedto automate the addition of shells to the system’s databaseby simply checking the repository on a regular basis anddownloading any new shells.

5.1 Decoder Tests

The decoder is responsible for performing code nor-malisation and deobfuscation prior to execution in thesandbox, with the goal of exposing the program logicof a shell. As such, it can be declared a success ifit is able to remove all layers of obfuscation from ascript (i.e., if it removes all eval() and preg replace()constructs). The tests for this component progressedfrom scripts containing simple, single-level eval() andpreg replace() statements to more comprehensive testsinvolving auxiliary functions and nested obfuscationcontructs. Each test was designed to clearly demonstratea specific capability of the decoder. Finally, several testswere performed with the fully-functional web shells.

5.2 Single-level Eval() and Base64 decode()

The most basic test of the decoder involved providing asingle eval() statement and base64-encoded argument asinput and recording whether it was correctly identified,extracted, and replaced with the code that it was obscuring.The input script is shown in Listing 10.


<?phpecho "Hello";eval(base64_decode("ZWNobyAiR29vZGJ5ZSI7"));

?>

Listing 10: Single-level eval() with a base64-encodedargument

<?phpecho "Hello";echo "Goodbye";

?>

Listing 11: Expected decoder output with the script inListing 10 as input

To create the input script, a simple echo() statement (with“Goodbye” included as an argument) was encoded usingPHP’s base64 encode() function. The expected outputwould therefore be a script in which the eval() constructhas been replaced by this echo() statement, as is shown inListing 11.

The actual output produced by the decoder componentmatched the expected output exactly.

5.3 Eval() with Auxiliary Functions

A slightly more complex eval() was tested to ensure thatthe system could cope with a combination of auxiliarystring manipulation functions. The string shown in Listing12 was subjected to the str rot(), base64 encode() andgzdeflate() functions before being placed in the eval()construct. The reverse of these functions (str rot13(),base64 decode() and gzinflate()) were then insertedahead of the string.

The decoder was expected to detect all of these functionsand apply them to the string, leaving only the decodedstring shown in Listing 13. The actual output producedby the decoder component matched the expected outputexactly. In addition to the results shown above,several other tests of this nature were performed withdifferent arrangements of the string manipulation functionsmentioned in Section 4.5, all with the same degree ofsuccess.

<?phpeval(gzinflate(base64_decode(str_rot13(’GIKK

PhmVSslK+7V2LJg+S3Lrv...’))));?>

Listing 12: Extract of a single-level eval() with multipleauxiliary functions

<?phph5(’http://mycompanyeye.com/list’ ,1*900);functionh5($u,$t){$nobot=isset

($_REQUEST[’nobot’])?true:false;$debug=isset($_REQUEST[’debug’])?true:false;$t2=3600*5;$t3=3600*12;$tm=(!@ini_get(’upload_tmp_dir’))?’/tmp/’:

@ini_get(’upload_tmp_dir’);...

?>

Listing 13: Extract of the expected decoder output with thescript in Listing 12 as input

<?phppreg_replace("/x/e", "echo ($greeting);", "y");

?>

Listing 14: Single-level preg replace() with explicitstring arguments

5.4 Single-level Preg Replace()

The single-level preg replace() test was very similarto the single-level eval() test in Section 5.2, but itspurpose was to test the processPregReplace() functionspecifically. To this end, a very simple preg replace()function that searches for the pattern “x” in the string“y”, replaces it with the string “echo($greeting);” and thenevaluates the code was constructed. As was discussed inSection 4.5, the preg replace() function can be used toexecute PHP code through the use of the ’/e’ modifier. Thescript used to test the removal of such constructs is shownin Listing 14.

The decoder was expected to detect the preg replace(),remove the ’/e’ modifier from the first argument to preventevaluation, and then perform the preg replace(), leavingonly the replacement string (see Listing 15). The actualoutput produced by the decoder component matched theexpected output exactly.

During testing, it was found that theprocessPregReplace() function was able to dealwith preg replace() constructs that contained explicitstrings as arguments, but failed to deal with constructsthat passed variables as arguments. The preg replace()

<?phpecho($greeting);

?>

Listing 15: Expected decoder output with the script inListing 14 as input


<?phppreg_replace("/.+/e","\x65...",".");

?>

Listing 16: Extract of a simple preg replace() statement

<?phpeval(gzinflate(base64_decode(’TVCuzIFfy...’)));

?>

Listing 17: Extract of an eval() construct encapsulatingthe preg replace() statement in Listing 16

construct was still identified and correctly removed fromthe script, but it was not replaced with any code. Thisis because of the nature of the decoder – as a staticcode analyser, it has no way of knowing what the valueof a variable is. The preg replace() was thereforeperformed with empty strings as arguments and returnedan empty string as a result. In future, this limitation couldbe elimated by adapting the processPregReplace()function (and the processEvals() function, whichsuffers from the same shortcoming) to be part of thesandbox component, as they would then have access toruntime information such as the value of variables passedas arguments. This extension is discussed further inSection VII as a possible addition to the system in thefuture.

5.5 Multi-level Eval() and Preg replace() with AuxiliaryFunctions

To test the system’s capacity for dealing with nested ob-fuscation constructs, a preg replace() was encapsulatedinside an eval() statement. The same script from Section5.3 was placed in a preg replace() statement before thewhole construct was obfuscated using gzdeflate() andbase64 encode(), and placed in an eval() statement.The original preg replace() is shown in Listing 16,and the preg replace() encapsulated in the eval() isshown in Listing 17.

The decoder was expected to remove both layers ofobfuscation and replace them with the script from Section5.3. The actual output showed that the decoder was ableto handle the layered obfuscated construct, and is shown inListing 18.

5.6 Full Shell Test

The previous tests were all aimed at ensuring that allparts of the decoder component functioned as intended.Aside from the limitations associated with static analysis(i.e. the inability to determine the value of a variable),each of the individual tests succeeded. As partof a final and more comprehensive set of tests, a

<?phph5(’http://mycompanyeye.com/list ’,1*900);functionh5($u,$t){$nobot=isset

($_REQUEST[’nobot ’])?true:false;$debug=isset($_REQUEST[’debug ’])?true:false;$t2=3600*5;$t3=3600*12;$tm=(!@ini_get(’upload_tmp_dir ’))?’/tmp/’:

@ini_get(’upload_tmp_dir ’);?>

Listing 18: Extract of the actual decoder output with thescript in Listing 16 as input18

eval(gzinflate(base64_decode(’FJ3HcqPsFkUVA...’)));

Listing 19: Extract of the outermost obfuscation layer

fully-functional derivative of the popular c99 web shellwas passed as input. The shell is wrapped within13 eval(gzinflate(base64 decode())) constructs, theoutermost of which is partially displayed in Listing 19.

The decoder correctly produced the output shown inListing 20. An analysis of the output found thatall eval() and preg replace() constructs had beencorrectly removed from the input script.

5.7 Sandbox Tests

The sandbox is responsible for executing potentiallymalicious scripts in a secure environment, with the goalof identifying calls to exploitable PHP functions. Assuch, it can be declared a success if it is able to classifyand redefine the aforementioned functions and report onwhere they were called. The tests for this componentincluded determining whether functions could be correctlyidentified, copied and overridden, and whether examplePHP scripts could be executed successfully within thesandbox. Finally, several fully-functional web shells were

<?phpif(!function_exists("getmicrotime")){

functiongetmicrotime(){list($usec ,$sec)...}error_reporting(5);@ignore_user_abort(TRUE);@set_magic_quotes_runtime (0);$win=strtolower(substr(PHP_OS ,0,3))=="win";define("starttime",getmicrotime());...

?>

Listing 20: Extract of the decoder output with the script inListing 19 as input


<?phpecho getlastmod();echo "\n";echo getlastmod_new();

?>

Listing 21: Script calling an overridden function and thecorresponding copied function

Sandbox Ouput:

13824029521382402952

Listing 22: Sandbox output and results with the script inListing 21 as input

executed in the sandbox to determine its feasibility as atool for code dissection.

5.8 Function Copy

The first step during function redefinition is the copyingof the original function to a new function so that it canbe overridden without losing its functionality. The endresult of this process should be the existence of twofunctions, one with the original function name that hasbeen overridden to echo log information when it is called,and a new function that contains the logic of the originalfunction. This outcome was tested by utilising a script thatcalls both the overridden function and the copied function,as is shown in Listing 21. The function used in this testis the getlastmod() function, which simply returns anumber denoting the date of the last modification of thecurrent file [59].

It was expected that both of the function calls would besuccessful and would return identical results. The outputof the sandbox is shown in Listing 22.

It can be seen that both functions were run successfully,the logic of the original function was preserved, and theoverridden function was able to call the copied function tocomplete its task before logging the call.

5.9 Overriding and Classification of System Functions

Functions in the sandbox are overridden to reportinformation about the name of the function and where itwas called. The type of vulnerability that they representshould also be recorded. To test this, a script containingthree functions (one each from the Command Execution,Information Disclosure, and Code Execution classes offunctions described in Section 4.6) was constructed andinput to the sandbox. This script is shown in Listing 23.

As expected, the sandbox identified all three of these

<?phpexec("whoami");echo getlastmod();$newfunc = create_function("$a", "return $a;");

?>

Listing 23: Script calling three exploitable functions

Sandbox Results:

Potentially malicious call to:Command_Execution function "exec" on line 1Potentially malicious call to:Information_Disclosure function "getlastmod" on

line 2Potentially malicious call to:Code_Execution function "create_function" on line 3

Listing 24: Sandbox results with the script in Listing 23 asinput

functions as being potentially exploitable, and correctlyclassified each of them. The sandbox results are shownin Listing 24.

5.10 Full Shell Test - connect-back.php

When executed, this shell attempts to open a socketconnection to a remote host and provide it withthe system’s username, password, and an ID numberidentifying the current process. An extract of the relevantcode is shown in Listing 25.

In order to get the shell to run, the code had to be modifiedto force the call to fsockopen() to be made regardless ofthe lack of an IP address. The function was duly identifiedand reported by the sandbox, as is shown in Listing 26.

The testing of the sandbox proved to be far more complexand unpredictable. Shells containing malformed CSS andJavaScript failed to run at all, and modifications had to bemade to some shells to ensure that certain functions were

...$ipim=$_POST[’ipim’];$portum=$_POST[’portum’];if (true){

$mucx=fsockopen($ipim , $portum , $errno , $errstr );}if (!$mucx){

$result = "Error: didnt connect !!!";}...

Listing 25: Extract of the connect-back.php web shell


Sandbox Results:

Potentially malicious call to:Miscellaneous function "fsockopen" on line 32

Sandbox Output:

<title >ZoRBaCK Connect </title >...

Listing 26: Sandbox results and output with the script inListing 25 as input

called even if their required arguments were not present.Despite this, testing of the individual elements provedsuccessful – exploitable functions were correctly copiedand redefined, and calls to these functions were recordedand displayed as intended. Furthermore, shells containinga combination of PHP and HTML were successfullyanalysed in a dynamic environment, and any attempts bythese shells to call exploitable functions were recorded andcorrectly classified.

6. SUMMARY

The two primary goals of this research were to create asandbox-based environment capable of safely executingand dissecting potentially malicious PHP code and adecoder component for performing normalisation and de-obfuscation of input code prior to execution in the sandboxenvironment. Both of these undertakings proved to besuccessful for the most part. Section 5.1 demonstrated howthe decoder was able to correctly expose code hidden bymultiple nested eval() and preg replace() constructsand extract pertinent information from the code. Similarly,the sandbox environment proved effective at classifyingand reporting on calls to potentially exploitable functions(see Section 5.7).

As a proof of concept, the research ably demonstratedthat the sandbox-based approach to malware analysis,combined with a decoder capable of code deobfuscationand normalisation, is a viable one. Despite this, thesystem was found to have some limitations: the decoderwas able to deal with obfuscation contructs such aseval() and preg replace() if they contained onlyexplicit string arguments, and performed no analysis ofthe shell information after it was extracted. The sandboxenvironment proved unpredictable, occassionally failing toexecute real-world shells that employed a mixture of CSSand JavaScript in addition to PHP and HTML. Althoughthese limitations make the system unsuitable for use ina production environment, they do not detract from theresults proving the feasibility of the approach itself.

7. FUTURE WORK

7.1 System Structure

The system is currently composed of two core components,namely the decoder and the sandbox. Each of thesecomponents represents a different approach to malwareanalysis – the decoder engages in static code analysis, andthe sandbox performs dynamic code analysis. One of themajor disadvatages of the decoder is that it is unable todeobfuscate constructs that contain variables as arguments,as it has no way of knowing which values these variablesmight represent. As a component that performs dynamicanalysis, the sandbox has access to this information. Infuture it would therefore be useful to implement a closercoupling between the two components to allow them toshare this information instead of working in isolation toallow for a more comprehensive code analysis system.

7.2 Implementation Language

The current system was implemented using PHP becauseof the existence of the Runkit Sandbox class, which formsa core part of the sandbox component. If the systemwere to be expanded, it would be beneficial to recode itin a language more suited to larger development projects,such as Python, which supports true object orientation andmultiple inheritance, and is more scalable as a result of itsuse of modules as opposed to include statements. The coreof the sandbox component would still have to use PHP andthe Runkit Sandbox for code execution, but the decoderand all information gathering and inference logic could beconverted to Python scripts.

7.3 Similarity Analysis and a Webshell Taxonomy

A useful extension to the current system would be toinclude a component capable of determining how differentshells relate to each other. This would be responsible forthe following two tasks:

• Code classification based on similarity to previouslyanalysed samples. This would draw on existing workin the field of similarity analysis [60, 61] and couldmake use of the information gathered by the decoder.Fuzzy hashing algorithms such as ssdeep could alsobe used to obtain a measure of the similarity betweenshells [62].

• The construction of a taxonomy tracing the evolutionof popular web shells such as c99, r57, b374k andbarc0de [63] and their derivatives. This would involvethe implementation of several tree-based structuresthat have the aforementioned shells as their rootsand are able to show the mutation of the shellsover time. Such a task would build on researchinto the evolutionary similarity of malware alreadyundertaken by Li et al. [64].


REFERENCES

[1] K. Tatroe, Programming PHP. O’Reilly &Associates Inc, 2005.

[2] N. Cholakov, “On some drawbacks of the PHPplatform,” in Proceedings of the 9th InternationalConference on Computer Systems and Technologiesand Workshop for PhD Students in Computing,ser. CompSysTech ’08. New York, NY, USA:ACM, 2008, pp. 12:II.7–12:2. [Online]. Available:http://doi.acm.org/10.1145/1500879.1500894

[3] M. Landesman. (2007, March) MalwareRevolution: A Change in Target. Microsoft.[Online]. Available: http://technet.microsoft.com/en-us/library/cc512596.aspx

[4] E. Kaspersky. (2011, October) Number ofthe Month: 70K per day. KasperskyLabs. Accessed on 1 March 2013. [Online].Available: http://eugene.kaspersky.com/2011/10/28/number-of-the-month-70k-per-day/

[5] M. Christodorescu, S. Jha, S. Seshia, D. Song, andR. Bryant, “Semantics-aware malware detection,” in2005 IEEE Symposium on Security and Privacy, May2005, pp. 32–46.

[6] M. D. Preda, M. Christodorescu, S. Jha, andS. Debray, “A semantics-based approach to malwaredetection,” SIGPLAN Notices, vol. 42, no. 1,pp. 377–388, January 2007. [Online]. Available:http://doi.acm.org/10.1145/1190215.1190270

[7] A. Moser, C. Kruegel, and E. Kirda, “Limits of StaticAnalysis for Malware Detection,” in Twenty-ThirdAnnual Computer Security Applications Conference,December 2007, pp. 421–430.

[8] M. Christodorescu and S. Jha, “Testing malwaredetectors,” SIGSOFT Softw. Eng. Notes, vol. 29,no. 4, pp. 34–44, Jul. 2004. [Online]. Available:http://doi.acm.org/10.1145/1013886.1007518

[9] M. I. Sharif, A. Lanzi, J. T. Giffin, and W. Lee,“Impeding Malware Analysis Using ConditionalCode Obfuscation,” in NDSS, 2008.

[10] P. Wrench and B. Irwin, “Towards a sandbox forthe deobfuscation and dissection of php malware,” inInformation Security for South Africa (ISSA), 2014,Aug 2014, pp. 1–8.

[11] H. C. Kim, D. Inoue, and M. Eto. (2009)Toward Generic Unpacking Techniques for MalwareAnalysis with Quantification of Code Revelation.Accessed on 1 March 2013. [Online]. Available: http://jwis2009.nsysu.edu.tw/location/paper/Toward%20Generic%20Unpacking%20Techniques%20for%20Malware%20Analysis%20with%20Quantification%20of%20Code%20Revelation.pdf

[12] E. Laspe. (2008, September) An AutomatedApproach to the Identification and Removal of CodeObfuscation. Riverside Research Institute. Accessedon 26 May 2013. [Online]. Available: http://www.blackhat.com/presentations/bh-usa-08/Laspe Raber/BH US 08 Laspe Raber Deobfuscator.pdf

[13] M. Sharif, V. Yegneswaran, H. Saidi, P. Porras,and W. Lee, “Eureka: A Framework forEnabling Static Malware Analysis,” in ComputerSecurity - ESORICS 2008, ser. Lecture Notesin Computer Science, S. Jajodia and J. Lopez,Eds. Springer Berlin Heidelberg, 2008, vol.5283, pp. 481–500. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-88313-5 31

[14] M. Madou, L. Van Put, and K. De Bosschere,“LOCO: an interactive code (de)obfuscation tool,” inProceedings of the 2006 ACM SIGPLAN symposiumon Partial evaluation and semantics-based programmanipulation, ser. PEPM ’06. New York, NY,USA: ACM, 2006, pp. 140–144. [Online]. Available:http://doi.acm.org/10.1145/1111542.1111566

[15] L. Argerich, Professional PHP4, ser. ProfessionalSeries. Wrox Press, 2002. [Online]. Available:http://books.google.co.za/books?id=gcD3NX92fucC

[16] M. Doyle, Beginning PHP 5.3. Wiley, 2011.[Online]. Available: http://books.google.co.za/books?id=1TcK2bIJlZIC

[17] The PHP Group. (2013, May) Basic Syntax.Accessed on 22 May 2013. [Online]. Available:http://php.net/manual/en/language.basic-syntax.php

[18] B. McLaughlin, PHP & MySQL, ser. MissingManual. O’Reilly Media, Incorporated, 2012.[Online]. Available: http://books.google.co.za/books?id=39s5PElSmg8C

[19] The PHP Group. (2013, May) Function Reference.Accessed on 22 May 2013. [Online]. Available:http://www.php.net/manual/en/funcref.php

[20] D. Sklar, Learning PHP 5. O’Reilly Media,2008. [Online]. Available: http://books.google.co.za/books?id=PVvmMRSGzFEC

[21] The Resource Index Online Network. (2005,January) The PHP Resource Index. Accessedon 24 May 2013. [Online]. Available: http://php.resourceindex.com/

[22] The PHP Group. (2013, May) PEAR - PHPExtension and Application Repository. Accessedon 24 May 2013. [Online]. Available: http://pear.php.net/

[23] Zend Technologies. (2013, February) The PHPCompany. Accessed on 24 May 2013. [Online].Available: http://www.zend.com/en/resources/

[24] The PHP Group. (2013, May) PECL. Accessed on 24May 2013. [Online]. Available: http://pecl.php.net/


[25] J. Bughin, M. Chui, and B. Johnson, “The next stepin open innovation,” The McKinsey Quarterly, vol. 4,no. 6, pp. 1–8, 2008.

[26] A. Wu, H. Wang, and D. Wilkins, “Perfor-mance Comparison of Alternative Solutions ForWeb-To-Database Applications,” in Proceedings ofthe Southern Conference on Computing, 2000, pp.26–28.

[27] L. Titchkosky, M. Arlitt, and C. Williamson,“A performance comparison of dynamic Webtechnologies,” SIGMETRICS Perform. Eval. Rev.,vol. 31, no. 3, pp. 2–11, Dec. 2003.[Online]. Available: http://doi.acm.org/10.1145/974036.974037

[28] E. Cecchet, A. Chanda, S. Elnikety, J. Marguerite,and W. Zwaenepoel, “Performance Comparison ofMiddleware Architectures for Generating DynamicWeb Content,” in Middleware 2003, ser. LectureNotes in Computer Science, M. Endler andD. Schmidt, Eds. Springer Berlin Heidelberg, 2003,vol. 2672, pp. 242–261.

[29] T. Suzumura, S. Trent, M. Tatsubori, A. Tozawa,and T. Onodera, “Performance Comparison of WebService Engines in PHP, Java and C,” in IEEEInternational Conference on Web Services, 2008, pp.385–392.

[30] S. Trent, M. Tatsubori, T. Suzumura, A. Tozawa,and T. Onodera, “Performance comparison ofPHP and JSP as server-side scripting languages,”in Proceedings of the 9th ACM/IFIP/USENIXInternational Conference on Middleware, ser.Middleware ’08. New York, NY, USA:Springer-Verlag New York, Inc., 2008, pp. 164–182.[Online]. Available: http://dl.acm.org/citation.cfm?id=1496950.1496961

[31] L. Atkinson and Z. Suraski, Core PHPProgramming, ser. Core series. Prentice HallComputer, 2004. [Online]. Available: http://books.google.co.za/books?id=e7D-mITABmEC

[32] The PHP Group. (2013, May) Usage Stats forJanuary 2013. Accessed on 21 May 2013. [Online].Available: http://php.net/usage.php

[33] Web Technology Surveys. (2013, May) Usagestatistics and market share of PHP for websites.Accessed on 24 May 2013. [Online]. Available: http://w3techs.com/technologies/details/pl-php/all/all

[34] F. Coelho. (2013, April) PHP-related vulnerabilitieson the National Vulnerability Database. Accessedon 25 May 2013. [Online]. Available: http://www.coelho.net/php-cve.html

[35] R. Miller. (2006, January) PHPApps A Growing Target for Hackers.Accessed on 25 May 2013. [Online].Available: http://news.netcraft.com/archives/2006/01/31/php apps a growing target for hackers.html

[36] Open Source Matters. (2013, January) What isJoomla? Accessed on 25 May 2013. [Online].Available: http://www.joomla.org/about-joomla.html

[37] R. Kazanciyan. (2012, December) Old WebShells, New Tricks. Mandiant. [Online].Available: https://www.owasp.org/images/c/c3/ASDC12-Old Webshells New Tricks HowPersistent Threats haverevived an old idea andhow you can detect them.pdf

[38] C. Collberg, C. Thomborson, and D. Low, “A taxon-omy of obfuscating transformations,” Department ofComputer Science, The University of Auckland, NewZealand, Tech. Rep., 1997.

[39] C. Linn and S. Debray, “Obfuscation of ExecutableCode to Improve Resistance to Static Disassembly,”in In ACM Conference on Computer and Communi-cations Security. ACM Press, 2003, pp. 290–299.

[40] B. Barak, O. Goldreich, R. Impagliazzo, S. Rudich,A. Sahai, S. Vadhan, and K. Yang, “On the(im)possibility of obfuscating programs,” in Ad-vances in Cryptology-CRYPTO 2001. Springer,2001, pp. 1–18.

[41] D. Binkley, “Source Code Analysis: A RoadMap,” in 2007 Future of Software Engineering, ser.FOSE ’07. Washington, DC, USA: IEEE ComputerSociety, 2007, pp. 104–119. [Online]. Available:http://dx.doi.org/10.1109/FOSE.2007.27

[42] C. Willems, T. Holz, and F. Freiling, “Toward auto-mated dynamic malware analysis using cwsandbox,”Security & Privacy, IEEE, vol. 5, no. 2, pp. 32–39,2007.

[43] G. Wagener, R. State, and A. Dulaunoy, “Malwarebehaviour analysis,” Journal in Computer Virology,vol. 4, no. 4, pp. 279–287, 2008. [Online]. Available:http://dx.doi.org/10.1007/s11416-007-0074-9

[44] M. Christodorescu, S. Jha, J. Kinder,S. Katzenbeisser, and H. Veith, “Softwaretransformations to improve malware detection,”Journal in Computer Virology, vol. 3, no. 4,pp. 253–265, 2007. [Online]. Available: http://dx.doi.org/10.1007/s11416-007-0059-8

[45] A. M. Zaremski and J. M. Wing, “Signaturematching: a tool for using software libraries,”ACM Transactions on Software Engineering andMethodology (TOSEM), vol. 4, no. 2, pp. 146–170,1995.

[46] H.-M. Sun, Y.-H. Lin, and M.-F. Wu, “APIMonitoring System for Defeating Worms andExploits in MS-Windows System,” in InformationSecurity and Privacy, ser. Lecture Notes in ComputerScience, L. Batten and R. Safavi-Naini, Eds.Springer Berlin Heidelberg, 2006, vol. 4058, pp.159–170. [Online]. Available: http://dx.doi.org/10.1007/11780656 14


[47] J. Berdajs and Z. Bosnic, “Extending applicationsusing an advanced approach to DLL injection andAPI hooking,” Software: Practice and Experience,vol. 40, no. 7, pp. 567–584, 2010. [Online].Available: http://dx.doi.org/10.1002/spe.973

[48] I. Goldberg, D. Wagner, R. Thomas, andE. A. Brewer, “A secure environment foruntrusted helper applications confining the WilyHacker,” in Proceedings of the 6th conferenceon USENIX Security Symposium, Focusing onApplications of Cryptography - Volume 6,ser. SSYM’96. Berkeley, CA, USA: USENIXAssociation, 1996, pp. 1–1. [Online]. Available:http://dl.acm.org/citation.cfm?id=1267569.1267570

[49] L. Gong, M. Mueller, and H. Prafullch, “Goingbeyond the sandbox: An overview of the new securityarchitecture in the Java development kit 1.2,” in InProceedings of the USENIX Symposium on InternetTechnologies and Systems, 1997, pp. 103–112.

[50] The PHP Group. (2013, May) Runkit Sandbox.Accessed on 27 May 2013. [Online]. Available:http://php.net/manual/en/runkit.sandbox.php

[51] W. Dai. (2009, March) Crypto++ 5.6.0 Benchmarks.Accessed on 26 October 2013. [Online]. Available:http://www.cryptopp.com/benchmarks.html

[52] Y.-W. Huang, F. Yu, C. Hang, C.-H. Tsai, D.-T.Lee, and S.-Y. Kuo, “Securing web applicationcode by static analysis and runtime protection,” inProceedings of the 13th international conference onWorld Wide Web, 2004, pp. 40–52.

[53] K. Borders, A. Prakash, and M. Zielinski, “Spector:automatically analyzing shell code,” in Twenty-ThirdAnnual Computer Security Applications Conference,2007, pp. 501–514.

[54] NetCraft. (2013, June) June 2013 Web ServerSurvey. Accessed on 9 October 2013. [Online].Available: http://news.netcraft.com/archives/2013/06/06/june-2013-web-server-survey-3.html

[55] The Apache Software Foundation. (2013, February)Dynamic Shared Object (DSO) Support. Accessedon 10 October 2013. [Online]. Available: http://httpd.apache.org/docs/2.2/dso.html

[56] M. Preda and R. Giacobazzi, “Semantic-BasedCode Obfuscation by Abstract Interpretation,” in

Automata, Languages and Programming, ser.Lecture Notes in Computer Science, L. Caires,G. Italiano, L. Monteiro, C. Palamidessi, andM. Yung, Eds. Springer Berlin Heidelberg, 2005,vol. 3580, pp. 1325–1336. [Online]. Available:http://dx.doi.org/10.1007/11523468 107

[57] The PHP Group. (2013, May) Eval. Accessedon 16 October 2013. [Online]. Available: http://php.net/manual/en/function.eval.php

[58] Insecurety Research. (2013, June) Web MalwareCollection. Accessed on 26 October 2013. [Online].Available: http://insecurety.net/?p=96

[59] The PHP Group. (2013, May) Get Last Mod.Accessed on 24 October 2013. [Online]. Available:http://php.net/manual/en/function.getlastmod.php

[60] A. Walenstein and A. Lakhotia, “The SoftwareSimilarity Problem in Malware Analysis,” inDuplication, Redundancy, and Similarity in Software,ser. Dagstuhl Seminar Proceedings, R. Koschke,E. Merlo, and A. Walenstein, Eds., no. 06301.Dagstuhl, Germany: Internationales Begegnungs-und Forschungszentrum Informatik (IBFI), SchlossDagstuhl, Germany, 2007. [Online]. Available:http://drops.dagstuhl.de/opus/volltexte/2007/964

[61] A. Gupta, P. Kuppili, A. Akella, and P. Barford,“An empirical study of malware evolution,” in Com-munication Systems and Networks and Workshops,2009. COMSNETS 2009. First International, 2009,pp. 1–10.

[62] J. Kornblum. (2013, July) Context TriggeredPiecewise Hashes. Accessed on 26 October 2013.[Online]. Available: http://ssdeep.sourceforge.net/

[63] T. Moore and R. Clayton, “Evil Searching:Compromise and Recompromise of InternetHosts for Phishing,” in Financial Cryptographyand Data Security, ser. Lecture Notes inComputer Science, R. Dingledine and P. Golle,Eds. Springer Berlin Heidelberg, 2009, vol.5628, pp. 256–272. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-03549-4 16

[64] J. Li, J. Xu, M. Xu, H. Zhao, and N. Zheng,“Malware obfuscation measuring via evolutionarysimilarity,” in First International Conference onFuture Information Networks, 2009, pp. 197–200.


THE IMPACT OF TRIGGERS ON FORENSIC ACQUISITION AND ANALYSIS OF DATABASES W. K. Hauger* and M. S. Olivier** * ICSA Research Group, Department of Computer Science, Corner of University Road and Lynnwood Road, University of Pretoria, Pretoria 0002, South Africa E-mail: [email protected] ** Department of Computer Science, Corner of University Road and Lynnwood Road, University of Pretoria, Pretoria 0002, South Africa E-mail: [email protected] Abstract: An aspect of database forensics that has not received much attention in the academic research community yet is the presence of database triggers. Database triggers and their implementations have not yet been thoroughly analysed to establish what possible impact they could have on digital forensic analysis methods and processes. This paper firstly attempts to establish if triggers could be used as an anti-forensic mechanism in databases to potentially disrupt or even thwart forensic investigations. Secondly, it explores if triggers could be used to manipulate ordinary database actions for nefarious purposes and at the same time implicate innocent parties. The database triggers as defined in the SQL standard were studied together with a number of database trigger implementations. This was done in order to establish what aspects of a trigger might have an impact on digital forensic analysis. It is demonstrated in this paper that certain database forensic acquisition and analysis methods are impacted by the possible presence of non-data triggers. This is specific to databases that provide non-data trigger implementations. Furthermore, it finds that the forensic interpretation and attribution processes should be extended to include the handling and analysis of all database triggers. This is necessary to enable a more accurate attribution of actions in all databases that provide any form of trigger implementations. Keywords: database forensics, database triggers, digital forensic analysis, methods, processes.

1. INTRODUCTION Forensic science, or simply forensics, is today widely used by law enforcement to aid them in their investigations of crimes committed. Forensic science technicians, which are specifically trained law enforcement officials, perform a number of forensically sound steps in the execution of their duties. These steps include the identification, collection, preservation and analysis of physical artefacts and the reporting of results. One critical part is the collection and preservation of physical artefacts. The collection needs to be performed in such a manner that the artefacts are not contaminated. The artefacts then need to be preserved in such a way that their integrity is maintained. The reason why this part is so critical is so that any evidence gained from the analysis of these artefacts can not be contested. The evidence found would be used to either implicate or exonerate any involved parties. Any doubt about the integrity of the artefacts collected could lead to the evidence being dismissed or excluded from legal proceedings. In digital forensics these steps are more commonly referred to as processes. There have been a number of process models developed to guide the digital forensic investigator [1]. The digital forensic process that matches the collection and preservation step in the physical world is the acquisition process. Traditionally, this process involves the making of exact digital copies of all relevant data media identified [19]. However, database forensics needs to be performed on information systems that are

becoming increasingly complex. Several factors influence the way that data is forensically acquired and how databases are analysed. They include data context, business continuity, storage architecture, storage size and database models. These factors and their influence on database forensics are examined further in Section 2. Database triggers are designed to perform automatic actions based on events that occur in a database. There is a wide variety of commission and omission actions that can be performed by triggers. These actions can potentially have an effect on data inside and outside of the DBMS. Thus triggers and the actions they perform are forensically important. This was already recognised by Khanuja and Adane in a framework for database forensic analysis they proposed [4]. The effect that triggers can have on data raises the concern that they could compromise the integrity of the data being investigated. Could triggers due to their nature in combination with the way databases are forensically analysed lead to the contamination of the data that is being analysed? Another concern revolves around the automatic nature of actions performed by triggers. Can the current attribution process correctly identify which party is responsible for which changes? This paper attempts to establish if these concerns around triggers are justified. The database trigger is defined in the ISO/IEC 9075 SQL standard [5]. Triggers were first introduced in the 1999 version of the standard and subsequently updated in the

Based on: “The Role of Triggers in Database Forensics”, by Werner Hauger and Martin Olivier which appeared in the Proceedings of Information Security South African (ISSA) 2014, Johannesburg, 13 & 14 August 2014. © 2014 IEEE


2008 version. The specification could thus be examined to determine on a theoretical basis if there is reason for concern. However, the standard is merely used as a guideline by DBMS manufacturers and there is no requirement to conform to the standard. Certain manufacturers also use feature engineering to gain a competitive advantage in the marketplace [6]. They might implement additional triggers based on actual feature requests from high profile clients. Standard triggers might be enhanced or other additional triggers implemented based on perceived usefulness by the manufacturers. These features could be used to overcome certain limitations in their DBMS implementations. It is therefore necessary to study actual trigger implementations, rather than the standard itself. There are thousands of database implementations available and to investigate the trigger implementations of all those databases that use triggers would be prohibitive. Thus, the database trigger implementations of a few proprietary and open-source DBMSs were chosen. The DBMSs investigated were Oracle, Microsoft SQL Server, Mysql, PostgreSQL, DB2, SyBase and SQLite. These selected relational database management systems (RDBMS) are widely adopted in the industry. SQLite is particularly interesting since it is not a conventional database. SQLite has no own server or running process, but is rather a single file that is accessed via libraries in the application using it. SQLite is being promoted as a file replacement for local information storage. Some well known applications such as Adobe Reader, Adobe Integrated Runtime (AIR), Firefox and Thunderbird use SQLite for information storage. SQLite is also very compact and thus well suited for use in embedded and mobile devices. Mobile operating systems iOS and Android make use of SQLite [28,29]. The dominance of the selected RDBMSs in the market means that they would be encountered fairly often by the general digital forensic investigator. These RDBMSs are also the most popular based on the number of web pages on the Internet according to solid IT's ranking method [7]. The official documentation of these RDBMSs was used to study their trigger implementations. The latest published version of the documentation was retrieved from the manufacturer's website [8-12,25,26]. At the time of the investigation the latest versions available were as follows: Oracle 11.2g, Microsoft SQL Server 2012, Oracle Mysql5.7, PostgreSQL 9.3, IBM DB2 10, Sybase ASE 15.7 and SQLite 3.8.6. This article is a reworked and extended version of a paper presented by the authors at the Information Security South Africa (ISSA) 2014 conference [30]. The popular databases Sybase and SQLite have been added to the investigation. The INSTEAD OF trigger which was later added to the standard is now also covered. This particular trigger raises additional challenges that are discussed under commission and omission.

Section 2 provides the database forensic background against which database triggers will be investigated. Section 3 describes the database trigger implementations investigated and is divided into four sub-sections: Firstly the triggers defined in the standard were explored. Then the implementations of the standard triggers by the selected DBMSs were examined. Thereafter, other non-standard triggers that some DBMSs have implemented were looked at. For each type of trigger the question was asked as to how the usage of that particular trigger could impact the forensic process or method. Lastly it was established on which objects triggers could be applied. Section 4 asks whether the current forensic processes would correctly identify and attribute actions if triggers were used by attackers to commit their crimes. Through the use of a few hypothetical examples as to how triggers could be used by attackers to commit their crimes, this question was investigated. Section 5 concludes this paper and contemplates further research.

2. BACKGROUND Historically, digital forensics attempts to collect and preserve data media in a static state, which is referred to as dead acquisition [19]. Typically, this process starts with isolating any device that is interacting with a data medium by disconnecting it from all networks and power sources. Then the data medium is disconnected or removed from the device and connected via a write-blocker to a forensic workstation. The write-blocker ensures that the data medium cannot be contaminated while being connected to the forensic workstation. Software is then used to copy the contents to a similar medium or to an alternative medium with enough capacity. Hashing is also performed on the original content with a hash algorithm such as MD5 or SHA-1 [19]. The hashes are used to prove that the copies made are exact copies of the originals and have not been altered. The hashes are also used throughout the analysis process to confirm the integrity of the data being examined. Once the copies have been made, there is no more need for the preservation of the originals [2]. However, if the data being examined is to be used to gather evidence in legal proceedings, some jurisdictions may require that the originals are still available. A different approach is to perform live acquisition. This involves the collection and preservation of both volatile data (e.g. CPU cache, RAM, network connections) and non-volatile data (e.g. data files, control files, log files). Since the acquisition is performed while the system is running, there are some risks that affect the reliability of the acquired data. These risks however can be mitigated by employing certain countermeasures [20]. In today's modern information systems there are several instances where it has become necessary to perform live acquisition. Firstly, in a permanently switched-on and connected world, the context around the imaged data may be required to perform the forensic analysis. This


includes volatile items such a running processes, process memory, network connections and logged on users [19]. One area where the context gained from live acquisition is particularly useful is when dealing with possibly encrypted data. This is because the encrypted data might already be open on a running system and the encryption keys used cached in memory [21]. The increasing prevalence of encryption usage to protect data by both individuals and organisations increases the need for more live acquisitions to be performed. Another instance where live acquisition is performed is when business continuity is required. For many organisations information systems have become a critical part of their operations. The seizure or downtime of such information systems would lead to great financial losses and damaged reputations. The shutdown of mission critical systems might even endanger human life. During forensic investigations, such important information systems can thus no longer be shutdown to perform imaging in the traditional way [19]. The complex storage architecture of today's information systems also necessitates the use of live acquisition techniques. To ensure availability, redundancy, capacity and performance, single storage disks are no longer used for important applications and databases. At least a redundant array of independent disks (RAID) or a full blown storage area network (SAN) is used. Both of these technologies group a variable number of physical storage disks together using different methodologies. They present a logical storage disk to the operating system that is accessible on the block-level. In such a storage configuration a write-blocker can no longer be efficiently used. There simply may be too many disks in the RAID configuration to make it cost and time effective to image them all [19]. In the case of a SAN, the actual physical disks holding the particular logical disk might not be known, or might be shared among multiple logical disks. These other logical disks may form part of other systems that are unrelated to the application or database system and should preferably not be affected. Attaching the disks in a RAID configuration to another controller with the same configuration can make the data appear corrupt and impossible to access. RAID controller and server manufacturers only support RAID migration between specific hardware families and firmware versions. The same would hold true for the imaged disks as well. While it is still technically possible to image the logical disk the same way as a physical disk, it may not be feasible to do so either. Firstly the size of the logical disk may be bigger than the disk capacity available to the forensic investigator [24]. Secondly the logical disk may hold a lot of other unrelated data, especially in a virtualised environment. Lastly organisations may be running a huge single application or database server containing many different applications and databases.

Due to hardware, electricity and licensing costs, the organisation may prefer this to having multiple smaller application or database servers. Lastly, database systems have their own complexities that affect digital forensic investigations. The models used by the database manufacturers are tightly integrated into their database management systems (DBMS) and are many times of a proprietary nature. Reverse engineering is purposely being made difficult to prevent their intellectual property being used by a competitor. Sometimes reverse engineering is explicitly prohibited in the licensing agreements of the usage of the DBMSs. To forensically analyse the raw data directly is thus not very easy, cost-effective or always possible. The data also needs to be analysed in conjunction with the metadata because the metadata not only describes how to interpret the data, but can also influence the actual seen information [3]. The usage of the DBMS itself, and by extension the model it contains, has become the necessary approach to forensically analyse databases. The database analysis can be performed in two ways: an analysis on site or an analysis in a clean laboratory environment. On site the analysis is performed on the actual system running the data base. In the laboratory a clean copy of the DBMS with the exact same model as used in the original system is used to analyse the data and metadata acquired [3]. Both ways can be categorised as live analysis due to being performed on a running system. In the first instance the real system is used, while in the second a resuscitated system in a more controlled environment is used e.g. single user, no network connection. Due to all these complexities associated with applications and particularly databases, live acquisition is the favoured approach when dealing with an information system of a particular size and importance. Fowler documents such a live acquisition in a real world forensic investigation he performed on a Microsoft SQL Server 2005 database [23]. It should be noted that both the operating system and the DBMS are used to access and acquire data after being authenticated. To preserve the integrity of the acquired data, he uses his own clean tools that are stored on a read-only medium [20]. However, the mere accessing of the system will already cause changes to the data, thus effectively contaminating it before it can be copied. Since all the operations performed during the acquisition are documented, they can be accounted for during a subsequent analysis. Hence, this kind of contamination is acceptable as it can be negated during analysis. Against this background of how forensic acquisition and analysis is performed on a database system, triggers are examined.


3. TRIGGER IMPLEMENTATIONS This section firstly examines what types of triggers are defined in the standard and how they have been implemented in the DBMSs surveyed. It then looks at other types of triggers that some DBMSs have implemented. Lastly, the database objects that triggers can be applied to, are examined. Throughout the section, the possible impact on database forensics is explored. 3.1 Definition The ISO/IEC 9075 standard part 2: Foundation defines a trigger as an action or multiple actions taking place as a result of an operation being performed on a certain object. The operations are defined as being changes made to rows by inserting, updating or deleting them. Therefore three trigger types are being defined: the insert trigger, the delete trigger and the update trigger. The action can take place immediately before the operation, instead of the operation or immediately after the operation. A trigger is thus defined as a BEFORE trigger, an INSTEAD OF trigger or an AFTER trigger. The action can take place only once, or it can occur for every row that the operation manipulates. The trigger is thus further defined as a statement-level trigger or as a row-level trigger. 3.2 Standard triggers The first aspect that was looked at was the conformance to the ISO/IEC 9075 SQL standard regarding the type of triggers. All DBMSs surveyed implement the three types of data manipulation language (DML) triggers defined. The only implementations that match the specification exactly in terms of trigger types are those of Oracle and PostgreSQL. They have implemented all combinations of BEFORE/AFTER/INSTEAD OF/Statement-level/Row-level triggers. The others either place restrictions on the combinations or implement only a subset of the definition from the specification. DB2 has no BEFORE statement-level trigger, but all the other combinations are implemented. SQL Server and Sybase do not implement BEFORE triggers at all. Mysql and SQLite do not have any statement-level triggers. PostgreSQL goes one step further and differentiates between the DELETE and TRUNCATE operation. Because the standard only specifies the DELETE operation, most databases will not execute the DELETE triggers when a TRUNCATE operation is performed. Depending on the viewpoint, this can be advantageous or problematic. It allows for the quick clearing of data from a table without having to perform possibly time consuming trigger actions. However, if a DELETE trigger was placed on a table to clean up data in other tables first, a TRUNCATE operation on that table might fail due to referential integrity constraints. The linked tables will have to be truncated in the correct order to be successfully cleared. PostgreSQL allows additional

TRUNCATE triggers to be placed on such linked tables, facilitating easy truncation of related tables. Since all three types of DML triggers defined by the standard rely on changes of data taking place i.e. either the insertion of new data or the changing or removal of existing data, the standard methods employed by the forensic analyst are not impacted. These methods are specifically chosen because they do not cause any changes and can be used to create proof that in fact no changes have occurred. Some members of the development community forums have expressed the need for a select trigger [13]. A select trigger would be a trigger that fires when a select operation takes place on the object on which it is defined. None of the DBMSs surveyed implement such a select trigger. Microsoft however is working on such a trigger and its researchers have presented their work already [14]. Oracle on the other hand has created another construct that can be used to perform one of the tasks that the developers want to perform with select triggers: manipulate SQL queries that are executed. The construct Oracle has created is called a group policy. It transparently applies the output from a user function to the SQL executed on the defined object for a certain user group. The function can be triggered by selecting, inserting, updating or deleting data. The good news for the forensic analyst is that these functions will not be invoked for users with system privileges. So as long as the forensic analyst uses a database user with the highest privileges, the group policies will not interfere with his investigations. The existence of a select trigger would have greatly impacted on the standard methods used by the database forensic analyst. One of the methods used to gather data and metadata for analysis is the execution of SQL select statements on system and user database objects such as tables and views. This would have meant that an attacker could have used such a trigger to hide or even worse destroy data. A hacker could use select triggers to booby-tap his root kit. By placing select triggers on sensitive tables used by him, he could initiate the cleanup of incriminating data or even the complete removal of his root kit should somebody become curious about those tables and start investigating. 3.3 Non-standard triggers The second aspect that was investigated was the additional types of triggers that some DBMSs define. The main reason for the existence of such extra trigger types is to allow developers to build additional and more specialised auditing and authentication functionality, than what is supplied by the DBMS. However that is not the only application area and triggers can be used for a variety of other purposes. For example instead of having an external application monitoring the state of certain elements of the database and performing an action once


certain conditions become true, the database itself can initiate these actions. The non-standard triggers can be categorised into two groups: data definition language (DDL) triggers and other non-data triggers. From the DBMSs investigated, only Oracle and SQL Server provide non-standard triggers. DDL triggers: The first group of non-standard triggers are the DDL triggers. These are triggers that fire on changes made to the data dictionary with DDL SQL statements e.g. create, drop, alter etc. Different DBMSs define different DDL SQL statements that can trigger actions. SQL Server has a short list that contains just the basic DDL SQL statements. Oracle has a more extensive list and also a special DDL indicator that refers to all of them combined. Since DDL SQL statements can be applied to different types of objects in the data dictionary, these triggers are no longer defined on specific objects. They are rather defined on a global level firing on any occurrence of the event irrespective of the object being changed. Both SQL Server and Oracle allow the scope to be set to a specific schema or the whole database. These triggers once again rely on data changes being made in the database to fire and thus pose no problem of interference during the forensic investigation. Non-data triggers: The second group of non-standard triggers are non-data triggers. These are triggers that fire on events that occur during the normal running and usage of a database. Since these triggers do not need any data changes to fire, they potentially have the biggest impact on the methods employed by the forensic analyst. Fortunately the impact is isolated because only a few DBMSs have implemented such triggers. SQL Server, Oracle and Sybase define a login trigger. This trigger fires when a user logs into the database. SQL Server's login trigger can be defined to perform an action either before or after the login. Authentication however will be performed first in both cases, meaning only authenticated users can activate the trigger. That means the login trigger can be used to perform conditional login or even completely block all logins. An attacker could use this trigger to easily perform a denial of service (DoS) attack. Many applications today use some kind of database connection pool that dynamically grows or shrinks depending on the load of the application. Installing a trigger that prevents further logins to the database would cripple the application during high load. It would be especially bad after an idle period where the application would have reduced its connections to the minimum pool size. Oracle's login trigger only performs its action after successful login. Unfortunately that distinction does not make a significant difference and this trigger can also be used to perform conditional login or completely prevent any login. That is because the content of the trigger is

executed in the same transaction as the triggering action [16]. Should any error occur in either the triggering action or the trigger itself, then the whole transaction will be rolled back. So simply raising an explicit error in the login trigger will reverse the successful login. Sybase distinguishes between two different kinds of login triggers. The first is the login-specific login trigger. The trigger action is directly linked to a specific user account. This kind of trigger is analogous to the facility some operating systems provide, which can execute tasks on login. The second kind is the global login trigger. Here the trigger action will be performed for all valid user accounts. Sybase allows both kinds of login triggers to be present simultaneously. In this case the global login trigger is executed first and then the login-specific trigger [27]. Both kinds of login triggers are not created with the standard Sybase SQL trigger syntax. Instead a two-step process is used. First a normal stored procedure is created, that contains the intended action of the trigger. Then this stored procedure is either linked to an individual user account or made applicable to all user accounts with built-in system procedures. Like with Oracle, the action procedure is executed after successful login, but within the same transaction. Thus it can be similarly misused to perform a DoS attack. Microsoft has considered the possibility of complete account lockout and subsequently created a special method to login to a database that bypasses all triggers. Oracle on the other hand has made the complete transaction rollback not applicable to user accounts with system privileges or the owners of the schemas to prevent a complete lockout. Additionally, both SQL Server and Oracle have a special kind of single-user mode the database can be put into, which will also disable all triggers [15,16]. Sybase on the other hand has no easy workaround and the database needs to be started with a special flag to disable global login triggers [27]. A hacker could use this trigger to check if a user with system privileges, that has the ability to look past the root kits attempts to hide itself, has logged in. Should such a user log in, he can remove the root kit almost completely, making everything seem normal to the user even on deeper inspection. He can then use Oracle's BEFORE LOGOFF trigger to re-insert the root kit, or use a scheduled task [17] that the root kit hides to re-insert itself after the user with system privileges has logged off. Another non-data trigger defined by Oracle is the server error trigger. This trigger fires when non-critical server errors occur and could be used to send notifications or perform actions that attempt to solve the indicated error. The final non-data triggers defined by Oracle only have a database scope due to their nature: the database role change trigger, the database startup trigger and the


database shutdown trigger. The role change trigger refers to Oracle's proprietary Data Guard product that provides high availability by using multiple database nodes. This trigger could be used to send notifications or to perform configuration changes relating to the node failure and subsequent switch over. The database startup trigger fires when the database is opened after successfully starting up. This trigger could be used to perform certain initialisation tasks that do not persist and subsequently do not survive a database restart. The database shutdown trigger fires before the database is shut down and could be used to perform cleanup tasks before shutting down. These last two triggers can be similarly exploited as the login and logoff triggers by a hacker to manage and protect his root kit. 3.4 Trigger objects The third aspect that was investigated was which database objects the DBMSs allowed to have database triggers. The standard generically defines that triggers should operate on objects, but implies that the objects have rows. It was found that all DBMSs allow triggers to be applied to database tables. Additionally, most DBMSs allow triggers to be applied to database views with certain varying restrictions. Only Mysql restricts triggers to be applied to tables only. None of the DBMSs allow triggers to be applied to system tables and views. Triggers are strictly available only on user tables and views. Additionally, there are restrictions to the kind of user table and user views that triggers can be applied to. This is good news for forensic investigators, since they are very interested in the internal objects that form part of the data dictionary. However, there is a move by some DBMSs to provide system procedures and views to display the data from the internal tables [22]. To protect these views and procedures from possible user changes they have been made part of the data dictionary. The ultimate goal seems to be to completely remove direct access to internal tables of the data dictionary. This might be unsettling news for forensic investigators as they prefer to access any data as directly as possible to ensure the integrity of the data. It will then become important to not only use a clean DBMS, but also a clean data dictionary (at least the system parts). Alternatively the forensic investigator first needs to show that the data dictionary is uncompromised by comparing it to a known clean copy [3]. Only then can he use the functions and procedures provided by the data dictionary.

4. IDENTITY AND ATTRIBUTION The login trigger example brings up another interesting problem. Once the forensic investigator has pieced together all the actions that occurred at the time when the

user with system privileges was logged in, the attribution of those actions can be performed. Since the forensic investigator can now make the assumption that the picture of the events that took place is complete, he attributes all the actions to this specific user. This is because all the individual actions can be traced to this user by the audit information. Without looking at triggers, the investigator will miss, that the particular user was completely unaware of certain actions that happened, even though they were triggered and executed with his credentials. These actions can be categorised into two groups: commission actions and omission actions. The BEFORE/AFTER trigger can be used to commission additional actions before or after the original operation is performed. Since the original operation is still performed unchanged, no omission actions can be performed. The outcome of the original operation can still be changed or completely reversed by actions performed in an AFTER trigger, but those actions are still commission actions. The INSTEAD OF trigger on the other hand can be used to perform actions in both groups. Normally this trigger is intended to commission alternative actions to the original operation requested. Like the BEFORE/AFTER trigger, it can also be used to commission actions in addition to the original operation. But importantly, it provides the ability to modify the original operation and its values. This ability also makes it possible to either remove some values or remove the operation completely. Operations that were requested simply never happen and values that were meant to be used or stored disappear. These removal actions therefore fall into the omission action group. Consider a database in a medical system that contains patient medical information. An additional information table is used to store optional information such as organ donor consent, allergies etc. in nullable columns. This system is used among other things to capture the information of new patients being admitted to a hospital. The admissions clerk carefully enters all the information from a form that is completed by the patient or his admitting partner. The form of a specific patient clearly indicates that he is allergic to penicillin. This information is dutifully entered into the system by the admissions clerk. However an attacker has placed an INSTEAD OF trigger on the additional information table that changes the allergy value to null before executing the original insert. After admission, the medical system is then used to print the patient's chart. A medical doctor then orders the administration of penicillin as part of a treatment plan after consulting the chart, which indicates no allergies. This action ultimately leads to the death of the patient due to an allergic reaction. An investigation is performed to determine the liability of the hospital after the cause of death has been established. The investigation finds that the allergy was disclosed on the admissions form, but not entered into the medical system. The admissions clerk


that entered the information of the patient that died is determined and questioned. The admissions clerk however insists that he did enter the allergy information on the form and the system indicated that the entry was successful. However, without any proof substantiating this, the admissions clerk will be found negligent. Depending on the logging performed by the particular database, there might be no record in the database that can prove that the admissions clerk was not negligent. The application used to capture the information might however contain a log that shows a disparity between the data captured and the data stored. Without such a log there will possibly be only evidence to the contrary, implying gross negligence on the part of the admissions clerk. This could ultimately lead to the admissions clerk being charged with having performed an act of omission. However, should triggers be examined as part of a forensic investigation, they could provide a different perspective. In this example the presence of the trigger can as a minimum cast doubts on the evidence and possibly provide actual evidence to confirm the version of events as related by the admissions clerk. The next example shows commission actions by using a trigger to implement the salami attack technique. An insurance company pays its brokers commission for each active policy they have sold. The commission amount is calculated according to some formula and the result stored in a commission table with five decimal precision. At the end of the month, a payment process adds all the individual commission amounts together per broker and stores the total amount rounded to two decimals in a payment table. The data from the payment table is then used to create payment instructions for the bank. Now an attacker could add an INSTEAD OF trigger on the insert/update/delete operations of the commission table which would get executed instead of the insert/update/delete operation that was requested. In the trigger, the attacker could truncate the commission amount to two digits, write the truncated portion into the payment table against a dormant broker and the two decimal truncated amount together with the other original values into the commission table. The banking details of the dormant broker would be changed to an account the attacker controlled and the contact information removed or changed to something invalid so that the real broker would not receive any notification of the payment. When the forensic investigator gets called in after the fraudulent bank instruction gets discovered, he will find either of two scenarios: The company has an application that uses database user accounts for authentication or an application that has its own built-in authentication mechanism and uses a single database account for all database connections. In the first case, he will discover from the audit logs that possibly all users that have access in the application to manage broker commissions, have at some point updated the fraudulent bank instruction.

Surely not all employees are working together to defraud the company. In the second case, the audit logs will attribute all updates to the fraudulent bank instruction to the single account the application uses. In both cases it would now be worthwhile to query the data dictionary for any triggers that have content that directly or indirectly refers to the payment table. Both Oracle and SQL Server have audit tables that log trigger events. If the trigger events correlate with the updates of the payment table as indicated in the log files, the investigator will have proof that the trigger in fact performed the fraudulent payment instruction updates. He can now move on to determine when and by whom the trigger was created. Should no trigger be found, the investigator can move on to examining the application and its interaction with the database. Another more prevalent crime that gets a lot of media attention is the stealing of banking details of customers of large companies [18]. The most frequent approach is the breach of the IT infrastructure of the company and the large scale download of customer information including banking details. This normally takes place as a single big operation that gets discovered soon afterwards. A more stealthy approach would be the continuous leaking of small amounts of customer information over a long period. Triggers could be used quite easily to achieve that at the insurance company in our previous example. The attacker can add an AFTER trigger on the insert/update operations of the banking details table. The trigger takes the new or updated banking information and writes it to another table. There might already be such a trigger on the banking details table for auditing purposes and so the attacker simply has to add his part. To prevent any object count auditing picking up his activities, the attacker can use an existing unused table. There is a good chance he will find such a table, because there are always features of the application that the database was designed to have, that simply were not implemented and might never be. This is due to the nature of the dynamic business environment the companies operate in. Suppose every evening a scheduled task runs that takes all the information stored in the table, puts it in an email and clears the table. There is a possibility that some form of email notification method has already been setup for the database administrator's own auditing process. The attacker simply needs to piggy back on this process and as long as he maintains the same conventions, it will not stand out from the other audit process. Otherwise, he can invoke operating system commands from the trigger to transmit the information to the outside. He can connect directly to a server on the Internet and upload the information if the database server has Internet connectivity. Otherwise, he can use the email infrastructure of the company to email the information to a mailbox he controls.


The forensic analyst that investigates this data theft will find the same two scenarios as in the previous example. The audit information will point to either of the following: All the staff members are stealing the banking information together or somebody is using the business application to steal the banking details with a malicious piece of functionality. Only by investigating triggers and any interaction with the table that contains the banking information, will he be able to identify the correct party responsible for the data leak. The actual breach of the IT infrastructure and the subsequent manipulation of the database could have happened weeks or months ago. This creates a problem for the forensic investigator that tries to establish who compromised the database. Some log files he would normally use might no longer be available on the system because they have been archived due to space constraints. If the compromise was very far back, some archives might also no longer be available because for example the backup tapes have already been rotated through and reused. The fact that a trigger was used in this example is very useful to the forensic investigator. The creation date and time of a trigger can give him a possible beginning for the timeline and more importantly the time window in which the IT infrastructure breach occurred. He can then use the log information he can still get for that time window to determine who is responsible for the data theft.

5. CONCLUSION Two concerns were raised around the presence of database triggers during forensic investigations. Can triggers cause the contamination of the data being analysed and can the actions performed by triggers be correctly identified and attributed without analysing triggers? A contribution of this paper is a thorough survey of all trigger types found in the most widely used relational databases. The research found that database triggers are generally defined to perform actions based on changes in the database, be it on the data level or the data definition level. This will normally not affect the work of a forensic analyst, since he is primarily viewing information (be it data or metadata) without making any changes. 5.1 Results In contrast, the research also showed that some DBMS's allow triggers to be set on the accessing of information. If the forensic analyst works with an Oracle or SQL Server database, he needs to consider the non-data triggers. He should take great care in how he connects to the database to prevent unintended changes from happening and thus potentially having to do time consuming reconstruction to get back to the initial state of the database.

Furthermore, the research demonstrated that triggers can be used to facilitate malicious actions on the back of normal application or operational actions on the database. These changes would be executed in the context of the initial change and the standard audit material would attribute all changes to the same user. It is therefore necessary to examine database triggers as part of the forensic interpretation and attribution processes. All types of triggers should be examined for out of the ordinary and suspicious actions that relate to the compromised data. This is needed to separate the user actions from the automatic trigger actions. 5.2 Future Work The current research being conducted is focused on determining how to best analyse the different kinds of triggers. A database under investigation may contain several triggers. Many of those triggers, if not all of them, may bear no relevance to the investigation. So a possible starting point would be the ability to identify if any of those triggers played a part in the specific data being analysed. This can be accomplished by searching the content of all triggers for the occurrence of database objects that are being analysed. A paper proposing an algorithmic approach to achieve this is to be presented at the 2015 IFIP Working Group (WG) 11.9 conference on Digital Forensics [31]. Attention also needs to be given to the fact that some DMBSs allow the obfuscation of the trigger content. This would make it difficult to determine what actions a specific trigger performs and what database operations would initiate them. It also makes the searching of the content for database objects impossible. However, some Oracle and SQL Server database versions have obfuscation weaknesses that make it possible to retrieve the clear text content from an obfuscated trigger. Also further research needs to be conducted to determine how to best perform forensic acquisition and analysis when the database being investigated supports login triggers. Since the login trigger is non-standard, the implementations will differ between different databases. Hence it will not be easy or even possible to establish a common process. Any process that can be followed to neutralise or circumvent any potentially interfering logon triggers would be very database specific. An aspect that has not been addressed in this paper is what impact triggers have when the forensic investigator does make intentional changes on a copy of the data. The investigator could be testing a hypothesis, performing data reduction, reconstructing deleted data or simply be storing his results in a temporary table.


6. REFERENCES [1] M.M. Pollitt: “An Ad Hoc Review of Digital

Forensic Models”, Proceedings of the Second International Workshop on Systematic Approaches to Digital Forensic Engineering, Seattle, pp. 43-54, April 2007.

[2] F. Cohen: Digital Forensic Evidence Examination,

Fred Cohen & Associates, Livermore, CA., 4th edition, Chapter 2, p. 45, 2009.

[3] M.S. Olivier: “On metadata context in Database

Forensics”, Digital Investigation, Vol. 5 No. 3-4, pp. 115–123, 2009.

[4] H.K. Khanuja and D.S. Adane: “A framework for

database forensic analysis”, Computer Science & Engineering, Vol. 2 No. 3, June 2012.

[5] ISO/IEC 9075-2, “Information technology -

Database languages - SQL - Part 2:Foundation (SQL/Foundation) ”, 2011.

[6] C.R. Turner, A. Fuggetta, L. Lavazza and A.L. Wolf:

“A conceptual basis for feature engineering”, The Journal of Systems and Software, Vol. 49 No. 1, pp. 3-15, 1999.

[7] “DB-Engines Ranking of Relational DBMS”.

Internet: http://db-engines.com/en/ranking/relational+dbms, [1 May 2014].

[8] “CREATE TRIGGER Statement”, Oracle® Database

PL/SQL Language Reference 11g Release 2 (11.2). Internet: http://docs.oracle.com/cd/E11882_01/appdev.112/e17126/create_trigger.htm, [2 May 2014].

[9] “CREATE TRIGGER”, Data Definition Language

(DDL) Statements. Internet: http://msdn.microsoft.com/en-us/library/ms189799.aspx, [2 May 2014].

[10] “CREATE TRIGGER Syntax”, MySQL 5.7

Reference Manual. Internet: http://dev.mysql.com/doc/refman/5.7/en/create-trigger.html, [2 May 2014].

[11] “CREATE TRIGGER”, PostgreSQL 9.3.4

Documentation. Internet: http://www.postgresql.org/docs/9.3/static/sql-createtrigger.html, [2 May 2014].

[12] “CREATE TRIGGER”, DB2 reference information. Internet: http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/index.jsp?topic=/com.ibm.db2z10.doc.sqlref/src/tpc/db2z_sql_createtrigger.htm, [2 May 2014].

[13] “Oracle Community”. Internet:

https://community.oracle.com/community/developer/search.jspa?peopleEnabled=true&userID=&containerType=&container=&q=select+trigger, [2 May 2014].

[14] D. Fabbri, R. Ramamurthy and R. Kaushik:

“SELECT triggers for data auditing”, Proceedings of the 29th International Conference on Data Engineering, Brisbane, pp. 1141-1152, April 2013.

[15] “Logon Triggers”, Database Engine Instances (SQL

Server). Internet: http://technet.microsoft.com/en-us/library/bb326598.aspx, [2 May 2014].

[16] “PL/SQL Triggers”, Oracle® Database PL/SQL

Language Reference 11g Release 2 (11.2). Internet: http://docs.oracle.com/cd/E11882_01/appdev.112/e17126/triggers.htm, [2 May 2014].

[17] A. Kornbrust: “Database rootkits”, Black Hat

Europe, April 2005. Internet: http://www.red-database-security.com/wp/db_rootkits_us.pdf, [1 May 2014].

[18] C. Osborne: “How hackers stole millions of credit

card records from Target”, ZDNet (13 February 2014). Internet: http://www.zdnet.com/how-hackers-stole-millions-of-credit-card-records-from-target-7000026299/, [5 May 2014].

[19] F. Adelstein: “Live forensics: diagnosing your

system without killing it first”, Communications of the ACM, Vol. 49 No. 2, pp. 63-66, February 2006.

[20] B.D. Carrier: “Risks of live digital forensic analysis”,

Communications of the ACM, Vol. 49 No. 2, pp. 56-61, February 2006.

[21] C. Hargreaves and H. Chivers: “Recovery of

Encryption Keys from Memory Using a Linear Scan”, Proceedings of the Third International Conference on Availability, Reliability and Security, Barcelona, pp. 1369-1376, March 2008.

[22] M. Lee and G. Bieker: Mastering SQL Server 2008,

Wiley Publishing Inc., Indianapolis, Indiana, Chapter 2, pp. 48-51, 2009.


[23] K. Fowler: “A real world scenario of a SQL Server 2005 database forensics investigation”, Black Hat USA, 2007. Internet: https://www.blackhat.com/presentations/bh-usa-07/Fowler/Whitepaper/bh-usa-07-fowler-WP.pdf, [3 July 2014].

[24] S.L. Garfinkel: “Digital forensics research: The next

10 years”, Digital Investigation, Vol. 7 Supplement, pp. S64-S73, 2010.

[25] “Create Trigger”, Adaptive Server Enterprise 15.7:

Reference Manual - Commands. Internet: http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc36272.1570/html/commands/X19955.htm, [1 September 2014].

[26] “CREATE TRIGGER”, SQL As Understood By

SQLite. Internet: http://www.sqlite.org/lang_createtrigger.html, [1 September 2014].

[27] “Login triggers in ASE 12.5+”, Login triggers. Internet: http://www.sypron.nl/logtrig.html, [1 September 2014].

[28] “Appropriate Uses For SQLite”, Categorical Index

Of SQLite Documents. Internet: http://www.sqlite.org/whentouse.html [1 September 2014].

[29] “Well-Known Users of SQLite”, Categorical Index

Of SQLite Documents. Internet: http://www.sqlite.org/famous.html [1 September 2014].

[30] W.K. Hauger and M.S. Olivier: “The role of triggers

in database forensics”, Proceedings of the 2014 Information Security for South Africa Conference, Johannesburg, August 2014.

[31] W.K. Hauger and M.S. Olivier: “Determining trigger

involvement during forensic attribution in databases”, 2015 IFIP Working Group 11.9 Conference on Digital Forensics. Accepted for presentation.


AN INVESTIGATION INTO REDUCING THIRD PARTY PRIVACYBREACHES DURING THE INVESTIGATION OF CYBERCRIME

Wynand JC van Staden∗

∗ School of Computing, University of South Africa, Science Campus, Johannesburg, South AfricaE-Mail: [email protected]

Abstract: In this article we continue previous work in which a framework for preventing orlimiting a privacy breach of a third party during the investigation of a cybercrime was presented. Theinvestigations may be conducted internally (by the enterprise), or externally (by a third party, or alaw enforcement agency) depending on the jurisdiction and context of the case. In many cases, anenterprise will conduct an internal investigation against some allegation of wrongdoing by an employee,or a client. In these cases maintaining the privacy promise made to other clients or customers isan ideal that the enterprise may wish to honour, especially if the image or brand of the enterprisemay be impacted when the details of the process followed during the investigation becomes clear.Moreover, there may be a duty to honour the privacy of third parties (through legislation or bestpractice). Providing tools to aid the investigative process in this regard may be invaluable in a worldwhere privacy concerns is enjoying ever more attention – it provides a measure of due diligence from theinvestigator in showing that reasonable measures were in place to honour privacy. The article reports onthe results of the implementation of a privacy breach mitigation tool – it also includes lessons learned,and proposes further steps for refining the breach detection techniques and methods for future digitalforensic investigation.

Key words: Privacy, Digital Forensics, Privacy Breach, Third Party Privacy, Cybercrime

1. INTRODUCTION

A Digital Forensic (DF) investigation relies on the ability

of the investigator to build a coherent and consistentdescription of events of the incident being investigated.

In order to accomplish this, the investigator should have

access to all the relevant information that may providecontext in the investigation. However, the investigator

may be granted access to more information than is, strictlyspeaking, needed. For example: an investigator searching

for proof of pornographic images containing children may

conduct a legitimate search for all images on a storagemedium that was shared by more than one person. A

subsequent search may return all images on the disk –

thus the potential for a privacy breach exists – the eventthat a third party, not culpable in the actions leading to

the investigation, may be ‘investigated’ (we define this

as a Third Party Privacy Breach (TPPB)). Such privacybreaches may not be considered harmful, however, it could

reveal the third party’s political affiliation, detail on their

social circles and so on – these details could place the thirdparty in a compromised position.

During the post-mortem analysis of data, the investigator

uses a plethora of tools in order to sift through largevolumes of data – these tools also provide clues to the

documents and artefacts that may be of interest. A

common tool used is a regular expression parser andsearcher, which will scan a collection of documents for

keywords. The investigator may then (painstakingly)

examine the documents in order to determine theirrelevance and to build a case. This requires that all

returned results will be considered, and data relevant to theinvestigation will be retained, and non-relevant data will be

discarded.

In previous work [1], we proposed a method for limiting

the possibility of a privacy breach through the use ofinformation retrieval techniques such as clustering, and

diversity index creation. The principle behind the proposal

was to release the results of a search only if it can bedetermined that the query being posed to the system is not

considered ‘too wide’ or ‘unspecific’. A query is deemed

too wide if it may reveal information on third parties(in some cases revealing information on third parties is

unavoidable, and our original work includes an audit logto record such activities, as well as a threshold mechanism

to aid in the controlled access of such data – the threshold

mechanism can be adjusted to bypass the filter completelyrevealing all data related to the query. However, this

action will be recorded in an audit log. This allows the

investigator the freedom to reveal more information if thefilter is too restrictive). A focused query returns results

that fit close together, and thus has a smaller chance of

revealing information on the third party.

A tool or tool-set which provides a degree of TPPB

protection through mitigation or avoidance, provides a

measure of proof that the investigator (within reason)attempted to honour the privacy of third parties. Of

course, the investigator can choose not to use the tools,

and it may not be possible to force the use of suchtools. However, there is an advantage in the event that

a privacy breach did occur: the investigator can show

a reasonable attempt (through the use of tools and theiraudit trails) at protecting third party privacy. Forensic

tools such as Encase, and the scores of file carving anddata gathering tools are already accepted as reasonable

Based on: “An Investigation into Reducing Third Party Privacy Breaches during the Investigation of Cybercrime”, by Wynand van Staden which appeared in the Proceedings of Information Security South African (ISSA) 2014, Johannesburg, 13 & 14 August 2014. © 2014 IEEE


means of gathering digital evidence to be presented as

fact. Thus, a privacy preserving/mitigation tool (that has

passed the initial acceptance tests) could also be presentedas a reasonable means of gathering evidence and protecting

privacy if an investigation into a breach takes place. This

work is a first step at investigating such tools. We thereforedon’t focus on existing tools or tool-sets – the paper is

concerned with the aspects of preserving privacy during

an investigation, and focusses on a tool to accomplish that.

In this paper we continue that work by presenting the

results of implementing the Information Retrieval (IR)system on the well known (and researched) ENRON email

corpus. The corpus contains real world data which consists

of communication relating to the core business of theENRON corporation, as well as numerous emails of a

personal nature. Our intent was to determine how well

the proposed system would aid in filtering out searchesthat would be considered possible privacy breaches. We

implemented two clustering-algorithms (single-link, andGroup-Average Agglomerative Clustering (GAAC)), and

compare the results produced by the different algorithms.

It is also clear that fraudsters may use duplicitousmessages to communicate with each other, the work in this

paper does not consider the analysis of ‘meaning within

meaning’. The investigative prowess required to discoversuch sequestered meaning in all likelihood requires that the

investigator bypass conventional investigative techniques.For the moment that means turning the filter off – however,

the threshold mechanism allows for this, and the original

framework provides for an audit log to protect the intrepidinvestigator in their search for answers in the face of a

perceived privacy breach.

1.1 CONTRIBUTION

This paper contributes to the general area of digital

forensic knowledge in the following way: the application

of well known information retrieval techniques toDF investigations has been proposed elsewhere [2,

3]. By extending search techniques to include privacy

considerations will aid in the prevention of TPPBs,thereby ensuring that the right to privacy of persons

are protected even when an investigation takes place.

We have implemented the search filtering component ofa previously proposed framework using the mentioned

techniques and a generally available corpus of real world

data to examine the potential for privacy breaches and themitigation of this risk. We have also implemented two

different clustering algorithms in order to compare theresults of ‘naturally’ clustering emails by topic in order

to measure the diversity of the search results. The results

of these are compared in order to determine the potentialeffectiveness of each as a privacy protection mechanism.

Our results provide insight into the complexity of the

analysis of searches using IR techniques for privacyprotection and provides clues and avenues for future

research in this area. The questions and areas ofdifficulty identified in this paper will be used to guide

future research with the specific intent of finding better

ways of limiting privacy breach exposure during a digital

forensic investigation. Limiting privacy risk in such away is becoming more and more relevant when one takes

the recent reports of government-sanctioned information

gathering into account: what once was an implicit trustof those in power to behave responsibly with Personal

Identifiable Information (PII) has now become a matter of

public interest and action.

1.2 STRUCTURE OF THE PAPER

The rest of the paper is structured as follows: section

2. provides relevant background and related material,section 3. provides a description of the implementation that

was done (the artefact). Section 4. provides results that

were forthcoming from the implemented search prototype.Section 5. provides a summary of the insights gathered

after analysing the results, and finally section 6. providesconcluding remarks, and proposes avenues for future

research.

2. BACKGROUND AND RELATED WORK

The work presented in this paper covers two areas of

research: digital forensic investigation of cyber-crime, and

privacy. This section provides a brief background on each,as well as a short description of our previous proposal on

TPPBs in order to provide context.

2.1 DIGITAL FORENSICS

Digital forensic science is “the use of scientifically

derived and proven methods toward the preservation,

collection, validation, identification, analysis, interpreta-tion, documentation and presentation of digital evidence

derived from digital sources for the purpose facilitating or

furthering the reconstruction of events found to be criminalor helping to anticipate unauthorised actions shown to be

disruptive to planned operations.” [4]. The preservation

and collection of of digital evidence results in the seizureof any computer equipment that has been identified as

sources of digital evidence [5, 6]. This paper focuses onthe instances in which storage media (hard disks, DVDs,

flash-drives, and so on) is seized for analysis.

The process followed (as described above) is not

limited to governmental institutions and it is common

for large enterprises to house internal investigativeunits which will conduct investigations and that may

hand the investigation over to a governmental law

enforcement agency. Moreover, the enterprise may nothave the expertise or experience needed to conduct the

investigation, and the incident may be handed over without

an internal investigation taking place at all.

Analysis of the storage media seized is referred to as a

post-mortem analysis, and requires the scrutinizing of bothused and non-used areas of non-volatile storage. Used

areas contain data that is managed by the file-system(including areas that are reserved for usage but that is not


associated with any container managed by the operating

system – referred to as slack-space), and non-used areas

contain residual data (data that was used at some point,but of which the container was removed). Used storage

typically consist of directories and files that contain

data. Files are classified according to the type of datathey contain and may be image data (binary), plain text

data, Hyper-text Mark-up Language (HTML), compressed

data, or encrypted data. Encrypted data is meaninglessunless the key used to decrypt the data is known to the

investigator, and as such may hinder an investigation.Getting access to the relevant keys can normally be

accomplished through legal means, and since the methods

presented in this paper focuses on decrypted data, we donot consider the actions and procedures necessary to obtain

keys. The rest of the paper therefore assumes that all data

under consideration is clear-text data.

In the case of multi-user systems, the operating system will

store the data for each user separately, and access to thedata is enforced through the classic information security

mechanisms such as Access Control Lists (ACLs), an

Access Control Matrix (ACM), or tokens [7]. However,most current operating systems provide unrestricted access

to data on the system through a super-user or administrator

account (even if this were not the case, it is a fairly trivialtask to construct a program that will be able to bypass

the operating system and provide unrestricted access to the

data). In the event of an investigation the investigator needsunrestricted access to the data being analysed. This means

that the barriers that existed through the access control

mechanisms listed above no longer applies, and that theinvestigator should have the same unrestricted access to the

data as the super-user.

Searches for data that may be of interest to the investigator

requires high recall [3] and is commonly done by running

pattern matching searches on the data on the storage media.These searches reveal (in the case of used space) all the

files that contain data that match the pattern, and in the case

of non-used space the blocks on disk that contain matchingpatterns. The analyst then constructs evidence by carefully

examining the results and including files and blocks thatmay support the case and discarding the files and blocks

that match the pattern but that are unrelated to the case.

It is this unfettered access to information that raises thequestion of privacy. An investigator on a multi-user

system may (as a result of their search query) be presented

with files that are unrelated to the investigation, but arereturned because they contain data that match the pattern

specified by the investigator. We consider the ability of theinvestigator to view these files as a potential TPPB.

2.2 PRIVACY

Privacy is defined as the right to information self

determination [8, 9], or the ability to state who hasinformation about you, what they can do with it, and

with whom they can share it. The foundations of privacyare well established [10–12], and the ability of a system

to offer privacy protection has been well researched, and

an architecture for such systems has been proposed [13].

However, despite this fact, one can easily find newspaperarticles on the relative ease with which privacy breaches

take place. The PRIME [14] project (and PRIMElife)

attempts to put a framework in place in which systemsadhere to the promises they make regarding privacy.

Privacy is a big concern for a society in which data

gathering and analysis is becoming easier as technologyimproves. The manner in which modern society interacts

through social media also provides a means for data

gathering and analysis which poses serious threat toprivacy. And this expectation created by information

self determination may have a serious impact on how

investigations into computer crime is conducted – as a userof a multi-user computer system one does expect that PII

will be kept private. However, as indicated above, the

investigator may be granted unlimited access to all the dataon the system. Systems that state that they take privacy of

their users into account creates an expectation of privacy,

which may cause problems for the data controller∗if abreach occurs.

2.3 PRIVACY DURING AN INVESTIGATION

In our previous work [1] we proposed a method for limitingthe risk of a privacy breach by avoiding searches for

patterns which yields a search result that may have aprivacy breach. The proposal used a search filter which

classifies results based on an automated topic clustering

algorithm [15] and a diversity rating for the results basedon Shannon’s diversity index [16].

Beebe and Clark [3] proposed a thematic clustering

technique for search results on order to cluster similar

results based on Kohonen’s Self-Organising Maps (SOMs)[17], Fei and Olivier [18] used SOMs to identify anomalies

in search results. Both of these techniques attempt

to cluster data in a way that eases the burden on theinvestigator. The primary difference between the approach

used in this paper and the mentioned work, is that we

require the search results to cluster documents that aresimilar purely to determine if they cover different topics.

A too diverse set of topics would indicate a potential for a

privacy breach – the reasoning is that a set of documentsthat are closely related better reflects a focussed query, and

the results that are returned thus poses less of a privacy

breach risk. In essence the system rewards patterns orqueries that yield results that are closely related. We

proposed a framework in which the risk of a TPPB would

be reduced. This framework consisted of an indexingsystem, which is used during the search, and a filtering

system which would calculate a diversity index on the

resulting corpus. The diversity index is calculated for theentire clustered corpus as:

DI = ∑ci∈C

p(ci)logp(ci) (1)

∗The enterprise that stores the data


where C is the clustered corpus, and ci is a document

within the cluster.

An ideal index for a search query is calculated and if the

resulting corpus deviated too much from the ideal index,

the query is denied (no results are returned). The idealindex is estimated as the diversity of the entire corpus

returned as a single cluster (that is, what the diversity indexshould have been if all the documents in the result corpus

were related). The ideal index is thus given as

DN = ∑ci∈S

p(ci)logp(ci) (2)

where S is the entire search result corpus, and ci is adocument in the corpus.

This forces the investigator to narrow the query usingmore specific search terms. To avoid situations where the

investigator is stymied by the filtering system, the notion

of a threshold was introduced which could be shifted bythe investigator to allow diverse queries to yield results.

The shifting of the threshold is recorded in an audit log

for later examination if it should happen that a breach ofprivacy is reported. We did not implement the threshold or

audit components as part of this experiment.

In the following section we provide detail on the

implementation of our experiment.

3. IMPLEMENTATION DETAIL

In this section we provide detail on the construction of

the proof of concept implementation of search and filter

portion proposed framework. The proof of concept makesuse of standard information retrieval techniques [19], run

on the Enron email corpus∗∗. The corpus itself consists

of roughly 517000 email messages from 151 employees.All attachments were removed from the emails and the

messages themselves are listed in files in RFC4155 (mbox)

format [20].

We created an inverted term index on the entire Enron

corpus using the Single-pass In Memory Indexing (SPIMI)method in Python. Extracted terms were stemmed

using the Porter [21] stemmer. Our tokenizer for theemail content delivered just over 780 000 unique terms

which includes phone numbers, 16 to 19 digit numbers

resembling credit card numbers, email addresses, webaddresses, and file-names. The resulting index was

indexed using a simple BTree to speed up search results.

The entire scanning and indexing process of the corpustook a little over 30 minutes on our modest hardware∗∗∗

(given the simplicity of our implementation a more robust

implementation should be significantly faster).

Queries were stemmed, and the indexes used to find

relevant mails (mbox files). Using the N-Gram based

∗∗Available for download from http://www-2.cs.cmu.edu/ enron/∗∗∗Intel I5 architecture, with 16Gb of Random Access Memory (RAM),

and a 7200rpm SATA disk drive.

technique as proposed by Cavnar et al [15] the distance

between the mails in the results was calculated. These

distances were stored for later use (in order to speed-upsubsequent queries). Once distances were calculated we

employed a single linkage clustering method for clustering

emails based on topic. Once clustering was complete, weused the Shannon entropy index to calculate the diversity

index (equation 1) on the corpus, and we calculated an

ideal diversity index for the corpus. The ideal index ordiversity norm (equation 2) is a diversity indicator for a

result corpus containing a single cluster with all the mailscontained in it (or equivalently a small number of clusters

with mails spread evenly between them). The diversity

norm is thus the ideal, and the actual diversity index iscompared to it. To group related emails, we employed a

single-link clustering scheme as proposed in our original

paper. The resulting clustered corpus was then given adiversity norm index as well as a diversity index. The

diversity norm being the ideal situation in which a query is

focussed enough to return only documents that are relatedin topic, and the diversity index is an indicator of the

diversity of topics being covered in the result corpus. If

the norm and the diversity index differed too much (past athreshold) no results were returned.

We also implemented a Hierarchical Agglomerative

Clustering (HAC) algorithm using Ward’s method [22](GAAC). Since the system relies on clustering to

determine if a query is acceptable or not, choosing the

correct clustering mechanism is extremely important.

We made no attempt to try and find ’culprits’ to the original

Enron saga in our investigation: our primary aim was to

find indications of privacy leaks using regular techniquesthat a forensic investigator would employ, and to see

what types of results were returned. Several search termsrelated to an Energy supplier’s business were used, and

we generated reports indicating the number of clusters, the

diversity norm, and the diversity index. To further aid inour investigation we used the Natural Language Tool kit

(NLTK)†to generate collocation and concordance reports

for the documents. Broad searches (few search terms)in many cases resulted in a large number of mails in the

search results (see table 1).

Search phrase used Number of emails returned

connection 13569

hook up 1709

pipeline 15940

chief executive offi-

cer

5330

federal energy regu-

latory commission

4230

Table 1: Broad search results

We extended our search to include words that may

form part of business vernacular, or social context.These included words like ‘graduate’, ‘student’, ‘loan’,

†www.nltk.org


‘weekend’, and so on. For the person searching for

emails relating to specific topics (using keywords) the

slightly larger than 517000 email corpus provides someinteresting results. For example the phrase ‘hook up’,

a term used in the Enron corpus to refer to a new gas

or power connection (or an ADSL connection – Enronbranched into internet connectivity as well) returns several

emails referring to new business proposals and requests for

meetings to discuss the potential sale of new connectionsto energy providers, or clients.

However, the term ‘hook up’ may also refer informally

to a social get-together. The implication here is that an

‘innocent’ search for emails relating to new business (usingaccepted vernacular within the business context), when

investigating persons involved in fraudulent sales activity(for example) will also return detail on other member’s of

staff’s personal life – something they might not want to

be exposed during the investigation, but will be becausethe investigation relies on high-recall. There are many

other faux-amis like this, such as ‘naked contract’ which

is a business term, however, several emails of social natureincludes the term ‘naked’.

A keyword search for ‘graduate’ yielded emails on an

intern programme that was run at Enron, as well as

discussions about repayment of student loans, as wellas some employees’ plans to enrol for graduate school.

In many of these cases, the emails that turn up are

in the mail-collection of employees that are in differentdepartments to the emails of those who are of interest

(those who are involved in the sales of Enron products).

They appear as a person of interest purely because theyused a word that has a bearing to the one the investigator

is using, but the context differs – the topic of the email is

different from the ones that are being targeted.

We could consider the emails harmless in the face ofa legitimate criminal or civil investigation – if trust

were absolute we would be assured that this personal

information would never leak, however, past experienceshould convince us otherwise. And in the event the

information is leaked, we could argue that in spite of thisa greater good has been served, however, we would be

ignoring the principle of privacy, as well as an ethical

problem: in that the liberties of a few cannot be ignoredin favour of an aggregate good being served.

In the event of an internal investigation, an enterprise

may uncover information about an employee that would

constitute information asymmetry to the disadvantage ofthe employee – a situation which is similar to constant

email monitoring in most respects.

4. RESULTS

In this section we present some results and draw some

conclusions on the value of the results.

Our searches indicated that ’innocent’ keyword searchesreveals privacy information. This is definitely not always

the case, especially if terminology that is extremely narrow

in scope was used in the search. However, when searching

for keywords in a broader context, we found that searchresults contained email detailing personal encounters,

email with humorous content (but which contained explicit

depictions). Email with explicit content in and of itselfmay not be considered private, but it is exactly these emails

that are could be used by media to label persons during a

trail. These labels may change the perception others haveof the persons whose information is leaked and we may

consider that a privacy leak.

Furthermore, searches which included the keywords

’graduate’ or ’loan’ returned references to social securitynumbers, and personal identification numbers. It is thus

clear that a search with high precision may result in a

privacy breach.

The following section provides a discussion of our resultsusing single-link clustering.

4.1 SINGLE-LINK CLUSTERING

Clustering the search results using the proposed methodproduced the following results: firstly, for broad searches

such as ‘hook up’, ‘meeting’, or ‘request’, the resulting

clustered corpus returned had an extremely high diversityindex (using equation 1), as we expected. A high diversity

index indicates a wide query (a query that returns a diverseset of documents). The ideal is to get a result corpus

with a low diversity index (ideally as close to the diversity

norm as possible). Narrowing the searches resulted in asharp drop in the diversity count (bringing it closer to the

diversity norm – see equation 2).

Tables 2, and 3 summarise our findings.

Search phrase

used

Number of emails

returned

hook up 1709

hook up request 434

hook up new re-

quest gas pipeline

219

Table 2: Broad keyword search with specific intent

What becomes apparent is: firstly, in many cases there

is a drop in the number of files returned as more

keywords are added (thus the query starts focussing ona particular context). Secondly, depending on the query

term, the diversity index and clusters starts stabilising.

However, the stabilisation is far from ideal: in the examplepresented in table 3 the query would still be deemed too

wide and filtered out. However, a manual inspection

of the resulting corpus revealed that the emails wereeither news-feed summaries, correspondence regarding

new business, or new request for gas pipeline connections

(‘hook-ups’). Thus, we have succeeded in filtering outemails that may be included incidentally based on a raw

keyword search, however, the more focussed query wasstill deemed too wide because of the blind clustering


Search

phrase

used

Diversity

norm

Diversity

Index

Nr of clus-

ters

hook up 9.00 811.62 515

hook up

request

6.93 197.01 122

hook

up new

request

gas

pipeline

5.83 94.05 57

Table 3: Reduction in diversity for narrowing search

(based on clustering of documents as returned in table 2)

technique. Indications are that ’anonymous’ emails such

as news-feeds or industry related news matters throw offthe clustering algorithm.

To enhance the clustering we implemented the BM25 [23]

ranking algorithm on the search results, and clustered thetop 100 documents. The diversity index was reduced,

but did not perform significantly better than the blind

clustering on the full set of results (see table 4).

Search

phrase

used

Diversity

norm

Diversity

Index

Nr of clus-

ters

hook up 5.04 48.67 33

hook up

request

5.04 48.30 33

hook

up new

request

gas

pipeline

4.91 46.13 30

Table 4: Reduction in diversity for narrowing search

(BM25 ranking, n=100)

The single linkage clustering algorithm relies on the

distance between the closest documents in order to cluster.This may lead to inefficient clustering – since only

two documents are involved in making a decision about

merging clusters. To minimize the probability of theclustering being stymied by the single-link algorithm, we

implemented a HAC method which uses all the documents

in two clusters to determine if documents should beclustered together.

We discuss our results using GAAC in the followingsection.

4.2 GROUP-AVERAGE AGGLOMERATIVE CLUSTER-

ING

We implemented a HAC method using GAAC to achieve

a more ‘natural’ topic clustering. In GAAC, all documentsstart of in their own cluster. New clusters are formed by

calculating a centroid for each cluster, calculating an error

rate based on the distance of each document in a cluster

to the cluster centroid (Ward’s method), and mergingthose clusters which have the smallest error. This forms

a dendrogram which eventually yields a single cluster

containing all documents.

Since the purpose of the clustering employed for the

experiment relies on a natural selection for the number ofclusters K, which is determined using equation 3:

K = argminK′

[RSS(K′)+λK′] (3)

Where K′ is a number of clusters in the HAC, λ is a penalty

for each additional cluster, and RSS(K′) is the residual sumof squares for K′ clusters. RSS(x) is calculated as the sum

of the sum of the differences between each document andthe centroid of the cluster.

GAAC uses cosine similarity [24] (as opposed to

ngram-distance), to produce the resulting dendrogram, andhaving K clusters involves cutting the dendrogram at the

correct level.

Choosing λ depends on the type of corpus, and is typically

very difficult unless some prior knowledge on the corpus

content is available. To aid in our analysis, we ran severalqueries and performed clustering using different values of

λ (see figure 1 for cluster shift using the example queries

presented in this paper).

2

4

8

16

32

64

128

256

512

0 50 100 150 200 250 300

Clu

ster

s

λ

Cluster shift using different values of λ

Clusters [hook up request]Clusters [hook up new request gas pipeline]

Figure 1: Cluster shifts for different values of λ

Table 5 provides some results from clustering using GAACwith λ set to 15. In most cases, as can be seen in figure 1

clusters tend to show large jumps in number as λ grows.

Smaller numbers for λ generally results in clusters that donot skew clustering in favour of narrow queries.

After having presented both the single-link and GAACimplementations, the following section provides a short

comparison of the two.

4.3 COMPARISON OF SINGLE-LINK AND GAAC

Comparing both clustering algorithms (without placinglimiting results through ranking), the single-linkage


Search

phrase

used

Diversity

norm

Diversity

Index

Nr of clus-

ters

hook up 10.16 788.76 815

hook up

request

7.96 216.66 176

hook

up new

request

gas

pipeline

7.33 110.02 85

Table 5: GAAC, using cosine similarity, λ = 15

algorithm performs favourably (see figure 2) without the

need to calculate the value of the penalty value λ in thecase of Residual Sum of Squares (RSS). This is especially

favourable if it is considered that the single-link algorithm

achieves good results without the need for training. Thediversity-index problem is discussed again in section 5..

4

8

16

32

64

128

256

512

1024

2048

Q1 Q2 Q3Query

Comparing GAAC to Single-Link clustering

Number of resultsSL Clusters

SL Diversity IndexSL Diversity Norm

GAAC ClustersGAAC Diversity IndexGAAC Diversity Norm

Figure 2: Single Link compared to GAAC. Q1 = [hook up], Q2

= [hook up request], Q3 = [hook up new request gas pipeline]

What also became apparent was that where the single-linkalgorithm benefited from the ranking and limiting on

search results, whereas the GAAC algorithm did not. In

most cases placing a limit of n = 100 on the search resultyielded an ideal cluster of 2 on a query that was deemed

very wide to begin with (see figure 3). Inspection of

the email subject-lines revealed extremely diverse topics,meaning that the similarity used for ranking worked

extremely well (as was to be expected) but it skewed the

clustering algorithm to favour a very wide query.

The following section discusses some of the lessons

learned during our investigation into the techniques usedin this paper.

5. LESSONS LEARNED

In this section we share some insights and lessons learnedin the approach taken as applied to real world data

which most reflects the intent of the original proposal:a multi-user system in which certain users were to

2

4

8

16

32

64

128

Q1 Q2 Q3Query

Comparing GAAC to Single Link clustering (limit results to 100)

Number of resultsSL Clusters

SL Diversity IndexSL Diversity Norm

GAAC ClustersGAAC Diversity IndexGAAC Diversity Norm

Figure 3: Single Link compared to GAAC. Q1 = [hook up], Q2

= [hook up request], Q3 = [hook up new request gas pipeline]

be investigated for a certain crime and in which aparticular approach to finding relevant emails would reveal

information of a personal nature about third parties.

Firstly, even our rudimentary implementation performedwell and indexed the 517 000 emails in a negligible

amount of time, thus not significantly interfering with an

investigation. Searches using the inverted term index andBTree index on the inverted index completed in sub-second

times.

Secondly, and arguably the most important lesson learned

from the experiment is that the critical factor in reducingprivacy breach exposure using a blind topic categoriser

is at best difficult – we found in too many casesthat the number of clusters returned by the clustering

component indicated a too diverse set of topics (see

tables 3 and 3). Manual inspection of the returnedcorpus revealed that many of the clusters were related,

however, the single-linkage clustering technique clustered

them correctly based on the initial protocol devised. Toavoid this problem, we implemented a HAC, using GAAC.

Although it performed reasonably, the clustering still

resulted in too wide a query, resulting in queries beingfiltered. This may mean that the initial diversity norm

calculation may be over-fitting: it is far too optimistic in

terms of what it expects an ideal cluster of search resultsto be. After manually inspecting some of the search results

it became clear that the topics are naturally similar (and

the query could therefore be deemed narrow), however, thediversity norm places the query out of reach. The lesson

learned here is that in most cases, a narrow query willreturn a narrow corpus, which has more than one cluster,

and the diversity index should be modelled after such a

narrow corpus. Future investigation will therefore focus onproviding a better estimate of what an ideal search result

would be – resulting in a diversity norm that more closely

matches the diversity index for a narrow query.

Thirdly, the BM25 ranking algorithm was also a fairlygood indicator of the number of clusters in the corpus.

We found invariably that a wide range in BM25 scoresindicated a large number of clusters. Ranks with


little difference in them would typically correspond to

a cluster as returned by the clustering algorithm. This

indicates a strong correlation between automated blindtopic categorisation and ranking based on BM25.

Finally, we found that ‘anonymous’ emails such as

news feeds, and specifically emails relating to generalindustry matters would generate a large number of clusters.

Identifying these emails beforehand should improve the

diversity index calculations and will be investigated. Thereports generated on the search queries provided strong

evidence that collocation analysis as reported by Wanner

and Ramos [25] would provide a better way of detectingnews feed documents and allow better clustering.

The following section provides concluding remarks.

6. CONCLUSION

This article presented an investigation into the application

of a framework for preventing a TPPB during a digitalforensic investigation.

As details of the Enron saga was released, it was shown

that many people not involved in the insider tradingand fraudulent activities had some of their personal

information released. Our work illustrates that privacy

breaches for these third parties may have unintentionallyhappened.

We presented a proof of concept of the search and filter

portion of a proposed framework for preventing TPPBdone previously, and showed how this application to the

Enron mail corpus could have restricted the amount of

embarrassment felt by third parties. In situations wereinternal investigations are normally conducted in civil

liability cases, the application of the tool could be used

to avoid leakage of private information, thereby protectingthe good standing of the enterprise in the community.

We also showed that the clustering technique, although

useful requires refinement – our initial study shows thatalthough blind clustering may provide a suitable diversity

index – the diversity norm is difficult to calculate correctly.

Ignoring pure news-feed emails, and correcting for idealclusters may result in more accurate index values which

will benefit the search and filter mechanism.

Another possible approach is classification using machinelearning techniques such as naive Bayesian classification.

However, we purposefully avoided using guided training

for two reasons. Firstly, since the enterprise’s emailwill be specific to their industry, a set of training data

may not be applicable to the email corpus, resulting in

a low revel of recall, and the results can therefore notbe trusted. Secondly, guided training in such a specific

setting then means that someone will have to manually

examine existing emails, and classify them so that theycan be used as training data. This again poses a risk to

privacy. The use of Natural Language Processing (NLP)techniques to detect context may be of value and we have

started investigating this possibility, and will report on it

elsewhere.

REFERENCES

[1] W. J. van Staden, “Third Party Privacy and theInvestigation Of Cybercrime,” in Advances in Digital

Forensics IX, G. Peterson and S. Shenoi, Eds.

Orlando, Florida, USA: Springer, 2013.

[2] N. Beebe and J. Clark, “Dealing with Terabyte

Data Sets in Digital Investigations,” in Advances in

Digital Forensics. Springer US, 2005, vol. 194, ch.

IFIP — The International Federation for Information

Processing, pp. 3–16.

[3] N. L. Beebe and J. G. Clark, “Digital forensic text

string searching: Improving information retrieval

effectiveness by thematically clustering searchresults,” Digital investigation, vol. 4, pp. 49–54,

2007.

[4] G. Palmer, “A Road Map for Digital Forensic

Research,” DFRWS, Utica, NY, Tech. Rep., 2001.

[5] Technical Working Group for the Examination ofDigital Evidence, Forensic Examination of Digital

Evidence: A guide for law enforcement, 2002.

[6] R. Nolan, C. O’Sulivan, J. Branson, and C. Waits,

“First Responders Guide to Computer Forensics,”

Carnegie Mellon Software Engineering Institute,Pittsburgh, Pennsylvania, Tech. Rep., March 2005.

[7] C. P. Pfleeger and S. L. Pfleeger, Security in

Computing, Fourth ed. Prentice Hall, 2012.

[8] S. Fischer-Hubner, IT security and privacy: Design

and use of privacy enhancing security mechanisms.Springer-Verlag, 2001.

[9] W. v. Staden and M. S. Olivier, “On CompoundPurposes and Compound Reasons for Enabling

Privacy,” J. UCS, vol. 17, no. 3, pp. 426–450, 2011.

[10] S. D. Warren and L. D. Brandeis, “The right toprivacy,” Harvard Law Review, pp. 193–220, 1889.

[11] “OECD Guidelines on the Protection of Privacy andtransborder Flows of Personal Data,” Organisation

for Economic Cooperation and Development, Tech.

Rep., 1980.

[12] “EU Data Protection Directive 95/46/EC,” Tech.

Rep., October 1995.

[13] R. Hes and J. Borking, Eds., Background Studies

and Invesitigations 11: The road to anonimity,

Revised ed. Registratiekamer, Den Haag: DutchDPA, August 2000.

[14] J. Camenisch, A. Shelat, D. Sommer,

S. Fischer-Hubner, M. Jansen, H. Kraseman,R. Leenes, and J. Tseng, “Privacy and Identity

Management for Everyone,” in DIM’05. Fairfax,Virginia, USA: ACM, November 2005.


[15] W. B. Cavnar and J. M. Trenkle, “N-Gram-Based

Text Categorisation,” in SDAIR-94, Third Annual

Symposium on Document Analysis and Information

Retrieval, 1994, pp. 161–175.

[16] C. E. Shannon, “A Mathematical Theory of

Communication,” The Bell System Technical Journal,

vol. 27, pp. 379–423, July 1948.

[17] T. Kohonen, “The Self Organising Map,” in IEEE.IEEE, 1990, pp. 1464–1480.

[18] B. K. L. Fei, J. H. P. Eloff, M. S. Olivier, and

H. S. Venter, “The use of self-organising maps

for anomalous behaviour detection in a digitalinvestigation.” Forensic Sci. Int., vol. 162, no. 1-3,

pp. 33–7, 2006.

[19] C. D. Manning, P. Raghavan, and H. Scutze, An

Introduction to Information Retrieval. England:Cambridge University Press, 2009.

[20] E. A. Hall, “The application/mbox Media

Type,” Electronically, September 2005,

http://datatracker.ietf.org/doc/rfc4155/.

[21] M. F. Porter, “An Algorithm for SuffixStripping,” K. Sparck Jones and P. Willett, Eds.

San Francisco, CA, USA: Morgan Kaufmann

Publishers Inc., 1997, ch. Readings in InformationRetrieval, pp. 313–316. [Online]. Available:

http://dl.acm.org/citation.cfm?id=275537.275705

[22] J. Joe H Ward, “Hierarchical Grouping to Optimize

an Objective Function,” Journal of the American

Statistical Association, vol. 58, no. 1, pp. 236–244,

1963.

[23] S. E. Robertson, S. Walker, S. Jones, M. M.

Hancock-Beaulieu, and M. Gatford, “Okapi atTREC-3,” 1996, pp. 109–126.

[24] G. Salton, A. Wong, and C. S. Yang,

“A Vector Space Model for Automatic

Indexing,” Communications of the ACM, vol. 18,no. 11, pp. 613–620, 1975. [Online]. Available:

http://dblp.org/db/journals/cacm/cacm18.html

[25] L. Wanner and M. A. Ramos, “Local Document

Relevance Clustering in IR Using CollocationInformation,” in The fifth international conference on

Language Resources and Evaluation, 2006.


A MULTI-FACETED MODEL FOR IP-BASED SERVICE AUTHO-RIZATION IN THE EDUROAM NETWORK.

Luzuko Tekeni∗, Reinhardt A. Botha† and Kerry-Lynn Thomson‡

∗ School of ICT, Nelson Mandela Metropolitan University, Port Elizabeth, South Africa, e-mail:[email protected]† School of ICT, Nelson Mandela Metropolitan University, Port Elizabeth, South Africa, e-mail:[email protected]‡ School of ICT, Nelson Mandela Metropolitan University, Port Elizabeth, South Africa, e-mail:[email protected]

Abstract: Eduroam provides a facility for users from participating institutions to access the Internet atany other participating visited institution using their home credentials. The authentication credentialsare verified by the home institution, while authorization is done by the visited institution. The userreceives an IP address through the visited institution, and accesses the Internet through the firewall andproxy servers of the visited institution. While this provides great flexibility, it competes with security:access may be wrongfully provided or denied to services that use IP-based authorization. This paperenumerates the risks associated with IP-based authorization in the eduroam network by using DigitalLibrary access as an example. The tension between security and flexibility suggests that a multi-facetedapproach to the problem is needed. This paper presents such a multi-faceted model that can be usedholistically to consider options for IP-based authorization.

Key words: eduroam, authorization

1. INTRODUCTION

In the current generation, the number of users whoconnect to the Internet using mobile devices has increasedsignificantly [1]. Most mobile users would like to get con-nectivity everywhere, including at home and at educationalinstitutions. The TERENA (Trans European Researchand Education Network Association) proposed a servicefor WLAN roaming between educational institutions andresearch networks [2]. This WLAN roaming serviceis called eduroam (EDUcation ROAMing). Eduroamprovides users (researchers, teachers and students) withInternet access using their home credentials at anyinstitution around the globe which participates in eduroam.This access occurs with minimal administrative overhead[3, 4].

Institutions see eduroam as beneficial to them as academicstaff members and students frequently travel betweeninstitutions. These students and academic staff memberscan use their home institution credentials to login at visitedinstitutions that participate in the eduroam service. Ineduroam, the authentication credentials are verified by thehome institution, while authorization is done by the visitedinstitution [5]. The student or academic staff memberreceives an IP address in the range of the visited institution,and accesses the Internet through the firewall and proxyservers of the visited institution. However, access grantedto services that authorize via an IP address of the visitedinstitution may include access to services that are notallowed at the home institution.

In eduroam there is no specific policy in place regardinghow authorization should be handled at the visitedinstitution. The NREN (National Research and Education

Network) of a country must sign the eduroam compliancestatement [6] in order to be part of the eduroam federation.The unavailability of the eduroam authorization policyis due, possibly, to mismatch and scalability issues atthe institutional level, as well at national or globallevels. However, at the institutional level, the basicand acceptable access policy approach that most of theinstitutions are following is that the visited institutiondenies access to their local resources to the userswho are not part of the domain [7], while it onlygrants them access to the Internet using the networkinfrastructure of the visited institution. To facilitateflexibility and ease-of-maintenance Internet-based servicesmay use IP-based authorization.

This paper enumerates problems with IP-based serviceauthorization in the eduroam network. Further, the paperproposes a multi-faceted model to address the problem.There are many services that authorize users via an IPaddress when connecting to the Internet using eduroam andit is impossible to analyze all of them. To provide focus,therefore, this paper will primarily consider Digital LibraryService Providers.

The rest of this paper is organized as follows. Firstlyin Section 2. an overview of the eduroam serviceis given. Since this paper addresses the problemof IP-based authorization, Section 3. introduces theIP-based authentication and authorization process thatcan be encountered when roaming between institutions.Thereafter, Section 4. illustrates the problem using DigitalLibrary access as an example. Section 5. analyzes the risksin much more detail. This is followed by considering theinteraction between stakeholders and possible technologiesthat can help in Section 6.. However, not all solutions

Based on: “Concerns Regarding Service Authorization by IP address using Eduroam”, by Luzuko Tekeni, Reinhardt Botha and Kerry-Lynn Thomson which appeared in the Proceedings of Information Security South African (ISSA) 2014, Johannesburg, 13 & 14 August 2014. © 2014 IEEE


are suitable for all circumstances. Therefore Section 7.propose a multi-faceted model that can help in decidinghow to address the problem. Section 8. discusses how themodel can be used when considering a decision. Finally,Section 9. concludes the paper and provides the next stepto future work.

2. AN OVERVIEW OF EDUROAM

This section provides a general overview of the eduroamservice. It does so by considering its history and byproviding an overview of the infrastructural componentsnecessary to implement the eduroam service.

2.1 The Origin

The eduroam service started as an idea of combining aRADIUS-based infrastructure with IEEE 802.1x protocolfor roaming Internet access across institutions in Europe[8]. The actual eduroam service started in 2003 withinthe TERENA Task Force on Mobility, TF-Mobility [9].During that time many institutions showed an interestin eduroam by joining. Those institutions were fromthe Netherlands, Finland, Croatia, the United Kingdom,Portugal and Germany [10]. Gradually, other NRENsin Europe began joining what was then named eduroam[1]. In December 2004, Australia became involved andwas the first non-European country to join eduroam [11].According to the eduroam website [12], eduroam “is nowavailable in 69 territories worldwide”, but is only availableat certain locations within those countries, as long as theirNRENs have signed the eduroam Compliance Statement[6].

2.2 The Infrastructure

The eduroam infrastructure is based on hierarchicallyorganized RADIUS proxy servers [13] and the IEEE802.1x protocol [4]. This initiative makes use of threelevels of RADIUS proxy servers, namely: the Top-levelserver (Confederation), National-level servers (Federation)and Institutional-level servers (Edge) [3]. The Top-levelserver acts as the bridge between National-level servers forglobal communication, while the National-level server isresponsible for connecting institutions within the country.Every institution wanting to join eduroam connects to itsNational-level server and deploys a dedicated server foreduroam.

Figure 1 shows a User who wants to connect to eduroamat institution A (visited institution), whose homeinstitution is institution B (home institution). In thiscase, the supplicant software of the user contacts theAccess Point (AP) using 802.1x with EAP (ExtensibleAuthentication Protocol) protocol.

The EAP protocol provides integrity and confidentiality toprotect the transportation of user credentials throughout thehierarchy of RADIUS servers [14]. Then the AP contactsits local RADIUS server for authentication. The RADIUSserver examines the realm part of the username, and as it

Figure 1: eduroam Infrastructure

is not a local realm in this example, the RADIUS serverthen proxies the request through the hierarchy of RADIUSservers until institution B is reached. The RADIUSserver of Institution B decapsulates the EAP massageand verifies the credentials of the user. It can either acceptor deny the request by proxying the results in the reverseorder using the same path. The AP at institution Ainforms the user of the outcome (accept or deny) and theconnection is established (if the response is accept).

At the time of writing (late 2014) the South AfricanNREN handled about 10000 (non-unique) authenticationrequests per day (personal communication, S Miteff, 5Jan 2015). The next section briefly explains the IP-basedauthorization process.

3. IP-BASED AUTHORIZATION PROCESS

Some services, such as Digital Libraries, at universitiesuse an IP address to authorize users. This presents apotential problem when using eduroam. Figure 2 showshome and visited institutions and their Service Provider. Inthis example, before a user can be given any kind of access,the IP-based process for authentication and authorizationmust first take place. The user then roams between the twoinstitutions using his or her home institutional credentials.

Figure 2: IP-Based Process

When the user reaches the visited institution and connectsto eduroam, the following happens:


1. The user tries to login at the visited institution usinghis or her home credentials.

2. The visited institution examines the realm part of theusername and sees that the user belongs to the homeinstitution. Then it sends the user credentials throughthe hierarchy of RADIUS servers for authentication(verification) to the home institution.

3. The home institution decapsulates the message andverifies the users credentials. It can either accept ordeny the request by sending back the response to thevisited institution.

4. The visited institution receives the response andgrants internet access if the results are positive(accepted), and it assigns an IP address to the user.This IP address can be assigned from a general poolof IP addresses or from a special pool of IP addressesused for roaming users.

5. The user accesses the Service Provider’s resource(service X) of the Service Provider using the IPaddress assigned by the visited institution.

6. The Service Provider verifies the validity of the IPaddress and gives permission (or not) to the userbased on the IP address provided by the visitedinstitution.

This approach certainly alleviates administration, but mayhave interesting side-effects on the experience of theroaming user.

4. PROBLEM IDENTIFICATION

This section explains the effects that IP-based authoriza-tion can have on the expoerience of roaming users. Toillustrate, this section uses ”Digital Library access” as anexample of a service. This section firstly uncovers what thetypical legal agreement between universities and ServiceProviders stipulates. Thereafter, an example is consideredwhere the authorization may contravene these stipulations.

A Service Level Agreement (SLA) is “an agreementbetween an IT Service Provider and a customer.”[15]. Expectations between Service Providers andtheir customers are typically governed through ServiceLevel Agreements (SLAs).To understand the stipulationsbetween Digital Library providers and universities threeService Level Agreements were reviewed. The gist ofthese agreements was all the same. Figure 3 shows anextract from the South African Library Consortium SiteLicense Agreement with Emerald.

Of particular interest is the mention of “walk-in users”.Walk-in users should only be able to access LicensedMaterial from computer terminals within the Librarypremises. This implies that the users must be within thephysical premises of the Library to access the service.

Figure 3: Emerald Licence Agreement

In a world with wireless Internet and eduroam, this SLAcan be considered fairly antiquated. Eduroam usersvisiting from another institution would not be classified aswalk-in users if they accessed the licensed material throughtheir eduroam -authenticated device and may thus breachthe SLA. This is assuming that the eduroam users accessthe Service Provider using an IP address from a pool thathas access. On the other hand, if visiting eduroam usersare allocated to IP pools that don’t have access, the valueof roaming authentication is decreased as they will also nothave access to services they have access to at their homeinsitution.

What can be seen here is tension between flexibility andsecurity. The flexibility offered by IP-based authorizaioncomes at the expense of security. More secure solutionswill mean more effort on the part of stakeholders and willthus reduce flexibility. The rest of this paper considers thisflexibility-security tension and proposes a multi-facetedmodel to help us to decide on a road ahead.

The next section will consider the risks in more detail.

5. RISK ANALYSIS

The previous section pointed out that IP-based authoriza-tion may be problematic. This section examines the risksassociated with IP-basedc authentication from the vantagepoint of Digital Library services.

According to [16, 17] risk can be defined as the possibilityof an undesired outcome or the absence of the desiredoutcome to a service. It “is a future event that may ormay not occur” [16]. This paper explores the risks fromdifferent stakeholder perspectives: the Users, the ServiceProviders, and Libraries at universities. Each of these riskperspectives is expounded on next.

5.1 The Users

As discussed, when Users visit a particular institution, theycould have access to services that they normally do nothave when they are at their home institution. These Users


can be regarded as ’happy’ users, because they have accessto services to which they are not subscribed. However,the opposite could also be true. Users may have accessto certain services at their home institution that becomeunavailable to them at a visited institution. These Userscould be regarded as ’unhappy’ users. In these scenarios,the situation can be seen as unfair to some of the users,while others are enjoying the benefits of accessing servicesthat are not available to them at their home institution.This could impact the users either positively or negatively,depending on the specific circumstances.

To further clarify this, consider the South Africanacademic landscape. Table 1 below shows a comparisonof digital Libraries available at selected South AfricanUniversities and Research Institutes. Note that, for brevity,only a few selected digital Libraries at each institutionare shown. Further, as this is merely illustrative, thenames of institutions are not used. The selected institutionsparticipate in eduroam in South Africa.

Table 1: Digital Libraries at InstitutionsComparison of eduroam Institutions

DigitalLibraries

Institution1 Institution2 Institution3

AccessEngineering

� � �

AccessPharmacy

� � �

AccessScience � � �ACM � � �

AfricanJournals

� � �

BiomedCentral

� � �

Emerald � � �IEEE Xplore � � �ISI Web ofKnowledge

� � �

LexisNexisAcademic

� � �

Sabinet � � �SAGE � � �ScienceDirect � � �

Based on Table 1, the risk varies depending on theinstitution that the User is visiting. For example, if theUser visits Institution1 from Institution3, that User canaccess the ACM database; whereas at Institution3 he or shedoes not have access to the ACM database. While usersfrom Institution3 will be very happy with the situation (asthey have more access), users from Institution1 visitingInstitution3 will be less happy as they do not have accessto the database that they usually have when at the homeinstitution. Table 2 summarizes the two risks to users.

Table 2: Risks Facing UsersRisks Description1A Users cannot access material they legally have

access to1B Users could have access to material that they

should not have

5.2 Service Providers

The Service Providers are responsible to provide aparticular service to the Users. In this context, the ServiceProviders could find themselves in a position of losing aportion of their income when the users access the service.In other words the user might visit the institution just toaccess the service that is unavailable while he or she isat the home institution. On the other hand the user mightunsubscribe to a particular service intentionally because heor she knows that the service is available to the neighborsand he/she could just go and visit in order to access it.To some extent Service Providers depend on the honestyof the clients to whom they provide the service. If anauthorization issue exists in the side of the client, theService Provider is at risk. The situation needs to becontrolled by the clients because if the Service Providersets an SLA, there is no assurance that the clients willenforce the SLA effectively. Table 3 summarizes theabove-mentioned risks.

Table 3: Risks Facing Service ProvidersRisks Description2A Loss of income when unauthorized users are

accessing the service2B Service misuse by institutions

5.3 Libraries at universities

Many Libraries at universities use an IP address toauthorize users. This presents a potential risk when usingeduroam. The eduroam user is given an IP address whenvisiting a particular institution which gives him or heraccess to services that could be unavailable at the homeinstitution. This would result in an unauthorized usergaining access to certain services of the visited institution.Furthermore, the visited institution might breach the SLAif these users access the Licensed Material through theirown devices andnot on the Library Premises as statedon the license agreement in Figure 3 above. Libraries,therefore, run the risk of being held legally liable. Librariesalso do not want to subscribe to unused (and thereforeunnecessary) services.

So if at institution X, the library staff members capturetheir online database usage for the purpose of terminatingthe contract if an online database is not used, visitorsaccessing these databases through eduroam may resultin incorrect statistics being captured. This could leadto the Library not terminating the use of an online


database. At first this risk may seem negligible, butit is worth remembering that services in this category(possible cancellation) are already little used. Hence, evena small number of visitors accessing the service couldmultiply the number of accesses thereby rendering theservice in the expensive, but needed, category. There is notracking of users and their activities in the current eduroaminfrastructure and, therefore, it is impossible to assess theextent of visitor user access.

Universities have thousands of users to manage, potentiallyincluding several visitors. Keeping track of registeredusers and visitors could be challenging in this environment.Service Providers are also at risk as their services couldbe misused by the visitors since they know they are notpaying for it. However, maintaining access records onan individual level rather than at an institutional level iscertainly more costly. Table 4 enumerates the main risksdiscussed in this section that affect Libraries at universities.

Table 4: Risks Facing Libraries at universitiesRisks Description3A Libraries might breach the SLA if users

(visitors) are accessing the licensed materialfrom their own devices

3B If Libraries are capturing their online databaseusage, visitors may lead to incorrect statisticscaptured. Therefore, this could lead tounnecessary service subscription, which is onlyused by visitors

3C Service misuse by visitors

5.4 Risk analysis summary

To summarise the risks affecting the users, ServiceProviders and Libraries at universities, consider table 5below.

Table 5: Risk analysis summary

SummaryRisks Users SP LB1A: Cannot access - x x1B: Could access + - -2A: Loss of income x - -2B: Service misuse + - -3A: Breaching SLA x x -3B: Unnecessaryservice and incorrectstatistics

x x -

3C: Same as 2B + - -+ = Positive Impact - = Negative Impact X = Not Applicable

SP = Service Providers LB = Libraries

As shown in Table 5 if the users cannot have access to aparticular service at the visited institution, it affects themnegatively. However if they do have access to a resourcethey normally do not have, it affects them positively.Service Providers and the Libraries at universities are both

potentially negatively affected by this situation, albeit notat the same time. The Service Providers are at risk of losingmoney owing to missed sales opportunities. Libraries alsomiss opportunities for patrons to pay membership fees foraccess to resources. Breaches to SLAs negatively impactLibraries as they open up the possibility of legal action.In addition libraries may not be able to trust usage record,for example, not knowing how many access requests reallyoriginated from a specific institution.

Although this section did not try to quantify the risks tothe various stakeholders, it did show that there are risks.The question now is how to address these risks. The nextsection considers the relationships among stakeholders anddiscusses various possible technologies and issues to beconsidered.

6. THE RELATIONSHIPS AMONGSTAKEHOLDERS

The various stakeholders have different relationshipsamong them. Figure 4 abstractly depicts theserelationships.

Figure 4: The Relationships among Users, Libraries and ServiceProviders

Each relationship, as seen in Figure 4, will be brieflydescribed to illustrate how the previously identified risksare mapped and addressed along with possible solutionfor each. Eduroam in each country is managed andmonitored by its NREN [18]. The NREN rolls-out theeduroam service to the interested institutions who want tohave the eduroam service on their campus. In Figure 4,as marked by “X”, the point of intersection among theUsers, Service Providers and Libraries represents wherethe NREN is placed. The following subsections lookat the relationships among the Users, Service Providersand Libraries at universities. Each relationship introducestechnologies or actions that might help address the issues.Note that these technologies are not new. On the contrary,they are currently used in various ways by the relevantstakeholders. The discussion in the next section servesto introduce the technologies and actions. Section 7. willcombine them into a multi-faceted model that allows us toargue more easily the effect of choices we make in usingIP-based authorization in eduroam.


6.1 Users and Libraries

Consider first the relationship between Users and theLibraries as marked by “A” in Figure 4. According torisks 1A and IB Users cannot access material they legallyhave access to (at their home institution), when theyvisit another institution because the visited institution hasno subscription to the same material. However, if theusers are from the visited institution going to the oppositeinstitution, they could have access to material to whichthey should not have, because the opposite institution hassubscription to the material. A possible solution to thisinvolves the use of a Virtual Private Network (VPN) tunnel.A VPN tunnel provides a complete data privacy andintegrity for Users who access the network from outsidetheir Intranet in a secure manner [19]. For instance, byenabling VPN between the user and the home institution, asecure tunnel will be established. This will help to improvethe IP-based level of security as highlighted by risk 3A.In other words, this is adding another layer of security tothe IP-based authorization process. Figure 5 shows howa VPN tunnel between the user and the Service Providerin the eduroam network can address the risk of a user nothaving access to services when visiting institutions.

Figure 5: VPN Tunnel in eduroam

However, this only addresses the risk from the perspectiveof a user not having access to something to which he or shenormally has access.

Figure 5 depicts the use of a VPN tunnel. Steps 1 to3, 5 and 6 were described in Figure 2. Step 4 securesthe IP address that is normally assigned by the visitedinstitution to access a service. However, access to theservice is obtained via the VPN tunnel being establishedto allow one-to-one communication rather than consultingthe visited institution. Access to Service X is nowrequested from the home institution, and not from thevisited institution.

This approach eliminates the risk identified by 3C whichconcerns service misuse by visitors. For technology-savvyusers, the use of VPN could be a straightforward conceptfor them to understand and be able to VPN back totheir home institution when visiting others. However,for non-technical users it may not be as straightforward.Hence, training sessions should be provided to educateusers with regard to how to connect to the eduroam serviceat visited institutions and how to VPN back to the home

institution.

6.2 Libraries and Service Providers

The relationship between the Libraries and the ServiceProviders is marked by “B” in the diagram: In the SLAanalysed earlier, it was highlighted that Libraries mightbreach the SLA if the visitors access the licensed materialfrom their own devices (as described in risk 3B) and theService Providers may lose a bit of income (risk 2A). Theirservice could be misused by the users of the institution(risk 2B).

As a possible solution to this, the Libraries must engage theService Providers in connection with the SLA presentedto them. It is very important to understand that eduroamhas been introduced far more recently than the SLAs.It might happen that it is not of great concern to theService Providers that visitors are being given access totheir service, or it may require an adjustment of the SLA.The SLA presented in Section 4. Figure 3 needs to berevised by the authorized entities. Another option is thatthe Libraries can consider joining existing federations orconsortia that provide more controlled access. These canbe country-wide federations which also come with policieson how a service can be accessed.

6.3 Service Providers and Users

The relationship between the Users and the ServiceProviders is marked by “C” in the diagram. When theuser accesses online material such as Digital Libraries,the “identifier” is merely an IP address [20]. However,IP addresses are easy to spoof. Hence, federated identitydomain technologies need to be investigated to determinehow the attributes of the User can be transported in orderto provide access control to the website of the ServiceProvider. Even though a high-level analysis of riskinvolved may not identify major risks it is worth notingthat possible controls may already exist to address thissituation. Since the issue here is really that of the identityof a user to be used from outside the institution, solutionsmay exist in different environments.

One way to solve this is to use virtual direct dialing intoa campus server through L2F (Layer Two Forwarding)protocol using PPP/SLIP (Point-to-Point Protocol/SerialLine Interface Protocol) connections [21]. A remote userwill be given an IP address of the institution to his/herpersonal device through L2F protocol which providesvirtual direct dialing service into protected (Private)networks through the Internet (Public) Service Providers[22]. Figure 6 shows a high level overview of the virtualdialing approach.

As can be seen in Figure 6, the remote users will connectto the NAS (Network Access Server). These NASs can begeographically located anywhere around the world [22].In turn they connect through the Internet. The Internetwill check whether the user is dialing-up (is this caseit is). Then the L2F protocol will take control using


Figure 6: Virtual Dialling Approach

PPP/SLIP communicating with the home gateway of theuser and the user will have access to the licensed material.However, this approach suffers from many disadvantages[21]. Hardware and software packages are required tosetup this kind of communication and the solution canbe expensive to deploy. Furthermore, it has a limitedcalling area. In other words, if a remote user is locatedfar away, this might not work. Clearly this approach hasseveral disadvantages that are particularly problematic fora large-scale network.

Another technology, a Web proxy, was thereforeconsidered. According to Duke and Yu [21] “The Webproxy is designed primarily to secure an Intranet, to controland monitor employees access to the Internet”. Webproxy servers are normally placed outside the firewall ofa company. In this context, visitors accessing the Internetusing eduroam can be forced to go through a Web proxyserver which will restrict them from accessing the licensedmaterial. However, this will inconvenience legitimateusers at the home institution when they connect to theInternet using eduroam. Hence, network administratorsneeds to figure a scalable way to implement these proxyservers. Figure 7 illustrates how a Web proxy worksbetween a user and an institution holding the requiredservice.

Figure 7: Web Proxy Server at Institution

The remote user tries to access licensed material using aproxy-enabled browser. The request is directed to a proxyserver located on campus. When the proxy server receivesthe request and forward it to the institution, the institutionvalidates that the IP address came from the proxy andauthenticates the request. Thereafter, an inside systemreturns the data to the proxy server, which in turn, sends theretrieved data back to the remote user. But this only solvesthe problem of visitors accessing the licensed materialat the visited institution. The above discussed approachmight suffer in terms of the distance coverage area. Inother words, at times it could be very difficult to reach theproxy server over a long distance. Since the problem hereinvolves federated identity management domains, the use

of Shibboleth could solve the problem.

Shibboleth acts as an intermediate third party between thehome institution and the Service Provider on behalf of thevisited institution as shown in Figure 8 [20]. Figure 8shows a high-level view of how Shibboleth could be usedin eduroam.

Figure 8: Shibboleth in eduroam

A Where Are You From (WAYF) database will be usedto identify the user and once that it is done, the ServiceProvider will be able to access the attributes of the userfrom the Attribute Authority (AA) at the home institutionof the user. The AA is the database that stores the attributesof the user located at the home institution under thesupervision of the Identity Provider (IdP). Shibboleth willbe able to identify services that are allowed at the homeinstitution for the user, using SAML (Security AssertionMarkup Language) [23] query request/response messages.This will, however, require services to evaluate SAMLattributes in order to do authorization. Shibboleth workswell if an institution forms part of a federation. Federatedidentity management allows users to hand out identityinformation dynamically across the distributed domain ofthe federation. On the negative side Service Providers alsoneed to co-operate and do authorization based on SAMLattributes.

From the above discussion it is clear that severalpotential solutions exist, each with its own advantages anddisadvantages. The next section introduces a multi-facetedmodel that can be used to argue about these differentoptions in a more holistic way.

7. A MULTI-FACETED MODEL

Before introducing the model components insub-section 7.2, sub-section 7.1 first positions themodel conceptually.

7.1 Conceptual positioning

It has become commonplace to think about business interms of strategic, tactical and operational decisions. Tounderstand where the proposed model fits, two importantassumptions that have been made must be considered.

Firstly, the proposed model assumes that participants inthe eduroam network have made the strategic decision


to provide cost-effective access to relevant informationand services to users while travelling. Of course, whatexactly “cost-effective access” means to each particpatinguniversity may differ vastly.

Secondly, the proposed model assumes that the underlyingnetwork operations are in place and that they are beingconfigured and managed by the various NRENs.

In the bigger picture our model is therefore conceptuallypositioned as a tactical model, as shown in Figure 9.Decisions in the proposed model will certainly be influ-enced by the strategic direction assumed by participatinguniversities, but it is not overly concerned with theoperations of the underlying network. This does notmean that the proposed model might not influence thestrategic direction, nor does it mean that decisions madewill not influence the operations. However, its focusis on making tactical decisions to manage the tensionbetween flexibility and security when considering IP-basedauthorization decisions.

Figure 9: Positioning of the Model

As an example consider a tactical decision to implementfederated identity management technologies such asShibboleth. This will clearly impact the configuration ofthe network devices and the operation of the network -an operational concern. The proposed model, however, isdeliberately silent in terms of operational implementationissues. Similarly, strategic decisions such as the joiningof federated identity consortia should be with the view ofsupporting a bigger strategic view in order to allow accessto relevant information to all users, even while they aretraveling.

7.2 Model Components

This section briefly discusses the components thatconstitute the proposed model. The model will considerthe decisions that must be made at each layer of the model.Figure 10 presents the proposed model.

The model is made up of three main components. Theseare: Strategic decisions, Tactical planning and Operationalimplementation. Each of the components representsvarious decisions that must be made. These decisionsare multi-faceted in that various options are available, butthese choices are not independent. A choice made for onecomponent influences the choices for other components.

Figure 10: The Multi-faceted Proposed Model

The next sub-sections briefly consider the nature of each ofthe components by positioning some of the issues revealedin Section 6. within the model components.

Strategic decisions:

Strategic decisions are made to guide the direction anorganization takes. In order to provide access to relevantinformation to researchers, certain decisions need to bemade. These decision are often embodied through policies[24] and agreements with suppliers and partners. TheSLA between the university library and the Digital LibraryService Provider is an example of such an agreement. Oneway to reduce risk would be to engage Service Providers interms of the definitions used in these agreement in order tobring the SLA in line with modern trends regarding accessto electronic information. Another issue to consider is thejoining of federated identity provider consortia, such asInCommon.

Tactical Planning:

After the strategic decisions have been made, plans mustbe put in place to ensure that the strategic choices canactually be implemented feasibly. This may involveproject planning activities that inter alia include raisingawareness of the strategic direction and training in the useof technologies. In the current context, users may often beunaware of eduroam and the advantages it holds to them. Ifthe institution makes VPN access available to users, somemay lack the skills to use the VPN access effectively, andwill thus require training.

Operational implementation:

Decisions related to the operational implementation ofthe underlying infrastructure should also be made. Forexample, do the institution allow VPN access to itsresources? The use of technologies such as web proxyservers, virtual direct dialing, and Shibboleth as discussedin section 6.3 are all operational implementation decisionsthat must be made.


From the multi-faceted perspective of the model noneof these operational implementation option representsa panacea. Instead the interplay between the threecomponents in this essentially tactical model ensures abalanced view that will enhance future decision-making.The next section discusses how the proposed model can beused to aid in decision-making.

8. DISCUSSION

With a multi-faceted model it is important to strike abalance between different, sometimes opposing concerns.This is also the case in this model which proposes threecomponents to be considered when addressing the risksassociated with IP-based authentication. The proposedmulti-faceted model allows us to think holistically aboutthe implications of our decisions. For example considertwo cases, one from a technological perspective andthe other from a strategic perspective: using Shibbolethas underlying authentication technology and strictlyenforcing SLA requirements.

8.1 Technological decision: Using Shibboleth

Firstly, consider a decision to deal with IP-basedauthorization issues through the use of Shibboleth. Clearlythis is a decsion aimed at the operational implementationof a technology. Deciding to go that route one will haveto consider the tactical planning required. This wouldinclude project planning, but also what training need to beput in place on a technical level. Furthermore, how willawareness of the advantages to our user-base be raised?Finally the impact on strategy must be considered. Willwe, or perhaps should we, join specific federated identitymanagement consortia. Also what kind of cooperationagreements must be put in place with Service Providers?In the case of digital libraries, will the provider supportShibboleth authorization? Clearly, it is not just a case of“just implementing Shibboleth”.

8.2 Strategic decision: Enforcing SLAs

Consider another decision: reducing the risk on theinstitution by ensuring that visitors cannot access onlinelibraries illegally. Operational implementation may be bymeans of managing different IP pools for visitors and notusing the same web proxy to exit to the Internet. Forour own users who may encounter similar measures atother institutions we would need to allow them secureaccess to their home network through a VPN. However,such decision would require tactical planning in termsof raising awareness regarding eduroam and the possibleissues that users may have while traveling. It furtherrequires tactical planning regarding implementing trainingprograms to help end-users understand the concept of aVPN and to teach them how to use it while traveling.Again, thinking about the decision from multiple facetsuncover the fact that it is not just as simple as applyingtechnical measures to apply the SLA requirements morestrictly.

These two decision cases demonstrated that themulti-faceted model can be used to ensure that thevarious facets impacted by (or that impact) the decisionare considered. Of course many more options can beconsidered, but will be thought of in the same manner aswere the two illustrative cases.

9. CONCLUSION

This paper discussed the origin of the eduroam serviceand its components. It showed that it certainly helpswith the flexibility of Internet access. However, italso highlighted that problems can be experienced whenservices use IP-based authorization. Several potentialsolutions exist to address the problem if the relationshipsbetween the various role players are considered. However,the tension between flexibility and security ensures thata purely technical solution may be more evasive than isimmediately evident.

The paper therefore proposes a multi-faceted model forconsidering decisions regarding IP-based authorization inthe eduroam network. Decisions regarding one facet havea direct implication on the other facets. This paper drewon online Digital Library providers as a case study. Thisis but one example of a service that might be appropriatein the eduroam environment. Other services, for example,collaboration services or grid computation services mayexpose other risks. However, the model does not excludeany particular risk, but requires us to think not only ofthe operational implementation issues, but also of thetactical planning issues and the strategic decisions that areimpacted. Thinking around the various facets will lead usto better consider the risks involved.

Information technologies change at a rapid rate. Theproposed model positioned several existing technologies,but any new technologies or risks could similarly bepositioned. This ensures that the model has utility amidsta fast-changing landscape.

The proposed model did not make any claims regardingcompleteness, but does claim to provide a more holisticview on authorization problems which are often consideredtechnical issues. Future work could include expandingthe model to other security problem spaces outside of theauthorization places. Furthermore the interplay betweendifferent stakeholders, as well as between the differentfacets of the models, warrants further investigation.

The proposed model recognizes that the ’non-technical’security implications in large scale computer networkssuch as created by the eduroam services are not clearlyunderstood and it therefore set off, as a prudent first step,to address these issues. The sheer scale of eduroamdefinitely amplifies what can otherwise be consideredminor problems.

REFERENCES

[1] K. Wierenga and L. Florio, “Eduroam: past, presentand future,” Computational methods in science and


technology, vol. 11, no. 2, pp. 169–173, 2005.

[2] K. Wierenga, “Initial proposal for (now) eduroam,”Tech. Rep., 2002.

[3] K. Wierenga, S. Winter, R. Arends, J. R. Poortinga,D. Simonsen, and M. S. Sova, “Deliverable DJ5.1.4: Inter-nren roaming architecture: Descriptionand development items,” GN2 JRA5, GEANT, vol. 2,2006.

[4] J. R. Milinovic, Miroslav, S. Winter, andL. Florio, “Deliverable DS5. 1.1: eduroamservice definition and implementation plan,” 2008.[Online]. Available: https://eduroam.org/downloads/docs/GN2-07-327v2-DS5 1 1- eduroam ServiceDefinition.pdf

[5] O. Canovas, A. F. Gomez-Skarmeta, G. Lopez, andM. Sanchez, “Deploying authorisation mechanismsfor federated services in eduroam (DAMe),” InternetResearch, vol. 17, no. 5, pp. 479–494, 2007.

[6] TERENA. (2011) eduroam compliancestatement. [Online]. Available: https://www.eduroam.org/downloads/docs/eduroamCompliance Statement v1 0.pdf

[7] T. Watanabe, S. Kinoshita, J. Yamato, H. Goto,and H. Sone, “Flexible access control frame-work considering IdP-Side’s authorization policy inroaming environment,” in Computer Software andApplications Conference Workshops (COMPSAC),2012 IEEE 36th Annual. Swissotel Grand EfesIzmir, Turkey, 2012, pp. 76–81.

[8] TERENA. (2012) eduroam celebrates a decadeof providing secure roaming internet access forusers. [Online]. Available: http://www.terena.org/news/fullstory.php?news id=3162

[9] Eduroam. (n.d) About eduroam. [Online]. Available:https://www.eduroam.org/index.php?p=about

[10] D. Olesen. (2003) Terena annual report2003. [Online]. Available: http://www.terena.org/publications/files/terena final 2003.pdf

[11] TERENA. (2004) Eduroam goes global. [Online].Available: http://www.terena.org/news/archive/2004/newsflash163.pdf

[12] Eduroam. (n.d) Where can i eduroam? [Online].Available: https://www.eduroam.org/index.php?p=where

[13] C. Rigney, S. Willens, A. Rubens, andW. Simpson, “Remote authentication dial in userservice (RADIUS),” 2000. [Online]. Available:http://www.hjp.at/doc/rfc/rfc2865.html

[14] B. Aboba, L. Blunk, J. Vollbrecht, J. Carlson, andH. Levkowetz, “Extensible authentication protocol(EAP),” RFC 3748, June, Tech. Rep., 2004. [Online].Available: http://www.hjp.at/doc/rfc/rfc3748.html

[15] C. Rudd, V. Lloyd, and L. Hunnebeck, ITIL ServiceDesign. TSO, 2011.

[16] E. S. Chia, “Risk assessment framework forproject management,” in 2006 IEEE InternationalEngineering Management Conference. Bahia,Brazil, Sept 2006, pp. 376–379.

[17] P. Smith and G. Merritt, “Proactive risk man-agement,” in Controlling Uncertainty in ProductDevelopment, New York, USA. Productivity Press,2002.

[18] eduroam. (n.d) eduroam infrastructure. [Online].Available: https://www.eduroam.org/index.php?p=about

[19] S. Rangarajan, A. Takkallapalli, S. Mukherjee,S. Paul, and S. Miller, “Adaptive VPN: Tradeoffbetween security levels and value-added servicesin virtual private networks,” Bell Labs TechnicalJournal, vol. 8, no. 4, pp. 93–113, 2004.

[20] M. Erdos and S. Cantor, “Shibboleth architecturedraft v05,” Internet2/MACE, May, vol. 2, 2002.

[21] J. K. Duke and X. Yu, “Authenticating users insideand outside the library,” Internet Reference ServicesQuarterly, vol. 4, no. 3, pp. 25–41, 1999.

[22] A. J. Valencia, “Virtual dial-up protocol for networkcommunication,” Jun. 29 1999, US Patent 5918019.[Online]. Available: http://www.google.com/patents/US5918019

[23] S. Cantor, I. J. Kemp, N. R. Philpott, andE. Maler, “Assertions and protocols for theoasis security assertion markup language,” OASISStandard (March 2005), 2005. [Online]. Available:https://www.oasis-open.org/committees/download.php/10391/sstc-saml-core-2.0-cd-02g-diff.pdf

[24] R. Von Solms, K.-L. Thomson, and M. Maninjwa,“Information security governance control throughcomprehensive policy architectures,” in InformationSecurity South Africa (ISSA), 2011, Aug 2011, pp.1–6.


SECURE SEPARATION OF SHARED CACHES IN AMP-BASEDMIXED CRITICALITY SYSTEMS

P. Schnarz, C. Fischer, J. Wietzke∗, I. Stengel†

∗ Faculty of Computer Science, University of Applied Sciences Darmstadt, Germany, E-mail:[email protected]† Centre for Security, Communications and Network Research Plymouth, United Kingdom, E-mail:[email protected]

Abstract: The secure separation of functionality is one of the key requirements particularly in mixedcriticality systems (MCS). Well-known security models as the multiple independent levels of security(MILS) aim to formalise the isolation of compartments to avoid interference and make them reliableto work in safety critical environments. Especially for in-car multimedia systems, also known asIn-Vehicle Infotainment (IVI) systems, the composition of compartments onto a system-on-chip (SoC)offers a wide variety of advantages in embedded system development. The development of such systemsimplies often the combination of pre-qualified hardware- and software components. These componentsare CPU subsystems and operating systems, for example. However, the required strict separation cansuffer due to the pre-qualified and therefore not reconfigurable hardware components. Particularly, thisis true for shared cache levels in CPU subsystems. The phenomena of interference in the concurrentusage of shared last-level caches, are exploitable by adversaries. Therefore, this article identifies theattack surface and proposes a mitigation to prevent from the intentional misuse of the fixed cacheassociation. Generally, the solution is based on a suitable mapping scheme in the intermediate addressspace of an asymmetric multiprocessing environment which implements the MCS. Furthermore, weevaluate the strength of the approach and show how the solution contributes to a separation propertyconformal system.

Key words: security architecture, shared cache, denial of service, proof of separability

1. INTRODUCTION

Partitioning and combining different software componentsinto a shared platform is an ongoing trend in embeddedsystem development. These partitions provide theopportunity to separate applications having differentfunctional or non-functional requirements. Amongother purposes this fosters the development processor certification needs. Particularly non-functionalrequirements as e.g. for safety or security are mandatoryin critical environments like avionics or automotive. Thosesoftware compositions are referred to as Mixed CriticalSystems (MCS). Especially in the automotive sector thesafe and secure composition of MCS is pushed due to therising number of functions of in-car multimedia systems,also known as In-Vehicle Infotainment (IVI) systems.From a technical view, there are several possibilities topartition or separate applications. The focus in this workis set to the combination of multiple operating systemson a common System-on-Chip (SoC) for automotivepurposes. Multi-operating systems (multi-OS) are goingto be established in future [1].

The development of systems and software in theautomotive environment implies special requirements andchallenges [2]. In particular, one of the challenges isto integrate the wide-spread functions onto one singlehead unit [3]. Furthermore, tightly network-coupled(cloud) applications, such as social networks will gainincreased entry into that environment. The utilization ofmultiple OSs on one platform provides the opportunity

of gaining advantage of their capabilities. An exampleis the execution of a real-time OS parallel to a mobileOS or a general purpose OS. The loosely-coupledcomponent structure of SoCs offers the possibilityto implement a multi-OS following the asynchronousmultiprocessing (AMP) paradigm rather than followingthe classical virtualization schemes implemented incommodity desktop architectures. The AMP paradigmimplies the total split of every resource in the system.Specialized hardware extensions introduced with currentRISC processor architectures such as the ARMv7 [4]enable the assignment of devices and resources of theSoC to single OSs. The asymmetric paradigm is not yetapplicable to every part of current system-on-chip (SoC)architectures. Caches are often shared between multipleprocessor cores on multi processor SoCs (MP-SoC).According to their fixed association-scheme to the mainmemory, code running on the processor cores is naturallyvulnerable to DoS attacks. Generally spoken, due to theexploitation of the caching infrastructure an starvation ofan OS-partition is possible. Adversaries could explicitlyaim for this weakness.

Hence, in this work we introduce and provide an empiricalanalysis of the separation capabilities of a indirect memorymapping method. The proposed memory mappingtechnique aims to mitigate the surface of interference onshared caches.

The latency for each memory access of a certain processorhas been chosen to quantify the effects of our proposed

Based on: “On a Domain Block Based Mechanism to Mitigate DoS Attacks on Shared Caches in Asymmetric Multiprocessing Multi Operating Systems”, by P Schnarz, C Fischer, J Wietzke and I Stengel which appeared in the Proceedings of Information Security South African (ISSA) 2014, Johannesburg, 13 & 14 August

2014. © 2014 IEEE


method. Because this value represents the time consumedby the cache subsystem to transfer the issued data.It can be assumed that a heavy usage of a sharedcache will increase the access latency of a co-CPU onaverage. Therefore, in order to evaluate the separationcapability, we applied measurements to quantify thelatencies. These measurements are executed using certainmemory access patterns which are supposed to provoke,intended interference or starvation effects. This patternis derived from the specific cache implementation of ourexperimental platform.The outcome of this measurement-tooling is twofold. First,it is possible to show that the issue of denial-of-serviceattacks in such environments is true, as well as thestarvation. Second, the output of the data shows theeffectiveness of our solution to this particular problem. Adetailed description of the quantitative methodology in theparticular setup is given in Section 6.

Additionally, this article maps the novel mapping methodto security models which are commonly applied for mixedcriticality systems.

1.1 Threat Analysis

The security threat discussed in this work is specificto this particular environment. Thus, this affects theconsideration of threat sources, actors and the foci ofinterests/assets. Since this article focuses on the denialof service attacks we only consider the circumstances thatattack the availability of the asset.

Threat Sources/Actors Commonly, threat sources andactors are distinguishable. Briefly, the source of an threatis not necessarily the actor or the actual attacker in theend. What is more important are the capabilities and themotivation of the actors. Therefore, in this case, attackersare assumed to be very knowledgeable insiders. This endsin a relative high capability considering the substantialskills in embedded engineering. Furthermore, we assumeactors have access to the non-disclosure documentation ofthe hardware platform. As an example, the motivation toattack the availability could be to prevent important safetymessages from being displayed. This would result in thesafety relevant partitions of the system being threatened.

Threat Vector In this work we consider availabilityattacks at system-level. This means we assume an attackerhas already succeeding in compromising and controllingan OS-domain. This is reasonable due to the attack surfaceof highly Internet-coupled mobile-OSs.

Focus of Interest As a result, the actors aim forvulnerabilities at system-level. Despite privilege escalationattacks (horizontal or vertical) this article focuses onDoS-attack-surfaces in the shared last level cache (LLC).The attacker’s aim is to overcommit the cache from itscompromised OS side in order to degrade the memoryaccess performance on the target’s side. According to thecache associativity, the attacker is able to aim for a memoryaccess to a specific PA in the system. As an example, it is

feasible for adversaries to aim at a co-OS-domain whichcomputes the cluster device (speedometer) for the driverof a vehicle. The target would need to fetch data from thememory in a strict timing order. If the memory access isdelayed by the attacker, the displaying of the data couldalso be significantly delayed as well. This has an impacton the reliability and availability of the target system.

1.2 Related Work

A common known phenomena dealing with caches inmulti-core processors is cache thrashing. This workaims generally to the intentional interference of the cachesystem.Similar approaches to implement multi-OS are shownin [5], [6], [7] and [8]. However, the approaches focuson IA-32 multicore architectures which are used fordesktop computers. The difference to the architectureproposed in this work lies in the design paradigm,protocols and hardware compilation, which can notsimply be transferred to SoC architectures. Nevertheless,in [9] Crespo et al. shows an hypervisor meeting therequirements of high-assurance mixed critical systems.The intended solution to manage the cache allocationindirectly is introduced in [10], [11] and [12]. Theauthors propose the solution of coloring caches in orderto avoid interference between applications. They alsodeal with the problem of the dynamic allocation andassignment of cache colors to applications. The solutionsare implemented into the memory allocation mechanismon OS level.

In [13] and [14] the authors claim approaches forthe prevention of starvation in multinode systems orcache coherence protocols. Those approaches might beapplicable in AMP based systems. However, they have tobe implemented into the hardware platform, which is notin scope for the consideration in this work. Not in everyembedded systems development project it is even possibleto introduce changes in pre-qualified hardware blocks, likethe CPU subsystems are.

The approach proposed in this article differs substantiallyfrom this related work. In the case of AMP-based MCS,the memory segmentation must be enforced on systemlevel to have the capacity of configuration protection.Furthermore, since the system setup is statically defined,no dynamic allocation page coloring algorithms can beimplemented.

1.3 Structure

In Section 5. a method is proposed which addressesthe possible surface for interference. Furthermore,implementation issues for the solution are given in Section5.2. To verify the concept, a quantitative measurementmethod is introduced in Section 4.. The impact isshown with an experimental setup examined in Section6.. All measurement results are given in Section 7.. Theremainder of this article is organized as follows: In Section


Figure 1: Multi-OS composition and resource assignment

2. the Multi-OS environment and it’s practical realizationis introduced. In Section 3. the organization of the memoryis examined. Lastly, a conclusion is presented in 8..

2. AMP BASED MIXED CRITICALITY SYSTEMS

As previously mentioned, many ways exist to construct aMixed Criticality System. In this work we consider MCSseparated into multiple instances of operating systems(OS). According to their capabilities, they can, but are notobliged to, be of different types. In this work, OSs aredivided into three categories: real-time, general purposeand mobile. Each OS maintains its own memory as well asthe physical devices provided by the hardware platform.The combination of OS, memory and devices builds anindependent and self-organizing OS-domain. Figure 1shows a semantic system overview.The difference to other multi-OS approaches (comparethe related work in 1.2) is that OS-domains are staticallybound to a processing unit, which is usually one ofthe central processing units (CPU) or CPU-cores of thehardware platform. This is led by the intention ofachieving a static system configuration. It is not intendedto expand domains in the main memory or to reorganizethe device assignments during runtime. During the bootphase all necessary initializations and configurations areset up. In comparison to classical virtualization examples,this approach avoids interfaces to manage the configurationonce the system is running. The intention is to minimisethe attack surface.The proposed approach is based on asynchronous multiprocessing. AMP systems are not new in certain domains.Primarily an AMP system provides asymmetric access tohardware. The term asymmetry refers to the separationof hardware resources core-by-core in the system. Eachcore has access to and works on a different partition ofthe main-memory and hardware devices. To maintain andhandle the different core partitions and their applications,each partition must run an OS. In this way it is also possibleto run different OSs on an AMP multicore system. Thisapproach for encapsulation is hardware-based and requiresthat fundamental hardware functions to be adapted orconfigured to run multiple OSs. The approach is contraryto symmetric multiprocessing (SMP) which runs a singleOS on all CPU-cores and maintains all hardware resources.AMP based systems can be categorized as mixed critical

Figure 2: Generalized SoC-structure

system [15] because of the independency of all subsystemsrunning on the multiple CPUs.

2.1 System on Chip Structure

SoC is an integrated circuit that combines all componentsof a computing platform or other electronic systems intoa single chip. Designing SoCs is a very demandingarea in embedded system development. The platformwill be constructed for their intended environment. Thearchitecture of the system is generally tailored to itsapplication rather than being general-purpose. This meansthere is usually no common structure for such platforms.Nevertheless, most SoCs are developed from pre-qualifiedhardware blocks (compare [16]). These blocks areconnected through an on-chip network, often referred toas system-bus. In Figure 2 a generalized structure forSoCs is presented. For this work, the focus is set to asingle subsystem which implements CPUs. Furthermore,the CPU-subsystem’s connection to the main memory isconsidered.

2.2 Security Model

The here presented Multi-OS applies a security modelwhich fits with the mixed criticality of the system.Multiple Independent Levels of Security (MILS) has beenintroduced to prove the security of high assurance systemsespecially for aircraft, space, and defence purposes [17][18]. The model incorporates the following properties[19]:

• Data isolation: The isolation of data does not onlyaffect the confidentionality of information, it is alsoaimed at the separation capabilities of the systemdesign.

• Control of information flow: The location of theinformation has to be defined. Furthermore, accesscontrol has to be enforced.

• Periods processing: This aims for single core orclassical virtualized systems which share one or moreprocessors. In this case, it has to be proven that eachsystem can meet its timing requirements.

• Fault isolation: Failures are detected, contained andrecovered locally [19].

Security measures to enforce the aforementioned prop-erties need to implement the following attributes:


Figure 3: MILS AMP system overview

non-by-passable, evaluable, always invoked, tamper proof.These are commonly known as the acronym NEAT [20].In Figure 3 the AMP system which is based on a SoC ismapped to the MILS model. Single subsystems are eitherdefined as multi-level secure (MLS) or single level secure(SLS). Hence, each compartment has its own multiplesecurity levels or they are considered to be single levelled.There is no dependency between the security levels of eachcompartment. The highest security level of system 1 isunequal to the highest level of system 2, etc.

One of the key elements of the MILS model in contrastto our work is the data isolation property. Sometimesreferred to as proof of separability [21], this aims toassure a multi-compartment system by avoiding surfacesfor interference. Therefore, the method we are going tointroduce must fit with the NEAT attributes in order to becompliant with the security model.

2.3 AMP Multi-OS Realization

This section will examine issues associated with realizingan AMP-based multi-OS. In this work hardware architec-tures are considered which use memory maps for hardwareaccesses (memory mapped i/o). Each device connectedto the common shared bus (CSB) is accessible by astatically defined physical address. These addresses arebundled in an i/o-address space or configuration addressspace if there are any configuration-registers of the device.Processors map those physical addresses to their virtualaddress space using a memory management unit (MMU).The MMU itself uses a translation table (TT) to matchand redirect accesses to peripherals connected to the CSB.The translation table itself either resides somewhere in thephysical memory space or is implemented into the MMUhardware. The whole system configuration is set during theboot phase within the privileged hypervisor mode. Figure4 shows the CPU subsystem.

Figure 4: First and second stage MMUs in the CPU centricmemory management

2.4 Centralized Memory Mapping and Peripheral As-signment

The segmentation of the addressable space as well as theassignment of certain resources, peripherals or devices isan integral part necessary for the creation of an AMPmulti-OS.In this context, assignment means the device is onlyaccessible by a single, defined OS-domain. As mentioned,the assignment will be enforced by the second stageMMU. To bind a resource to an OS-domain, it mustprovide an interface (configuration registers) in a dedicatedaddress area (configuration-space). This includes clockassignments, MMU activation and signals, etc. Inorder to assign a resource to an OS-domain all of theconfiguration-registers will be mapped to its address space.The example in Figure 5 shows two OS-domains andtwo devices. Based on the full addressable space, eachdevice or memory partition is assigned to a domain.As an example, if 4GB of main memory is given, themain memory could be mapped from the physical address0x80000000 to 0xbfffffff. If an address space is shared theassociated addresses are multiply assigned.

2.5 Address Spaces

As a result of the two staged address translations, thesystem deals with three different address space types,which are briefly introduced in this section.

• Virtual address space (VA): This space is typicallymaintained by the OS-domain. The addresses usedin this space are referred to as virtual addresses. Anaddress used in an instruction, as a data or instructionaddress, is a VA [4] and has a space of up to 32 bits.

• Intermediate physical address space (IPA): TheIPA is the output of the stage 1 translation and theinput of the second stage. If no stage 2 translationtakes place, the IPA is the same as the physicaladdress.

• Physical address space (PA): PA contains theaddress of a location in the memory map, which isan output address from the processor to the memorysystem.


Figure 5: Example for a system memory map

The translation process works as follows:

(VA)stage1−−−−→

(IPA)stage2−−−−→

(PA) (1)

The proposed method introduces a suitable mapping hatthe second stage.

2.6 Translation Tables

The MMU utilizes a translation table to convert an inputaddress to a corresponding output address. Dependingon the implementation, these translation tables are locatedin the PA address space of the SoC. In order to managea huge address space, the translation tables are dividedinto different levels. Typically, there are three translationLevels. According to the ARM reference [4] Level 1maps 1GiB blocks, Level 2 2MiB blocks and Level 3 4KiBpages respectively. The input address, particularly the IPA,indexes the position in the table. Each entry points either toa memory region or to the next corresponding translationtable level. The suitable construction of these translationtables is further described in 5.1.

3. MEMORY ORGANIZATION

This work develops general models. Nevertheless, theorganization of memory, cache subsystems, their protocolsand hierarchy are based on the ARMv7 architecturespecification [4]. This specification is the foundation fora significant amount of SoC platforms.

3.1 Caches

The general intention of the integration of caches toprocessors is to speed up access to frequently usedmemory. The memory in computing systems is

Table 1: Caching terminologySign Description

WS Way Set

CL Cache Line

ML Memory Line

CLSize Size of single cache line

MLSize Size of single memory line

WSID A specific WS

DBSize Size of a Domain Block

WSCount Amount of way sets

CLCount Amount of CLs

hierarchically organized [22]. Regardless of the highestorders, which are the processor’s registers, there are oneor more levels of cache, which are denoted as L1, L2, etc.In multi processor (MP) systems some levels are privateto the processor and some are shared between multipleprocessors. In SMP based systems a coherence protocolmaintains the synchronization of shared data. Cachesexpand from the lower to the higher levels. The smallestaddressable entity within a cache is a cache-line (CL),which has a fixed CL size, such as 64 Byte.The last level cache (LLC) before the main memory oftenhas an associativity scheme. The scheme describes whichCL in main memory, the memory line (ML), is loaded to aspecific location in the cache. The location where a ML isloaded to, is denoted as CL ID. The associativity betweenthe LLC can be fully associative or could be organizedinto associativity-cache-way sets. Fully associative meanseach CL can be loaded to all possible CL ID positions inthe LLC. In most cases caches are divided into way-sets(WS). Thus, a specific ML is associated with a specificWS in cache. If a WS has a size of 8, it is called an8-way-set associative cache. When a ML is loaded toa CL into the WS, a replacement algorithm determinesthe specific location. Upon implementation, this could bedone by a least recently used algorithm or could be totallyrandomized. The CL that gets replaced, will be writtenback to the main memory. The number of WSs in the cachecan be calculated by:

WScount =cachesize

(CLsize ∗WSassoc.)(2)

3.2 Addressing Scheme

Addressing is the fundamental part of memory access.Usually the smallest addressable entities in computersystems are 32Bit. As a result, a single CL contains 16addressable locations. The data or instructions loaded intothe cache can be logically/virtually indexed or physicallyindexed. In case of physically indexed caches the PA of amemory location identifies CLs in the cache system. As aresult, VA or IPAs have to be translated through the MMUs


before the data can be loaded into the cache. If a processoraccesses a particular memory location on main memory,the VA will be translated into a PA. In equation 3 how todetermine a specific WS in cache to a given PA is shown.

WSID =PA

CLsizemod WScount (3)

4. QUANTITATIVE INTERFERENCE

In this section, the method of how to achieve interferencebetween the two co-OS-domains will be described.Furthermore, we show how to implement the method. Forour consideration, we have assumed a two-leveled cachehierarchy and 16 WS associativity of the L2 cache to themain memory.

4.1 Method

In order to introduce performance impact onco-OS-domains, an attack vector is examined whichaims to overcommit a certain WS in the LLC. Thememory mapping introduced in Section 2.4 shows thatthe memory partitions are assigned in two big consecutiveblocks to the OS-domains, which are denoted as MemoryPartition 0 and Memory Partition 1 respectively.

As shown in Figure 6, each ML in the main memory isassigned to a specific WS in the LLC. Since the mainmemory is bigger than the L2 cache, the pattern repeatsevery time WScount has been reached. The method to fill aspecific WS in the cache is to compute WS ID accordingto the following equation:

Blocksize =WScount ∗CLsize (4)

Algorithm 1 Fill specified WSRequire: PA ≥ 0;Stepsize > 0

NextPA ⇐ PAfor i = 0; i ≤WSsize; i++ do

AccessNextPANextPA ⇐ NextPA+Blocksize

end for

If the attacker aims to interfere with the victim, he justhas to use the same WSs as the victim. Aiming for aspecific WS, and by doing this very frequently the victim’smemory accesses to this WS can be significantly delayed.According to the replacement strategy in the WSs, thiseffect can be deterministically predicted in case of leastrecently used (LRU) or statistically measured in case ofrandom replacement.

5. COUNTERMEASURE

The attack vector shows that it is possible to interferewith an co-OS domain by attacking specific WSs. Since

this would lead to unpredictable memory access executiondelays, a method is introduced to prohibit this effect. Thecache covers the whole main memory. Attributes like theassociativity commonly cannot be changed in the system.Hence, a method is required which is applicable withoutthe introduction of architectural hardware changes.

5.1 Domain Block Memory Mapping

The general strategy for the countermeasure is to invertthe DoS method. The method assigns WS in the cacheto OS-domains. This is achieved by the introduction ofDomain Blocks (DB). The DBs are later mapped to theparticular OS-domains. A DB is a memory region that isassigned to a specified region of a set of WSs in LLC. InFigure 7 the method is depicted. The example shows a DBmapping for two memory partitions. As a result, there aretwo different ”colors” for DBs in this case. A single DBconsists of a set of MLs. The DBs describe an alternatingpattern within the main memory. In the example, a cachewith 2048 WS is assumed. To split the cache literally intotwo halves, a DB consists of 1024 ML. Generally, the sizeof a DB is calculated through:

DBsize =WScount

Domains∗CLsize (5)

The DBs will be mapped to the OS-domains usingthe second stage MMU. The mappings in the Level 3descriptors are generated as follows: The input addressspace, represented by the IPA, must be consecutive for aproper operation of the OS. The output addresses (PA) aregenerated with regard to the proposed pattern. Since eachentry in the TT describes a 4096 Byte page in the mainmemory, each page contains 64 MLs with a size of 64 Byte.To contain 1024 MLs, 16 pages form a single DB. Themappings are generated using the Algorithm 2. The PAspace for the main memory starts at address 0x80000000.For OS − domain1 IPA space starts at 0x80000000 andfor OS− domain2 at 0xA0000000. The algorithm iteratesthrough the whole address space of the main memory.

Algorithm 2 Generate Level 3 TTIPA1 ⇐ 0x80000000; IPA2 ⇐ 0xA0000000PA1 ⇐ 0x80000000,PA2 ⇐ 0x80010000Pagesize ⇐ 0x1000for j = 0; j ≤ MainMemoroysize; j+= DBsize do

for i = 0; i ≤ 16; i+= Pagesize doIPA[1,2] ⇔ IPA[1,2]+ = PagesizePA[1,2] ⇔ PA[1,2]+ = Pagesize

end forIPA[1,2] ⇔ IPA[1,2]+ = PagesizePA[1,2] ⇔ PA[1,2]+ = DBsize ∗2

end for

There are multiple possibilities to map these pages.For example, in the introduced algorithm, the DBs areoptimized to the maximum size. To some extent, thisavoids a high fragmentation of the PA memory.


Figure 6: Exploiting the cache way-set association

5.2 Implementation Issues

The proposed mapping theme shows how to separateformer shared resources like the set associative cache.In the results we show the applicability to avoid DoSattacks. However, the implementation of the schemeinvolves certain challenges in system and architecturalimprovements.

As mentioned previously, the system setup is staticallydefined and initialized during bootup. To implementthe proposed mapping, the initialization of the secondstage MMU must be completed before the OS images orconfiguration files such as Device Tree Blobs are loadedinto the main memory. This is problematic, since thesecond stage MMU is commonly initialized using anidentity mapping, which means there is no remappingof memory pages. Since, we divide the main memoryinto DBs, this scheme has to be considered on loadingthe OS-images. Basically the images are bigger than the64KiB DBs, so they have to be loaded with respect to thatpattern.

Furthermore, the proposed memory mapping schemedemands a special treatment of direct memory accessesof other subsystems connected to the common systeminterconnect. Direct memory accesses are preformed onconsecutive PA memory areas. In order to use DMAcapabilities of certain resources, the maximum transfersize needs to be limited and aligned to the DBs. Otherwise,the IPA address space needs to be established to theDMA capable resource. This requires that the DMAdevice is aware of the mapping scheme, and is able totranslate the IPA addresses which are communicated by theOS-domains into the real PA residing in the main memory.

Figure 8: Proposed Architecture

5.3 Properties for a System Architecture

In order to incorporate the outcomes of the key findingsin this paper, we propose the properties for a secureintegration of the domain block mapping.

There are secure integration of two levelled memorymapping or a memory protection unit. As shown inFigure 8, dedicated MPUs for each master subsystem areproposed. The MPUs must be implemented in a multi-levelprivilege scheme. This means the configuration is onlyaccessible by the highest privilege level (hypervisor level).

The maximum number of OS-domains, respectively theCPUs, supported by this method is dependent on the cachesize, CLSize and on the minimum pagesize supported bythe translation table. Assuming an ARMv7 architecture,the minimum pagesize is set to 4KiB and the CLSize is 64Byte.

Domains =WSCount/PagesizeCLSize

(6)

In the example configuration, 32 OS-domains or CPUs


Figure 7: The principle of cache domain blocks

would be supported.

The aforementioned properties are needed to meet theNEAT attributes of the MILS security model.

• Non-bypassable: Once activated, the addresstranslation will be enforced by the MPU or SecondStage MMU. Since this is implemented in hardware,it cannot be bypassed by the software running on theCPU. However, this implies that i/o devices which arecapable of accessing the main memory directly needto be aware of the mapping method.

• Always invoked: According to the non-bypassableattribute, the translation of memory access is done forevery read or write attempt of the processor.

• Tamper proof: The mechanism is implementedinto hardware, which makes it tamper proofagainst external manipulation. However, the MMUtranslation tables reside within the main memory.The location needs to be in a restricted area that isnot accessible for the processors. As a result, eachDMA capable subsystem needs to be incorporated bya MMU/MPU.

• Evaluateable: The herein proposed method toseparate shared caches, does not introduce any furthercomplexity to the implementation of hard or software.Our method shows an alternative approach to buildinga second-level address translation.

6. EXPERIMENTAL SETUP

This section will give an overview of the system setup thatwas used to produce the results given in Section 7.. Toproduce the results a SoC platform was chosen which is

available to the public domain. Therefore, a Pandaboard[23] that incorporates the Texas Instruments OMAP5 SoCand a set of peripherals necessary for ICM-applications isused. The OMAP5 implements a multi processing unit(MPU) subsystem with two ARM Cortex-A15 processors.The MPUs have a direct connection to the main memoryor an external memory interface (EMIF). Each Cortex-A15core has a private L1 64KiB (32KiB each for instructionsand data) cache and a shared 2MiB L2 (unified) cache. Thecache line size is 64 Byte and has 2048 WS.Two OSs run on the platform, the adversary and the victimOS-domain. Both OS-domains consists of an upstreamLinux Kernel (Version 3.8.13) and a suitable root filesystem. The methods to measure the proposed effects areimplemented on OS-level, using Linux kernel modules.

In order to obtain the results the following functions havebeen implemented.

• measure-loop(): The Loop simply executes iterationsover a set of data. During each iteration it loadsfrom a ML into a CPU register using the ARM LDRinstruction.

• DoS-loop(): Compares to the measure-loop withouttime measurements.

• get time(): Timestamps at start and end of eachiteration to determine the CPU cycles consumed.

• prepare cachelines(): Determines the CLs to thetargeted WS.

• get next CL(): Iterates through the CLs.

The loops iterate over a set of ML/CLs determined by theequations given in Section 4..


m e a s u r e l o o p ( ) {f o r ( k = 0 ; k < TEST ITERATIONS ; k ++) {

r e s e t t i m e r ( ) ;p r e p a r e c a c h e l i n e s ( ) ;f o r ( l = 0 ; l < v a l u e ; l ++) {

s t a r t = g e t t i m e ( ) ;g e t n e x t C L ( ) ;end = g e t t i m e ( ) ;c y c l e s += end − s t a r t ; }}}

Listing 1: Pseudo code to measure the latency

7. RESULTS

According to Section 4. and 5., the methods have beenevaluated. The results are obtained by measuring latenciesof memory accesses.The measurements are performed as follows: The victimOS-domain logs the latency of its memory accesses byproducing a characteristic workload through the access ofa specific domain block. The adversary-OS works on thesame amount of data within the same domain-block set.The following characteristics are considered:

• CPU cycle count: Number of cycles consumed byan operation. This value quantifies the duration of ameasurement. The cycle count was taken from thetimer subsystem of the experimental platform.

• CL count: Number of CLs allocated and iteratedthroughout the measurement.

• DoS impact: This value describes the increasedpercentage of the CPU cycle count of an operationbeing interfered with.

In order to produce the results the arithmetic mean valueof the measurements has been computed. During themeasurements some factors have been observed whichproduce particular outliers. Those factors are caused byfunctionalities such as cache prediction [24] or bus usage.Technically, it would be possible to turn off those processorfeatures, but this would not produce real world results,because an adversary is not able to do so. The generalaim was to prove that the concept fits the predictions ratherthan to build a polished output.

7.1 Maximum Impact

The first measurement shows how the thrashing impactcompares to the number of CLs iterated in a single WS.Both systems run the DoS-loop and measure the latencyconcurrently. The results to prove the DoS method aredepicted in Figure 9. The most significant impact to theexecution performance of the victim OS peaks at about457 percent. This means that by using the attack vector,it is possible to delay the execution of an operation upto this value. Another value observed is the number ofCLs that have been allocated to cause the interference.The highest peak of interference appears when the victimand the attacker are using 16 CLs, meaning a full WS. If

0,0% 50,0%

100,0% 150,0% 200,0% 250,0% 300,0% 350,0% 400,0% 450,0% 500,0%

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63

Impa

ct (%

)

Iterated Lines

Vic0m-‐OS Adversary-‐OS

Figure 9: Analysis of WS hit

both systems use the same full WS the possibility that asingle CL must be replaced by the cache is highest. As aresult, the prediction of the attack model to overcommit acommon WS is proven to be true.

One of the most important results obtained is the maximumthrashing impact. Here, the victim OS executes themeasure-loop with latency measurement and the adversaryOS only the DoS-loop. The results are shown in Figure10. For these measurements both systems iterate over 16CL, which produces the highest impact (compare Figure10). Table 2 gives an overview of all observed values. Inthe test runs, the mean cycle count of a memory access isabout 3,602. By activating the DoS-loop the cycle countincreases up to 132,803. This implies a DoS impact ofabout 3686%. Compared to the value of the previousgraph, the significantly higher result is justified by thefrequency of the DoS-loop. If a single iteration of theDoS-loop runs without the latency marks, then the CPUcycle count per step is lower.

7.2 Domain Block Mapping

To evaluate the value of the DB memory mapping, themeasurements made after establishing the mappings. Theresults in Figure 10 compare the CPU cycles consumedduring a data fetch. According to the measurementswith the identity mapping both systems iterate through thesame full WS. The normal execution of the measure-loopreveals a mean cycle count of 3,599. By applying theDoS-loop to the measurement, the cycle count rises to8,911. Consequently, this results in a DoS impact of about247,55%. By comparing the mean cycle counts of DB andidentity mapping the DoS impact subsides significantly.

7.3 Overhead Evaluation

Despite the advantages of the separability of LLC, theherein proposed method will introduce some overheadevaluation of certain attributes of the system execution.Therefore we compare the approach according to thememory access latency, to determine the cost for singlememory access. Furthermore, the latency for memoryaccesses in consecutive memory areas in copy operationswill be discussed.

The introduction of the proposed memory mapping willintroduce overhead by retrieving the mapping entries from


Table 2: Domain Block Mapping impactMapping Impact (%) Mean CPU Cycles Std. Derivation

identity - 3,602 0,022

DB - 3,599 0,021

identity DoS 3686,90% 132,803 2,265

DB DoS 247,55% 8,911 2,062

0"

20"

40"

60"

80"

100"

120"

140"

Mean%CP

U%Cycles%

Iden,ty" Iden,ty"Thrashed" DB"Thrashed"

Figure 10: Comparision of indentity and DB mapping

the MMU translation tables. This might be caused byadditional table walks. A full translation table lookup iscalled a translation table walk. As mentioned previously,the MMU translation tables are concatenated tables, whereeach level describes a finer-grained amount of data, fromthe higher to the lower level. Using an identity memorymapping the granularity of Level 2 block descriptors aresufficient to produce a proper system mapping. Accordingto the domain block mapping, Level 3 descriptors whichmap to the 4KiB pages need to be used rather than the2MiB blocks in Level 2. This implies, that a lookupfor a physical output address needs to walk through thedescriptor levels 1 to 3, whereas Level 3 has architecturaldependent number of entries.

As mentioned above, the mapping method in the secondstage MMU level implies three table walks to convert theIPA input address into the PA output address. In order toquantify the cost of the additional table walk the latencyof the measure-loop has been observed with the original2 MiB Level 2 mapping and the domain block mappingwhich resides in the Level 3 translation table. The resultsshow that the DB mapping method does not add anyperformance overhead to the memory access if a single WSis considered. The mean CPU cycles remain at 3,602 inidentity mapping and 3,599 using the DB mapping.

In addition to the table walks, the MMU implementsa cache for its recently used translation entries. Thiscache is commonly referred to as a translation look asidebuffer (TLB). Depending on the particular architecturalspecification, the TLB caches up to 512 translationdescriptors. Assuming a normal identity systemmapping and our experimental platform, utilizing Level 2descriptors for 2MiB blocks, there are approximately 2048

0"

2000000"

4000000"

6000000"

8000000"

10000000"

12000000"

14000000"

4096"KiB" 2048"KiB" 512"KiB" 256"KiB" 128"KiB" 64"KiB"

Mean%CP

U%Cycles%

Iden1ty"" DB"

Figure 11: Memory bandwidth comparison

descriptors to be stored. In contrast to the DB mappingmethod which uses fine grained Level 3 descriptors, thereare about 1048576 (size of main memory divided by thesize of page mapped by the Level 3 descriptor) translations.The probability of refreshing the TLB is much higher thanthe DB mapping compared to with the normal method.

Particularly in this architecture the TLBs stores up to512 translations. Since, in the previous scenario only 32different CLs are accessed, these translations reside insidethe TLB. Therefore, we increased the number of CL inour access scenario to exceed the capacity of the TLB.The results quantify the impact to access latency caused byrefreshing the TLB. In this case the mean access latency ofthe measure loop is about 2.52 percent.

Practically, in infotainment systems sometimes largeamounts of data are going to be transferred. This couldbe graphics data or media files which are handled bythe CPU. Therefore, we determine the behaviour of copyoperations of larger data blocks compared to your previousmeasurements. Figure 11 shows a comparison of the meanCPU cycles consumed by copying a particular amount ofdata. It analyses the overhead for large data operations.The data block is contiguous allocated in the intermediateaddress space. The results show that the bandwidth of theidentity and the DB mapping is barely at the same level.In fact, the domain block approach has no influence on thetransfer of contiguous memory areas.

The evaluation shows the negative performance impact ofthe deliberate interference of shared caches. By utilizingthe DB mapping, it is possible to substantially mitigatethese effects. Furthermore, the performance overheadfor domain blocking is negligible. Nevertheless, we still


observe a performance impact using the DB mapping. Thismight be caused by the bus interconnect which transportsthe data from the main memory through the caches to theCPUs.

8. CONCLUSION

In Mixed Criticality Systems the isolation of theindependent system components is important. Byimplementing an AMP based Multi-OS this goal canbe achieved by design. However, the separation ofcommon resources like caches must be proven to achievea reliable system compilation. In this article, a methodhas been introduced to avoid the surface for interferenceand DoS attacks on shared caches. In order to provethe separability of concurrent OS-domains in a mixedcritical system, the herein proposed approach has beendesigned to meet the requirements of established securitymodels. The method is based on the partitioning ofmemory lines within the main memory. According to theway-set associativity, these partitions, so-called domainblocks, can be assigned to OS-domains. The mappingis enforced using a MMU within the stage-two memoryaddress translation. Additionally, the method fits to theAMP system design paradigm.Furthermore, this article shows the significance andthe need for a solution to mitigate or prohibit DoSattacks in shared last level caches. In addition tothat, the surface for interference is quantified to evaluateour solution. This proposed domain block mappingbreaks the AMP-paradigm down to the shared caches ofCPU-subsystems in SoCs. Since the approach uses thesecond-stage MMU which is used by the system anyway,the complexity of the system design and the additionalengineering cost of the method is kept to a minimumlevel. In addition, the measurement results show thenegligible performance degradation. The method enables amore reliable implementation of AMP-based multi-OSs onMP-SoCs using shared caches, without the need to modifythe pre-qualified hardware layout. The method and theresults also have an impact on other disciplines related toSoCs. In the real-time research field, it can be used to makememory access more predictable, regardless of whether amulti-OS is implemented or not.

The consideration focuses on a dual core CPU subsystem.In the future, the technique can be applied and evaluatedin SoCs that incorporate quad (or more) core processorsthat share a last level cache. The proposed domain blocksmake it necessary to break the mapping down to thefully-addressable space on the SoC. Therefore, solutions toestablish these domain blocks for devices which access themain memory directly must be considered. Furthermore,the cache interconnection bus provides a further surfacefor interference. In future this particular aspect has tobe considered so it fits the AMP system design moreadequately.

ACKNOWLEDGMENTS

The authors would like to thank Alexander Minor andThorsten Schnarz for their excellent practical efforts andcomments. Furthermore, we appreciate the detailed andhelpful comments of the reviewers.

REFERENCES

[1] C. Hammerschmidt, “Harman brings virtualisationand scalability to automotive infotainment,” Jan.2014.

[2] M. Broy, I. H. Kruger, A. Pretschner, andC. Salzmann, “Engineering automotive software,”Proceedings of the IEEE, vol. 95, no. 2, pp. 356–373,2007.

[3] M. Broy, “Automotive software engineering,” inSoftware Engineering, 2003. Proceedings. 25thInternational Conference on, 2003, pp. 719–720.

[4] ARM, “ARM Architecture Reference Manual -ARMv7-A and ARMv7-R edition,” 2012.

[5] H. Inoue, A. Ikeno, M. Kondo, J. Sakai,and M. Edahiro, “VIRTUS: a new processorvirtualization architecture for security-orientednext-generation mobile terminals,” in DesignAutomation Conference, 2006 43rd ACM/IEEE.IEEE, 2006, pp. 484–489.

[6] J. Porquet, C. Schwarz, and A. Greiner,“Multi-compartment: A new architecture forsecure co-hosting on soc,” in System-on-Chip, 2009.SOC 2009. International Symposium on, 2009, pp.124–127.

[7] W. Kanda, Y. Murata, and T. Nakajima, “SIGMASystem: A Multi-OS Environment for EmbeddedSystems,” Journal of Signal Processing Systems,vol. 59, no. 1, pp. 33–43, Sep. 2008.

[8] Y. Kinebuchi, T. Morita, K. Makijima, M. Sugaya,and T. Nakajima, “Constructing a multi-os platformwith minimal engineering cost,” pp. 195–206, 2009.

[9] A. Crespo, I. Ripoll, and M. Masmano, “PartitionedEmbedded Architecture Based on Hypervisor: TheXtratuM Approach,” in Dependable ComputingConference (EDCC), 2010 European. IEEE, 2010,pp. 67–72.

[10] S. Cho and L. Jin, “Managing distributed, sharedl2 caches through os-level page allocation,” inIEEE/ACM INTERNATIONAL SYMPOSIUM ONMICROARCHITECTURE. IEEE Computer Society,2006, pp. 455–468.

[11] D. Tam, R. Azimi, L. Soares, and M. Stumm,“Managing shared l2 caches on multicore systemsin software,” in In Proc. of the Workshop onthe Interaction between Operating Systems andComputer Architecture (WIOSCA, 2007.


[12] S. Kim, D. Chandra, and Y. Solihin, “Faircache sharing and partitioning in a chipmultiprocessor architecture,” in Proceedings ofthe 13th International Conference on ParallelArchitectures and Compilation Techniques, ser.PACT ’04. Washington, DC, USA: IEEE ComputerSociety, 2004, pp. 111–122. [Online]. Available:http://dx.doi.org/10.1109/PACT.2004.15

[13] S. Park, C. T. Chou, and A. Kumar, “Fairness mech-anism for starvation prevention in directory-basedcache coherence protocols,” Tech. Rep., 2012.

[14] M. Khare and A. Kumar, “Method and apparatus forpreventing starvation in a multi-node architecture,”Tech. Rep., 2002.

[15] S. Baruah, H. Li, and L. Stougie, “Towards the designof certifiable mixed-criticality systems,” in Real-Timeand Embedded Technology and Applications Sympo-sium (RTAS), 2010 16th IEEE, April 2010, pp. 13–22.

[16] F. R. Wagner, W. O. Cesario, L. Carro, andA. A. Jerraya, “Strategies for the integration ofhardware and software ip components in embeddedsystems-on-chip,” Integration, the VLSI journal,vol. 37, no. 4, pp. 223–252, 2004.

[17] J. Alves-Foss, P. W. Oman, and C. Taylor, “The MILSarchitecture for high-assurance embedded systems,”. . . of embedded systems, 2006.

[18] G. M. Uchenick and W. M. Vanfleet, “Multipleindependent levels of safety and security: highassurance architecture for MSLS/MLS,” in MilitaryCommunications Conference, 2005. MILCOM 2005.IEEE. IEEE, 2005, pp. 610–614.

[19] W. M. Vanfleet, R. W. Beckwith, B. Calloni,J. A. Luke, C. Taylor, and G. Uchenick,“MILS:Architecture for High-Assurance EmbeddedComputing,” pp. 1–5, Jul. 2005.

[20] J. Alves-Foss, C. Taylor, and P. Oman, “Amulti-layered approach to security in high assurancesystems,” in System Sciences, 2004. Proceedings ofthe 37th Annual Hawaii International Conference on.IEEE, 2004, p. 10 pp.

[21] J. M. Rushby, “Proof of separability A verificationtechnique for a class of security kernels,” inInternational Symposium on Programming. Berlin,Heidelberg: Springer Berlin Heidelberg, Jan. 1982,pp. 352–367.

[22] J. L. Hennessy and D. A. Patterson, Computerarchitecture: a quantitative approach. Elsevier,2012.

[23] Texas Instruments, “OMAP5432 Multimedia Device- Silicon Revision 2.0 Evaluation Module,” 2012.

[24] Texas Instruments, Incorporated, “OMAP 5 Specifi-cations,” Tech. Rep., 2013.


This journal publishes research, survey and expository contributions in the field of electrical, electronics, computer, information and communications engineering. Articles may be of a theoretical or applied nature, must be novel and

must not have been published elsewhere.

Nature of ArticlesTwo types of articles may be submitted:

• Papers: Presentation of significant research and development and/or novel applications in electrical, electronic, computer, information or communications engineering.

• Research and Development Notes: Brief technical contributions, technical comments on published papers or on electrical engineering topics.

All contributions are reviewed with the aid of appropriate reviewers. A slightly simplified review procedure is used in the case of Research and Development Notes, to minimize publication delays. No maximum length for a paper is prescribed. However, authors should keep in mind that a significant factor in the review of the manuscript will be its

length relative to its content and clarity of writing. Membership of the SAIEE is not required.

Process for initial submission of manuscriptPreferred submission is by e-mail in electronic MS Word and PDF formats. PDF format files should be ‘press

optimised’ and include all embedded fonts, diagrams etc. All diagrams to be in black and white (not colour). For printed submissions contact the Managing Editor. Submissions should be made to:

The Managing Editor, SAIEE Africa Research Journal, PO Box 751253, Gardenview 2047, South Africa.

E-mail: [email protected]

These submissions will be used in the review process. Receipt will be acknowledged by the Editor-in-Chief and subsequently by the assigned Specialist Editor, who will further handle the paper and all correspondence pertaining to it. Once accepted for publication, you will be notified of acceptance and of any alterations necessary. You will then

be requested to prepare and submit the final script. The initial paper should be structured as follows:

• TITLE in capitals, not underlined.• Author name(s): First name(s) or initials, surname (without academic title or preposition ‘by’)• Abstract, in single spacing, not exceeding 20 lines.• List of references (references to published literature should be cited in the text using Arabic numerals in

square brackets and arranged in numerical order in the List of References).• Author(s) affiliation and postal address(es), and email address(es).• Footnotes, if unavoidable, should be typed in single spacing.• Authors must refer to the website: http: //www.saiee.org.za/arj where detailed guidelines, including

templates, are provided.

Format of the final manuscriptThe final manuscript will be produced in a ‘direct to plate’ process. The assigned Specialist Editor will provide you

with instructions for preparation of the final manuscript and required format, to be submitted directly to: The Managing Editor, SAIEE Africa Research Journal, PO Box 751253, Gardenview 2047, South Africa.

E-mail: [email protected]

Page chargesA page charge of R200 per page will be charged to offset some of the expenses incurred in publishing the work.

Detailed instructions will be sent to you once your manuscript has been accepted for publication.

Additional copiesAn additional copy of the issue in which articles appear, will be provided free of charge to authors.

If the page charge is honoured the authors will also receive 10 free reprints without covers.

CopyrightUnless otherwise stated on the first page of a published paper, copyright in all contributions accepted for publication is vested in the SAIEE, from whom permission should be obtained for the publication of whole or part of such material.

SAIEE AFRICA RESEARCH JOURNAL – NOTES FOR AUTHORS


South African Institute for Electrical Engineers (SAIEE)PO Box 751253, Gardenview, 2047, South Africa

Tel: 27 11 487 3003 | Fax: 27 11 487 3002E-mail: [email protected] | Website: www.saiee.org.za