research.tees.ac.uk€¦ · web viewtool testing and reliability issues in the field of digital...

Tool testing and reliability issues in the field of digital forensics

AbstractThe digital forensic discipline is wholly reliant upon software applications and tools designed and marketed for the acquisition, display and interpretation of digital data. The results of any subsequent investigation using such tools must be reliable and repeatable whilst supporting the establishment of fact, allowing criminal justice proceedings the ability to digest any findings during the process of determining guilt or innocence. Errors present at any stage of an examination can undermine an entire investigation, compromising any potentially evidential results. Despite a clear dependence on digital forensic tools, arguably, the field currently lacks sufficient testing standards and procedures to effectively validate their usage during an investigation. Digital forensics is a discipline which provides decision-makers with a reliable understanding of digital traces on any device under investigation, however, it cannot say with 100% certainty that the tools used to undertake this process produce factually accurate results in all cases. This is an increasing concern given the push for digital forensic organisations to now acquire ISO 17025 accreditation. This article examines the current state of digital forensic tool-testing in 2018 along with the difficulties of sufficiently testing applications for use in this discipline. The results of a practitioner survey are offered, providing an insight into industry consensus surrounding tool-testing and reliability.

Keywords: Digital Forensics; Testing; Validation; Research; Error-Rate; Reliability.

1 IntroductionThe field of digital forensics (DF) has long been defined as involving the acquisition, examination, interpretation and reporting of digital evidence (Carrier and Spafford, 2003). The discipline requires the use of forensic software to analyse digital data which exists largely in an intangible form, stored and interpreted via the processing and translation of its resident electronic signal information. Despite digital storage media itself being capable of a high-level physical and visual examination (a typical USB memory stick or hard disk drive), the data stored upon these devices is only examinable via the use of specific equipment and software capable of interpreting it and displaying it in a readable format. Whilst microscopic technologies potentially offer the ability to manually analyse data on some device types at a sector level, it is not feasible to consider investigating media in this way in most instances (Wright, Kleiman and Sundhar, 2008). As a result, DF practitioners are fully reliant on the DF software tools they use during an investigation to provide an accurate interpretation and presentation of digital information (Guo, Slay and Beckett, 2009).

While all forensic disciplines are dependent upon the tools they use during an examination to ensure valid results, arguably the level of reliance in the field of DF is greater. To stress this point, without the use of DF software, practitioners in most cases would not be able to see what content is stored on an item of digital storage media without compromising data integrity (Association of Chief Police Officers, 2010). There is limited feasible scope to manually examine digital storage media without specialist equipment that would not have scalability issues or be subject to human error. If the process of interpreting digital evidence is inaccurate, leading to erroneous data being presented to a practitioner for evaluation, any subsequent investigation may be compromised, potentially beyond the knowledge of the practitioner. To provide a simplified example, Figure 1 offers a fingerprint and simplified digital storage media examination procedure comparison. In traditional forensic sciences, the examination of fingerprints involves the preservation and collection of a print, followed by an analysis and interpretation of the sample and then subsequent evaluation and matching

(Warner, 2012). To highlight, often a practitioner may be involved in all stages of the process, and even where reliance is placed upon procedures for automated matching of print structures, a manual verification of the match can take place by sight in order to validate accuracy. As a result, the practitioner can be involved in all stages of the investigation and can intervene and confirm results at any point.

This is not the case with digital storage media analysis. A DF practitioner’s analysis of digital data often starts at the data visualisation phase, post-acquisition and before the interpretation phase which are both completed by forensic software. DF practitioners analyse the results of procedures which acquire and interpret the digital content of a form of digital storage media. In comparison to fingerprint analysis, these two stages are not manually verifiable as digital storage media content cannot be analysed by sight alone to confirm or deny its existence, or to confirm that a tools interpretation is correct. Instead, specialist software and hardware interpret signals stored on the physical device and convert this information into a viewable form. Therefore, a DF practitioner’s engagement in the investigatory process in terms of analysing digital data occurs late in the overall procedure of an investigation. The issue remains; what happens if a software’s interpretation of data is incorrect? In absence of the ability in most cases to visually/manually verify results, practitioners are vulnerable to the misinterpretation or error of forensic tools, potentially beyond a practitioner’s detection. Such issues may also exist in other forensic science disciplines where reliance is placed on software to allow analysis of data, for example, DNA analysis.

Figure 1. A comparison of simplified forensic procedures for fingerprint and digital storage media analysis. (Fingerprint image source: http://www.gadgetsnow.com/photo/57061444.cms)

In 2009 (p.2), Garfinkel et al., stated that ‘sadly, much of today’s digital forensic research results are not reproducible. For example, techniques developed and tested by one set of researchers cannot be validated by others since the different research groups use different

data sets to test and evaluate their techniques’. DF analysis now forms part of many criminal cases (Parliamentary Office of Science and Technology, 2016) where forensic tools must meet a required level of reliability in order to be accepted for use in legal proceedings (Craiger et al., 2006). Casey’s (2017) recent editorial highlights the continued need for developments in tool-testing in the DF field to improve standards of evidence. Yet, whilst the volume of DF research has significantly increased in recent years, work on tool-testing appears to have stagnated, with few projects focusing on tackling this issue. This article provides an examination of the problems surrounding DF tool-testing, debating the feasibility of tool-testing and potential solutions, examining past concerns and current issues in 2018. The results of a practitioner survey with up to 100 responses are also offered providing industry consensus of current tool-testing structures and strategies. Finally concluding thoughts are offered. 2 Tool-Testing The problem facing DF regarding tool-testing, and a potential lack of, is best stated by the Association of Chief Police Officers’ (2010, p50).

“All software has ‘bugs’ (minor programming anomalies) which can cause the erroneous reports of what appears to be fact”.

Tool-testing is the DF field’s elephant in the room. Whilst those operating in the field of DF acknowledge the dependency practitioners maintain on DF software, there remains minimal discussion as to whether these tools are trustworthy and importantly, how to establish this. There is also limited discussion as to why this is, and whether the discipline has simply come to accept that there is an inevitability that errors do exist within the tools they use and that these must simply be managed to the best of the field’s ability (in some cases, it may simply be the replacement of a malfunctioning piece of hardware, noticeable due to a process failure) in an attempt to limit their impact on investigations. Such a statement remains a supposition offered via this article, unsupported by published reports, a fact in itself which stunts the growth of tool-testing cultures in DF as errors in this discipline are rarely distributed publicly. However, if such a consensus exists, further issues arise regarding whether such a situation is acceptable given the nature of the work undertaken in the DF field. A plausible consideration remains that the issues posed by DF tool-testing are simply too difficult a task to undertake by any one individual and as of yet, we lack a global regulatory body willing to take on this issue in its entirety.

Attempts at tool-testing have been previously made, with some continuing to tackle the need for testing. Perhaps most notable is the National Institute of Standards and Technology’s (NIST) (2015b) Computer Forensics tool-testing (CFTT) Project. The CFTT Project has been in operation since 2000 (Guttman, Lyle and Ayers, 2011) and attempts to establish robust testing methodologies from which to potentially validate and evaluate DF tool performance, covering task from disk imaging and media preparation, to file carving and key term searching. The CFTT Project’s objective ‘is to provide measurable assurance to practitioners, researchers, and other applicable users that the tools used in computer forensics investigations provide accurate results’ (Guttman, Lyle and Ayers, 2011). Whilst the CFTT Project offers an invaluable resource to practitioners it also provides an insight into the issues surrounding DF tool-testing.

The CFTT Project is a ‘joint National Institute of Justice (NIJ), the Department of Homeland Security (DHS), and the National Institute of Standards and Technology’s (NIST’s) Law

Enforcement Standards Office (OLES) and Information Technology Laboratory (ITL). CFTT is supported by other organizations, including the Federal Bureau of Investigation, the U.S. Department of Defense Cyber Crime Center, U.S. Internal Revenue Service Criminal Investigation Division Electronic Crimes Program, the Bureau of Immigration and Customs Enforcement and U.S. Secret Service’ (Guttman, Lyle and Ayers, 2011, p.1). Whilst to date, almost 100 tool-testing reports have been released (Department of Homeland Security, 2017), this numbers falls short of the volume of tools and techniques which are currently in circulation and operation in DF investigations worldwide. To provide an example, the file carving algorithms used in EnCase v7.09.05 and v6.18.0.59 have been tested using a preconstructed raw dd image dataset designed for the recovery of the still image formats of .bmp, .png, .tiff, .gif and .jpg files, with reports published in 2014. Yet the following issues remain.

1. Testing covers only two sub-versions of EnCase where multiple updates to the version 6 and 7 packages have been released. Reliance on these finding in relation to a different sub-version would be based on the assumption that the underlying carving algorithm in the EnCase toolset was not subject to update in any future software releases. As Guidance Software’s EnCase is closed source, it would not be possible to analyse the carving algorithm code thoroughly and therefore any assumption would be unreliable. Even if Guidance Software formally announces no changes to the algorithm, an independent verification of no change should take place for completeness. In addition, as EnCase is a tool suite, it is necessary to ensure that any non-algorithmic updates have indirectly affected to performance of the carving algorithm itself or the display of subsequent results.

2. The release of the tool-testing reports also demonstrates the burden of taking on the task of tool-testing and assumed resources needed. EnCase v6 was released in 2007 (Guidance Software, 2007) and v7 in 2011 (Guidance Software, 2011). Both of the CFTT Project reports regarding EnCase’s carving functionality were published July 16, 2014.

3. Testing has yet to cover EnCase 8, released early 2016. Whilst, this may be currently ongoing, it is over one year since its release, meaning that potential issues which would have been flagged by the CFTT Project are yet to be identified. Further, it demonstrates that the rate required for thorough testing is not achievable in line with the speed of software releases.

4. Tests are narrowly defined, covering five graphic file types. Whilst the structure of media on the test media covers a range of scenarios (fragmentation, cluster padded etc.) it cannot be assumed with 100% confidence that because the algorithm functions for .bmp, .jpg, .tiif, .gif and .png, that it will also operate with the same characteristics when attempting to recover other file varients. It would be easy to make such assumptions, however, doing so begins to pick-away at the very purpose of implementing thorough tool-testing in the first instance. In theory, functionality is likely to remain unchanged, but the very purpose of effective tool-testing is to determine this as fact.

5. Testing is confined to only the ‘dd’ image format. Whilst in theory, forensic image file formats should not impact the process, for purposes of thorough testing, they should be explored.

Whilst individual attempts have been made, the CFTT Project is currently the only sustained attempt at maintaining the burden of tool-testing. Additional efforts often result in the production of test data for implementation by the practitioner, but traditionally job roles have not accommodated this volume of work (Beckett and Slay, 2007). Tool-testing research is not in abundance, despite the growing reliance on digital forensic evidence in criminal cases, and as a result there are very few solutions to this growing issue. Following an extensive literature review, it appears that the issues surrounding tool testing continue to exist. Arguably, tool testing attracted peak academic attention between 2007-2012. The issues surrounding limitations in testing strategies and discussions for improving such processes received critical commentary, and whilst some developments have taken place, largely, these issues remain. As a result, this works aims to re-state such points and acknowledge that tool testing issues in DF continue to persist and those questions raised during initial research some 10 years earlier must now be restated in current research. These points are discussed and analysed in a current modern context in Sections 4.1- 4.7.

2.1 Areas for Tool-TestingA DF investigation is multifaceted, with the goal to produce forensically sound evidence ((McKemmish, 2008) with Daniel (2012) defining a forensically sound tool as one which is definable, predictable, repeatable and verifiable. Figure 2 provides a visual depiction of the 3 main areas in a typical traditional investigation where tool-testing can have an impact, notably acquisition, parse & display of file system structures and system/application artefact parsing. A tool error occurring at each stage can impact an investigation with different levels of severity, with each discussed in turn below.

Figure 2. Breakdown of the data interpretation stages.

2.1.1 Data Acquisition

At the foundation of many investigations lies data acquisition; the process of creating a forensically sound copy of the whole (or in some instances, partial) storage area of digital media (Casey, 2011). In all instances, the goal remains the same; to extract in a forensically sound manner data as it exactly originates (or as close to in cases such as physical memory). This process is vital to the forensic process, where tool-generated errors at this point would undermine any subsequent examination stemming from a forensic image created during this process. As a result, a tool error occurring at this stage of an investigation has the potential to be critical to an investigation. A forensic examination relies on being able to examine data from an event and if this data cannot be captured accurately, reliable and applicable examination results can often not be produced. Current reliability measures such as hash checks are in place, allowing a comparison of data to be made in order to identify amatch, where a disparity in hash values indicates non-comparable data (subject to hash collisions (Thompson, 2005)).

The problem remains that in any investigation the burden of proof in criminal investigations is high, where in the United Kingdom, it is beyond reasonable doubt. Yet if it cannot be guaranteed that any examination is based on a reliable and sound representation of suspect material, ambiguity already exists. If a tool cannot guarantee the validity of data acquisition processes, the following questions are raised:

● Why has the tool not been able to effectively acquire data?:- If a successful image of a device's content cannot be acquired, it raises questions as to whether there is an issue with the target hardware or whether any imaging algorithm is fundamentally flawed. The problem lies with establishing which one is the source of the error.

● What has the tool missed?:- Investigations need to include an objective examination of all available evidential data. If practitioners cannot at first instance gather all data needed for an examination, it raises questions as to what is missed. Whilst in some instances threshold amounts of evidence may provide a counter argument it cannot be a reliable one..

● What has a tool potentially added?:- As with the issue of missed data noted above, questions should also be raised as to whether the process has resulted in the addition of any content, particularly if this content is evidential. Despite being unlikely, if the acquisition algorithm cannot be fully documented, its function is unlikely to be fully trusted. Further, the implementation of the tool’s algorithm must be validated as correct and free from error in order to determine if the software is functioning in a reliable and correct way (Scientific Working Group on Digital Evidence, 2017).

2.1.2 File System InterpretationUnlike the challenges faced during acquisition and the diversity of device standards and interfaces, file systems in comparison to a lesser extent have remained largely consistent with only a comparatively smaller subset of well documented mainstream structures (NTFS, EXT etc.). It is key to emphasize here that this remains a statement relatable to mainstream technologies (unregulated and unpopular file systems may be released but attract limited usage). Whilst there are multiple file systems in operation with updates and amendments being applied, and arguably areas of current knowledge weakness (see for example mobile device file systems), in contrast to the volume of different devices in circulation which a practitioner must examine, there are a fewer number file system types which are likely to be

encountered. For example, Microsoft operating systems have utilized NTFS for multiple iterations. This provides a potential opportunity for forensic analysts to obtain a sound understanding of its functionality. As a result, the level of research and development targeted at this information combined with limited internal changes may suggest that practitioners may be less likely to encounter errors at this stage (albeit some reports of inconsistent results at this level have been reported on practitioner forums - see ForensicFocus (2013)). Such a supposition may arguably be controversial, where tools for the parsing of a file system may focus on displaying file and folder content to a user and inaccurately interpret file system metadata possibly due to insufficient testing. Errors at this stage are not beyond possibility, yet at a file system level there is a smaller set of variables (file systems) for which to target research resources at in order to better understand this type of analysis. Such a view may be oversimplified, however, the risks of incorrect file system analysis should not be understated as when new files systems are released (for example, Apple’s latest APFS release), some remain undocumented and unsupported for a period of time whilst tool vendors catch up or in some cases their implementation may be inconsistent.

Providing that the acquisition stage produces a valid and complete set of data, should the file system parsing process maintain errors, the practitioner still has all available data from which to correct any misinterpreted or missed data. The problem is that due to the complexity and size of modern computer file systems, a practitioner may be unlikely to a) detect the error in the first instance (unless obvious erroneous interpretation is present) and b) be in a position to manually reconstruct the data given the time, effort and knowledge it may take. As a result, tools errors at this stage are a concern to the practitioner.

2.1.3 Artefact InterpretationArguably DF tools which target the interpretation of specific system and application artefacts carry the greatest chance of inaccuracy in terms of interpretation. The reasons for this are as follows:

1. Diversity:- There are a significant volume of different artefact types in need of interpretation, both system and application specific. In comparison, arguably over the last 15 years, those dealing with the development of file system parsing tools have predominantly tackled four main fundamental file system types, notably FAT, NTFS, EXT and HFS. Over this period of time, file system structures have become more greatly understood, with interpretive procedures honed and improved. Whilst potentially not perfect (given that testing is arguably still inadequate), it is hoped that there are less errors in the tools used to carry out these tasks. In comparisons, the hundreds of thousands of applications in circulation, each capable of being abused and involved within a digital forensic investigation have led to a wide diversity of file structures and metadata. Whilst some standards have developed (the consistent use of SQLite in many Internet browsing and mobile application), even internal structure and metadata types may vary. In addition, minor application updates can go unnoticed where underlying artefact structures may have been modified which effectively rendering past parsing algorithms ineffective until they are updated to reflect the change. A disparity in file types in need of interpretation inevitably leads to more work in terms of tool development and ultimately human error.

2. Individuals Vs. Organisations:- Forensic tool development organisations must balance profitability Vs. demand. As a result, tools often cover core interpretative functionality with a ‘sprinkling’ of targeted interpretation scripts aimed at specific

artefact types which may be based on popularity and a practitioner’s interpretative need. Whilst still comprehensive, these organisations cannot tackle every practitioner requirement. As a result, there are areas which see greater investment in development as opposed to others. Although tools such as Magnet’s (2017) AXIOM provide a wide coverage of artefact analysis, it is not 100% comprehensive and there will be occasions where practitioners must look to additional tools for help. At which point, an individual examiner’s/script writer’s tools may be utilised. Such individuals may not be professional code writers or in absence of resources implement thorough documented testing like funded organisations.

3. Responsiveness:- Artefact parsing tools are likely developed in response to an event or product release. Whilst fundamental tool functions like that of file system parsing is core to a case, artefact parsing is arguably an ‘advantageous bolt-on’ and retrospective afterthought. In addition, where new applications are released, the development of parsing scripts can be responsive and rushed in an attempt to be the first to produce a tool offering support and gain field-kudos. In such cases, thorough planning and robust development and testing may be missing.

Unlike, acquisition and file system interpretation tools, errors occurring at the application parsing could be considered less severe. This statement is arguably controversial and is offered tentatively for the following reason. If acquisition and file system parsing is 100% correctly functioning, then the practitioner should be presented with the correct set of data from within an artefact. This data is significantly less (in most cases) than that faced during acquisition and file system parsing, and in some cases, manually verifying results ‘may’ (emphasis here) be possible. Similarly, surrounding evidence from other aspects of an investigation may alert a practitioner to the potential existence of such an error, prompting further examination. Yet some caveats remain, such as instances of Internet history parsing where thousands of records may be present questioning the viability of manually reviewing records.

3 A Breakdown of ResultsTo acquire industry consensus regarding tool-testing in the field of DF, a supplementary survey has been utilised to acquire anonymous digital forensic practitioner responses. This survey was carried out utilising the online survey platform Qualtrics with the survey link distributed via online DF practitioner professional networking platforms and online DF practitioner forums. Participation in the survey was voluntary, with no compensation awarded, where participants were required to be operating within the field of DF in order to offer a valid response. Due to the nature of distributing online anonymous surveys in such ways with an accompanied self-declaration for meeting the suitability criteria, a caveat must be applied that the validity of responses can not be vetted. However, to minimise this risk, survey distribution was targeted at the practitioner-based locations noted above in order to glean responses from appropriate individuals. Below is the presentation of 14 contained questions provided in Sections 3.1-3.3. Question design focused on the topics of tool validation, usage and trust, mimicking existing issues surrounding tool testing in an attempt to obtain practitioner consensus. Question design was influenced by recent practitioner discussions surrounding tool testing and ISO 17025 implications. Questions 1-5 are designed to establish the current state of practitioner tool-testing and procedural implementation. Questions 6-10 examine trust in DF tool usage, apparent errors dual tool implementation. Finally questions 11-14 are designed to gather responses regarding the consensus of existing tool-testing strategies and their reliability and usage. As a collective, the survey is designed to

provide a representation of practitioner opinion on the collective stages of tool testing practices in their current state.

To account for partial responses and drop-outs part way through the survey, the volume of respondents is stated for every question and response. The survey results have been broken down into three parts, commencing with questions designed to establish a consensus surrounding tool testing in DF. Finally, analysis of results is offered throughout.

3.1 Part 1: Is there belief that DF has a tool-testing issue?Questions 1-5 are designed to acquire a consensus of tool-testing within the DF field. Submissions are displayed below.

Question 1 Yes No

Are you concerned about the current state of tool-testing in the field of digital forensics?

76 76% 24 24%

Total No. responses 100

Question 2 Yes No

Do you implement your own set of tool tests on the forensic/security tools you have/purchase before using them in a live forensic investigation?

61 66% 32 34%


Question 3 Yes No

Have you ever utilised a tool during a forensic investigation which you have not tested yourself, instead relying on the vendor/provider?

71 79% 19 21%


Question 4 Yes No

Do you feel it is acceptable in the forensics field if you cannot personally confirm the accuracy of your tool's results?

13 16% 70 84%


Question 5 Yes No

Do you think there is a lack of transparency from forensic software providers regarding their product's error rates and testing procedures which have been carried out?

72 90% 8 10%


3.1.1 Part 1 - Do we have an issue?:- AnalysisThe results of Q1 suggest a consensus of concern (over three quarters of respondents (76%)) regarding the current state of tool-testing in DF. Presumably as a direct result of stances held in Q1, 66% of Q2 respondents indicated that they self-test their applications before using them in an investigation (indicating that 34% do not undertake such procedures). What is key to note here is that respondents acknowledged using their ‘own set of tool tests’ inferring that steps have been taken to develop tests capable of assessing the reliability of their tools, in addition to those undertaken by a vendor. Whilst Q2 suggests the positive acknowledgement and undertaking of self-tool testing (despite a drop in value between those concerned in Q1 (76%) and those self-tool testing (66%)), 79% of respondents to Q3 reported using a tool which they had not personally tested. Such a majority does raise concern given apparent apprehension surrounding the state of tool testing (noted in Q1) and particularly with regards to Q6 where 88% of respondents reported to have ‘encountered erroneous results generated by a forensic tool’, and Q8, with 48% of respondents being concerned ‘that a tool they are using may miss evidential content because of inadequate tool-testing carried out’ (both discussed in Section 3.1.2). The results of Q3 may infer a reliance on vendor testing procedures which may be due to numerous factors including resourcing and organisational requirements which may restrict a respondents ability (lack of time for example) to design and implement sufficient tool-testing.

Whilst some may argue that the cost of purchase and yearly licensing of major DF products should adequately cover the cost of in-house vendor testing, it is arguably not feasible to lay the burden of tool-testing solely at the door of a vendor with many indicating that on purchase a tool should be further tested before use and software EULAs noting that errors may be present (Horsman, 2018). The problem with reliance on vendor testing surfaces in Q5 where a lack of transparency regarding their product's error rates and testing procedures which have been carried out appears to be an overarching feeling with practitioners. Given that this may be a concern and existing testing practices may be limited in terms of up-take, Q4 sits in conflict. Here, 84% of respondents indicated that it was not acceptable if a practitioner ‘cannot personally confirm the accuracy of your tool's results’. However, in order to do so sufficient testing is required, and despite there being barriers to this as noted above, Q1-3 suggest that despite it being perceived as unacceptable, a lack of self-testing is likely to prohibit a practitioner from being able to personally validate the reliability of their tools.

3.2 Part 2: Errors and TrustPart 2 of analysis focuses on responses to questions 6-10 aimed at establishing a consensus of trust and error rates in DF software. The following questions and responses are presented.

Question 6 Yes No

Have you ever encountered erroneous results generated by a forensic tool?

79 88% 11 12%


Question 7: How much do you trust the tools that you are using during an investigation? (81 submissions).

Figure 3. The results of Question 7

Question 8

Have you ever been concerned that a tool you are using may miss evidential content because of inadequate tool-testing carried out on a tool you are using?

No 12 15%

Maybe 30 37%

Yes 39 48%


Question 9

Have you ever felt it necessary to utilise more than one forensic tool during a specific task due to a fear of it missing/providing incomplete results due to insufficient testing?

Never 6 7%

Sometimes 23 29%

About half of the time 7 9%

Most of the time 27 34%

All of the time 17 21%


Question 10

What percentage of missed evidence or inaccuracy due to a forensic tool's limitation is acceptable to you?

Maximum 90%

Minimum 0%

Mean 9.8%


3.2.1 Part 2 - Trust:- Analysis88% of respondents to Q6 had encountered erroneous results generated by a forensic tool. Such a result raises concerns about both the effectiveness of current testing procedures and also the potential number of errors in existence. Whilst Q6 by no means indicates ineffective test strategies are being used, it is a likely indicator of the sheer complexity and volume of processes and procedures which forensic tools are attempting to cater for, where a likelihood of an error is inevitable. Yet Q6 raises a concern when examined inline with responses noted in Section 3.1.1 and the fact that self-tool-testing is not undertaken by everyone. Q7 indicates the level of trust respondents have given the tools that they use as part of their investigations. Here, only 2% of respondents indicated that they fully trust their tools and 69% of respondents indicated a score of 7 or greater in terms of trust. Whilst this question infers that practitioners tend to trust their tools more than they do not, there is still an unease in the fact that an element of doubt with regards to their performance exists. Reasoning for a lack of full tool trust is apparent in Q8 where 48% of respondents indicated that they were concerned that a tool they were using may miss evidential content because of inadequate tool-testing carried out. 15% of respondents indicated no concern.

Q9 addresses the implementation of dual-tool strategies (as recommended by the Association of Chief Police Officers’ (2010)) as a potential method of combating testing and trust issues. Only 21% of respondents always dual-tool when carrying out an investigation, with 34% doing so most of the time, therefore as a result 55% of respondents can be classed as ‘regularly’ deploying multiple tools in order to try and protect against tool issues. 36% of respondents are classed as dual-tooling less than half of the time. Dual tool approaches are not a guaranteed way of improving the reliability of an investigation (discussed further in Section 4.5) and may have resourcing implications in terms of multiple tool purchase costs

and processing overheads. However, generally it is seen as good practice and of potential concern that Q9 does not indicate a higher percentage of respondents undertaking this practice given previous queries. Q10 indicates that on average, respondents suggested that an ‘error rate’ of approximately 10% is acceptable. This can be viewed as both a realistic target and also a concerning factor that practitioners may be willing to accept such levels of inaccuracy in tools designed to support the establishment of fact.

3.3 Part 3: Existing StrategiesPart 3 focuses on responses to questions 11-14 aimed at establishing potential strategies to improving tool-testing. The following questions and responses are presented.

Question 11

Do you think the field of digital forensics will ever be in a position where all of the major tools in use have been satisfactorily tested?

No 21 26%

Maybe 28 35%

Yes 12 15%

It is currently in that position now 1 1%

Not possible 18 23%


Question 12: How would you rate the following (0 (inadequate) - 10 (very good)): (Total No. responses = 81).

Figure 4. The results of Question 12.

Question 13: How useful you would personally find the following items as part of tool-testing? Please rate the following (0 (not useful) - 10 (extremely useful)):-(Total No. responses = 81).


Question 14 (a complete breakdown of rankings is displayed below this table)

Please rank the following (1 (most trusted) - 6 (least trusted)) sources of a test and documented results which you trust the most. (Total No. responses = 82).

Source Mean Ranking Score

A vendor / provider's documented tests and results 3.78

NIST (or equivalent organisation offering tool-testing) 1.61

Sources originating from within academic institutions 3.32

Relevant Blogs 4.23

Relevant Forums 4.49

Journal / conference articles 3.57


3.3.1 Part 3: Existing Strategies:- Analysis Q11 examined whether respondents felt the field of digital forensics would ever be in a position where all of the major tools in use have been satisfactorily tested. 26% indicated that the field would not, with 23% indicating that attaining such a position was not possible. Only 15% suggested it would be able to attain sufficient levels of testing. Such consensus suggests a number of potential issues. First, if attaining sufficient testing levels is not possible then the impact of this position on future evidential reliability benchmarks must be assessed. Second,

such results suggest that dialog is needed between practitioners and tool vendors in DF to assess any potential current shortfalls and solutions to identifying and developing test strategies.

Q12 provides an indication of the current limitations of available resources for DF tool testing with respondents asked to rank industry resources, methodologies and standards for testing DF software on a scale of 0 (inadequate) - 10 (very good). Results show that resources (mean score of 4.13), methodologies (mean score of 3.92) and standards (mean score of 3.36) for tool-testing in DF score an average ranking of less than 5, suggesting a consensus of inadequacy across all techniques. Such results may reveal part of the issue surrounding tool testing in DF and a reasoning behind responses witnessed in Section 3.1.1 and 3.2.1. If practitioners perceive existing resource to support tool testing to be currently inadequate, then there adoption as part of laboratory testing may be lessened. Further, where ready-made solutions are not viewed favourably, practitioners may not have the time and resources available to create their own, therefore opting to rely on vendor testing.

Q13 asked respondents to rank the ‘usefulness’ (0 (not useful) - 10 (very useful)) of source-code access, test data access and vendor-produced error rate documentation. 75% of respondents in relation to vendor published error rates, 77% of respondents in relation to access to test data and 33% of respondents in relation to access to source code recorded responses of 8 or higher. Results suggests that source code access may provide limited value where 43% of respondents scored source code access < 5 in comparison to 9% for access to test data and 2% for access to vendor published error rates. This result may be expected as a practitioner can directly respond to known error rates and can run their own tests using test data. However, parsing source code for error indicators requires bespoke programming knowledge and time, which a practitioner may not possess. Discussions regarding the difficulties of source code analysis are elaborated on in section 4.7.

Q14 asked respondents to rank the trustworthiness (1 (most trusted) - 6 (least trusted)) of sources of documented tests and results. NIST carry clear levels of support and are considered the most trustworthy source demonstrated by its mean ranking of 1.61. Both relevant forums and blogs score poorly (4.23 and 4.49 respectively), which may be understandable as they may lack formal peer review. However, this may be surprising given that they are an easily accessible source of domain specific knowledge and are frequently created and maintained by practitioners operating in the field in some capacity, where www.forensicfocus.com provides a prominent example. Sources originating from within academic institutions (3.32) and journal / conference articles (3.57) both could be considered to score arguably poor, despite potentially being subject to increased scrutiny and peer review, in some cases, potentially comparable to that of work by NIST. This may indicate a disconnect between industry and academia, where potentially an undervaluing of their work in areas of testing and development may not be held in as high regard as other more prominently known sources such as NIST. Despite this, both sources score better in terms of trust that vendor produced documentation (3.78), a result which may be due to a perceived lack of transparency in testing processes indicated in Q5.

4 The Problem of Tool-TestingTool-testing in DF has and continues to provide an issue (Brady, 2018). As a discipline driven by technological change, law enforcement require the validation of tools to bring DF in line with other forensic disciplines (Guo, Slay and Beckett, 2009). Validation requires the “confirmation by examination and the provision of objective evidence that a tool, technique

or procedure functions correctly and as intended” (Guo, Slay and Beckett, 2009, p.3). As defined by SWGDE (2014, p.4), validation testing is a process of ‘evaluation to determine if a tool, technique or procedure functions correctly and as intended.’ In an ideal world, practitioners should know everything about every tool they use in terms of functionality. In an ideal world, practitioners should know everything about every tool they use in terms of functionality. Whether this is achieved in the real world is debatable and it is likely that the field of DF operating often with non-fully validated tools. DF is a discipline driven by the establishment of fact, yet it cannot say in most instances that practitioners operating within it can factually ascertain that the DF tools in use are functioning correctly. Section 4 analyses those issues historically acknowledged in DF with regards to tool-testing and discusses these in line with current developments and limitations.

Casey (2002 p.4) states that “forensic examiners who do not account for error, uncertainty, and loss during their analysis may reach incorrect conclusions in the investigative stage and may find it harder to justify their assertions when cross-examined”. Weaknesses in tools can also be exploited for purposes of anti-forensics (Cusack and Homewood, 2013). By no means does this article suggest forensic software companies are producing sub-standard software or implementing sub-standard testing, in fact such organisations are often in the best position to evaluate their products. Yet arguably their testing cannot sufficiently cover every eventuality and errors are inevitable (see the Casey Anthony trial reported by Digital Detective (2011) and Cusack and Liang’s 2011 study highlighting discrepancies in testing between FTK Imager Version 2.9.0, Helix 3 pro and Automated Image and Restore (AIR) Version 2.0.0). A call for better tool-testing is easy to make, but in practical terms, defining strategies to produce a notable improvement in this area of DF requires consideration of the following problems. 4.1 Dataset GenerationOne of the main components (and challenges) of testing in DF is the generation and maintenance of sufficiently detailed and documented test datasets which can be used in an attempt to exhaust a tool’s functionality in pursuit of validation (Brunty, 2011; Grajeda et al., 2017). Although test images are available (see for example, ForensicFocus (2017); DigitalCorpora.org (2017); National Institutes of Standards and Technology’s (2016) Computer Forensic Reference Data Sets (CFReDS) and The International Society of Forensic Computer Examiners (2017)) there remains limitations in this area and this is acknowledged in the results of Q11 of the practitioner survey noted in Section 3. Whilst these data sets provide a good basis for developing tool tests, they are non-exhaustive and arguably do not provide the required depth needed to effectively test the complete functionality of available and future DF tools.

The time needed to create a sufficient data set for use in testing cannot be underestimated (Garfinkel, 2012) and this is the main issue surrounding tool-testing; available resources to dedicate to this issue are sparse. It is key that any data set maintains sufficiently comprehensive data to exhaustively test tool functionality and this information also must be fully tested and documented to allow a tool's performance to be benchmarked and limitations be identified. A created data set must also be fully documented and contain ‘evidence’ to test a tool at both various levels of interpretation and all of its features. For example, data for both high-level file recovery and tool-feature chaining along with byte-level recovery for purposes such as carving and string matching evaluation may be needed in order to validate tool

functionality. Achieving such reliable and documented data sets is a difficult task but only once it has been achieved can be distributed for industry-wide use.

Maintenance also plays a vital role, where it is not just enough to produce a singular data set (Grajeda et al., 2017). To be effective, datasets must match the pace of technological development (Yates and Chi, 2011; Parliamentary Office of Science and Technology, 2016). When new technology hits the market, effective data set generation must take place for additional tool functionality to be tested. Whilst simple in concept, in reality this requires resources both in time and money for the research and development. Arguably this can only be achieved through the organization and implementation of a higher independent governing body who can manage and distribute testing results, not through any singular organisation or research institute.

Finally, those seeking to create test data sets must maintain the knowledge and experience to do so (both at a software development level and subject expertise within the DF discipline to ensure complete knowledge coverage). These individuals are not available in abundance and may already possess investigatory roles which would be unlikely to allow the time needed to carry out this task in addition to their working role.

4.2 Impossible to Exhaust all Testing ScenariosTest complexity is a crucial aspect of testing and in DF, where the field is dealing with an almost infinite number of possible scenarios and subsequent outcomes; all in need of testing. Even when focusing on a singular aspect of functionality, there remains multiple valid outcomes. File carving provides a useful example from which to draw upon underlying issues (see Figure 7). A simple contiguous carving algorithm may identify a file's header and footer then carve and export all data in between into a separate file. Although this process may seem straightforward, there are actually a number of variables to be examined. First, the algorithm must be tested, including any changeable variables. For example, many carvers offer a range of file recovery options including interchangeable file signatures, predefined signatures and user-definable signatures. To ensure thorough testing, all functionality should be validated. Therefore, although conceptually the process of carving for example a .jpg is comparable to carving for a .bmp, both processes should be equally scrutinised and subject to the same conditions. The second consideration is that of media type. The carving algorithm should be subject to changes in target media both physically and in terms of file system structures. The context of files on digital storage media is also a consideration when both standard and non-standard file positioning should be examined for algorithm performance (see work carried out by the National Institutes of Standards and Technology (2015)).

Figure 7. A breakdown of the elements of a simple carving process which all need to be validated.

Any external factors which may affect the validity of performance need to be considered and evaluated. This includes all direct and indirect functions of a process. To expand upon this concept, a direct function is the algorithm itself (a carving algorithm used for carving files). Indirect functions include functions such as those involved in the display of results, therefore even if an algorithm functions correctly, incorrectly displayed results can still lead to investigatory errors.

4.3 Establishing Confidence and RepeatabilityIn 2010, Garfinkel suggested that forensic tool ‘algorithms should be reported with a measurable error rate’. This remains a valid argument today and as previously stated by Carrier in 2013, ‘tools must ensure that the output data is accurate and a margin of error is calculated so that the results can be interpreted appropriately’. In absence of reliability, the evidential value of any results is diminished (McKemmish, 2008). Reports of malfunctioning tools have been made, most notably with the release of FTK 2.0 (Glisson et al., 2013).

To have confidence in the accuracy of a tool requires establishing that the process is repeatable, where in turn, defined thresholds of repeatability and tolerances of error should be identified (Daniel, 2012). Carrier (2003) states that “to ensure the accuracy of a tool, it must always produce the same output when given a translation rule set and input.” With the caveat of live investigations, digital data is static, in a non-volatile forensic image format. As it does not change, any processes ran, should result in repeatable results. As the target data does not change, any changes in result stemming from repeated tests must derive from inconsistencies in the interpretation process. However, to assess reproducibility requires test data sets (Garfinkel et al., 2009), where issues already exist, as noted in Section 4.2 above. Garfinkel et al., (2009, p.3) states that “fundamentally, the reproducibility of scientific results makes it possible for groups of scientists to build upon the results of others” and forms part of the U.S. Daubert Test requirements to define known error rates (Garrie, 2014). The issue is also alluded to by the Association of Chief Police Officers (2010, p50)

“The repeat of the examination should conclude in the report of the same result, thus giving the one ‘nugget of gold’ more reliability and credibility” (Association of Chief Police Officers, 2010, p50).

Whilst the Daubert test may not be legally applicable in some countries outside of the U.S., many other jurisdictions maintain complex rules governing evidence reliability and admissibility with a goal of ensuring reliable evidence is offer to the court as part of any criminal justice processes. Repeatability only forms part of the issue, as it remains viable to repeat a process and achieve the same output, both of which may still be factually incorrect. For example, the development of a flawed algorithm which repeatedly interprets data incorrectly. Therefore despite being a mandatory requirement in most instances, repeatability is not in itself solely a measure of tool validation, but simply a confidence indicator and one which should form part of the overall process of validating a tool.

4.4 Must Distinguish Tool Errors from User ErrorsPart of the challenge of tool-testing requires validating not only correct functionality but also correct usage, and the two need to be distinguished. This is highlighted by Lyle (2010).

“There are three broad sources of error that can occur in execution of a procedure:

1. The algorithm intended for the process. 2. The software implementation of the algorithm.3. The performance of the process by a person.” (Lyle, 2010, p.136).

As a fundamental, a tool must be validated to process data in the manner for which it was intended and that any outputs are reliable and repeatable. However, given the persistent issue of a lack of standardisation and practitioner accreditation, differing levels of experience and knowledge play a factor. Any tool-testing procedures must ensure that tool errors are distinguished from user procedural errors and incorrect tool usage. Tool usage errors are potentially difficult to identify and may be mistaken for actual tool errors. Yet this is an issue to address by the DF field at the level of practitioner training and competence. One method for tackling tool errors includes the effective logging of errors which occur during the operational processes. Where errors occur, transparent and detailed logging can provide an effective way to address issues and develop knowledge and standards moving forward, improving the reliability of results. However the problem error logging provides is that an application must acknowledge an error has occurred in the first instance. Whilst some procedures may generate exception errors, errors involving the misinterpretation are unlikely to be flagged to the user as arguably, these would have been addressed at tool development stages. Testing must also be comparable, taking into account changes in tool configurations. For example string search testing where changes to searchable character-sets may results in different outputs, yet both processes may be accurate within the confines of their configured approach (Lyle, 2010). Essentially, when validating a tool, establishing what it is capable of (and accepting this as a feature limitation, not error) is key to ensuring fair and comparable assessments of validity.

Establishing errors either through independent testing or via published vendor reports is advantageous to practitioners, yet currently very few are published and available for scrutiny with regards to DF tools. The field is not necessarily demanding tools which function with 100% accuracy 100% of the time (albeit this offers the perfect, yet arguably unattainable position), but a more realistic request for tools which alert practitioners to their actual capability with an accurate account of their reliability and constraints. Often, there is incomplete availability of limitations and known bug information regarding DF tools, which would allow for practitioners to exercise caution where necessary. The results of Q5 in the practitioner survey draw reference to inadequate error rate reporting. In addition, one of the challenges of identifying tool errors is actually knowing that an error has occurred in the first instance. As Marshall and Paige, (2018, p.27) note, ‘ examinations start with a source of potential evidence whose contents are unknown. Thus the inputs to the whole forensic process are unknown. Although the user may have some experience of what abnormal outputs look like, this depends entirely on the tool actually producing abnormal outputs or indications of errors. It is entirely possible for a tool to process inputs incorrectly and produce something which still appears to be consistent with correct operation’. In such cases, error detection may be solely down to the experience of the practitioner operating the tool.

4.5 Is DF Striving for Something that is Unobtainable?Perhaps a controversial statement, yet one which falls in line with the sentiment echoed in survey Q11; it is unlikely that the field of DF will ever be in a position to satisfactorily test

https://www.sciencedirect.com/topics/computer-science/forensic-process

https://www.sciencedirect.com/topics/computer-science/potential-evidence

https://www.sciencedirect.com/topics/computer-science/potential-evidence

and validate with 100% confidence, all of the functions of DF tools. This is not possible due to the diversity of tools combined with the lack of a regulatory body or equivalent overseeing the production and dissemination of content into industry. Although arguably this is an unsatisfactory situation, it is not a new one and has persisted since DFs commencement. DF has never been in a position to thoroughly test and validate its tools, yet continues to function and support legal processes daily.

Often, arguments around tool validation and testing stem from the need to manually verify the output of any completed process. Whilst in theory, the approach seems logical, but is limited in practice. Practitioners are competing against client demands, time and resource constraints, case backlogs, jurisdictional requirements and also company profitability. Where an organisation purchases a tool, it is not feasible to verify every output which it gives despite evidential reliability requirements. For example, where a tool recovers 10,000 Internet history records, it is not feasible to manually verify the physical position of the metadata on a suspect device and manually verify that a correct interpretation of data has taken place. At best, the process may incorporate a sampled-set of verified results. Whilst certainly a method for indicating reliability, it does not provide a 100% measure of accuracy. However, sampling strategies are arguably the best compromise available to practitioners.

Often, ‘dual tool’ validation is touted as a solution to ensuring the reliability of results given during an investigation (Association of Chief Police Officers’, 2010). Although a valid consideration, it should not be considered the silver bullet to all DF tool-testing problems. Essentially, those championing a dual tool approach are suggesting that as the primary tool is not and cannot be thoroughly tested, verification via a second untested and verified tool should equal increased reliable results. When examined, the concept of dual tool verification is flawed and is built upon the premise that if two or more tools return the same output, neither can be bad. Yet, it remains a viable possibility that both equally and erroneously interpret data in the same way, for example if shared flawed code libraries have been used (Horsman, 2018). Similarly, both tools may provide a correct answer, but with differing degrees of ‘completeness’, which on face value may look different and suggest one is functioning incorrectly. Without knowing the underlying design and configuration of any tool, it is not possible to tell whether both tools are built upon a shared code library or set of functionality. If this code maintains errors, then both are unreliable. Dual-tooling is an easy option to call for, but also maintains costing and resource issues. DF software can be expensive and the need to have two distinct packages which achieve the same goal should be questioned, both in terms of business efficiency and basic logic.

Nevertheless, ultimately as organisations seek to gain and retain the confidence of clients, claims of dual-verified evidence will continue. Whilst this article is not claiming that such methods are bad practice (in fact the opposite), it is just emphasised that it should not be used to claim complete reliability in the evidence which is provided. DF operates in a business where reliability is crucial and any method which allows practitioners to increase this should be taken. DF will continue implementing the dual tool approach, but arguably should not just settle for it as the sole method of validating a set of results.

4.6 Endless Tools, Endless FunctionalityThe increased use and development of technologies has driven the demand for and creation of new tools and expansion of existing functionalities (Guo, Slay and Beckett, 2009). Investigations now often require multiple tools to effectively examine a single case, which infinitely increases testing requirements (Beckett and Slay, 2007). Due to the nature of the

DF field, tool-testing is a continuous cycle driven by technological developments (Scientific Working Group on Digital Evidence, 2017). Testing must be continuous, and cannot be a one-time event. Updated versions of applications are constantly released and often without accompanying in depth documentation regarding the revision or updating of any underlying structural metadata. Therefore, as new releases of applications are offered, tools must be revalidated against new data as even simple changes to underlying metadata structures can affect forensic tool performance.

4.7 Expectations and Capabilities of TestingDF tool-testing requires an unprecedented depth of testing as not only emerging tools, but backdated testing of existing tools is required. Whilst some tools maintain a singular functionality, larger tool suites can possess hundreds of features. Every features needs to be tested in order to validate the accuracy of functionality. The importance of validating every function lies with the ‘stacking’ of processes in order to achieve a goal. For example, to achieve goal ‘x’, process ‘y’ may need to be run followed by process ‘z’. A real world example may include the mounting of compressed volumes followed by a keyword search in order to identify a set of evidential files.

Arguably, it is impossible to fully validate a tool's function without access to source code (Gerber and Leeson 2004). Yet in most instances, access to source code for tool-testing is not possible, and even where code can be accessed for analysis, it is likely that a practitioner would have neither the time or resources to effectively scrutinise its structure for error validation as previously noted. Effective testing requires in-depth knowledge of a process followed by the robust design of procedures which will exhaust all possibilities in order to accurately assess the reliability of the process. In addition, should code errors or vulnerabilities be detected, issues are likely to exist regarding the distribution of warnings for using the tool and methods for correcting the code. Despite open source code allowing the further development of a tool, it may not be the case that the error detector also possesses both the knowledge and times to correct such issues for an updated release. Further, there are few places (other than practitioner forums, for example) for the wide-scale dissemination of warnings regarding the use of a particular tool. There are two issues to consider regarding tool-testing and time. First, it takes time to validate a tool and arguably, however, second, we can’t wait for tools to be validated in most instances.

As a result, the majority of tool-testing which can take place during DF tool-testing is via black box (BB) testing strategies, where internal coding structures are not examined. During BB testing, the practitioner relies on a knowledge of what a tool is ‘supposed to do’, and is simply concerned with obtaining what can be considered a correct output, given a tool is fed a specifically designed test case. Whilst this testing method provides a relatively quick method of determining the basic functionality of a tool or process, Khan and Khan (2012) identity that it is not well suited for algorithm testing, a major requirement for DF tool-testing. Without knowledge of the code and algorithm design, it is difficult to design test cases which can exhaustively test its function (Glisson et al., 2013). As a result, test cases designed in BB testing may simply provide a false confidence in regards to a tool's function, because they have failed to extensively test all variations of its implementation (Flandrin et al., 2014).

These limitations have led to a reliance on vendor software testing, with Beckett and Slay in 2007 highlighting “in discussions with practitioners, there has been a heavy reliance on vendors to validate their tools. The vendor validation has been widely undocumented, and not

proven publicly, except through rhetoric and hearsay on the bulletin boards of individual tool developers such as Guidance Software (www.encase.com), and Access Data (www.accessdata.com) the main players in this domain”. Arguably this situation remains relatively similar today.

5 Tool-Testing and ISO/IEC 17025 Accreditation Whilst arguably the need for validated tools in DF has always been important, with the deadline set by the Forensic Science Regulator (October 2017) for DF organisations to have implemented ISO/IEC 17025 having passed (Forensic Science Regulator, 2016), it is now effectively a mandatory requirement. ISO/IEC 17025 ‘specifies the general requirements for the competence to carry out tests and/or calibrations, including sampling. It covers testing and calibration performed using standard methods, non-standard methods, and laboratory-developed methods’ (International Organisation for Standardization, n.d.). To comply with ISO/IEC 17025, effective testing and validation must be carried out on the tools and methods utilised during a DF investigation, yet as noted above, there are difficulties in achieving this and a current dissatisfaction at testing standards in the field. This has led to conflicting stances surrounding the imposition of the requirement for ISO/IEC 17025 on DF organisations, resulting in a number of dissenting opinions expressed in Sommer et al’s., (2017) recent practitioner survey regarding this ISO standard. In addition, arguments of cost and the inability for the field to keep pace with technological developments are also touted as issues potentially affecting ISO/IEC 17025’s implementation in this field (Flandrin et al., 2014). The problem lies with the conflicting demands. ISO/IEC 17025 is a standard designed to ensure organisational competence and maintain public confidence that standards of DF evidence are maintained (Forensic Science Regulator, 2016; Watson and Jones, 2013; United Nations Industrial Development Organisation, 2009). This can only be beneficial to the field, supporting its continued growth and development of forensically sound methods and processes. Yet there appear to be issues surrounding the feasibility of adherence to ISO/IEC 17025’s requirements. Whilst some mainstream vendors provide software products considered fit for purpose (Watson and Jones, 2013) which are ultimately accepted but criminal justice systems in many jurisdictions, a concern exists that such tools may still produce erroneous results. Where non-standard tools are being utilised, the indication that resources, standards and methodologies available for validating such tools are inadequate may result in their non-compliance with this ISO standard.

ISO/IEC 17025 requires an organisation to demonstrate the reliability in the methods they use, defined by the Forensic Science Regulator (2016b, p.8) as ‘a logical sequence of procedures or operations intended to accomplish a defined task’. In terms of validation, both methods and tools are closely related where methods may include the use of tools to accomplish a given task (Forensic Science Regulator, 2016b). Method validation processes must objectively establish that a given method is fit for purpose (linking to ISO/IEC 27041, a standard for demonstrating the fitness for purpose of a method (International Organisation for Standardization, n.d.b.)) and can achieve the goal it was intended to achieve (discussed in detail by Marshall and Paige, (2018)). The problem here is that in DF, utilised and developed tools and methods are driven by technology which moves at a significant pace. Arguably, the testing and verification of tools in existence has not reached a standard required to factually ascertain the accuracy of their function with new tools being frequently released, exacerbating the issue. A lack of what can be considered ‘standard methods’ in DF means that almost all methods utilized must be subject to rigorous validation processes. Where methods are new or have minimal available data to support effective validation, the method will be considered novel which requires testing to a greater degree (Forensic Science Regulator, 2016b). This is an issue as DF tools often focus on the reverse engineering of data where often limited documentation denoting the structure of said data

exists, meaning that ‘it may be considered to be difficult for producers or users of forensic tools to show that the tools are actually correct except by potentially lengthy and costly empirical methods’ (Marshall and Paige, 2018).

There is currently no universal way to effectively meet the levels of validation of tools and methods in DF required by ISO/IEC 17025 and this is an issue. Whilst individual organisational attempts can be made to demonstrate the use of measures in a laboratory environment to validate results, in reality, these are likely not exhaustive and this risks both the validity of evidence produced and the field itself as reliance upon it to produce reliable evidence is increasing.

At the time of writing there is not one DF tool available to practitioners which claims to (with documented proof) produce 100% accurate results. This does not necessarily suggest that tools with such levels of validity do not exist, but arguably it is unlikely; a potentially damning assessment for a forensic science discipline.

6 Concluding Thoughts It can be argued that the DF field has yet to reach a satisfactory point of tool-testing and worryingly, there are no obvious and easy solutions available to rectify this situation. This appears to be reflected in the results of the practitioner survey offered in this article. The difficulties of exhaustive testing within the DF discipline have been previously acknowledged, but in order to improve this area, potential solutions should be considered.

In regards to tool testing, the advantages and disadvantages of both federated and centralized approaches should be considered. A centralized approach places a significant burden on an identified entity to design, implement and carry out tool testing approaches. This is then followed by the challenge of establishing methods for the effective dissemination of findings. Suggestions for the use of such an overarching governing body responsible for the regulation and implementation of tool-testing are easy to make, despite associated resourcing costs and the difficulty in establishing the vast expertise required all under one umbrella organization. The feasibility of developing and maintaining an organisation of this type may also be unrealistic, given associated costs both at setup and for long-term continued operation. Further, centralized approaches may not allow testing processes and results to be exposed to the levels of scrutiny required to achieve the levels of reliability required given the limited number on individuals involved in the initial testing processes. Despite such concerns, centralized approaches do have advantages and may achieve greater consistency in the development and implementation of testing due to greater oversight potentially being imparted on those involved in such processes. Assuming effective implementation, the benefits of a centralized governing body would lie with the ability to carry out and document transparent and independent testing, which is available for the discipline to evaluate as a whole by disseminating the knowledge, assuming that such dissemination could be effectively achieved. Any overseeing body would need to provide access to a central database of testing and subsequent results which can be accessed and updated showing tool versions which have been validated against application versions. This is one of the issues faced at present as there are large volume of sub-version releases of major software packages and each in need of evaluation, not just the latest version. In comparison, federated approaches offer a lesser level of centralized oversight, granting autonomy to those individuals involved in any procedure. This is seen with the CFTT Federated Testing Project introduced by NIST (2018), where test materials are centrally

developed but disseminated to individual labs to carry out their own testing and subsequent sharing of results. This approach potentially sees more practitioners involved in tool testing and in doing so, governance of the application of test procedures is left to each individual entity. Whilst the potential for greater engagement in, and scrutiny of tool testing and results is provided for by such approaches, there remains a chance that the application of test approaches and the subsequent reliability of results may vary in quality. Further, the dissemination of results requires commitment from each entity to do so, where engagement levels differ.

Both centralized and federated approaches to tool testing offer challenges and advantages in terms of implementation and maintenance and as of yet, arguably the field of DF is yet to see a single widely adopted approach to doing so. NIST arguably provide the closest thing DF has to an organization aiming to directly tackle tool testing, but as noted previously, despite in-roads having been made into tool-testing, their work is not as extensive as to what is required to provide field-wide complete tool validity. If the field is to expect a greater contribution to testing from its practitioners it needs to develop more resources and standards to facilitate and manage increased levels of engagement. The development of a blueprint defining the structure for dataset development and the subsequent use in the design and implementation of tool-testing could make effective testing more accessible to practitioners. Such a blueprint could be used to create a manageable standard which can lead to the standardisation of tool-testing in DF and increased confidence in the results generated.

Tool-testing is arguably one of the hardest challenges faced by the DF field, but this does not mean that it should simply be ignored. Whilst it is unlikely satisfactory testing will be achieved any time soon, every valid attempt to improve the current situation should be met with support. The field is under an ethical and legal obligation to continue to strive to improvement standards therefore every step to improve the current situation must be taken, regardless of how small. The more testing that takes place, the greater the chance of errors being detected and standards improving, where “testing may not highlight every error, but will lead to a measure of reliability” (Beckett and Slay, 2007). Failure to tool-test means practitioners are unable to hold tool creators to account for their actions which may indirectly lead to the development of poor-practice and further error-prone tool developments.

Finally, the extent of testing must also be considered. It is possible to derive hypothetical disaster scenarios, prompting the continuous scrutiny of a piece of investigation software’s functionality. Yet if it is considered practically infeasible to validate every single functionality then the field must examine existing resources and evaluate their deployment for testing against specific areas following an assessment of risk. Whilst arguably unsatisfactory and a potentially defeatist stance to take, it is realistic one. If a complete validation of tools cannot be accomplished, then areas where the risk of error is both increased and severe should be identified and dealt with immediately, whilst the the field continues its pursuit of a suitable long term and thorough solution to tool-testing.

ReferencesAssociation of Chief Police Officers’ (2010) ‘ACPO Managers Guide: Good Practice and Advice Guide for Managers of e-Crime Investigation’ Available at: http://www.digital-detective.net/digital-forensics-documents/ACPO_Good_Practice_and_Advice_for_Manager_of_e-Crime-Investigation.pdf (Accessed 25th June 2017)

Beckett, J. and Slay, J., 2007, January. Digital forensics: Validation and verification in a dynamic work environment. In System Sciences, 2007. HICSS 2007. 40th Annual Hawaii International Conference on (pp. 266a-266a). IEEE.

Brady, O.D., 2018. Exploiting digital evidence artefacts: finding and joining digital dots (Doctoral dissertation, King's College London).

Brunty, Josh (2011) ‘Validation of Forensic Tools and Software: A Quick Guide for the Digital Forensic Examiner’ Available at: https://www.forensicmag.com/article/2011/03/validation-forensic-tools-and-software-quick-guide-digital-forensic-examiner (Accessed 22nd June 2017)

Carrier, B., 2003. Defining digital forensic examination and analysis tools using abstraction layers. International Journal of digital evidence, 1(4), pp.1-12.

Carrier, B. and Spafford, E.H., 2003. Getting physical with the digital investigation process. International Journal of digital evidence, 2(2), pp.1-20.

Casey, E., 2002. Error, uncertainty, and loss in digital evidence. International Journal of Digital Evidence, 1(2), pp.1-45.

Casey, E., 2011. Digital evidence and computer crime: Forensic science, computers, and the internet. Academic press.

Casey, E. (2017) The broadening horizons of digital investigation, Digital Investigation, Volume 21, Pages 1-2

Craiger, P., Swauger, J., Marberry, C., Hendricks, C. and Kanellis, P., 2006. Validation of digital forensics tools. Digital Crime and Forensic Science in Cyberspace, p.91.

Cusack, B. and Liang, J., 2011. Comparing the performance of three digital forensic tools. Journal of Applied Computing and Information Technology, 15(1), p.2011.

Cusack, B. and Homewood, A., 2013. Identifying bugs in digital forensic tools. Australian Digital Forensics Conference

Daniel, L.E., 2012. Digital forensics for legal professionals: understanding digital evidence from the warrant to the courtroom. Elsevier.

Department of Homeland Security (2017) ‘National Institute of Standards and Technology (NIST) Computer Forensic tool-testing (CFTT) Reports’ Available at:https://www.dhs.gov/science-and-technology/nist-cftt-reports (Accessed 23rd June 2017)

DigitalCorpora.org (2017) ‘Home’ Available at: http://digitalcorpora.org/ (Accessed 22nd June 2017)

Digital Detective (2011) ‘Digital Evidence Discrepancies – Casey Anthony Trial’ Available at: http://www.digital-detective.net/digital-evidence-discrepancies-casey-anthony-trial/ (Accessed 26th June 2017)

Flandrin, F., Buchanan, W., Macfarlane, R., Ramsay, B. and Smales, A., 2014, September. Evaluating digital forensic tools (DFTs). In 7th International Conference: Cybercrime Forensics Education & Training.

ForensicFocus (2013) ‘Interpretation of NTFS Timestamps’ Available at: https://articles.forensicfocus.com/2013/04/06/interpretation-of-ntfs-timestamps/ (Accessed 26th June 2017)

ForensicFocus (2017) ‘Test Images and Forensic Challenges’ Available at: http://www.forensicfocus.com/images-and-challenges (Accessed 22nd June 2017)

Forensic Science Regulator, (2016) Codes of Practice and Conduct for forensic science providers and practitioners in the Criminal Justice System (Issue 3: February 2016)

Forensic Science Regulator, (2016b) Method Validation in Digital Forensics (Issue 1)

Garfinkel, S., Farrell, P., Roussev, V. and Dinolt, G., 2009. Bringing science to digital forensics with standardized forensic corpora. digital investigation, 6, pp.S2-S11.

Garfinkel, S.L., 2010. Digital forensics research: The next 10 years. digital investigation, 7, pp.S64-S73.

Garfinkel, S., 2012. Lessons learned writing digital forensics tools and managing a 30TB digital evidence corpus. Digital Investigation, 9, pp.S80-S89.

Garrie, D.B., 2014. Digital Forensic Evidence in the Courtroom: Understanding Content and Quality. Nw. J. Tech. & Intell. Prop., 12, p.i.

Gerber, M. and Leeson, J., 2004. Formalization of computer input and output: the Hadley model. Digital Investigation, 1(3), pp.214-224.

Glisson, W.B., Storer, T. and Buchanan-Wollaston, J., 2013. An empirical comparison of data recovered from mobile forensic toolkits. Digital Investigation, 10(1), pp.44-55.

Grajeda, C., Breitinger, F. and Baggili, I., 2017. Availability of datasets for digital forensics–and what is missing. Digital Investigation, 22, pp.S94-S105.

Guidance Software, (2007) ‘Guidance Software Releases EnCase(R) Version 6’ Available at: http://investors.guidancesoftware.com/releasedetail.cfm?releaseid=225079 (Accessed 22nd June 2017)

Guidance Software, (2011) ‘Guidance Software Transforms Digital Forensics with EnCase Forensic Version 7’ Available at: http://investors.guidancesoftware.com/releasedetail.cfm?ReleaseID=586685 (Accessed 22nd June 2017)

Guo, Y., Slay, J. and Beckett, J., 2009. Validation and verification of computer forensic software tools—Searching Function. digital investigation, 6, pp.S12-S22.

Guttman, B., Lyle, J.R. and Ayers, R., 2011. Ten years of computer forensic tool-testing. Digital Evidence & Elec. Signature L. Rev., 8, p.139.

Horsman, G., 2018. “I couldn't find it your honour, it mustn't be there!”–Tool errors, tool limitations and user error in digital forensics. Science & Justice.

International Organisation for Standardization, n.d.a ‘ISO/IEC 17025:2005’ Available at: https://www.iso.org/standard/39883.html (Accessed 6 July 2017)

International Organisation for Standardization, n.d.b ‘ISO/IEC 27041:2015’ Available at: https://www.iso.org/standard/44405.html (Accessed 9 January 2019)

Khan, M.E. and Khan, F., 2012. A comparative study of white box, black box and grey box testing techniques. International Journal of Advanced Computer Sciences and Applications, 3(6), pp.12-1.

Lyle, J.R., 2010. If error rate is such a simple concept, why don’t I have one for my forensic tool yet?. digital investigation, 7, pp.S135-S139.

Magnet (2017) ‘Magnet AXIOM’ Available at: https://www.magnetforensics.com/ (Accessed 25th June 2017)

Marshall, A.M. and Paige, R., 2018. Requirements in digital forensics method definition: Observations from a UK study. Digital Investigation, 27, pp.23-29.

National Institutes of Standards and Technology (2015) ‘Forensic File Carving’ Available at: https://www.cftt.nist.gov/filecarving.htm (Accessed 22nd June 2017)

National Institutes of Standards and Technology (2015b) ‘Welcome to the Computer Forensics tool testing (CFTT) Project Web Site’ Available at: https://www.cftt.nist.gov/ (Accessed 22nd June 2017)

National Institutes of Standards and Technology’s (2016) ‘The CFReDS Project’ Available at: https://www.cfreds.nist.gov/ (Accessed 22nd June 2017)

National Institutes of Standards and Technology’s (2018) ‘CFTT Federated Testing Project’ Available at: https://www.nist.gov/itl/ssd/software-quality-group/computer-forensics-tool-testing-program-cftt/cftt-federated-testing (Accessed 22nd October 2018)

Parliamentary Office of Science and Technology (2016) ‘Digital Forensics and Crime’ POSTNOTE Number 520 March 2016

Scientific Working Group on Digital Evidence (2017) ‘SWGDE Establishing Confidence in Digital Forensic Results by Error Mitigation Analysis’ Version: 1.6 Available at: https://www.swgde.org/documents/Current%20Documents/SWGDE%20Establishing%20Confidence%20in%20Digital%20Forensic%20Results%20by%20Error%20Mitigation%20Analysis (Accessed: 26th June 2017)

Sommer, Peter, Pat Beardmore, Geoff Fellows (2017) ‘UK ISO 17025 Digital Forensics Survey April 2017: Results’ Available at: http://digital-evidence.expert/UK%20ISO%2017025%20Digital%20Forensics%20Survey%20April%202017.pdf (Accessed 6th July 2017)

https://www.iso.org/standard/44405.html

SWGDE (2014) ‘SWGDE Recommended Guidelines for Validation Testing’ Available at: https://www.swgde.org/documents/Current%20Documents/SWGDE%20Recommended%20Guidelines%20for%20Validation%20Testing (Accessed 22nd October 2018)

SWGDE’s (2018) ‘SWGDE Minimum Requirements for Testing Tools used in Digital andMultimedia Forensics’ Available at: https://www.swgde.org/documents/Released%20For%20Public%20Comment/SWGDE%20Minimum%20Requirements%20for%20Testing%20Tools%20used%20in%20Digital%20and%20Multimedia%20Forensics (Accessed 22nd October 2018)

The International Society of Forensic Computer Examiners (2017) ‘Sample Practical Exercise Problem’ Available at: http://www.isfce.com/sample-pe.htm (Accessed 22nd June 2017)

Thompson, E., 2005. MD5 collisions and the impact on computer forensics. Digital investigation, 2(1), pp.36-40.

United Nations Industrial Development Organisation (2009) ‘Complying with ISO 17025 A Practical Guidebook’ Available at: https://www.unido.org/fileadmin/user_media/Publications/Pub_free/Complying_with_ISO_17025_A_practical_guidebook.pdf (Accesed: 10th July 2017)

Warner, G.C., 2012, November. Practical fingerprint analysis process and challenges both internal and external for the latent print community. In Homeland Security (HST), 2012 IEEE Conference on Technologies for (pp. 384-389). IEEE.

Wright, C., Kleiman, D. and Sundhar RS, S., 2008. Overwriting hard drive data: The great wiping controversy. Information systems security, pp.243-257.

Yates, M. and Chi, H., 2011, March. A framework for designing benchmarks of investigating digital forensics tools for mobile devices. In Proceedings of the 49th Annual Southeast Regional Conference (pp. 179-184). ACM.

https://www.swgde.org/documents/Released%20For%20Public%20Comment/SWGDE%20Minimum%20Requirements%20for%20Testing%20Tools%20used%20in%20Digital%20and%20Multimedia%20Forensics



https://www.swgde.org/documents/Current%20Documents/SWGDE%20Recommended%20Guidelines%20for%20Validation%20Testing

https://www.swgde.org/documents/Current%20Documents/SWGDE%20Recommended%20Guidelines%20for%20Validation%20Testing