anomaly detection capability of existing models of continuity equations in continuous assurance

)

Research Proposal

ANOMALY DETECTION CAPABILITY OF EXISTING MODELS OF CONTINUITY EQUATIONS IN CONTINUOUS ASSURANCE

E.J.F. VAN KEMPEN ANR: 201386

Pre-master Accounting

Supervisor : Prof. Dr. W.F.J. Buijink

2014

Abstract

Continuous assurance is a methodology to provide assurance on financial data on a near

real-time basis. One of the fundamental elements of continuous assurance is continuous data

auditing in which the integrity of the data provided by the client is tested. Continuity

equations can be used to evidence assertions regarding data integrity. In order to do so, data

is tested by predicting subsequent values based on a fitting model. In total there are three

models: the simultaneous equations model, the vector autoregressive model and the restricted

vector autoregressive model. I propose to test these models and compare them on the aspect

of anomaly detection capability.

1

I. Introduction

Continuous assurance has been a subject of interest for auditors and financial professionals

for the last three decades. However, this field of research took off only after Vasarhelyi et al.

(2004) published a widely accepted conceptual framework for continuous assurance. In the

following years additional studies were performed in this field, but most of these studies were

focused on refining the theoretical framework and developing new and innovative analysis

methods. Comparison of existing analysis models was not yet in scope. This proposal focuses

on the comparison of the anomaly detection capability of existing models of continuity

equations

Conventional audit procedures focus on time consuming manual testing on a fixed number

of randomly selected supporting documents, like invoices or inventory counts. By

introducing more superior audit procedures from the continuous assurance domain, like

continuity equations, substantive testing can in theory be performed more efficiently and

effectively. The level of assurance can improve, while time consumption is reduced at the

same time.

However, all these audit procedures from the continuous assurance domain are fairly new

and remain mostly untested in the real world. This research intends to investigate one of these

procedures, continuity equations, on a more detailed level. By using continuity equations

business processes could be tested by detecting anomalies in one or more of the steps within

these processes. The audit procedures or manual testing can then be narrowed down to the

detected anomalies.

Efficient performance of anomaly detection could lead to a paradigm shift in the field of

auditing. Instead of sampling evidence randomly from the population, the level of assurance

can be improved by inspecting exceptions only: audit by exception.

2

II. Literature review and research question

Continuous assurance

The Canadian Institute of Chartered Accountants (1999) provides a definition of continuous

assurance: “Continuous auditing [or continuous assurance] is a methodology that enables

independent auditors to provide written assurance on a subject matter using a series of

auditor’s reports issued simultaneously with, or a short period of time after, the occurrence of

events underlying the subject matter.” The emphasis of continuous assurance is on reducing

the lag between preparing a report and subsequently providing assurance on the matters

reported.

In order to be able to provide assurance on a near real-time basis, the auditors have to rely

heavily on automated testing. Vasarhelyi et al. (2004; 2010) have defined three elements of

continuous assurance and continuous monitoring: Continuous Control Monitoring (CCM),

Continuous Data Auditing (CDA), Continuous Risk Monitoring and Assessment (CRMA).

CCM can be compared to interim testing of procedures in the conventional audit framework

and CDA can be compared to final testing focusing more on data than procedures. These two

elements combined can be used to provide sufficient assurance. CRMA can be used as an

additional part of the control framework, but is not essential for providing assurance. CDA

verifies the integrity of the data flowing through the information system. The data provided

by the client is the basis for all testing procedures, so data assurance forms an essential part of

continuous assurance. Continuity equations can be used as a tool from the CDA sub-domain

to evidence management assertions focusing on data integrity.

Continuity equations

Continuity equations have been a fundamental part of classical physics since the eighteenth

century. These equations describe the transport of a quantity, while simultaneously ensuring

conservation of this quantity (like mass and/or energy). Accordingly similar relations can be

defined for the transport of quantities within a system in the financial domain. The movement

of reported quantities, e.g. ordered kilograms or invoiced units, between steps in the key

business processes can be described with continuity equations.

The term continuity equations was coined in 1991, when Vasarhelyi and Halper (1991)

modeled the flow of billing data at AT&T. Although Vasarhelyi and Halper proposed

3

continuity equations more than 20 years ago, little research has been performed on the

application in practice and implementation of a decent continuity equations model.

In most businesses the flow of goods is the most important basis for revenue recognition.

As such, the flow of goods can be used to provide evidence for the completeness, timeliness

and accuracy of the reported revenue. If the continuity equations hold for a specific business

process, one can assert that there are no ‘leakages’ from the transaction flow, i.e. the integrity

of the flow of goods can be asserted. Therefore, continuity equations provide a method to

evidence the integrity of the basis for revenue recognition, which makes them a valuable tool

in continuous assurance.

Continuity equations are based on historical data of quantities in the separate steps of

business processes. For example, the sales cycle can be modeled as three separate steps:

receiving the order from the customer, shipping goods to the customer and invoicing for the

ordered and shipped goods. The quantity of ordered goods today will of course show up in

the invoicing step a certain number of days later. The daily flow of goods between these steps

can be defined with a certain quantity and a lag between the steps . This research will

focus on the sales cycle consisting of the three previously defined process steps.

Previous research by Leitch and Chen (2003), Kogan et al. (2010) and Alles et al. (2005)

has resulted in three models of continuity equations: the simultaneous equations model

(SEM), vector autoregressive model (VAR) and the restricted vector autoregressive model

(RVAR).

Simultaneous Equations Model

Leitch and Chen (2003) proposed a first model of continuity equations in the field of

assurance: the Simultaneous Equations Model (SEM). When applied to the sales cycle this

model can be represented as Equation (1). Each step in the sales cycle is simultaneously

dependent on historic quantities from the previous step. These historic quantities are

represented with lag in each step. This model simplifies the sales cycle by assuming that

there is only a single fixed lag between each step.

(1)

The coefficients of this model are estimated by OLS linear regression, optimizing for the

overall of the model.

4

Leitch and Chen tested the application of SEM on monthly data of financial statements.

They found that SEM outperformed other more conventional models of analytical

procedures.

Basic Vector Autoregressive model

Alles et al. (2005) introduced another model: the basic Vector Autoregressive (VAR)

model. This model for the sales cycle can be represented as Equation (2). In this model

, , are respectively the quantities ordered, shipped and invoiced

at time , the terms are transition vectors for a multivariate linear model, the

terms are vectors containing daily aggregates of quantities for the given dimension

and is the number of time periods covered in the model.

( ) ( ) ( )

( ) ( ) ( )

( ) ( ) ( )

(2)

Each of these sub-equations models a predictor for the reported quantities in a specific step

in the business process. As previously defined, the quantities are related to quantities in the

other process steps by a time delay (lag). For example, if orders are shipped in exactly one

day, without exception, and invoicing is performed simultaneously with shipping, the

resulting predictors can be defined as Equation (3).

(3)

The VAR model is estimated by OLS linear regression, optimizing for the overall by

trying different lags for the process steps. Only the maximum expected lag is provided to the

algorithm, which then tries to find the best fitting model by iterating trough all lag

possibilities up to the maximum expected lag. The exact lags do not have to be known prior

to modeling as the best fitting lags are determined while modeling.

One can easily understand that it is not always trivial to determine lags prior to the

modeling process, e.g. lags in the purchasing cycle are highly dependent on the policies and

processes at third parties. Therefore, the VAR model can be a powerful tool for modeling

continuity equations when exact lags can not be predefined easily.

5

Contrary to the SEM model, the VAR model does not assume that there is a singular fixed

lag between steps. All lags up to a maximum are considered in the model. This can possibly

result in a comprehensive estimated model. Therefore, most VAR models are represented

using matrix notation.

Restricted Vector Autoregressive model

Kogan et al. (2010) have shown in their studies that the VAR model shows outstanding

accuracy. More importantly, they showed that the Restricted VAR (RVAR) model resulted in

better accuracy. With a MAPE (mean absolute percentage error) of 0.3374 on the test set it

outscored even several other models, i.e. SEM and VAR type of models. Only the Bayesian

VAR model performed better when taking only the MAPE into account, but it also resulted in

a larger standard deviation for the absolute percentage error. Therefore, the Bayesian VAR

model is not considered viable for auditing purposes. The RVAR model was found to be one

of the best models for continuity equations.

The RVAR model translates roughly to optimizing for of the predictor by removing

insignificant coefficients from the VAR model. For example, if the mean lag between order

and shipping is less than a month shipment a year after ordering is obviously

not significant and thus excluded from the model. This method iterates the modeling process

per equation by removing all coefficients with | |-statistics below a predefined threshold, as

explained in Figure 1. Kogan et al. (2010) find that a threshold of and its

corresponding yields the model with the best prediction accuracy.

StartInitial model estimation

Exclude parameters with t-statistic

below thresholdRe-estimate model

All t-statisticsabove threshold?

Data ThresholdFinal model

Yes

No

Figure 1. RVAR modeling process. The initial VAR model is restricted by excluding parameters with a t-

statistic below a predefined threshold. The model is re-estimated followed by the next exclusion iteration, until

all parameters satisfy the t-statistic requirement.

6

The RVAR model usually results in less extensive and more accurate estimated models due

to the restriction to significant terms only.

Research question

In total three different models of continuity equations are used in the field of continuous

assurace. Auditors rely on the accuracy and anomaly detection capability of these models to

provide assurance on the data. This leads to my research question:

Which of the existing models of continuity equations in continuous auditing has the best

anomaly detection capability?

III. Method

Data

The proposed base model for the sales cycle is based on three different quantities: the

ordered quantity, the quantity of goods shipped and the quantity invoiced. These three

variables can be provided by most ERP systems on a daily basis.

Data is provided by a Dutch wholesaler in technical supplies. This company uses an off-

the-shelf solution of Microsoft Dynamics AX 2009. The data was extracted from separately

generated reports containing transaction quantities for each of the process steps by merging

the columns by date, as presented in Figure 2.

SalesOrders

PK Date

Quantity

Shipments

PK Date

Quantity

Invoices

PK Date

Quantity

SalesData

PK,FK1,FK2,FK3 Date

SO GS IS

7

Figure 2. Data model consisting of daily aggregates for three different stages in the sales cycle: ordered

quantity (SO), quantity of goods shipped to customer (GS) and quantity invoiced (IS) combined by date via a

SQL join clause. The date serves as the primary and foreign keys of the data source involved.

The data reflects actual day-to-day transaction quantities of February 2007 up to November

2007, excluding Sundays and holidays during which the company was closed for business.

Saturdays are still included, because sometimes high priority orders are shipped on Saturdays.

The resulting data is exported as a CSV file to be imported by the model implementations

in R. The CSV file consists of four data fields, i.e. date, the quantities ordered, quantities

shipped and quantities invoiced. More detailed information about the data can be found in

Appendix A.

Panel A

Variable n Mean Std.Dev. 25th Pct. Median 75th Pct.

Sales orders (SO) 264 66,845 60,676 38,384 62,548 83,122

Goods shipped (GS) 264 62,068 46,099 42,295 63,326 40,865

Invoices sent (IS) 264 60,211 47,237 78,393 60,745 81,303

Panel B

Pearson correlations | | | | | | | | 1.000 0.600* 0.588*

| | 1.000 0.960*

| | 1.000 *:values significant on the 1% level.

Table 1. A: sample characteristics of the data set consisting of 264 observations of actual day-to-day

transaction quantities in sales orders, goods shipped en invoices sent. B: Pearson correlations between the

quantity variables.

Table 1 and Figure 3 presents descriptive statistics about the three quantity fields in the data

set. The Pearson correlations show that the GS and IS variables are strongly related. This is

fully in line with the notion that invoices are generated at the same time as the goods are

shipped most of the time. Furthermore, the charts clearly show less activity on Saturdays

compared to weekdays. On Saturdays only priority orders and over-the-counter sales are

handled.

The data is split into two separate parts, which account for roughly ⁄ and ⁄ of the

observations included in the data set respectively. The first part will be used as a training set

to estimate the model parameters for all three models. The second part is used as a test set.

After estimation, the models will be tested by generating predictions for the test set.

8

Implementation of the models

The models will be implemented in R, the most widely accepted language for statistical

processing and data analytics. A rudimentary implementation of these models is already

available in the form of R packages.

The SEM model is implemented in four stages: data collection, pre-processing, modeling

and prediction. The code is based on the systemfit package, which has been developed and

pusblished by Arne Henningsen and Jeff D. Hamann and is available via CRAN.

(Henningsen & Hamann, 2007)

The VAR and RVAR models are also implemented in four stages: data collection, pre-

processing, modeling and prediction. The code is centered around the vars package, which

has been developed and pusblished by Bernhard Pfaff and Matthieu Stigle and is available via

CRAN. (Pfaff & Im Taunus, 2007; Pfaff, 2008; Pfaff, 2008) The package includes several

functions for modeling VARs, testing the VARs and presenting the results.

Figure 3. Plot of daily aggregates for three different stages in the sales cycle: ordered quantity (SO), quantity of

goods shipped to customer (GS) and quantity invoiced (IS) as provided in the data set.

9

The modeling implementation in R can be found in Appendix B.

Testing of the models

After the model parameters were estimated based on the training set the resulting models

are tested. Anomaly detection capability is tested by counting false negatives or Type II

errors in the model predictions based on a slightly modified test set. Type I errors or false

positives are not in scope, due to the lack of negative effects on the level of assurance.

The test set is altered by increasing the quantities in five randomly selected observations by

100%. These altered observations serve as injected anomalies in the test set. The test set,

including the seeded anomalies, are then processed by the model implementation and

anomalies are reported.

In order to improve randomness and reduce the apparent selection bias the testing is

repeated 1,000 times, while randomly selecting five observations to be altered by 100% in the

original test set for every repetition. The mean number of Type II errors found serves as the

test statistic for comparison purposes. These means are compared using a dependent t-test.

The test procedure, as implemented in R, can be found in Appendix C.

IV. Expected results

After testing I expect to find that the RVAR model to be the superior model in terms of

anomaly detection capability. The SEM model will probably underperform due to the

oversimplification of the sales cycle steps and the accompanying lag terms. I expect most

companies to have two or more lag terms associated with the largest part of the flow of

goods. The data provider for the proposed tests for example provides next day delivery for

some items which are separately shipped. The ordered quantity can thus be considered as two

or more flows with and . The SEM model would oversimplify this cycle.

In theory it should also outperform the basic VAR model purely based on statistical

properties. In both the RVAR and VAR model multiple lag terms are considered and

included in the model. This should result in better performance than the SEM model. The

RVAR model can be considered an improved version of the basic VAR model due to the

exclusion of statistical insignificant terms. Eventhough the algorithm for estimating the

RVAR model on real data is simple and elegant it could result in a suboptimal estimation.

10

Estimating anomaly detection performance and accuracy prior to the estimation algorithm is

even more difficult.

V. Limitations

Type II errors only

The research focuses on Type II errors only, since only false negatives (failing to identify

an anomaly when one exists) influence the level of assurance. The level of assurance is the

most important factor in acceptance of the models used. If the models are considered to be

not reliable, auditors will not be able to use them. Therefore, actual errors can not pass the

test undiscovered.

However, Type I errors also influence the audit procedure. The detection of false positives

can lead to an increase in audit activities, since all detected anomalies have to be tested

manually. Eventhough Type I errors are not in scope, the models can only be accepted if the

number of false positives stays below a certain limit.

Data

The data used in this research is provided by a single entity and for a single year only.

Therefore, conclusions and results are only applicable to the data provider and can not be

generalized. In order to be able to generalize the results and conclusions, the proposed

methods need to be used on data provided by multiple entities. Furthermore, reliability will

be improved by testing data from subsequent years. Furthermore, since the data is provided

by a single entity selection bias may occur. In addition, the data set contains noise. Pre-

existing anomalies might exist in the data set.

11

REFERENCES

(CICA), C. I. (1999). Continuous Auditing. Continuous Auditing. Toronto, ON, Canada.

Alles, M., Kogan, A., Vasarhelyi, M., & Wu, J. (2005). Continuity Equations in Continuous

Auditing: Detecting Anomalies in Business Processes.

Dzeng, S. (1994). A Comparison of Analytical Procedures Expectation Models Using Both

Aggregate and Disaggregate Data. Auditing: A Journal of Practice \& Theory,

13(Fall), 1-24.

Henningsen, A., & Hamann, J. D. (2007). systemfit: A Package for Estimating Systems of

Simultaneous Equations in R. Journal of Statistical Software, 23(4), 1-40.

Kogan, A., Alles, M. G., Vasarhelyi, M. A., & Wu, J. (2010). Analytical Procedures for

Continuous Data Level Auditing: Continuity Equations.

Leitch, R. A., & Chen, Y. (2003). The effectiveness of expectation models in recognizing

error patterns and generating and eliminating hypotheses while conducting analytical

procedures. Auditing: A Journal of Practice & Theory, 22(2), 147-170.

Pfaff, B. (2008). VAR, SVAR and SVEC models: Implementation within R package vars.

Journal of Statistical Software, 27(4), 1-32.

Pfaff, B. (2008). vars: VAR Modelling. R package version, 1-3.

Pfaff, B., & Im Taunus, K. (2007). Using the vars package.

Vasarhelyi, M. A., & Halper, F. B. (1991). The continuous audit of online systems. Auditing:

A Journal of Practice & Theory, 10(1), 110-125.

Vasarhelyi, M. A., Alles, M. G., & Kogan, A. (2004). Principles of analytic monitoring for

continuous assurance. Journal of Emerging Technologies in Accounting, 1(1), 1-21.

Vasarhelyi, M. A., Alles, M., & Williams, K. T. (2010). Continuous assurance for the now

economy. Institute of Chartered Accountants in Australia Sydney, Australia.

12

Appendix A. Data

The data is provided by a Dutch wholesaler in technical supplies and contains daily

aggregates of the three separate steps in the sales cycle.

SalesOrders

PK Date

Quantity

Shipments

PK Date

Quantity

Invoices

PK Date

Quantity

SalesData

PK,FK1,FK2,FK3 Date

SO GS IS

Figure 2. Data model consisting of daily aggregates for three different stages in the sales cycle: ordered

quantity (SO), quantity of goods shipped to customer (GS) and quantity invoiced (IS) combined by date via a

SQL join clause. The date serves as the primary and foreign keys of the data source involved.

The data is imported by using the following R code:

13

Appendix B. Implementations of the models in R

15

Appendix C. Test algorithm

anomaly detection capability of existing models of continuity equations in continuous assurance

Documents