anomaly detection capability of existing models of continuity equations in continuous assurance
DESCRIPTION
Continuous assurance is a methodology to provide assurance on financial data on a near real-time basis. One of the fundamental elements of continuous assurance is continuous data auditing in which the integrity of the data provided by the client is tested. Continuity equations can be used to evidence assertions regarding data integrity. In order to do so, data is tested by predicting subsequent values based on a fitting model. In total there are three models: the simultaneous equations model, the vector autoregressive model and the restricted vector autoregressive model. I propose to test these models and compare them on the aspect of anomaly detection capability.TRANSCRIPT
)
Research Proposal
ANOMALY DETECTION CAPABILITY OF EXISTING MODELS OF CONTINUITY EQUATIONS IN CONTINUOUS ASSURANCE
E.J.F. VAN KEMPEN ANR: 201386
Pre-master Accounting
Supervisor : Prof. Dr. W.F.J. Buijink
2014
Abstract
Continuous assurance is a methodology to provide assurance on financial data on a near
real-time basis. One of the fundamental elements of continuous assurance is continuous data
auditing in which the integrity of the data provided by the client is tested. Continuity
equations can be used to evidence assertions regarding data integrity. In order to do so, data
is tested by predicting subsequent values based on a fitting model. In total there are three
models: the simultaneous equations model, the vector autoregressive model and the restricted
vector autoregressive model. I propose to test these models and compare them on the aspect
of anomaly detection capability.
1
I. Introduction
Continuous assurance has been a subject of interest for auditors and financial professionals
for the last three decades. However, this field of research took off only after Vasarhelyi et al.
(2004) published a widely accepted conceptual framework for continuous assurance. In the
following years additional studies were performed in this field, but most of these studies were
focused on refining the theoretical framework and developing new and innovative analysis
methods. Comparison of existing analysis models was not yet in scope. This proposal focuses
on the comparison of the anomaly detection capability of existing models of continuity
equations
Conventional audit procedures focus on time consuming manual testing on a fixed number
of randomly selected supporting documents, like invoices or inventory counts. By
introducing more superior audit procedures from the continuous assurance domain, like
continuity equations, substantive testing can in theory be performed more efficiently and
effectively. The level of assurance can improve, while time consumption is reduced at the
same time.
However, all these audit procedures from the continuous assurance domain are fairly new
and remain mostly untested in the real world. This research intends to investigate one of these
procedures, continuity equations, on a more detailed level. By using continuity equations
business processes could be tested by detecting anomalies in one or more of the steps within
these processes. The audit procedures or manual testing can then be narrowed down to the
detected anomalies.
Efficient performance of anomaly detection could lead to a paradigm shift in the field of
auditing. Instead of sampling evidence randomly from the population, the level of assurance
can be improved by inspecting exceptions only: audit by exception.
2
II. Literature review and research question
Continuous assurance
The Canadian Institute of Chartered Accountants (1999) provides a definition of continuous
assurance: “Continuous auditing [or continuous assurance] is a methodology that enables
independent auditors to provide written assurance on a subject matter using a series of
auditor’s reports issued simultaneously with, or a short period of time after, the occurrence of
events underlying the subject matter.” The emphasis of continuous assurance is on reducing
the lag between preparing a report and subsequently providing assurance on the matters
reported.
In order to be able to provide assurance on a near real-time basis, the auditors have to rely
heavily on automated testing. Vasarhelyi et al. (2004; 2010) have defined three elements of
continuous assurance and continuous monitoring: Continuous Control Monitoring (CCM),
Continuous Data Auditing (CDA), Continuous Risk Monitoring and Assessment (CRMA).
CCM can be compared to interim testing of procedures in the conventional audit framework
and CDA can be compared to final testing focusing more on data than procedures. These two
elements combined can be used to provide sufficient assurance. CRMA can be used as an
additional part of the control framework, but is not essential for providing assurance. CDA
verifies the integrity of the data flowing through the information system. The data provided
by the client is the basis for all testing procedures, so data assurance forms an essential part of
continuous assurance. Continuity equations can be used as a tool from the CDA sub-domain
to evidence management assertions focusing on data integrity.
Continuity equations
Continuity equations have been a fundamental part of classical physics since the eighteenth
century. These equations describe the transport of a quantity, while simultaneously ensuring
conservation of this quantity (like mass and/or energy). Accordingly similar relations can be
defined for the transport of quantities within a system in the financial domain. The movement
of reported quantities, e.g. ordered kilograms or invoiced units, between steps in the key
business processes can be described with continuity equations.
The term continuity equations was coined in 1991, when Vasarhelyi and Halper (1991)
modeled the flow of billing data at AT&T. Although Vasarhelyi and Halper proposed
3
continuity equations more than 20 years ago, little research has been performed on the
application in practice and implementation of a decent continuity equations model.
In most businesses the flow of goods is the most important basis for revenue recognition.
As such, the flow of goods can be used to provide evidence for the completeness, timeliness
and accuracy of the reported revenue. If the continuity equations hold for a specific business
process, one can assert that there are no ‘leakages’ from the transaction flow, i.e. the integrity
of the flow of goods can be asserted. Therefore, continuity equations provide a method to
evidence the integrity of the basis for revenue recognition, which makes them a valuable tool
in continuous assurance.
Continuity equations are based on historical data of quantities in the separate steps of
business processes. For example, the sales cycle can be modeled as three separate steps:
receiving the order from the customer, shipping goods to the customer and invoicing for the
ordered and shipped goods. The quantity of ordered goods today will of course show up in
the invoicing step a certain number of days later. The daily flow of goods between these steps
can be defined with a certain quantity and a lag between the steps . This research will
focus on the sales cycle consisting of the three previously defined process steps.
Previous research by Leitch and Chen (2003), Kogan et al. (2010) and Alles et al. (2005)
has resulted in three models of continuity equations: the simultaneous equations model
(SEM), vector autoregressive model (VAR) and the restricted vector autoregressive model
(RVAR).
Simultaneous Equations Model
Leitch and Chen (2003) proposed a first model of continuity equations in the field of
assurance: the Simultaneous Equations Model (SEM). When applied to the sales cycle this
model can be represented as Equation (1). Each step in the sales cycle is simultaneously
dependent on historic quantities from the previous step. These historic quantities are
represented with lag in each step. This model simplifies the sales cycle by assuming that
there is only a single fixed lag between each step.
(1)
The coefficients of this model are estimated by OLS linear regression, optimizing for the
overall of the model.
4
Leitch and Chen tested the application of SEM on monthly data of financial statements.
They found that SEM outperformed other more conventional models of analytical
procedures.
Basic Vector Autoregressive model
Alles et al. (2005) introduced another model: the basic Vector Autoregressive (VAR)
model. This model for the sales cycle can be represented as Equation (2). In this model
, , are respectively the quantities ordered, shipped and invoiced
at time , the terms are transition vectors for a multivariate linear model, the
terms are vectors containing daily aggregates of quantities for the given dimension
and is the number of time periods covered in the model.
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
(2)
Each of these sub-equations models a predictor for the reported quantities in a specific step
in the business process. As previously defined, the quantities are related to quantities in the
other process steps by a time delay (lag). For example, if orders are shipped in exactly one
day, without exception, and invoicing is performed simultaneously with shipping, the
resulting predictors can be defined as Equation (3).
(3)
The VAR model is estimated by OLS linear regression, optimizing for the overall by
trying different lags for the process steps. Only the maximum expected lag is provided to the
algorithm, which then tries to find the best fitting model by iterating trough all lag
possibilities up to the maximum expected lag. The exact lags do not have to be known prior
to modeling as the best fitting lags are determined while modeling.
One can easily understand that it is not always trivial to determine lags prior to the
modeling process, e.g. lags in the purchasing cycle are highly dependent on the policies and
processes at third parties. Therefore, the VAR model can be a powerful tool for modeling
continuity equations when exact lags can not be predefined easily.
5
Contrary to the SEM model, the VAR model does not assume that there is a singular fixed
lag between steps. All lags up to a maximum are considered in the model. This can possibly
result in a comprehensive estimated model. Therefore, most VAR models are represented
using matrix notation.
Restricted Vector Autoregressive model
Kogan et al. (2010) have shown in their studies that the VAR model shows outstanding
accuracy. More importantly, they showed that the Restricted VAR (RVAR) model resulted in
better accuracy. With a MAPE (mean absolute percentage error) of 0.3374 on the test set it
outscored even several other models, i.e. SEM and VAR type of models. Only the Bayesian
VAR model performed better when taking only the MAPE into account, but it also resulted in
a larger standard deviation for the absolute percentage error. Therefore, the Bayesian VAR
model is not considered viable for auditing purposes. The RVAR model was found to be one
of the best models for continuity equations.
The RVAR model translates roughly to optimizing for of the predictor by removing
insignificant coefficients from the VAR model. For example, if the mean lag between order
and shipping is less than a month shipment a year after ordering is obviously
not significant and thus excluded from the model. This method iterates the modeling process
per equation by removing all coefficients with | |-statistics below a predefined threshold, as
explained in Figure 1. Kogan et al. (2010) find that a threshold of and its
corresponding yields the model with the best prediction accuracy.
StartInitial model estimation
Exclude parameters with t-statistic
below thresholdRe-estimate model
All t-statisticsabove threshold?
Data ThresholdFinal model
Yes
No
Figure 1. RVAR modeling process. The initial VAR model is restricted by excluding parameters with a t-
statistic below a predefined threshold. The model is re-estimated followed by the next exclusion iteration, until
all parameters satisfy the t-statistic requirement.
6
The RVAR model usually results in less extensive and more accurate estimated models due
to the restriction to significant terms only.
Research question
In total three different models of continuity equations are used in the field of continuous
assurace. Auditors rely on the accuracy and anomaly detection capability of these models to
provide assurance on the data. This leads to my research question:
Which of the existing models of continuity equations in continuous auditing has the best
anomaly detection capability?
III. Method
Data
The proposed base model for the sales cycle is based on three different quantities: the
ordered quantity, the quantity of goods shipped and the quantity invoiced. These three
variables can be provided by most ERP systems on a daily basis.
Data is provided by a Dutch wholesaler in technical supplies. This company uses an off-
the-shelf solution of Microsoft Dynamics AX 2009. The data was extracted from separately
generated reports containing transaction quantities for each of the process steps by merging
the columns by date, as presented in Figure 2.
SalesOrders
PK Date
Quantity
Shipments
PK Date
Quantity
Invoices
PK Date
Quantity
SalesData
PK,FK1,FK2,FK3 Date
SO GS IS
7
Figure 2. Data model consisting of daily aggregates for three different stages in the sales cycle: ordered
quantity (SO), quantity of goods shipped to customer (GS) and quantity invoiced (IS) combined by date via a
SQL join clause. The date serves as the primary and foreign keys of the data source involved.
The data reflects actual day-to-day transaction quantities of February 2007 up to November
2007, excluding Sundays and holidays during which the company was closed for business.
Saturdays are still included, because sometimes high priority orders are shipped on Saturdays.
The resulting data is exported as a CSV file to be imported by the model implementations
in R. The CSV file consists of four data fields, i.e. date, the quantities ordered, quantities
shipped and quantities invoiced. More detailed information about the data can be found in
Appendix A.
Panel A
Variable n Mean Std.Dev. 25th Pct. Median 75th Pct.
Sales orders (SO) 264 66,845 60,676 38,384 62,548 83,122
Goods shipped (GS) 264 62,068 46,099 42,295 63,326 40,865
Invoices sent (IS) 264 60,211 47,237 78,393 60,745 81,303
Panel B
Pearson correlations | | | | | | | | 1.000 0.600* 0.588*
| | 1.000 0.960*
| | 1.000 *:values significant on the 1% level.
Table 1. A: sample characteristics of the data set consisting of 264 observations of actual day-to-day
transaction quantities in sales orders, goods shipped en invoices sent. B: Pearson correlations between the
quantity variables.
Table 1 and Figure 3 presents descriptive statistics about the three quantity fields in the data
set. The Pearson correlations show that the GS and IS variables are strongly related. This is
fully in line with the notion that invoices are generated at the same time as the goods are
shipped most of the time. Furthermore, the charts clearly show less activity on Saturdays
compared to weekdays. On Saturdays only priority orders and over-the-counter sales are
handled.
The data is split into two separate parts, which account for roughly ⁄ and ⁄ of the
observations included in the data set respectively. The first part will be used as a training set
to estimate the model parameters for all three models. The second part is used as a test set.
After estimation, the models will be tested by generating predictions for the test set.
8
Implementation of the models
The models will be implemented in R, the most widely accepted language for statistical
processing and data analytics. A rudimentary implementation of these models is already
available in the form of R packages.
The SEM model is implemented in four stages: data collection, pre-processing, modeling
and prediction. The code is based on the systemfit package, which has been developed and
pusblished by Arne Henningsen and Jeff D. Hamann and is available via CRAN.
(Henningsen & Hamann, 2007)
The VAR and RVAR models are also implemented in four stages: data collection, pre-
processing, modeling and prediction. The code is centered around the vars package, which
has been developed and pusblished by Bernhard Pfaff and Matthieu Stigle and is available via
CRAN. (Pfaff & Im Taunus, 2007; Pfaff, 2008; Pfaff, 2008) The package includes several
functions for modeling VARs, testing the VARs and presenting the results.
Figure 3. Plot of daily aggregates for three different stages in the sales cycle: ordered quantity (SO), quantity of
goods shipped to customer (GS) and quantity invoiced (IS) as provided in the data set.
9
The modeling implementation in R can be found in Appendix B.
Testing of the models
After the model parameters were estimated based on the training set the resulting models
are tested. Anomaly detection capability is tested by counting false negatives or Type II
errors in the model predictions based on a slightly modified test set. Type I errors or false
positives are not in scope, due to the lack of negative effects on the level of assurance.
The test set is altered by increasing the quantities in five randomly selected observations by
100%. These altered observations serve as injected anomalies in the test set. The test set,
including the seeded anomalies, are then processed by the model implementation and
anomalies are reported.
In order to improve randomness and reduce the apparent selection bias the testing is
repeated 1,000 times, while randomly selecting five observations to be altered by 100% in the
original test set for every repetition. The mean number of Type II errors found serves as the
test statistic for comparison purposes. These means are compared using a dependent t-test.
The test procedure, as implemented in R, can be found in Appendix C.
IV. Expected results
After testing I expect to find that the RVAR model to be the superior model in terms of
anomaly detection capability. The SEM model will probably underperform due to the
oversimplification of the sales cycle steps and the accompanying lag terms. I expect most
companies to have two or more lag terms associated with the largest part of the flow of
goods. The data provider for the proposed tests for example provides next day delivery for
some items which are separately shipped. The ordered quantity can thus be considered as two
or more flows with and . The SEM model would oversimplify this cycle.
In theory it should also outperform the basic VAR model purely based on statistical
properties. In both the RVAR and VAR model multiple lag terms are considered and
included in the model. This should result in better performance than the SEM model. The
RVAR model can be considered an improved version of the basic VAR model due to the
exclusion of statistical insignificant terms. Eventhough the algorithm for estimating the
RVAR model on real data is simple and elegant it could result in a suboptimal estimation.
10
Estimating anomaly detection performance and accuracy prior to the estimation algorithm is
even more difficult.
V. Limitations
Type II errors only
The research focuses on Type II errors only, since only false negatives (failing to identify
an anomaly when one exists) influence the level of assurance. The level of assurance is the
most important factor in acceptance of the models used. If the models are considered to be
not reliable, auditors will not be able to use them. Therefore, actual errors can not pass the
test undiscovered.
However, Type I errors also influence the audit procedure. The detection of false positives
can lead to an increase in audit activities, since all detected anomalies have to be tested
manually. Eventhough Type I errors are not in scope, the models can only be accepted if the
number of false positives stays below a certain limit.
Data
The data used in this research is provided by a single entity and for a single year only.
Therefore, conclusions and results are only applicable to the data provider and can not be
generalized. In order to be able to generalize the results and conclusions, the proposed
methods need to be used on data provided by multiple entities. Furthermore, reliability will
be improved by testing data from subsequent years. Furthermore, since the data is provided
by a single entity selection bias may occur. In addition, the data set contains noise. Pre-
existing anomalies might exist in the data set.
11
REFERENCES
(CICA), C. I. (1999). Continuous Auditing. Continuous Auditing. Toronto, ON, Canada.
Alles, M., Kogan, A., Vasarhelyi, M., & Wu, J. (2005). Continuity Equations in Continuous
Auditing: Detecting Anomalies in Business Processes.
Dzeng, S. (1994). A Comparison of Analytical Procedures Expectation Models Using Both
Aggregate and Disaggregate Data. Auditing: A Journal of Practice \& Theory,
13(Fall), 1-24.
Henningsen, A., & Hamann, J. D. (2007). systemfit: A Package for Estimating Systems of
Simultaneous Equations in R. Journal of Statistical Software, 23(4), 1-40.
Kogan, A., Alles, M. G., Vasarhelyi, M. A., & Wu, J. (2010). Analytical Procedures for
Continuous Data Level Auditing: Continuity Equations.
Leitch, R. A., & Chen, Y. (2003). The effectiveness of expectation models in recognizing
error patterns and generating and eliminating hypotheses while conducting analytical
procedures. Auditing: A Journal of Practice & Theory, 22(2), 147-170.
Pfaff, B. (2008). VAR, SVAR and SVEC models: Implementation within R package vars.
Journal of Statistical Software, 27(4), 1-32.
Pfaff, B. (2008). vars: VAR Modelling. R package version, 1-3.
Pfaff, B., & Im Taunus, K. (2007). Using the vars package.
Vasarhelyi, M. A., & Halper, F. B. (1991). The continuous audit of online systems. Auditing:
A Journal of Practice & Theory, 10(1), 110-125.
Vasarhelyi, M. A., Alles, M. G., & Kogan, A. (2004). Principles of analytic monitoring for
continuous assurance. Journal of Emerging Technologies in Accounting, 1(1), 1-21.
Vasarhelyi, M. A., Alles, M., & Williams, K. T. (2010). Continuous assurance for the now
economy. Institute of Chartered Accountants in Australia Sydney, Australia.
12
Appendix A. Data
The data is provided by a Dutch wholesaler in technical supplies and contains daily
aggregates of the three separate steps in the sales cycle.
SalesOrders
PK Date
Quantity
Shipments
PK Date
Quantity
Invoices
PK Date
Quantity
SalesData
PK,FK1,FK2,FK3 Date
SO GS IS
Figure 2. Data model consisting of daily aggregates for three different stages in the sales cycle: ordered
quantity (SO), quantity of goods shipped to customer (GS) and quantity invoiced (IS) combined by date via a
SQL join clause. The date serves as the primary and foreign keys of the data source involved.
The data is imported by using the following R code:
13
Appendix B. Implementations of the models in R
14
15
Appendix C. Test algorithm
16
17