extending the portfolio and strategic planning …

EXTENDING THE PORTFOLIO AND STRATEGIC PLANNING HORIZON BY THE STOCHASTIC FORECASTING OF UNKNOWN FUTURE PROJECTS: AN FDOT

CASE STUDY

By

ALIREZA SHOJAEI KOL KACHI

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2017

To my mother and father

4

ACKNOWLEDGMENTS

I would like to thank my parents. Without their support, I could not have

accomplished this task. I would also like to thank my advisor, who helped me to grow

and become a better person, both in academia and on a personal level. I would also like

to thank my committee members for their encouragement, insightful comments, and

useful questions. Finally, my gratitude also goes to my friends, whose presence made

this journey pleasant.

5

TABLE OF CONTENTS page

ACKNOWLEDGMENTS .................................................................................................. 4

LIST OF TABLES ............................................................................................................ 7

LIST OF FIGURES .......................................................................................................... 9

LIST OF ABBREVIATIONS ........................................................................................... 12

ABSTRACT ................................................................................................................... 13

CHAPTER

1 INTRODUCTION .................................................................................................... 15

2 LITERATURE REVIEW .......................................................................................... 20

Definition of Project Portfolio Management ............................................................. 20 Project Portfolio Management Methods .................................................................. 23

Uncertainties in Project Portfolio Management ....................................................... 30

3 PROBLEM STATEMENT AND RESEARCH METHODOLOGY ............................. 37

Research Scope ..................................................................................................... 37 Aim ................................................................................................................... 39

Objectives ......................................................................................................... 39 Data Structure ......................................................................................................... 40 Research Design .................................................................................................... 42

4 MODEL COMPONENT DEVELOPMENT ............................................................... 45

Project Frequency Modeling ................................................................................... 45 Model Identification .......................................................................................... 46 Strategies to Divide the Data and Test the Models .......................................... 51 Model Development ......................................................................................... 52

Univariate modeling ................................................................................... 55

Identifying potentially relevant predictors and the exploratory data analysis ................................................................................................... 69

Feature selection and feature importance .................................................. 72 Multivariate modeling ................................................................................. 81

Final Model Diagnostic Checks ........................................................................ 95 Cost and Duration Characterization ........................................................................ 97

5 SIMULATION RESULTS AND DISCUSSION ....................................................... 103

6

Simulation Results ................................................................................................ 103 Analysis and Discussion ....................................................................................... 113

6 CONCLUSIONS AND RECOMMENDATIONS ..................................................... 115

LIST OF REFERENCES ............................................................................................. 119

BIOGRAPHICAL SKETCH .......................................................................................... 124

7

LIST OF TABLES

Table page 2-1 Summary of the literature on approaches toward project portfolio

management....................................................................................................... 29

3-1 Candidate variables and sources ....................................................................... 40

4-1 Summary of the ADF test for the project frequency series ................................. 50

4-2 Results of the ADF test for the explanatory variables ......................................... 51

4-3 The RMSE of the AR models.............................................................................. 57

4-4 The MAE of the AR models ................................................................................ 58

4-5 The RMSE of the MA models ............................................................................. 59

4-6 The MAE of the MA models ................................................................................ 59

4-7 The RMSE of the ARMA models ........................................................................ 61

4-8 The MAE of the ARMA models ........................................................................... 61

4-9 The RMSE and MAE of the exponential smoothing models ............................... 63

4-10 The RMSE of the LSTM models ......................................................................... 65

4-11 MAE of LSTM models ......................................................................................... 66

4-12 Potential variables and their abbreviations ......................................................... 69

4-13 Cross-correlation of the dependent variables ..................................................... 71

4-14 Result of the ADF test for the explanatory variables ........................................... 72

4-15 Linear filter approach results .............................................................................. 76

4-16 Nonlinear Filter approach results ........................................................................ 78

4-17 Linear correlation table of project cost and frequency with the budget ............... 80

4-18 Parameters of the generalized linear models ..................................................... 85

4-19 Performance of the Generalized linear model .................................................... 85

4-20 Multilayer perceptron models' performance ........................................................ 88

8

4-21 Performance of the support vector machine models .......................................... 92

4-22 Summary of the best performing models ............................................................ 94

4-23 Result of Box-Ljung test ..................................................................................... 97

4-24 Best fitted distribution function on cross-validation datasets ............................ 100

5-1 Copula functions fit results................................................................................ 106

5-2 Mean and standard deviation of the actual and simulated data ........................ 108

5-3 Goodness of fit for the duration distribution function ......................................... 110

5-4 Comparison of the best-fitting distribution’s properties of project duration ....... 110

5-5 Goodness of fit for the cost distribution function ............................................... 112

5-6 Comparison of the best-fitting distribution’s properties of project cost .............. 113

9

LIST OF FIGURES

Figure page 1-1 Relationship between project, program, and portfolio ......................................... 15

3-1 Data structure ..................................................................................................... 41

3-2 The sequence of generating information ............................................................ 44

4-1 Possible internal structures of the model ............................................................ 47

4-2 Rolling mean and standard deviation of the project frequencies ........................ 49

4-3 Evaluation of a rolling forecasting ....................................................................... 52

4-4 Model development scheme ............................................................................... 54

4-5 The ACF for the project frequencies ................................................................... 56

4-6 The PACF for the project frequencies ................................................................ 57

4-7 Comparison of the AR models’ performance ...................................................... 58

4-8 Comparison of the MA models’ performances .................................................... 60

4-9 The RMSE and MAE of the ARMA models ........................................................ 62

4-10 The RMSE and MAE of the exponential smoothing models ............................... 63

4-11 The LSTM structure ............................................................................................ 64

4-12 The RMSE and MAE of the LSTM models with one look-back ........................... 67

4-13 The ARMA (8,8) forecast based on cross-validation section 7 ........................... 68

4-14 Correlation plot of the variables .......................................................................... 70

4-15 Linear variable importance ................................................................................. 77

4-16 Nonlinear variable importance ............................................................................ 79

4-17 Comparison of the budgets and costs of the projects ......................................... 80

4-18 Generalized linear method optimization ............................................................. 83

4-19 Lasso coefficient curve ....................................................................................... 84

4-20 Variable importance for the generalized linear model ......................................... 84

10

4-21 Optimum network structure with all the independent variables ........................... 86

4-22 Feature importance according to the Olden method ........................................... 87

4-23 A 3D plot of the neural net model optimization ................................................... 89

4-24 A 2D plot of the neural net model optimization ................................................... 89

4-25 A focused 3D plot of the optimized parameters of the neural network ................ 90

4-26 A focused 2D plot of the optimized parameters of the neural network ................ 90

4-27 Structure of the optimized neural network .......................................................... 91

4-28 A 3D plot of the support vector machine parameter optimization ....................... 93

4-29 A 2D plot of the support vector machine parameter optimization ....................... 94

4-30 Residual autocorrelations ................................................................................... 96

4-31 Scatterplot illustrating the relationship between duration and cost ..................... 98

4-32 Cumulative cost per month ................................................................................. 98

4-33 Project frequency per month............................................................................... 99

4-34 Empirical density and cumulative distribution of the project durations .............. 100

4-35 Fitted distribution function and cumulative distribution of the project durations 101

4-36 Empirical density and cumulative distribution of the project costs .................... 101

4-37 Fitted distribution function and cumulative distribution of the project costs ...... 102

5-1 The ARMA (8,8) model’s project frequency forecast ........................................ 103

5-2 Autocorrelation plot of the project frequency forecast error .............................. 104

5-3 Histogram of the forecast errors ....................................................................... 105

5-4 An example of project frequency simulation ..................................................... 105

5-5 The copula’s probability density function .......................................................... 107

5-6 The probability density of the data sampled from the defined copula ............... 107

5-7 A sampled dataset plotted against actual values .............................................. 108

5-8 Kernel density estimates for the duration data ................................................. 109

11

5-9 Comparison of the project durations and representative distributions .............. 111

5-10 Kernel density estimates for the cost data ........................................................ 111

5-11 Comparison of project costs and representative distributions .......................... 113

5-12 Example of the functioning of the proposed method. ........................................ 114

12

LIST OF ABBREVIATIONS

ACF Autocorrelation Function

ADF Augmented Dickey–Fuller

AIC Akaike information criterion

APM

AR

Association for Project Management

Autoregressive

ARMA Autoregressive Moving Average

FDOT Florida Department of Transportation

LSTM

MA

Long Short-Term Memory

Moving Average

MAE Mean Absolute Error

PACF Partial Autocorrelation Function

PMI Project Management Institute

PPM Project Portfolio Management

RMSE Root Mean Squared Error

13

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

EXTENDING THE PORTFOLIO AND STRATEGIC PLANNING HORIZON BY THE STOCHASTIC FORECASTING OF UNKNOWN FUTURE PROJECTS: AN FDOT

CASE STUDY

By

Alireza Shojaei Kol Kachi

August 2017

Chair: Ian Flood Major: Design, Construction, and Planning

Construction companies typically work on many projects simultaneously, each

with its own specific objectives and resource demands. Consequently, a key managerial

function is to allocate equipment, employees, and financial resources across concurrent

projects in a way that satisfies individual project constraints while optimizing the

company’s overall objectives.

Project portfolio management (PPM) is concerned with managing multiple

projects to accomplish strategic goals. To date, the main research streams in this area

have emphasized project selection, project prioritization, and the alignment of a portfolio

with strategic goals among a pool of awarded projects. The literature contains a gap

regarding the effects of uncertainties associated with future projects, including both

known (but yet to be awarded to a contractor) and unknown (although statistically

quantifiable) ones. Such a capability, looking into the future, is critical for effective

medium- and long-term strategic planning for a company.

It is evident that companies should focus not only on current and known projects

but also on uncertain and unknown future projects. This research develops and

14

validates a stochastic model for predicting streams of uncertain and unknown future

projects. It also seeks to demonstrate the significance and implications of such

uncertainties on project portfolios and strategic planning. In terms of scope, this

research project considered the Florida Department of Transportation’s (FDOT) design-

bid-build projects as a case study. Records containing letting information from the past

14 years, along with a pool of candidate variables, were analyzed to capture the

characteristics of the time-series data and to identify any correlations between those

variables and macroeconomic factors. The objective was to develop a model capable of

generating representative future project streams to assist in strategic planning and

portfolio management.

The findings demonstrate how various univariate and multivariate models can be

used to forecast the number of future projects for individual months. Furthermore, a

sampling method was developed and verified to assign a cost and duration to each

forecasted project. Contractors could, for example, use these stochastic data streams to

test different bidding strategies and assess the sensitivity of a portfolio’s performance to

changes in market factors.

15

CHAPTER 1 INTRODUCTION

It is necessary to review the definition and relationship between important

concepts in this research. Figure 1-1 shows the relationships of project, program, and

portfolio. As shown in Figure 1-1, projects are the smallest unit, which can be grouped

and managed in programs. Programs can cover multiple projects or even have smaller

programs within themselves. There can also be projects, which are not included in any

program and being considered directly under the portfolio. All the projects and programs

together build the portfolio of a company.

Figure 1-1. Relationship between project, program, and portfolio (Project Management Institute 2013a; b)

The Project Management Institute (PMI) (2013a; b) in their standards define a

project as “a temporary endeavor undertaken to create a unique product, service, or

result.” This research focus on construction projects, however, the conceptual proposed

16

model can be applied in other fields. A construction project’s product is a building or

facility such as a factory, hospital, and so on.

The Project Management Institute (2013a; b) defines project management as

“the application of knowledge, skills, tools, and techniques to project activities to meet

the project requirements.” Resource allocation is a part of the project management skill

set which deals with scheduling activities and assigning available resources such as

financial, human, or equipment to activities within the projects in a timely manner.

Scheduling is not only the procedure of decision-making about resource allocation to

tasks but also includes sequencing tasks and prioritizing them. A schedule typically

consists of a set of tasks or activities with defined milestones, start, and finish times.

A program is a group of projects or smaller programs that pursuit same strategic

objectives or are related together with a significant relationship. The Project

Management Institute (2013a; b) defines a program as “programs are grouped within a

portfolio and are comprised of subprograms, projects, or other work that are managed in

a coordinated fashion in support of the portfolio.”

The Project Management Institute (2013a; b) defines a portfolio as “a component

collection of programs, projects, or operations managed as a group to achieve strategic

objectives.“ Portfolio management is the coordinated management of one or more

portfolios to fulfill organizational strategies and achieve its objectives.

Companies usually face multiple projects at any given time. While different

projects progress concurrently, they have different goals and objectives, for instance,

some of them may have financial objectives while others may be marketing or strategic

networking. Consequently, a key managerial duty is to allocate resources such as

17

financial, material, and human resources between these concurrently ongoing projects

and manage workflow of them together to maximize a company’s performance in terms

of financial or any other defined objectives (Blichfeldt and Eskerod 2008). The

methodology of coordination among different projects in a company is a challenging

task because each incoming project affects the schedules and progress of all other

ongoing projects (Araúzo et al. 2010), and without foreseeing these effects the result

can be devastating.

The concept of project portfolio is similar to financial portfolios where different

factors such as risks, returns, time-to-benefits, complexity, portfolio balance, etc., are

taken into consideration before investment. Similarly, the main concentration of PPM so

far has been the procedure of selecting and ranking the projects to balance risk,

resource distribution, and the benefits in accordance with the company’s strategy.

The publications on PPM (including the normative body of knowledge such as

the PMI standards) are relatively recent, and most of them have attempted to address

the most pressing needs rather than to cover all aspects in this field. For example, the

PPM literature gives consideration to the potential disorder effect on portfolio plans

resulting from the typical business environment changes such as new upcoming

projects, sudden termination of an ongoing project, inaccurate plans due to high

uncertainty, resource scarce. Also, changes in market condition, and new threats and

opportunities which might impact the successful implementation of the portfolio between

portfolio planning cycles are among the considered typical business environment

changes. These issues do not mean that developed methods are incorrect or

18

inadequate but incomplete. As a result, research on these areas can improve the

current state of knowledge and practice.

Selecting projects from the available options and planning and scheduling them

have recently received a considerable amount of attention (Liu and Wang 2011). For

construction related organizations, such as investors, developers, and contractors, it is

critical to gather and analyze project information to select the best options according to

their strategic goals and schedule them within the required time frame and the financial

constraints. This is a complex and multifaceted process, which has many contributing

factors, such as the market condition, the organization’s structure, resource availability

and so on (Scott 2002). Research on this topic has come from several different points of

view, such as selection model criteria and scheduling mechanisms (Martinsuo 2013),

yet the primary focus has been choosing the most appropriate projects rather than

providing a real-time dynamic model to address the project selection and scheduling

issues (Araúzo et al. 2010). Another shortcoming has been to disregard the importance

of multiple project scheduling and resource allocation under influential factors and

uncertainties, such as the economic situation of the construction industry.

Despite the available modeling proposals, companies still struggle to optimize

and manage changes among their projects (Martinsuo 2013). One of the reasons for

this is that the proposed mathematical models cannot address the complexity of the real

world situation (Araúzo et al. 2010). Excluding uncertainties, such as the impact of

possible upcoming projects or changes in the economic and financial situation of the

construction industry, are some other noteworthy contributing factors to the poor

performance of existing models.

19

Based on the preceding discussion, due to the lack of consideration of the impact

of major uncertainties on a portfolio plan, providing a methodology and model to

address simultaneous planning and control of multiple projects remains a challenging

and important task. It is essential for a successful methodology and model to

incorporate both ongoing and incoming projects (known and unknown) with

consideration of major uncertainties such as the economic condition of the construction

industry is crucial.

20

CHAPTER 2 LITERATURE REVIEW

Definition of Project Portfolio Management

The success of a construction company is strongly affected by its ability to

strategically plan and manage a stream of projects, many of which will overlap in time,

and all of which are subject to uncertainty about their occurrence, scope, and resource

needs. This task can be broadly classified as project portfolio management. Cooper et

al. (1997) describe PPM as “dealing with the coordination and control of multiple

projects pursuing the same strategic goals and competing for the same resources,

whereby managers prioritize among projects to achieve strategic benefit.” PPM is

rooted in two complementary but independent tasks. First, supporting investment

decision making in terms of selecting project types and projects with the goal of

optimizing return on investment and risk (Markowitz 1952). Second, allocating available

resources across many different projects in a way that best meets the goals of those

projects (such as contract deadlines and profitability) while managing risks involved

(Pennypacker and Dye 2002).

Modern portfolio theory was introduced by Markowitz (1952) within a financial

context. In his theory, a portfolio is defined as a set of financial assets and potential

investments, which are used to select a set of investments that either maximize return

on investment for a given risk or minimize risk for a given return on investment. Several

years later, McFarlan (1981) introduced the concept of PPM in an information

technology context. He suggested using projects as the elements of a portfolio (instead

of investments) to better achieve an organization’s objectives as well as reduce the

overall risk that the organization encounters during execution of those projects.

21

The first definitions of project portfolios tended to be simple and fairly close to the

financial portfolio definitions. For example, Archer and Ghasemzadeh (1999, 2004)

propose a definition of project portfolio as “a group of projects that are carried out under

the sponsorship and/or management of a particular organization.” Dye and

Pennypacker (1999) include the notion of fit to organizational strategy in their definition

for project portfolio: “a collection of projects that, in aggregate, make up an

organization’s investment strategy.” Githens (2002) adds the notion of program and fit

organizational strategy in his definition: “a collection of projects or programs that fit into

an organizational strategy. Portfolios include the dimensions of market newness and

technical innovativeness.” Project Management Institute (2013a; b) has defined the term

portfolio in their standards as “a component collection of programs, projects, or

operations managed as a group to achieve strategic objectives.”

PPM operates at the strategic level of decision making in the organization

structure. It has different components such as defining, prioritizing, planning, managing

and controlling the subparts of the project portfolio which are projects and programs, to

better distribute available resources and address associated risks (Young and Conboy

2013). In other words, PPM is a continuous process which tries to align the

management of all projects by continually examining and updating the selection and

management of projects to increase the company’s performance (Young and Conboy

2013).

However, while there is some agreement in the recent definitions of project

portfolio, there is still much variation in the definition of PPM. Authors focus on different

aspects of their definitions, and none of them is comprehensive. For example, Project

22

Management Institute (2013a) lists the PPM subprocesses and repeats its definition of

portfolio in its definition of PPM as “the coordinated management of one or more

portfolios to achieve organizational strategies and objectives. It includes interrelated

organizational processes by which an organization evaluates, selects, prioritizes, and

allocates its limited internal resources to best accomplish organizational strategies

consistent with its vision, mission, and values.” On the other hand, Dye and

Pennypacker (1999) prefer to focus on the term ‘management’ and define project

portfolio management as using management skills to satisfy an organization’s

investment strategy.

Some recent definitions emphasize the strategic alignment, for instance,

Rajegopal et al. (2007) look at portfolio management as a tool to implement an

organization’s strategy. Levine (2005) similarly emphasizes the role of PPM in

contributing to the overall success of the enterprise. Cooper et al. (2001) focus on the

decision and revision processes in their definition of project portfolio management. This

definition supports the view adopted in this paper that project portfolios are dynamic

entities, which must continuously be monitored, analyzed and controlled to ensure that

they are kept in line with the organizational goals. Finally, Turner and Müller (2003) take

an alternative view by building on the notion of the project portfolio as an organization.

They emphasize collective management of the projects to achieve better resource

distribution among projects and reduce uncertainty. However, their definition of a project

portfolio as an organization has not been widely accepted by the business and

academic communities.

23

In this research definition of project portfolio management is adopted from

Project Management Institute (2013a), The Standard for Portfolio Management, which

seems the most prevalent definition accepted by the scholars and practitioners.

The objectives of PPM are properly defined in the project management literature,

consisting objectives such as maximization of portfolio value and aligning the projects in

accordance with the organization’s strategic objectives (Cooper et al. 2001; Elonen and

Artto 2003; Teller 2013; Unger et al. 2012). Also, the importance of single project

management is described in the literature as a necessary but not sufficient requirement

for PPM (Martinsuo and Lehtonen 2007). Elements of successful PPM includes average

project success, considering synergy in management, strategic coordination, risk

management, and financial success (Teller 2013).

In spite of all the models that have been developed for assisting in the

establishment of a project portfolio, allocating resources among the projects, and

examining the portfolio success, generally, companies have not found that PPM models

meet their expectations and, moreover, it does not appropriately address the dynamic

nature of the project portfolios (Elonen and Artto 2003; Engwall and Jerbrant 2003).

Project Portfolio Management Methods

Many authors have developed models to provide a solution for different issues in

project portfolio management. In this section, the most notable ones are reviewed to

illustrate the body of knowledge in this discipline. Different proposed methods and

developed models in the literature are reviewed and compared to exemplify the state of

knowledge and find the gaps in the knowledge.

It is evident in the project portfolio literature that there is no single project portfolio

management system that works for all companies. In fact, each company should

24

customize their framework to best suit their situation (Floricel and Miller 2003; Killen et

al. 2007). For example, Dahlgren and Söderlund (2002) reviewed the project portfolio

control mechanisms in four Swedish enterprises and found that different types of firms

have different control mechanisms depending on the level of uncertainty and the extent

of dependencies between their projects. Based on the initial findings from a qualitative

investigation in four firms and using a model which had been developed by Thompson

(1967), they proposed four types of control mechanisms: routine-based control,

resource-based control, planning-based control, and program-based control based on

the level of uncertainty and the degree of dependencies between the projects.

Conventional planning techniques are not an appropriate controlling mechanism in

contexts with high uncertainty due to the requirement of particular level of stability. If the

projects are rather independent of each other, then, the controlling at a portfolio level is

based on the control of the independent projects, individually with a high level of

uncertainty. The resource-based control is centered on the choice of the project

managers (plus a delegation of authority) and the allocation of resources to projects. In

projects and portfolios with high dependencies and a high degree of uncertainty,

implementing measures for coordinating these dependencies is critical in addition to the

resource-based controls. One solution is progress meetings which frequently should

happen to find a solution for dependencies and identify errors in the portfolio plan. This

control mode is named program based control.

Bengtsson et al. (2009) study coordination mechanisms (instead of the control

mechanisms studied by Dahlgren and Söderlund (2002)) in relation to the activity

context (complex or simple) and ambiguity of the tasks (clear or ambiguous). They

25

defined four quadrants by these two variables and then further decomposed them to

identify the coordination activities from a temporal and spatial perspective. Based on the

temporal point of view, the coordination activities are planned time, continuous time,

predesigned or flexible while the coordination activities from a spatial perspective are

networking, virtualizing, sequencing and task forcing. Although Bengtsson et al. (2009)

approach is more sophisticated, it contains many similarities with those of Dahlgren and

Söderlund (2002).

Danilovic and Sandkull (2002, 2005) studied the relationship between uncertainty

and dependencies in multiple project situations. They claim that the sources of

uncertainty in a new product development are the organizational settings, the product

architecture, and the project management.

PPM frameworks are also critiqued for not considering all resource restrictions

(such as time and interdependence) simultaneously, and for lack of consideration of an

organization’s historical performance data which is necessary if a plan is to be based on

the organization’s capabilities (Henriksen and Traynor 1999; Martinsuo 2013).

A study by Liu and Wang (2011) presents an optimization framework for

selecting projects in a portfolio and scheduling them with consideration of time

constraints. Their model was developed for use in construction and research and

development departments to maximize the benefit of considering limitations such as

budgeting and time constraints. Their model can relatively integrate project selection

process, scheduling with priority consideration, and the correlation between projects to

optimize portfolio planning. However, their verification of the model lacks empirical data

and is based on the synthetically produced data. In addition, the developed model is

26

mathematically complex for implementing in practice by industry users. The model could

be more user-friendly by using a more visual approach. Moreover, the model is defined

to optimize financial benefit while not all companies’ objectives are purely financial.

Also, lack of monitoring resource utilization and sensitivity analysis besides the need for

more comprehensive scheduling system are other areas that their model can be

improved.

Providing a practical and comprehensive methodology to facilitate management

and coordination of multiple projects in a company’s portfolio is a challenging task.

There are no appropriate analytical solutions available for dynamic scheduling and

resource allocation of project portfolios in real-time (Araúzo et al. 2010). Existing

proposed mathematical models (such as those of (Archer and Ghasemzadeh 1999;

Browning and Yassine 2010; Carazo et al. 2010; Engwall 2003)) cannot handle the

complexity of real world challenges due to a limited consideration of significant

uncertainties within their models and a lack of provision for dynamic and real-time

analysis. Araúzo et al. (2010) have proposed a multi-agent system, where there is an

intermediate buffer between projects and resources. The buffer distributes resources

between projects with a mechanism that they called auction, which is conceptually

same as auction process in the real world, where a resource allocates to a project which

returns the best value. They have modeled both projects and resources as agents, and

an auction mechanism tries to correlate them with the optimum solution in terms of

resource distribution and financial benefit while tries to satisfy time constraints. Their

model only optimizes the resource distribution for financial objectives. Also, their model

lacks the ability to consider changes in the economy such as changes in the inflation

27

rate, a crisis or changes in internal organizational levels such as expansion or human

resource reduction. They have argued that “results show that auction based allocation

mechanism improves schedules and resource flexibility to achieve more efficient

performance” (Araúzo et al. 2010), but it is not clear this statement is based on

comparison to what baseline. Their research could be bolstered by using and

comparing other methods available, using actual empirical data for validation of their

model and extending their model to covers changes in the economy, price adjustments,

and flexible scenarios.

Petit and Hobbs (2010) in their research introduce change in scope, new

customer, and products as the most important sources of uncertainty with significant

impact on a company’s portfolio performance. In the construction industry, usually,

each new building and facility is unique and can be considered as a new product.

Thereby, it shows how much addressing uncertainty in construction companies’ portfolio

planning is critical. Also, Petit and Hobbs (2010) identified third party suppliers,

organizational change, and changes in processes as some sources of uncertainty in

portfolio management with medium to relatively high impact. These contributing factors

are primarily part of any new construction project. As a result, modeling and considering

the impact of upcoming projects is critical in the construction context.

The main priority of PPM publications and research were initially to improve

organizational performance by introducing good practices to choose and prioritize

projects and make certain that the right mix of projects was executed. A recurring theme

is the alignment of the projects with the organization’s strategy. There is also extensive

literature on project selection with different quantitative approaches. However, most

28

empirical research fail to demonstrate much about the application of these models in

practice. Also, it is notable that there is no forward looking theme in portfolio

management by incorporating future project streams in their models.

Table 2-1 shows a summary of the literature on approaches toward project

portfolio management and compares the tackled problem, proposed solution, limitation

and the gap in each research to help demonstrate the contribution of this dissertation. It

is clear that none of the presented models includes unknown future project streams in

their portfolio management methods and their planning horizon is limited to the known

projects. However, it is repeatedly argued that upcoming projects significantly impact a

portfolio’s performance. The proposed model in this research is not a standalone

portfolio management framework but should be considered as a supplementary

component to current PPM frameworks. It can be used as an add-on to the existing

PPM models to extend their horizon of planning and assist strategic planning by

forecasting unknown future projects. In this research, it is not proposed that developed

models are incorrect. Instead, it is argued they are incomplete, and the strategic horizon

of portfolio planning can be extended by using the proposed method in this dissertation.

29

Table 2-1. Summary of the literature on approaches toward project portfolio management

Author(s) Research Problem Solution Limitations and Gap

Henriksen and Traynor (1999)

Project evaluation and selection in a portfolio

Developed a new algorithm with criteria of relevance, risk, reasonableness, and return

Limited application to research and development project evaluation and only focuses on known project selection

Liu and Wang (2011)

Project selection and scheduling problems with time-dependent resource constraints

Developed an optimization model using constraint programming

Considers only known projects and financial objectives and lacks monitoring of resource utilization

Ghasemzadeh et al. (1999)

Selecting and scheduling an optimal project portfolio, based on the organization’s objectives and constraints

Developed a zero-one integer linear programming model

Model lacks the ability of dynamic and real-time analysis. Only considers known projects. Lack of consideration of uncertainties in the model

Archer and Ghasemzadeh (1999)

Selecting projects for a portfolio

Developed a qualitative multistage framework for selecting projects

Qualitative, only considers known projects and focuses on selecting projects.

Browning and Yassine (2010)

Performance of priority rules in Static resource constrained multi-project scheduling

Sensitivity analysis of priority rule method in different context by simulation

Deterministic, and only considers known projects.

Carazo et al. (2010)

Selection and scheduling of project portfolios from a set of candidate projects

A multi-objective binary programming model using a metaheuristic procedure

Only works for a pool of known projects.

Araúzo et al. (2010)

Dynamic scheduling of resources within a portfolio

Distributing resources by a multi-agent system through an auction mechanism

Model is limited to resource allocation optimization. It considers only known projects

Danilovic and Sandkull (2005)

Interdependencies and relations in a Multi-project environment

Dependence structure matrix and domain mapping matrix approach is suggested

Mainly focus on multi-project management and only consider known projects

30

Uncertainties in Project Portfolio Management

The concept of uncertainty is very significant within the field of project portfolio

management. Duncan (1972) and Daft (2009), for example, demonstrated that changes

in the business environment combined with projects with high complexity always result

in an increase in uncertainty in parameters such as the number of projects, how rapidly

and according to plan projects progress and changes in the economic conditions. This

has led to an extensive literature on uncertainty and the ways to handle it in

management. Following is a literature review and discussion of the terminology and

concepts of related areas to uncertainty in management, which includes risks, risk

management, changes, unexpected events, and uncertainty.

Risk management became one of the essential parts of the body of knowledge in

project management from a long time ago. There are plenty of literature in this area

(Persson et al. 2009; Ward and Chapman 2003; Wideman 1992) also typically there is

at least one chapter in project management books dedicated to risk management (Gray

and Larson 2008; Kerzner 2009). Risk Management is also covered in the PMI

standard for project management, Project Management Body of Knowledge (PMBOK)

which defines a project risk as “an uncertain event or condition that, if it occurs, has a

positive or negative effect on a portfolio objective.” (Project Management Institute

2013b)

PMI re-uses the same definition for project portfolio risks, which in this case the

effects would be on the portfolio rather than the project objective. “An uncertain event,

set of events or conditions that, if they occur, have one or more effects, either positive

or negative on at least one strategic business objective of the portfolio.” (Project

Management Institute 2013a)

31

There are a number of methods developed to predict the probability and measure

the effect of risks on a project. One of the classic ways to do so is based on the degree

of knowledge about the probability of occurrence and the impact of the risk. This

perspective leads to four categories as follow (Cleden 2009):

1. Known-known: This is related to the states such as predictable future and confirmable evidence. This type defined as the risks that we know that may happen and as well as their impact.

2. Unknown-known: This type is related to the issues such as unutilized skills and potentials. This type is defined as the risks that we do not know that may happen. However, we know their impact.

3. Known-unknown: This type is related to recognized risks that we are aware they might occur, but we are not aware when and we also do not know what their impact might be. A possible delay of a piece of equipment is an example of something that we are aware we do not know.

4. Unknown-unknown: This type is related to the events that we do not know they exist or might happen, and we do not know their impact. Gaps in the knowledge and unpredictable events are some examples of this type.

Different procedures have been developed to handle risks, mainly known-

unknown types, with remedies such as reducing the probability of happening (risk

mitigation) or reducing their impact on the project. This can be seen in PMI and

Association for Project Management (APM)’s definitions of Risk Management:

“Project Risk Management includes the processes of conducting risk

management planning, identification, analysis, response planning, and controlling risk

on a project. The objectives of project risk management are to increase the likelihood

and impact of positive events, and decrease the likelihood and impact of negative

events in the project.” (Project Management Institute 2013b)

“Project Risk Management is a structured process that allows individual risk

events and overall project risk to be understood and managed proactively, optimizing

32

project success by minimizing threats and maximizing opportunities. “(Association for

Project Management 2006)

Technical term “risk” is typically restricted to events instead of more generic

sources of uncertainty. In dynamic environments with high fluctuation and instability,

usually managers try to exceed the routine risk management practice such as planning

and control system and instead use more flexible systems, in fact, the dilemma is to find

the balance between planning and learning (De Meyer et al. 2002).

The impact of uncertainty on organizations is well established across many

disciplines from psychology to economics (Petit and Hobbs 2010). Environmental

uncertainties and their relation to organizations are analogous to the state of a person

with a shortage of critical information about the environment. Scott (2002) provides an

example of the definition of environmental uncertainty as variability or the extent of

predictability of the environment where work is executed. They also introduce some

measures for uncertainty, such as variability of inputs, the number of deviations in work

process, and the number of changes in the main products. In the project management

context, uncertainty in a project is defined as the accuracy of predicting the variation of

resource consumption, output, and work process (Dahlgren and Söderlund 2002).

Uncertainty in a project can be seen as a variation from expected performance of the

system under investigation.

Uncertainty in a project can be seen as a variation from activities basis that work

is carried based on them and the unforeseeable performance of humans. There are

different measures of uncertainties based on the variance in data and number of

33

anomalies in the workflow. In fact, risk can be defined as a measure for certain kinds of

uncertainties.

Using the uncertainty concept, De Meyer et al. (2002) propose to replace the

known-unknown method for risks classification with four types of uncertainties as follow:

1. Variation: Originated from the aggregation of small influences that can impose a different amount of impacts on an activity or a project. Managers can still plan and control the project with scheduling techniques such as PERT or Monte-Carlo.

2. Foreseen uncertainties: These known or identifiable variables are similar to risks with known-known status and can be mitigated with contingency remedies.

3. Unforeseen uncertainties: In terms of risks, this category is similar to unknown-unknown section. For example, this situation can happen with an unpredictable chain reaction of many foreseeable events.

4. Chaos: Typically, projects start with certain goals, objectives and a final product in mind, but in cases such as a technology development or research and development projects, the whole structure and base of the project is uncertain. It is the case sometimes that the final product is entirely different from initial project’s aim.

Project Management Institute (2013a) in his standard of portfolio management

defines portfolio risk as “an uncertain event or condition that, if it occurs, has a positive

or negative effect on one or more project objectives.” Then it brings the issue of risk

management in a portfolio as “a structured process for assessing and analyzing

portfolio risks with the goal of capitalizing on the potential opportunities and mitigating

those events, activities, or circumstances which can adversely impact the portfolio.”

PMI standard for portfolio management despite introducing the risk management

concept at a portfolio level does not provide much information on how managers should

handle uncertainty and risk within their portfolio. They only provide guidelines on

categorizing different possible stages and processes plus naming some of the possible

techniques available to handle uncertainties. The PMI only suggests monitoring risks

and the performance of the project portfolio under the monitoring and control process

34

group. The proposed framework by the PMI also includes monitoring changes in

business strategy. This is an important task because when it occurs, it might result in a

complete realignment of the portfolio. The mechanisms involved in this realignment are

not specified other than restarting the whole PPM process from the beginning. Also, ad-

hoc disturbances to the ongoing and approved project portfolios are almost entirely

neglected. This oversight is not because the topic lacks interest or that authors assume

a stable and predictable environment. Rather, it can probably be explained by the fact

that the subject of PPM is relatively young and that the researchers and academics

preferred to focus on more pressing issues in this area. For many companies, the

environment is unstable, and the high level of uncertainty and unknowns resulting from

the dynamic environment lead to some challenges. New tools and techniques are

required to help manage portfolios that exist within a continuously changing non-ergodic

environment.

Martinsuo's (2013) review of empirical research on PPM noted that uncertainty

and constant changes in company portfolios have a considerable impact on project

portfolio performance. Furthermore, he proposes that further research is required into

PPM as a continuous process of project selection, resource allocation optimization,

sensing and adapting to changes within an uncertain dynamic environment. Martinsuo's

(2013) findings can be summarized as follow:

1. Projects outside preliminary portfolio compete for available resources with projects in the portfolio, and as a result, actual work progress will be different from portfolio plan if the portfolio is not defined comprehensively.

2. Uncertainty and constant changes in companies’ portfolios have a considerable amount of impact on project portfolio performance.

3. Project portfolio management as a continuous process of selection of projects, optimizing resource allocation, sensing changes and adapting to them in a dynamic

35

environment with consideration of uncertainties’ impact on portfolio schedule demand further research.

Uncertainty can be divided into two broad categories. First, uncertainty which is

internal to a project affecting its performance (resulting from, for example, unexplained

variance in production rates or unexpected delays to the delivery of resources). Second,

uncertainty concerning the future supply of work, such as affected by the number, size,

and timing of jobs that become available for bidding, and a contractor’s success in

securing the winning bid.

Most modeling efforts have focused on the first of these two categories of

uncertainty, that which is internal to a project. However, upcoming projects significantly

affect the performance of a project portfolio (Araúzo et al. 2010) and are essential to

medium and long-term strategic planning. The typical approach when a new project is

acquired is to update the project portfolio's plans and to try to re-optimize everything.

This is neither practical (requiring frequent updates to the plans) nor efficient. In fact, a

practical portfolio management scheme should enable a user to see different scenarios

based on possible upcoming projects in a pipeline and incorporate possible impacts of

significant uncertainties on a portfolio to facilitate a better understanding of the future

options and likely best strategies. One solution to this issue would be to use a

stochastic sampling of streams of uncertain upcoming projects operating within

alternative business environments and model the resultant impact on the portfolio

performance. The idea is to plan proactively by taking into account statistical

assessments of potential future opportunities and needs, as opposed to planning

reactively (and therefore iteratively) to changes in the current circumstances. These

streams would allow PPM to be used for strategic project selection and resource

36

planning taking into account a potential future stream of known projects and unknown

projects.

The literature review showed there is not much work done regards to predicting

future project streams and incorporating that into portfolio management. Nevertheless, it

is being identified that the upcoming projects significantly affect a portfolio performance.

The idea of predicting future project streams and incorporating them into the portfolio

management framework is a novel approach to portfolio management, which in this

research is investigated. In this research, it is not proposed that developed models are

incorrect; instead, it is shown that they are incomplete. For example, focusing on a pool

of known projects, and lacking the insight to the unknown future projects are among the

reasons of the inadequacy of developed models. This research is intended to build upon

developed methods and findings of previous research to develop a new model to

address discussed problems in project portfolio planning practicality and efficiency. In

this research, it is suggested to enhance the current frameworks with empirical data and

conceptual supplements to help managing project portfolios with consideration of

uncertainties such as future project streams to help companies to plan their strategy

better.

37

CHAPTER 3 PROBLEM STATEMENT AND RESEARCH METHODOLOGY

Research Scope

Current project planning practices are only focused at the project level, while the

portfolio approach covers joint project management and consider the interactions

among individual undertakings. Master planning is planning for a full set of ongoing

projects, which makes it somewhat similar to program management (simultaneously

managing a selection of ongoing projects). In portfolio planning, one of the main

objectives is to decide which available future projects the company should pursue to

optimize its objectives. However, with master planning, the focus is only on current

projects, whether it be all of them or solely a selection. A comparison between master

planning and multi-project management, on the one hand, and the proposed models for

project portfolio management (PPM), on the other hand, demonstrates that while they

may provide the same type of results, portfolio management models yield more realistic

outcomes and better insight into how future projects will impact the portfolio plan. Future

projects can consist of projects on which the company has bid; projects that the firm has

won, but that will start in the future; and projects which the company does not know

much about them. This research introduces a new method for approaching current

portfolio management. Specifically, it describes the development of a project stream

generator that allows users to evaluate their portfolios by considering unknown future

projects in their planning processes.

This research not only contributes to the body of knowledge in the management

field but also has important practical applications in managing companies and

enterprises. Regardless of whether a firm is a small or medium-sized company or a

38

large international corporation, its managers need to optimize their resource (finances,

materials, human resources, and equipment) allocations across the projects in their

portfolios to more successfully and effectively achieve their objectives. Moreover, the

question remains of how upcoming projects with which a company is unfamiliar may

affect its portfolio. Due to the complexity of the issue, the solution is neither intuitive nor

apparent. Based on the presented literature review, the current PPM models are mostly

deterministic, while it is evident that the appropriate tools to plan for the future are

nondeterministic. As a result, current models’ methodologies fall short in terms of

incorporating significant uncertainties, such as uncertain upcoming projects and the

construction industry’s economic situation. This leads to a lack of practicality and a

deviation between the expected results and a company’s actual performance.

Consequently, comprehensive research is required to develop a portfolio management

framework that addresses these problems. The proposed approach provides a more

realistic plan for companies, and as a result, it achieves significant savings in terms of

both financial and human resources. Further, a better planning process will reduce idle

time for both employees and equipment, which will result in greater efficiency and less

waste.

The proposed model is not a standalone portfolio management framework;

rather, it should be considered a component of a broader portfolio management system.

It can be used as an extension of current PPM models, one that extends their planning

horizon and assists with strategic planning by forecasting unknown future projects. This

research seeks to extend PPM to include streams of projects advancing far enough into

the future, thereby facilitating medium- and long-term strategic planning. Incorporating

39

these streams elements would allow PPM to be used for strategic project selection and

resource planning, taking into account both known and unknown potential future

projects. Known future projects are those that have been announced but have not yet

been awarded to a contractor. Unknown projects are those that have not been

announced (they may still be in the design process or may not have even been

conceived) but can be modeled as a statistical expectancy based on historical data.

These streams are developed using stochastic techniques that are statistically

representative of potential future outcomes. A statistically significant sample of these

streams can then be filtered through a company’s bidding success model, with the

output used to optimize strategic planning, taking into account uncertainty and variance

in the future market. The optimality of a plan’s sensitivity to changes in key market

parameters can also be tested, and appropriate contingencies for such events thereby

established.

The scope of this research is limited to developing a stochastic project stream

generator based on the past 14 years of Florida Department of Transportation (FDOT)

design-bid-build projects.

Aim

• To develop a stochastic project stream generator to predict FDOT project streams in terms of occurrence, cost, and duration to facilitate short-, medium-, and long-term strategic planning. This project stream generator can be used as a supplement to current portfolio planning models to extend the planning horizon beyond known projects. This is a proof of concept to see whether it is feasible and how one may proceed in developing a general solution for the construction industry.

Objectives

• To devise a framework for developing a stochastic project stream generator for FDOT-led design-bid-build projects in terms of their occurrence, budget, and duration

40

• To identify the components of the model based on the literature and data limitations

• To identify appropriate models for each component

• To test different models for each component and finalize each component by optimizing and validating the best model

• To combine the components, and build and test the stochastic project stream generator

Data Structure

The main data for this study were obtained from the FDOT’s historical project

lettings database, which covers the past 14 years (from 2003 to 2017). The database

contains 3,192 project-letting reports. Based on the letting date, the monthly, quarterly,

and annual project frequencies were calculated as secondary variables.

A pool of candidate variables, including macroeconomics metrics and

construction indices, were compiled from relevant sources and the literature

(Shahandashti and Ashuri 2016). Table 3-1 provides a list of these variables and their

sources.

Table 3-1. Candidate variables and sources Candidate Variables Source

Gross domestic products (GDP) U.S. Bureau of Economic Analysis GDP implicit price deflator U.S. Bureau of Economic Analysis Inflation rate World Bank Consumer price index U.S. Bureau of Labor Statistics National highway cost index (NHCCI) U.S. Department of Transportation FDOT’s annual budget Florida Department of Transportation FDOT’s product budget Florida Department of Transportation Federal funds rate Federal Reserve Systems Unemployment rate U.S. Bureau of Labor Statistics Florida Unemployment rate U.S. Bureau of Labor Statistics Number of employees in construction U.S. Bureau of Labor Statistics Number of employees in construction in FL U.S. Bureau of Labor Statistics Average weekly hours U.S. Bureau of Labor Statistics Prime loan rate Federal Reserve System Building permits U.S. Bureau of Census Money supply Federal Reserve System Average hourly earnings U.S. Bureau of Labor Statistics Employment Cost Index (ECI) Civilian U.S. Bureau of Labor Statistics

41

Table 3-1. Continued Candidate Variables Source

Dow Jones industrial average Yahoo Finance Crude oil price U.S. Energy Information Administration Brent oil price U.S. Energy Information Administration Producer price index U.S. Bureau of Labor Statistics Housings starts U.S. Bureau of Census Construction spending U.S. Census Bureau

Factors such as infrastructure needs can also influence future project streams.

As infrastructure projects require substantial advance planning and budgeting, it was

assumed that including variables such as the FDOT’s annual budget and product

budget would capture some of these factors, such as the impact of infrastructure needs

on future project stream behavior.

Figure 3-1 presents the data structure and the investigated connections between

the variables. Notably, when using the cumulative dataset for a period, the duration and

cost variables were unusable, since it would have been meaningless to sum project

durations and costs in a particular month or quarter. However, the cumulative datasets

provided the project frequencies for different timeframes.

Figure 3-1. Data structure

42

Research Design

Little research has been conducted on nondeterministic project portfolio modeling

and forecasting unknown future projects. As a result, this study was inductive and used

grounded theory. In that approach, the researcher tries to develop a model or theory

based on the collected primary data.

In this study, synthetic analysis was employed. According to Auyang (2005),

synthetic analysis first acquires an abstract view of a complex problem and recognize its

characteristics. Then, to describe these traits, it breaks the problem into smaller

modules and studies them independently. Ultimately, it synthesizes all the individual

outcomes to find a solution. Trial and error is typically necessary to achieve acceptable

results. Synthetic analysis tries to solve complex problems by looking at their

components; however, it never loses sight of the whole system. In fact, this approach is

the opposite of reductionism, which solely examines the parts of a system.

To date, research has focused on the selection and prioritization of projects from

among a pool of known projects, ignoring the opportunities and needs of unknown

future projects. That method is, by definition, a short-term planning strategy with no

guarantee of satisfying a company’s longer-term goals. The current horizon of strategic

planning—which covers selecting projects for bidding, planning for their contractual

needs, and identifying the resources necessary for execution—is limited to considering

projects advertised in the market.

The proposed approach rests on the assumptions that unknown future projects

can be represented statistically and that by bringing them into the strategic planning

process, companies can devise more appropriate medium- and long-term strategies.

Forecasts of a company’s unknown future projects can be based on its past and present

43

portfolio data. Alternately, historical market data can be employed to forecast all

upcoming projects as project streams and to filter those project streams by bidding

success models. In a highly competitive environment in which the supply of projects is

scarce, using only a company’s past projects to forecast future unknown projects is a

potentially less accurate method. Arguably, it is more valid to forecast streams of

unknown projects (all the projects that will be available in future), considering contextual

uncertainties and filtering those projects using bidding success models to predict the

final future projects comprising a company’s portfolio. Such a forecast could statistically

generate a single set of outputs or stochastically produce streams of values as outputs.

Considering the uncertainties in the market, the PPM context, and the availability of

future projects, stochastic forecasting appeared to be the right choice.

The outputs from the generator are those parameters most critical to a company,

namely, the occurrence and letting date of a project, its expected duration, and its

anticipated cost. Other factors, such as economic conditions, can have an impact on the

project stream. The project stream generator can be divided into three sections. First,

the project frequency model forecasts the number of projects for each month. Second,

the generator estimates the cost of each project, and third, it predicts the duration of

each project, while considering any possible relationship between the duration and cost.

The sequence of information generation in the proposed model is illustrated in

Figure 3-2. The first step is to forecast the number of projects (frequency) for the

chosen time span. Next, sampling from the project cost distribution takes place. At each

point in time, the number of samples from the distribution is based on the number of

projects forecasted in the previous step. Finally, the same process is applied to the

44

duration distribution. The sampling process should also consider a potential correlation

between cost and duration.

Figure 3-2. The sequence of generating information

The complete set of results produced by the proposed framework can be used as

an input for any PPM model to consider unknown future projects in strategic planning.

Frequency Project cost Project Duration

45

CHAPTER 4 MODEL COMPONENT DEVELOPMENT

This chapter covers the overall model’s component development. First, the

project frequency forecast modeling is discussed. Then, the characterization of projects’

cost and distribution is presented.

Project Frequency Modeling

The first step in the simulation was forecasting the number of projects for each

month. This can be achieved with a range of modeling techniques, and the following

sections extensively discuss and analyze these options. In summary, the first task was

to identify appropriate models based on the data’s characteristics and limitations, taking

into account the model’s objectives. Next, the data were divided by a time-series cross-

validation method for training and testing the identified models, so as to identify the one

with the best performance. Afterward, model development and optimization took place.

During this stage, different models were trained and tested against actual data, while

parameter optimization and feature selection were completed. The output from this step

was the best-performing model with the right features and parameters for forecasting

the monthly project frequencies in the simulation. Before employing that model,

however, it was necessary to run diagnostic tests to check its stability. For instance,

checking for an autocorrelation between the forecast residuals was an appropriate tool

for the time-series forecasts. Moreover, assessing how error compounded and

undertaking a sensitivity analysis to identify how the parameter values affected the

model’s output yielded more insight into its performance.

46

Model Identification

A dichotomy of modeling project frequencies is into univariate and multivariate

methods. Models concerning time-series data frequently use the values from one or

more previous time steps to forecast values for the succeeding point in time; in other

words, they regress based on past values. In conventional modeling, the assumption is

that the independent values are known, and the dependent values are forecasted.

However, in multivariate time-series forecasting, even the independent variables’ future

values are unknown and must be estimated. As a result, such a model contains a

system of equations that forecast future values for both independent and dependent

variables. This system is recursive when all the causal relationships are unidirectional,

and it is non-recursive (simultaneous) when there is reciprocal causation between

variables.

Figure 4-1 demonstrates four possible internal structures of the model. Figure 4-

1A highlights the dependencies between the inputs and output in a univariate

autoregressive (AR) model with two lags. In this example, the forecast value at each

point in time is based on the two preceding values. Figure 4-1B contains a recursive

multivariate model where the dependent variable forecast is based on past values of

both itself and the independent variables. However, each independent variable is only

based on its own past values. Figure 4-1C displays another recursive model, which

differs from model 4-1B in that the independent variables also act as inputs to each

other. Figure 4-1D depicts an example of a non-recursive (simultaneous) model where

all the variables work as inputs for each other. There is no discrimination between

dependent and independent variables in this approach.

47

Figure 4-1. Possible internal structures of the model A) univariate AR model B) recursive multivariate model without dependency C) recursive multivariate model with dependency D) non-recursive (simultaneous) model

Univariate models (e.g., AR, moving average (MA), autoregressive moving

average (ARMA), and exponential smoothing models) are among the most widely used

time-series forecasting methods cited in the literature. These univariate methods need

fewer data points as compared to the more complex multivariate models. However, they

cannot account for the interaction between important factors by design. Nevertheless,

multivariate models, by using other variables as input, can account for those factors’

effects and their interactions with the dependent variable. Linear regression

implementations, artificial neural networks, random forest models, support vector

machines, and Gaussian processes are among the most widely used methods in

forecasting time-series data in the literature. Cargnoni et al. (1997) used Gaussian

models to forecast the number of high-school students in each grade in future school

years in the Italian school system. Voyant et al. (2017) employed a multilayer

perceptron to forecast global solar radiation. Li and Chen (2014) used a LASSO (Least

Absolute Shrinkage And Selection Operator) based regression to estimate

48

macroeconomic time series, and they demonstrated how this method could be

combined with a dynamic factor model to yield a more accurate forecast performance.

Exterkate et al. (2016) used kernel ridge regression as a multivariate model for

economic time-series forecasting by considering the nonlinear relationships among the

variables. They found that this method outperformed traditional time-series forecasting

techniques based on principal components. Yu and Liong (2007) compared the linear

ridge regression, ARIMA (Autoregressive Integrated Moving Average), naïve, inverse

approach, and support vector machine in forecasting hydrologic time series and

concluded that the ridge linear regression outperformed the other models in terms of

both performance and time of execution. Choubin et al. (2016) compared multiple linear

regression, a multilayer perceptron neural network, and an adaptive neuro-fuzzy

inference system for forecasting precipitation and concluded that the multilayer

perceptron neural network outperformed the other methods. Cao and Tay (2003) used a

support vector machine for financial time-series forecasting and compared it with a

multilayer back-propagation neural network and a regularized radial basis function

neural network. They concluded that the support vector machine outperformed the

back-propagation neural network and produced a performance similar to that of the

regularized radial basis function neural network. Different implementations of artificial

neural networks have been employed in univariate and multivariate time-series

forecasting. Gers et al. (2002) demonstrated how Long Short-Term Memory (LSTM)

neural nets can be used for time-series forecasting to solve problems that regular

feedforward networks are unable to resolve. Kohzadi et al. (1996) compared a

49

feedforward neural network with an ARMA model and found that the former

outperformed the ARMA model in forecasting time series.

To choose the right models for the research problem and better understand the

nature of project frequency series, a set of preliminary analyses was required. An

essential analysis tested for stationarity and identified the order of differencing that

made the series stationary. A time series is stationary if its mean and variance evolve

around constant values. To implement many of the modeling methods, verifying a

series’ stationarity was necessary. However, series that are not stationary can be

transformed into that format via tools such as differencing. Differencing consists of

calculating the difference between consecutive data points, and the order of difference

is the number of times a series must be differenced to make it stationary. Figure 4-3

illustrates the rolling mean and standard deviation of the project frequencies, plotted

with the actual data. It is evident that they seemingly fluctuated around a constant value.

Figure 4-2. Rolling mean and standard deviation of the project frequencies

An Augmented Dickey-Fuller test (ADF) was conducted to assess the stationarity

of the data. There are three variations of the ADF test, all with the null hypothesis that a

unit root is present in a time-series sample (series is not stationary). If the null

50

hypothesis is rejected under any of the three variations, it can be inferred that the time

series is stationary. Choosing the appropriate lag in the ADF is critical. In this research,

the suitable lag was selected based on the Akaike information criterion (AIC). A

summary of the ADF results is presented in Table 4-1. As evident, the null hypothesis

could be rejected at the 95% confidence level. Thus, the frequency series was

stationary.

Table 4-1. Summary of the ADF test for the project frequency series Lag t-statistic P-value

ADF with intercept and trend 11 -3.15 0.100 ADF with intercept 11 -3.12 0.027 ADF without intercept and trend 12 -0.43 0.52

On the basis of the literature (Shahandashti and Ashuri 2016; Thomas Ng et al.

2000; Wong and Ng 2010), incorporating the interrelationships between macroeconomic

factors and primary variables was anticipated to further improve the generator’s ability

to capture the essential characteristics of a project stream.

Testing the order of integration for the independent variables was essential. An

ADF test was conducted to identify the order of integration of the independent data. The

strategy presented by Enders (2015) was adopted to find the right order of integration.

Choosing the appropriate lag in the ADF is critical, and in this case, the lag length was

selected based on the AIC. A summary of the ADF results for the available monthly data

is presented in Table 4-2. It is evident that most of the variables were nonstationary and

required differencing to become stationary. The unemployment rate in the construction

sector, the number of housing started, and the Florida employment needs two levels of

differencing to become stationary. This variation in the levels of the variables indicated

that typical multivariate modeling methods, such as vector autoregressive and vector

51

error correction models, could not be used, as they need all the variables to be at the

same level.

Based on the literature review and the variable analysis, a set of univariate

models, including AR, MA, ARMA, exponential smoothing, and LSTM neural network

models, were selected. Also, a set of multivariate models, including a generalized linear

model, a multilayer perceptron, and a support vector machine, were chosen.

Table 4-2. Results of the ADF test for the explanatory variables Variable Significance Lag T-statistic Type

Federal Fund Rate 0.0736 8 -3.913497 Intercept and Trend Florida Employees in Construction 0.0586 9 -4.210828 Intercept and Trend Average Prime Rate 0.0135 9 -2.250865 Intercept D(Brent Oil Price) 0.0000 0 -7.212745 Intercept and Trend D(Crude Oil Price) 0.0000 0 -6.102265 Intercept and Trend D(Consumer Price Index) 0.0000 7 -6.142466 Intercept and Trend D(Dow Jones Industrial) 0.0000 0 -10.78966 Intercept and Trend D(Job Opening in Construction) 0.0844 12 -3.223512 Intercept and Trend D(Money Stock) 0.0000 3 -5.617107 Intercept and Trend D(Producer Price Index) 0.0001 4 -5.457166 Intercept and Trend D(Highway and Street Spending In Florida) 0.0000 5 -8.101598 Intercept and Trend D(Unemployment Rate in Construction) 0.0821 24 -1.713120 Intercept and Trend D(Construction Spending) 0.0778 4 -2.650357 Intercept D(Building Permit) 0.0994 12 -1.887453 No Trend, No Intercept D(Florida Unemployment Rate) 0.0765 1 -1.768058 No Trend, No Intercept D(Number of Employees in Construction) 0.0650 3 -1.779882 No Trend, No Intercept D(Unemployment Rate) 0.0153 4 -2.412137 No Trend, No Intercept D^2(Unemployment rate in Construction) 0.0000 12 -6.934389 Intercept and Trend D^2(Number of Housing Started) 0.0000 12 -11.31414 Intercept and Trend D^2(Florida Employment) 0.0361 7 -2.084427 No Trend, No Intercept

Strategies to Divide the Data and Test the Models

The data was split into three sections: a training set, a testing and model

selection set, and a validation set for the final simulation. The validation set consisted of

the data from 2015 and 2016, and the data from 2003 to 2015 were used to train and

evaluate the models for each component.

The data under study were time series. Thus, the integrity and temporal

continuity of the data were important, meaning that randomly dividing the data into

different sections for validation would have been inappropriate. In this case, as

52

demonstrated in Figure 4-3, the evaluation technique relied on a rolling forecasting

origin method. In this method, the data were divided into two sections, training and

testing. The training and testing sets started with three years of consecutive data, while

the training set was extended by one year in each trial. This method allowed for a form

of cross-validation without tampering with the integrity of the data.

Figure 4-3. Evaluation of a rolling forecasting

Model Development

The procedure used to develop the model to forecast the project frequencies is

depicted in Figure 4-4. The purpose of this procedure was to identify data

characteristics, capture them in the model’s projections, and then check whether the

model reproduced those features by using cross-validation techniques. The univariate

model was adopted as a benchmark against which the more complex multivariate

models were compared. These evaluations assessed whether these more intricate

models had an improved forecast accuracy and provided insight into more suitable

means of modeling this problem.

53

The first step was modeling the main variable through univariate modeling

methods, such as the AR, MA, ARMA, and exponential smoothing models. More

sophisticated approaches, such as artificial neural networks can also be implemented

considering the availability of the necessary volume of data. After establishing a

benchmark, potentially relevant predictors were identified to populate a pool of

candidate independent variables, and their selection was based on a literature review

and cognitive theories. Explanatory variables brought environmental uncertainties into

the forecast with the goal of improving the accuracy of the simulation. These variables

did not need to have a causal relationship with the main variables; the only inclusion

criterion was that they needed to be helpful in forecasting the dependent variable.

Afterward, the exploratory data analysis was executed. That procedure started with a

graphical comparison of the independent and dependent variables, considering

elements such as scatterplots of pairs of variables, Pearson correlations, and unit roots

(stationary or nonstationary test).

The last step was selecting a set of multivariate modeling approaches based on

the results of the exploratory data analysis and investigating whether including

explanatory variables and more complex models improved the accuracy of the

forecasts. The model range needed to test for linear and nonlinear relationships based

on the results of the previous step, along with variable selection (pruning) and

parameter optimization. It was crucial to embed a cross-validation method within the

variable selection approach to avoid overfitting.

54

Figure 4-4. Model development scheme

In this study, a stream of future projects is forecasted, which only their statistical

likelihood was known. The ultimate aim was to identify the best model in terms of its

ability to forecast unknown future project streams. In this research, it is attempted to

forecast an unknown-unknown phenomenon for which actual project occurrences or its

characteristics were unspecified. However, a reasonable stochastic forecast is better

than no estimate at all (our objective was to capture a stream’s characteristic behavior,

rather than its actual behavior).

An R2 value is primarily for evaluating a forecast in terms of its ability to predict

past values, while time-series forecasting is more concerned about how well a model

predicts future values. In addition, there are some problems with using R2 as a measure

of a time-series regression’s forecast, as it is possible to obtain a perfect R2 value by

adding regressors. However, models built on the basis of R2criterion tend to perform

poorly in forecasting out-of-sample data points and future values, which was the target

of this study. This issue arises when the unsystematic variability or irreducible error of

the dependent variable is turned into systematic variability by capturing it in an

estimated formula. Furthermore, R2 can be drastically affected by occasional large

errors. As a result, it was not a suitable measure for cross-model comparisons.

1- Univariate modeling

2- Identifying the potentially relevant predictors and exploratory data analysis

3- Multivariate modeling (along with variable selection (pruning), parameter optimization and finding the appropriate lag between variables)

55

In cases such as this study, the aim is to produce the best possible forecast while

understanding the possible error in those estimates. The Root Mean Squared Error

(RMSE) and the Mean Absolute Error (MAE) were better measures of accuracy in this

regard, as they had the same unit as the data and provided insight into the possible

error of the forecasts. Similar to R2, the RMSE is sensitive to occasional large errors.

However, a low RMSE can be achieved by having both high precision and no

systematic error. As a result, the RMSE was a better measure than R2 in this research.

Conversely, the MAE is less sensitive to occasional large errors. In conclusion, the

RMSE and MAE provided the most suitable means of evaluating the error in this study.

Univariate modeling

Prior to univariate modeling, a set of preliminary analyses was necessary to

optimize the models’ parameters to improve its performance and better understand the

project frequency characteristics. The Autocorrelation Function (ACF) served as another

essential analysis. Autocorrelation is the correlation between a time series and a

delayed version of that time series. ACF method is helpful in finding repeating patterns

in data. A correlogram is a figure that demonstrates the correlation between two series.

Figure 4-5 illustrates the ACF correlogram of the project frequencies. The X-axis

indicates the lag (delay) in years, the Y-axis offers the correlation value, and the dotted

line shows the 5% significance boundary. It is visible that lag 8 and lag 12 crossed the

significance bounds. The ACF correlogram thus demonstrated that using an MA model

with eight lags (the first lag with a significant correlation) was appropriate.

56

Figure 4-5. The ACF for the project frequencies

The Partial ACF (PACF) is the ACF between a time series and its lagged version

after removing any linear dependence on values with shorter lags. Figure 4-6 illustrates

the PACF correlogram of the project frequencies. The X-axis denotes the lag (delay) in

months, the Y-axis the correlation value, and the blue line the 5% significance

boundary. It is visible that lag 8 and lag 12 again crossed the significance bounds. The

PACF correlogram consequently revealed that using an AR model with eight lags (the

first lag with a significant correlation) was fitting. Considering the results of the ACF and

PACF, using an ARMA model, which combines an AR and MA model, could improve

the performance of the forecast.

57

Figure 4-6. The PACF for the project frequencies

The AR model is a stochastic model in which future values are calculated based

on a regression formula from past values. In this case, the parameter requiring

optimization was the number of past values in need of consideration. Then, the

coefficients for each element, along with the intercept, were calculated. Based on the

PACF correlogram, eight and twelve lags were the best option for fitting the AR models,

as they had the highest correlations. In addition to the two identified lags, an automatic

AR model was also fitted. This approach relied on fitting the AR models with different

lags and choosing the best model via the AIC. The automatic algorithm selected the

AR(12) as the best model based on the AIC. Table 4-3 presents the RMSE of the AR

models in the seven cross-validation sections, along with their average. The difference

between models’ performance was marginal. However, the AR(8) performed slightly

better.

Table 4-3. The RMSE of the AR models (unit: frequency of projects)

1 2 3 4 5 6 7 Average

AR(8) 9.899 11.382 11.825 11.623 10.808 10.196 10.745 10.925

AR(12) 9.947 11.243 11.982 12.084 11.067 10.196 10.022 10.934

58

Table 4-4 contains the MAE of the fitted AR models. The results confirmed the

findings of the RMSE measure. However, MAEs were lower in value as compared to the

RMSEs, potentially implying the presence of large errors. Such large errors result in a

higher RMSE, as that approach penalizes more significant errors.

Table 4-4. The MAE of the AR models (unit: frequency of projects)

1 2 3 4 5 6 7 Average

AR8 7.18 8.81 9.14 9.23 8.36 8.21 8.45 8.48

AR12 7.25 8.64 9.25 9.65 8.61 8.21 7.83 8.49

Figure 4-7 compares the performances of the AR models. Up to a point, as the

number of training years increased, the performance of the models decreased.

However, after validation set 4, their performance improved back to the early levels. It

was inferred that training the model needed to involve either the recent values or the

entire dataset to achieve the best performance.

Figure 4-7. Comparison of the AR models’ performance

There are two general types of MA models: (1) those used in ARMA models,

which are based on a linear regression on past forecast errors, and (2) the arithmetic

7.000

8.000

9.000

10.000

11.000

12.000

13.000

1 2 3 4 5 6 7

Erro

r (f

req

uen

cy o

f p

roje

cts)

Data section

AR(8) RMSE AR(12) RMSE AR(8) MAE AR(12) MAE

59

mean of the series over the past observations. The second approach has multiple

varieties, such as simple, exponential, and double exponential smoothing methods. In

exponential methods, more weight is given to recent observations. Table 4-5 presents

the RMSE of the MA models fitted to the past 8 and 12 values of the series. As

compared to the AR models, the MA models performed poorly. In the MA models, the

past 12 values and a simple MA resulted in the best performance.

Table 4-5. The RMSE of the MA models (unit: frequency of projects) 1 2 3 4 5 6 7 Average

moving averages (8) 10.388 11.976 11.952 12.131 11.408 10.513 10.876 11.321

moving averages (12) 10.330 11.156 11.978 12.485 11.431 10.673 10.966 11.288

Exponential moving averages (8)

10.084 11.769 12.769 12.587 11.139 10.486 10.998 11.404


10.229 11.546 12.440 12.492 11.162 10.523 10.879 11.324

Double Exponential moving averages (8)

10.485 13.278 14.317 12.892 11.067 10.435 11.875 12.050


10.087 12.570 13.237 12.677 11.165 10.478 11.313 11.647

Table 4-6 presents the MAE of the MA models. The results confirmed the

findings for the RMSE measure. However, the MAE values were lower than the RMSE

ones. That finding potentially suggested large error values, as those lead to a higher

RMSE.

Table 4-6. The MAE of the MA models (unit: frequency of projects) 1 2 3 4 5 6 7 Average

moving averages (8) 7.46 9.83 9.18 9.71 8.83 8.23 8.73 8.85

moving averages (12) 7.42 8.62 9.24 10.12 8.85 8.27 8.67 8.74


7.26 9.54 10.32 10.23 8.60 8.22 8.96 9.02


7.35 9.24 9.93 10.13 8.61 8.23 8.77 8.89


7.53 11.42 12.16 10.55 8.61 8.24 9.77 9.75


7.26 10.59 10.86 10.32 8.61 8.23 9.24 9.30

60

Figure 4-8 compares the MA models in terms of performance. The same pattern

found in figure 4-7 is evident, with the same implications.

Figure 4-8. Comparison of the MA models’ performances

The ARMA class is the most general category of models used in forecasting

univariate time series. This type of model is typically represented as an ARMA (p,q),

where p is the AR order, and q is the MA order. The order of the AR and MA was

selected via an autocorrelation correlogram and a partial autocorrelation correlogram.

As discussed previously, the preliminary analysis indicated that the project frequency

data were stationary, and so it was suitable for ARMA forecasting. Based on the results

of the autocorrelation and partial autocorrelation tests, an ARMA model (p=8, q=8) was

the best choice to model the project frequency series. However, the ACF and PACF

results also revealed a significant correlation on lag 12. As a result, a set of

7.000

8.000

9.000

10.000

11.000

12.000

13.000

14.000

15.000

1 2 3 4 5 6 7

Erro

r (f

req

uen

cy o

f p

roje

cts)

Data sections

Chart Title

MA 8 RMSE MA 12 RMSE EMA 8 RMSE EMA 12 RMSE

DEMA 8 RMSE DEMA 12 RMSE MA 8 MAE MA 12 MAE

EMA 8 MAE EMA 12 MAE DEMA 8 MAE DEMA 12 MAE

61

combinations of lag 8 and lag 12 for the AR and MA parts of the ARMA model was

tested. Further, an automatic algorithm was used to fit all combinations up to lag 24 and

to identify the best model via the AIC. The RMSEs of the ARMA models are presented

in Table 4-7. The results indicated that the ARMA (p=8, q=8) was the best model.

Table 4-7. The RMSE of the ARMA models (unit: frequency of projects) 1 2 3 4 5 6 7 Average

Auto ARIMA 9.944 11.073 11.982 12.084 11.067 10.294 10.954 11.057

ARIMA (8,8) 10.882 11.753 11.784 11.428 10.691 9.656 8.812 10.715

ARIMA (8,12) 11.094 12.166 12.064 10.769 10.108 10.013 9.596 10.830

ARIMA (12,8) 11.035 12.238 11.878 10.911 10.745 10.341 8.938 10.870

ARIMA (12,12) 12.062 13.807 12.128 11.911 11.294 9.600 10.093 11.556

Table 4-8 presents the MAE of the ARMA models. The MAE results were in

alignment with the RMSE values. It was evident that the ARMA (p=8, q=8) outperformed

the other models.

Table 4-8. The MAE of the ARMA models (unit: frequency of projects) 1 2 3 4 5 6 7 Average

Auto ARIMA 7.28 8.43 9.25 9.65 8.61 8.21 8.69 8.59

ARIMA (8,8) 8.50 9.19 9.05 9.30 8.80 7.29 7.03 8.45

ARIMA (8,12) 8.60 9.19 9.47 8.59 7.84 8.06 7.98 8.53

ARIMA (12,8) 8.36 9.33 9.47 9.11 8.68 7.74 7.16 8.55

ARIMA (12,12) 9.89 11.15 9.36 9.53 8.76 7.43 8.52 9.23

Figure 4-9 plots the RMSE and MAE of the ARMA models across the seven

cross-validation sections. In general, as the number of training data points increased,

the accuracy of the models also improved.

62

Figure 4-9. The RMSE and MAE of the ARMA models

There is another univariate time-series forecasting approach called exponential

smoothing. It has different variations, the most straightforward of which is simple

exponential smoothing. This technique is similar to the MA method. However, in

contrast to the MA approach, the weights assigned to the past values exponentially

decrease as the data grows older.

Holt-Winters is a triple exponential smoothing method that has two variations: the

additive and multiplicative seasonal methods. The multiplicative method is inappropriate

for series with negative or zero values. As the project frequency series had zero values

in some months, only the additive method was implemented. That method can consider

both seasonal changes and trends. Table 4-9 presents the RMSE and MAE of the

exponential smoothing models. The Holt-Winter model clearly outperformed the simple

exponential smoothing one. Still, the ARMA (8,8) remained the best model.

6

7

8

9

10

11

12

13

14

1 2 3 4 5 6 7

Erro

r (f

req

uen

cy o

f p

roje

cts)

Data sections

Auto ARIMA RMSE ARIMA (8,8) RMSE ARIMA (8,12) RMSE

ARIMA (12,8) RMSE ARIMA (12,12) RMSE Auto ARIMA MAE

ARIMA (8,8) MAE ARIMA (8,12) MAE ARIMA (12,8) MAE

ARIMA (12,12) MAE

63

Table 4-9. The RMSE and MAE of the exponential smoothing models (unit: frequency of projects)

1 2 3 4 5 6 7 Average

Exponential smoothing RMSE

9.896 11.075 11.986 12.084 11.067 10.345 10.945 11.057

Holt Winter RMSE 9.814 10.377 11.651 12.404 9.269 11.141 11.081 10.82

Exponential smoothing MAE

7.17 8.44 9.26 9.65 8.61 8.31 8.68 8.59

Holt Winter MAE 7.98 8.51 8.97 9.82 7.82 8.96 8.84 8.7

Figure 4-10 depicts the plot of the RMSE and MAE of the exponential smoothing

models across the seven cross-validation sections. Up to a certain point (six years for

the training set), performance declined as the training set grew in size. However,

including seven or more years of data improved the models’ performance.

Figure 4-10. The RMSE and MAE of the exponential smoothing models

The LSTM is a type of artificial neural network that falls into the category of

recurrent neural networks. LSTMs are especially useful in recognizing a pattern in a

data sequence. The literature has widely used the LSTM approach to forecast time-

6

7

8

9

10

11

12

13

1 2 3 4 5 6 7

Erro

r (f

req

uen

cy o

f p

roje

cts)

Data sections

Exponential smoothing RMSE Holt Winter RMSE

Exponential smoothing MAE Holt Winter MAE

64

series data, as it has the capability of including past values in the process and holds a

kind of memory, allowing it to use previous values in forecasting current ones.

Figure 4-11 A illustrates a neural cell with a loop for which X is the input and h is

the output. The loop enables the cell to pass information from one step to the next.

When it comes to learning long-term dependencies, LSTM networks are especially

strong. Figure 4-11 B depicts an LSTM memory cell. It consists of four main elements.

First, an input gate regulates the impact of the incoming data. It can allow that data to

effect the current state of the memory cell, or it can block it. Second, a neuron with a

recurrent connection ensures that in the absence of any outside intrusion, the memory

cell condition state will persist from one step to the next. Third, an output gate controls

the effect of the memory cell on the other neurons. Fourth, a forget gate regulates the

self-recurrent connection, determining whether the memory cell will be permitted to

remember its previous condition or made to forget it.

Figure 4-11. The LSTM structure A) a neural cell with loop B) an LSTM memory cell

Implementing a neural network requires a specific number of layers and neurons,

which are needed to build the network and then train and test it. Another characteristic

that needed to be defined in this study was the number of look-backs, or the number of

65

previous values used to forecast the next value. In this study, only a one-layer LSTM

was used, with varying numbers of neurons and look-backs across the cross-validation

datasets. Table 4-10 presents the results of the trained LSTM models. It is evident that

one look-back and two neurons were the optimal arrangement.

Table 4-10. The RMSE of the LSTM models (unit: frequency of projects) lookback neurons 1 2 3 4 5 6 7 Average

1 1 10.08 11.02 11.23 11.76 10.43 10.35 10.67 10.79

1 2 10.08 11.02 11.07 11.65 10.5 10.29 10.67 10.75

1 3 10.1 11.04 11.13 11.6 10.46 10.31 10.65 10.76

1 4 10.08 11.11 11.13 11.55 10.48 10.32 10.68 10.76

1 5 10.16 11.08 11.16 11.59 10.46 10.33 10.74 10.79

1 10 10.12 11.04 11.12 11.61 10.44 10.31 10.66 10.76

1 20 10.1 11.04 11.12 11.6 10.48 10.32 10.71 10.77

3 1 12.4 11.48 10.93 10.87 9.26 11.37 11.96 11.18

3 2 11.98 11.56 10.75 12.88 9.28 11.87 12.3 11.52

3 3 12.68 11.54 10.93 11.12 9.02 11.19 12.07 11.22

3 4 12.68 11.45 10.61 12.34 9.29 11.09 12.46 11.42

3 5 12.65 11.27 11.52 10.94 9.3 11.34 12.31 11.33

3 10 12.73 11.64 11.68 14.48 9.63 11.44 12.48 12.01

3 20 12.69 11.88 14.18 14.41 10.09 11.2 12.08 12.36

5 1 12.94 11.08 11.24 11.31 8.98 13.08 14.07 11.81

5 2 12.02 12.37 9.92 10.69 9.72 11.66 14.4 11.54

5 3 12.02 11.44 9.82 9.53 9.71 11.65 15.34 11.36

5 4 12.45 9.96 10.54 11.12 10.8 13.22 13.52 11.66

5 5 11.73 10.29 9.87 9.84 11.45 12.22 13.48 11.27

5 10 12.55 12.47 11.76 13.07 10.61 14.45 14.53 12.78

5 20 13.49 12.86 12.12 13.09 12.61 18.29 17.26 14.25

8 1 14.52 12.29 12.05 12.82 10.02 13.47 11.84 12.43

8 2 14.88 12.34 12.75 13.54 12.24 14.43 12.27 13.21

8 3 14.69 21.46 15.2 14.81 10.56 15.01 18.87 15.80

8 4 17.53 18.46 15.03 15.07 12.24 17.78 14.54 15.81

8 5 11.29 22.16 15.61 16.65 11.23 21.6 18.9 16.78

8 10 15.04 23.35 18.38 14.84 15.73 15.84 27.77 18.71

8 20 19.2 20.1 17.5 15.71 14.31 24.38 37.72 21.27

12 1 23.08 19.08 20.32 18 15.3 11.08 17.37 17.75

12 2 23.65 16.12 22.41 19.76 12.83 14.42 15.29 17.78

12 3 17.94 15.7 19.74 22.09 17.06 17.36 14.79 17.81

12 4 19.46 16.88 16.22 20.9 14.92 24.99 23.4 19.54

12 5 16.9 16.08 19.72 17.24 17.78 18.98 20.62 18.19

12 10 14.69 15.33 15.3 17.64 15.66 16.24 16.98 15.98

12 20 16.21 15.49 18.88 19.01 13.11 17.42 18.46 16.94

66

Table 4-11 exhibits the MAE of the LSTM, which confirmed the optimum number

of look-backs. However, according to that measure, one neuron seemed to be ideal.

The difference may have been due to the fact that one LSTM neuron produces

occasional larger errors, while two neurons produce more average errors.

Table 4-11. MAE of LSTM models (unit: frequency of projects) lookback neurons 1 2 3 4 5 6 7 Average

1 1 7.48 8.52 8.61 9.43 8.53 8.59 8.73 8.56

1 2 7.46 8.5 8.58 9.7 8.57 8.56 8.72 8.58

1 3 7.5 8.55 8.8 9.62 8.52 8.57 8.71 8.61

1 4 7.47 8.61 8.79 9.55 8.55 8.56 8.73 8.61

1 5 7.5 8.57 8.83 9.62 8.53 8.62 8.76 8.63

1 10 7.47 8.52 8.72 9.64 8.51 8.56 8.72 8.59

1 20 7.47 8.51 8.79 9.63 8.52 8.63 8.75 8.61

3 1 9.39 8.73 8.23 8.65 7.57 9.42 9.68 8.81

3 2 9.27 9.06 8.36 10.32 7.87 9.61 9.53 9.15

3 3 9.44 8.96 9.17 9.18 7.57 9.46 10.05 9.12

3 4 9.61 8.8 8.97 10.02 7.33 9.4 9.73 9.12

3 5 9.54 8.97 9.46 9.29 7.99 9.49 9.78 9.22

3 10 9.56 9.28 9.36 11.48 8.06 9.56 10.23 9.65

3 20 9.67 9.53 10.22 11 8.38 9.35 9.73 9.70

5 1 9.72 8.48 9.13 9.66 7.6 10.11 10.86 9.37

5 2 9.22 9.35 8.27 8.53 8.05 8.78 11.15 9.05

5 3 9.5 9.37 8.17 7.65 7.97 9.04 11.92 9.09

5 4 9.42 7.84 8.81 9.03 9.51 10.02 10.2 9.26

5 5 9.05 8.33 8.18 7.86 9.87 9.29 10.47 9.01

5 10 9.91 10.48 9.09 10.88 9.29 10.8 11.54 10.28

5 20 10.53 9.56 9.5 10.64 10.91 12.78 13.1 11.00

8 1 12.67 9.64 8.88 10.27 8.11 10.26 10.24 10.01

8 2 12.64 9.69 10.11 11.1 10.86 10.88 9.89 10.74

8 3 12.86 16.06 12.14 12.39 9.18 10.51 14.04 12.45

8 4 14.96 14.36 11.83 12.49 10.97 13.25 11.02 12.70

8 5 9.42 16.21 13.48 13.96 9.91 15.15 14.6 13.25

8 10 12.67 17.86 14.83 11.34 13.15 12.19 19.83 14.55

8 20 15.55 15.83 14.36 12.54 12.3 17.57 22.8 15.85

12 1 14.97 15.2 14.43 14.96 10.88 8.89 11.8 13.02

12 2 14.32 13.24 18.89 15.52 10.68 11.26 10.68 13.51

12 3 12.43 12.86 15.73 18.03 14.07 14.04 10.79 13.99

12 4 15.3 13.63 13.33 16.97 11.94 16.53 20.19 15.41

12 5 13.58 13.94 16.32 14.2 14.67 14.78 16.46 14.85

12 10 11.88 12.84 12.49 14.72 13.48 12.82 13.89 13.16

12 20 12.21 12.74 16.33 15.09 10.86 12.59 13.95 13.40

67

Figure 4-12 plots the RMSE and MAE of the LSTM models with one look-back

and a variable number of neurons. Up to a point, as the number of training years

increased, the performance of the models declined. However, after validation set 4, the

performance improved. These findings suggested that to achieve the best possible

performance, the model’s training needed to involve either the recent values or the

entire dataset.

Figure 4-1. The RMSE and MAE of the LSTM models with one look-back

Comparing performance across all the univariate models revealed that an ARMA

(8,8) was the best univariate model to forecast the project frequencies. Figure 4-13

provides a more in-depth overview the results via a visual illustration of the ARMA

model’s performance. It highlights the difference between the actual data and the model

7

7.5

8

8.5

9

9.5

10

10.5

11

11.5

12

1 2 3 4 5 6 7

Erro

r (f

req

uen

cy o

f p

roje

cts)

Data sections

#1 RMSE #2 RMSE #3 RMSE #4 RMSE #5 RMSE

#10 RMSE #20 RMSE #1 MAE #2 MAE #3 MAE

#4 MAE #5 MAE #10 MAE #20 MAE

68

with the best performance. The predicted values are in blue, and the actual data are

plotted in red. The dark gray indicates the 80% prediction interval, and the light gray the

95% prediction interval. Visual examination of Figure 4-7 reveals that the ARMA model

was more successful at forecasting values after 2008. The model was clearly better at

reproducing the data’s variance in later years, as the variance of the actual data

increased as time passed. However, the variance in the predicted values was smaller

than the variance in the actual data over the entire time span.

Figure 4-2. The ARMA (8,8) forecast based on cross-validation section 7

The presented univariate models served as a benchmark with which to compare

the performance of the multivariate models, and to assess whether using explanatory

variables and more complex multivariate models generated more accurate forecasts.

The first step to build multivariate models was to identify potentially relevant variables

for this forecast and to conduct an exploratory data analysis to better understand the

relationships among the variables.

69

Identifying potentially relevant predictors and the exploratory data analysis

To develop the multivariate models, a better understanding of the data

characteristics was first necessary, and that information was gained through an

exploratory data analysis and the identification of potentially relevant predictors. The

variables presented in Table 3-1 are reported at different time intervals (monthly,

quarterly, and annually), and not all of them were suitable for monthly predictions, as

certain data were only available annually. Table 4-12 indicates which variables were

available at the monthly level and did not have any missing values for the explored time

frame. It also provides the abbreviation for each variable. These factors served as the

dependent (explanatory) variables in this study.

Table 4-12. Potential variables and their abbreviations

Variable name Abbreviation

DIJ Average Vol DIJ

DIJ Closing DIJC

Money Stock M1 MS1

Money Stock M2 MS2

Federal Fund Rate FFR

Average Prime Rate APR

PPIACO PPIACO

Building Permit BP

Brent Oil Price BOP

Consumer Price Index CPI

Crude Oil Price COP

Unemployment Rate UR

Florida Employment FE

Florida Unemployment FU

Florida Unemployment Rate FUR

Florida Number of Employees in Construction NFEC

Number Housing Started HS

Unemployment Rate Construction URC

Number of Employees in Construction NEC

Number of Job Opening in Construction JOC

Construction Spending CS

Total Highway and Street Spending THSS

70

The first exploratory analysis consisted of a correlation analysis. Figure 4-14

provides the correlation plot of the variables. The color indicates the magnitude of the

correlation, and the direction of the ellipse illustrates the direction of the relationship.

Furthermore, the concentration of the ellipse tells us about the degree of the linear

relationship between the variables. Project frequency is represented by “freq” in the last

row and column. It appears that none of the exploratory variables had a strong linear

relationship with the project frequency. As a result, it was expected that the linear

models would perform poorly in forecasting the project frequencies. However, different

nonlinear models were tested to check for any nonlinear relationships between the

variables.

Figure 4-14. Correlation plot of the variables

71

The correlation analysis was conducted at the level of the variables without any

lag. To verify that no significant linear relationships existed between the variables, a

cross-correlation analysis was also conducted. The results are presented in table 4-13,

which illustrates the maximum correlation between each variable and the project

frequency, along with the correlation’s associated lag and standard error. Even when

considering the lags of the variables, no significant linear relationship was found

between the dependent and independent variables.

Table 4-13. Cross-correlation of the dependent variables

Name Lag Cross Correlation Std. Error

DIJ -1 0.214 0.127

DIJC 4 -0.165 0.13

MS1 1 -0.144 0.127

MS2 4 -0.145 0.13

FFR 4 0.159 0.13

PPIACO 4 -0.247 0.13

BP -9 -0.236 0.136

BOP 4 -0.172 0.13

CPI 4 -0.209 0.13

COP -12 0.218 0.14

UR 4 0.145 0.13

FE 4 -0.15 0.13

FU 4 0.149 0.13

FUR 4 0.152 0.13

NFEC -12 -0.107 0.14

HS 1 -0.147 0.127

URC 3 0.268 0.129

NEC 0 -0.099 0.126

JOC 4 -0.275 0.13

CS 1 -0.144 0.127

THSS -5 0.206 0.131

Testing the order of integration for independent variables was essential. An ADF

test was executed to identify the order of integration of the independent data. The

strategy presented by Enders (2015) was employed to detect the right order of

72

integration. With an ADF test, choosing the appropriate lag is critical; in this case, the

lag length was selected based on the AIC criterion. A summary of the ADF results for

the available monthly data is presented in table 4-14. The results revealed that the

majority of the variables needed one level of differencing to become stationary, while

the project frequency (our dependent variable) was stationary at the level (without

differencing).

Table 4-14. Result of the ADF test for the explanatory variables

Variable Significance Lag (m)

T-statistic Type

Federal Fund Rate 0.0736 8 -3.913497 Intercept and Trend Florida Employees in Construction 0.0586 9 -4.210828 Intercept and Trend

Average Prime Rate 0.0135 9 -2.250865 Intercept D(Brent Oil Price) 0.0000 0 -7.212745 Intercept and Trend D(Crude Oil Price) 0.0000 0 -6.102265 Intercept and Trend

D(Consumer Price Index) 0.0000 7 -6.142466 Intercept and Trend D(Dow Jones Industrial) 0.0000 0 -10.78966 Intercept and Trend

D(Job Opening in Construction) 0.0844 12 -3.223512 Intercept and Trend D(Money Stock) 0.0000 3 -5.617107 Intercept and Trend

D(Producer Price Index) 0.0001 4 -5.457166 Intercept and Trend D(Highway and Street Spending In

Florida) 0.0000 5 -8.101598 Intercept and Trend

D(Unemployment Rate in Construction) 0.0821 24 -1.713120 Intercept and Trend D(Construction Spending) 0.0778 4 -2.650357 Intercept

D(Building Permit) 0.0994 12 -1.887453 No Trend, No

Intercept

D(Florida Unemployment Rate) 0.0765 1 -1.768058 No Trend, No

Intercept D(Number of Employees in

Construction) 0.0650 3 -1.779882

No Trend, No Intercept

D(Unemployment Rate) 0.0153 4 -2.412137 No Trend, No

Intercept D^2(Unemployment rate in

Construction) 0.0000 12 -6.934389 Intercept and Trend

D^2(Number of Housing Started) 0.0000 12 -11.31414 Intercept and Trend

D^2(Florida Employment) 0.0361 7 -2.084427 No Trend, No

Intercept

Feature selection and feature importance

Feature selection is the process of selecting the most relevant predictors and

removing irrelevant variables from the pool of potentially useful predictors. Depending

on the model’s structure, feature selection can improve a model’s accuracy. This

73

process can be carried out by measuring the contribution of each variable to the

model’s accuracy, and then removing irrelevant and redundant variables while keeping

the most useful ones. In some cases, irrelevant features can even reduce a model’s

accuracy. In general, there are three approaches to feature selection: the filter method,

wrapper method, and embedded method.

The filter method of feature selection uses simple statistical tests to check which

variables are statistically significant. On that basis, the variable subset with the

maximum predictive power is selected. The typical approach is to calculate a score

based on the chosen statistical test and to then rank the variables according to their

individual scores. Finally, the features with the highest rank are chosen. This method

does not consider the interactions between the variables, and the selection process is

independent of the particular machine learning method that is implemented. However, it

is easy to apply and does not require much computational power.

Wrapper methods train a model using all the variables, or a subset of them. Next,

the model’s performance and comparisons with the previous subset determine whether

a feature should be added or removed. Wrapper methods are computationally

expensive. These methods fall into three main categories: forward selection, backward

selection, and recursive feature elimination. In forward selection methods, the model

starts with no variables and add one variable per iteration. The model is trained and its

accuracy calculated, and the final step is to check whether new variable have improved

its performance. In backward elimination, the model starts with all the variables and

then removes the least significant feature at each iteration, checking whether the

exclusion improves its accuracy. In recursive feature elimination, the model uses a

74

greedy search algorithm to find the best-performing variable subset. This method

iteratively creates new subsets and check the model’s performance while removing the

worst-performing feature. Finally, it ranks the features based on their order of

elimination. Using wrapper methods rather than filter techniques might result in a model

with a better performance. However, with that approach, a model is more prone

overfitting.

Embedded methods implement feature selection and model tuning at the same

time. In other words, these machine learning algorithms have built-in feature selection

elements. Examples of embedded method implementations include LASSO and elastic

net. Regularization is a process in which the user intentionally introduces bias into the

training, preventing the coefficients from taking large values. This method is especially

useful when the number of variables is high. In such a situation, the linear regression is

not stable and in which a small change in a few variables results in a large shift in the

coefficients. The LASSO approach uses L1 regularization (adding a penalty equal to the

magnitude of the coefficient), while ridge regression uses L2 regularization (adding a

penalty equal to the square of the magnitude of the coefficient). Elastic net uses a

combination of L1 and L2. Ridge regression is effective in reducing a model’s variance

by minimizing the summation of the square of the residuals. The LASSO method

minimizes the summation of the absolute residuals. The LASSO approach produces a

sparse model that minimizes the number of coefficients with non-zero values. As a

result, this approach has implicit feature selection. The generalized linear method

implemented in the next section uses elastic net. This approach incorporates both L1

and L2 regularization and thus has implicit feature selection.

75

To execute feature selection in the multilayer perceptron used in this research,

the method proposed by Olden and Jackson (2002) for measuring feature importance

was used. This method is very similar to that put forward by Goh (1995), which uses the

connection weights among the different layers to determine the variable importance.

The literature has made clear that the model proposed by Olden and Jackson (2002)

outperforms Goh’s algorithm. Olden and Jackson’s (2002) method uses the summation

of the product of the raw input-hidden and hidden-output connection weights between

the respective input and output neuron. As compared to the previous methods, this

technique works on neural nets with multiple layers, and it also consider the sign of the

weights in addition to their absolute value. However, the calculated values are relative

and should only be used in comparing the variables’ importance within a model, and not

across different models.

Looking at the correlation between independent variables and the dependent

variable, it became evident that a filter method using a correlation analysis was not

useful, as all the variables had a nonsignificant relationship with the project frequency.

Different sets of features compiled from both linear and nonlinear filter approach feature

selection methods served as pruning measures for a support vector machine without a

specific feature importance measure and in neural networks (for comparison with the

Olden method).

A linear regression model was fitted to the data, and the absolute value of the t-

value of each independent variable was used as the importance measure within a linear

filtering process. Table 4-15 and figure 4-15 provide the linear filter approach’s results.

76

Table 4-15. Linear filter approach results

Variable Name t statistic

URC 1.02340148

FE 0.87849779

UR 0.87478354

FUR 0.78731785

BP 0.78663472

FU 0.74257859

DIJC 0.49361533

NEC 0.43265008

MS1 0.42060125

CS 0.38206996

JOC 0.36666722

APR 0.35951523

MS2 0.32116516

FFR 0.30353176

HS 0.24692241

NFEC 0.23642652

DIJ 0.18496543

BOP 0.10243016

PPIACO 0.08947572

CPI 0.05912804

COP 0.05655688

THSS 0.03656233

Based on the presented results, three levels for variable pruning were chosen.

First, the models were trained with all available variables to create a benchmark. Then,

using the 0.3 t-value as the threshold for the first level of pruning yielded the following

independent variables: URC, FE, UR, FUR, BP, FU, DIJC, NEC, MS1, CS, JOC, APR,

MS2, and FFR. The second-level of pruning was identified by using the 0.5 t-value as

the threshold, and this step left UR, FE, UR, FUR, BP, and FU as the independent

variables.

77

Figure 4-15. Linear variable importance

To understand the importance of the features in a nonlinear manner, locally

weighted scatterplot smoothing was used. The loess model is a strong, nonparametric

model that uses regression and the K-nearest neighbor method. A loess smoother was

fit between the project frequency and each variable, and the R2 statistic was calculated

for each one. This metric served as a relative variable importance measure. Table 4-16

and figure 4-16 present the results of the nonlinear filter approach.

0 0.2 0.4 0.6 0.8 1 1.2

URC

FE

UR

FUR

BP

FU

DIJC

NEC

MS1

CS

JOC

APR

MS2

FFR

HS

NFEC

DIJ

BOP

PPIACO

CPI

COP

THSS

t-value

Linear filter

78

Table 4-16. Nonlinear Filter approach results Variable Name R Squared

JOC 0.038

URC 0.037

COP 0.024

DIJ 0.020

MS1 0.018

FUR 0.018

FU 0.018

DIJC 0.018

FE 0.018

HS 0.017

BOP 0.016

BP 0.013

MS2 0.011

CS 0.011

THSS 0.010

PPIACO 0.009

UR 0.007

APR 0.006

CPI 0.005

FFR 0.003

NFEC 0.003

NEC 0.001

Similar to the linear approach, three feature sets were chosen. First, all the

variables were used as a benchmark. Then, trimming the variables with a 0.01 R2

threshold resulted in JOC, URC, COP, DIJ, MS1, FUR, FU, DIJC, FE, HS, BOP, BP,

MS2, and CS as independent variables for level-one nonlinear pruning. Finally, further

trimming using the 0.02 R2 threshold yielded JOC, URC, and COP as the outputs of the

level-two nonlinear pruning.

79

Figure 4-16. Nonlinear variable importance

In this study, the analysis primarily focused on the monthly level, and as a result,

those variables reported on a yearly basis could not be included. However, variables

such as the FDOT’s budget may be important and impact future project streams. To

ensure that the final model did not disregard the impact of the FDOT budget, a

comparison between the FDOT’s total budget, the FDOT’s product budget, cumulative

costs, and project frequency on a yearly basis was executed. Figure 4-17 plots the

cumulative cost of the projects, along with the FDOT’s total and product budgets. It is

evident that the total budget and product budget were fully synchronized. In addition,

from 2006 to 2009, the cumulative cost of the projects revealed a similar pattern.

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04

JOC

URC

COP

DIJ

MS1

FUR

FU

DIJC

FE

HS

BOP

BP

MS2

CS

THSS

PPIACO

UR

APR

CPI

FFR

NFEC

NEC

R2 value

Nonlinear filter

80

However, considering the whole data span, project costs were not affected by the

FDOT’s total budget and product budget.

Figure 4-17. Comparison of the budgets and costs of the projects

To quantify the observations from figure 4-17, a correlation analysis was

conducted, examining the relationship between the cumulative project costs and project

frequencies, and the FDOT’s total and product budget. Table 4-17 presents the

correlation analysis results. The highest correlation among project frequency and

project cost with the budgets was -0.42 which was not significant.

Table 4-17. Linear correlation table of project cost and frequency with the budget

Project frequency Project Costs

FDOT Production budget

FDOT total budget

Project frequency 1.000 0.083 -0.407 -0.348

Project Costs 0.083 1.000 -0.291 -0.420

FDOT Production budget -0.407 -0.291 1.000 0.980

FDOT total budget -0.348 -0.420 0.980 1.000

0

500000000

1E+09

1.5E+09

2E+09

2.5E+09

3E+09

3.5E+09

4E+09

4.5E+09

2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

Do

llar

valu

e

Date

cost FPB FTB

81

Multivariate modeling

Based on the results of the correlation analysis, cross-correlation evaluation, and

ADF test of the variables, using multivariate time-series modeling methods such as

vector autoregressive and vector error correction model was not feasible. Consequently,

machine learning methods were used, especially those based on nonlinear relationships

among variables. Based on the literature review, the machine learning methods

including the generalized linear model, the feedforward perceptron, and support vector

machines with radial basis function kernel were implemented using the cross-validation

method explained earlier.

These machine learning methods were applied to the data using the previously

discussed cross-validation method to divide the data and test the model. An increasing

data window was employed for training of the model and then testing it on three

consecutive years of data. The independent variables had different magnitudes of order,

and using them as-is might have resulted the ones with small magnitudes being

overlooked. As a result, a 0-1 scaling transformation prepared the variables for the

implementation of the machine learning algorithms.

The general process of model optimization and feature selection consisted of first

defining a set of model parameter values to evaluate. Then, the data was preprocessed

in accordance with the 0-1 scale. For each parameter set, the cross-validation method

discussed earlier served to train and test the model. Finally, the average performance

was calculated for each parameter set, which identified the optimal parameters.

Ordinary linear regression is based on the underlying assumption that the model

for the dependent variable has a normal error distribution. Generalized linear models

are a flexible generalization of the ordinary linear regression that allows for other error

82

distributions. In general, they can be applied to a wider variety of problems than can the

ordinary linear regression. Generalized linear models are defined by three components:

a random component, a systematic component, and a link function. The random

component recognizes the response variable and its corresponding probability

distribution. The systematic component recognizes the independent variables and their

linear combination, which is called the linear predictor. The link function identifies the

connection between the random and systematic components. In other words, it

pinpoints how the dependent variable is related to the linear predictor of the

independent variables.

Ridge regression uses an L2 penalty to limit the size of the coefficient, while

LASSO regression uses an L1 penalty to increase the interpretability of the model. The

elastic net uses a mix of L1 and L2 regularization, which makes it superior to the other

two methods in most cases. Using a combination of L1 and L2, the elastic net can

produce a sparse model with few variables selected from the independent variables.

This approach is especially useful when multiple features with high correlations with

each other exist.

A generalized linear model was fit to the data using the cross-validation method

discussed earlier. Alpha (mixing percentage) and lambda (regularization parameter)

were the tuning parameters. Alpha controls the elastic net penalty, where α=1

represents lasso regression, and α=0 represents ridge regression. Lambda controls the

power of the penalty. The L2 penalty shrinks the coefficients of correlated variables,

whereas the L1 penalty picks one of correlated variables and removes the rest. Figure

4-18 illustrates the results of the generalized linear model, optimized by minimizing the

83

RMSE with controlling alpha and lambda. The optimized parameters were α=1 and λ=

0.56.

Figure 4-18. Generalized linear method optimization

Figure 4-19 depicts the LASSO coefficient curves. Each curve represents a

variable. The path for each variable demonstrates its coefficient in relation to the L1

value. The coefficient paths more effectively highlight why only two variables were

significant in the generalized linear model. When two variables were excluded, all other

coefficients became zero at the L1 normalization, and this arrangement yielded the best

performance. Figure 4-20 offers the variable importance for the generalized linear

model with all the variables. Only the unemployment rate in construction industry, the

(Frequency of projects)

84

Brent oil price, and the unemployment rate (total) had non-zero coefficients. However,

the unemployment rate (total) seemed to be relatively insignificant.

Figure 4-19. Lasso coefficient curve

Figure 4-20. Variable importance for the generalized linear model

85

To further prune the generalized linear model, another model with only the

unemployment rate in the construction sector and the Brent oil price was trained and

tested. Table 4-18 contains the optimized parameters for the generalized linear models.

Table 4-18. Parameters of the generalized linear models (unit: frequency of projects)

All variable Pruned by one variable

URC 3.94 4.03

BP 2.80 2.77

UR 0.11 0.00

Intercept 17.14 17.16

Table 4-19 illustrates the performance of the optimized general linear model

using a different dataset on the cross-validation sections. It was evident that excluding

the unemployment rate improved the model’s performance and that the only variables

contributing to the linear model were the unemployment rate in the construction sector

and the Brent oil price.

Table 4-19. Performance of the Generalized linear model (unit: frequency of projects)

Error term Feature set 1 2 3 4 5 6 7 Average

RMSE All 16.13 11.58 13.86 13.16 12.07 11.03 10.89 12.67

Pruned 9.78 11.94 13.69 13.14 10.94 10.27 10.87 11.52

MAE All 13.24 9.64 11.60 10.82 9.55 8.53 8.60 10.28

Pruned 10.80 8.56 8.01 8.25 10.00 8.60 11.28 9.36

A multilayer perceptron is a feedforward neural network including at least three

layers. An input layer, one (or more) hidden layer, and an output layer. Nodes in the

hidden layers and output layer are neurons with activation functions. A supervised

learning algorithm called back propagation is used to train a multilayer perceptron. The

advantages of the multilayer perceptron are its capability to learn nonlinear relationships

among variables and to learn in real time. However, different random weight

86

initializations could lead to different accuracy rates, as the optimization problem has

more than one local minimum. The issue of choosing the right number of hidden layers

and neurons is another disadvantage.

A multilayer perceptron model was optimized with one hidden layer over different

feature sets. The Olden and Jackson (2002) method was used to measure the feature

importance and conduct feature selection processes. Furthermore, datasets pruned with

the previously mentioned filter approaches were used to train and test the model. The

network with one layer of hidden units was optimized for the number of neurons in the

hidden layer and the weight decay for each dataset. The model optimization of the

single hidden layer neural net with all the variables revealed that 15 neurons with a

weight decay of 0.00017 were the optimum choice for model’s parameters. Figure 4-21

depicts the structure of the optimized model with 15 neurons in the hidden layer, with all

the variables as inputs and project frequency as the output.

Figure 4-21. Optimum network structure with all the independent variables

87

Figure 4-22 visualizes the feature importance using the Olden and Jackson

(2002) method for the optimized model with all the variables. Two levels of pruning were

chosen for feature selection based on the results in figure 4-22. The first level of pruning

relied on a threshold of 100, and the outcome was the exclusion of APR, FFR, NFEC,

HS, NEC, CS, and DIJ. The second level used a threshold of 200, which left only URC,

THSS, COP, CPI, FUR, UR, FU, BOP, and PPIACO.

Figure 4-22. Feature importance according to the Olden method

Table 4-20 contains the RMSE and MAE of the optimized multilayer perceptron

models with all the variables and pruned datasets. The pruning had a marginal impact

on the models’ performance. However, the nonlinear level-two pruning tended to

provide the best performance.

88

Table 4-20. Multilayer perceptron models' performance (unit: frequency of projects) Error term

Feature set 1 2 3 4 5 6 7 Average

RMSE

All variables 10.18 11.09 12.00 11.89 11.11 10.39 10.89 11.08

Level one nonlinear pruning

9.75 11.09 11.92 12.06 11.16 10.68 10.90 11.08

Level two nonlinear pruning

9.87 10.99 12.01 11.91 11.06 10.36 10.90 11.01

Level one linear pruning

9.78 11.14 11.94 12.02 11.15 10.54 11.11 11.10

Level two linear pruning 9.76 10.46 12.11 11.87 11.25 10.48 11.41 11.05

Level one Olden pruning

9.80 11.10 11.96 11.88 11.09 10.32 10.93 11.01

Level two Olden pruning

9.76 11.07 11.92 11.90 11.31 10.38 10.91 11.04

MAE

All variables 7.44 8.42 8.94 9.28 8.57 8.32 8.69 8.52

Level one nonlinear pruning

7.16 8.45 9.24 9.28 8.91 8.24 8.68 8.57

Level two nonlinear pruning

7.15 7.70 8.94 9.29 8.60 9.08 8.69 8.49

Level one linear pruning

7.20 8.42 9.01 9.54 8.76 8.23 8.73 8.56

Level two linear pruning 7.50 8.33 9.00 9.93 8.83 8.25 8.72 8.65

Level one Olden pruning

7.17 8.64 9.04 9.29 8.60 8.24 8.68 8.52

Level two Olden pruning

7.29 8.43 8.95 9.28 8.66 8.23 9.14 8.57

Figure 4-23 features a 3D plot of the multilayer perceptron optimization for the

level-two nonlinear pruning. As the weight decay increased, the accuracy declined.

Moreover, in models with a high weight decay, an increase in the number of neurons

resulted in a less accurate performance. Figure 4-24 displays the same information in a

2D plot. In that visualization, it is clear that as the weight decay increased, the accuracy

decreased. However, to more effectively identify the optimum model, we needed to

focus on the section with the low weight decay.

89

Figure 4-23. A 3D plot of the neural net model optimization

Figure 4-24. A 2D plot of the neural net model optimization

Figure 4-25 and figure 4-26 illustrate the optimization of the neural network

model, emphasizing the area of interest. For a lower weight decay, the performance


90

was more chaotic, but as the weight decay increased, the model became more stable.

As for the number of neurons in the model, the models with fewer than 10 neurons were

more stable in terms of error consistency.

Figure 4-25. A focused 3D plot of the optimized parameters of the neural network

Figure 4-26. A focused 2D plot of the optimized parameters of the neural network



91

Figure 4-27 presents the structure of the network for the best-performing

multilayer perceptron. The inputs were the number of job openings in the construction

sector, unemployment in the construction sector, and the crude oil price. The network

had five neural cells in its hidden layer.

Figure 4-27. Structure of the optimized neural network

Support vector machines are among the nonparametric (using a kernel function)

supervised learning algorithms that can be used in classification and regression

problems. A support vector machine builds a hyperplane or a number of hyperplanes in

a high-dimensional space to maximize the distance between the nearest data point and

the hyperplane. As a result, a model constructed by a support vector machine only

depends on a subset of the dataset, because the points beyond the margin do not have

any effect on the model. The support vector regression works similarly to its

classification method. However, as the output is numerical, a margin of tolerance is

introduced to the model, allowing it to produce single numerical values. Support vector

machines are especially useful for high-dimensional problems, and due to the different

kernel functions, they are applicable to a range of problems. Moreover, as they only use

92

a subset of data, they are computationally efficient. As by this point in our inquiry, we

were more interested in the nonlinear relationships among the variables, a support

vector regression model using a Gaussian kernel was implemented and optimized.

Table 4-21 presents the results of the support vector regression for different datasets.

The feature pruning did not have a significant effect on the model’s performance.

However, considering the RMSE as a performance indicator, the model with all

variables tended to perform marginally better. However, when the MAE served as the

performance indicator, the model with the level-two nonlinear pruning had a marginally

better performance.

Table 4-21. Performance of the support vector machine models (unit: frequency of projects)

Error term Feature set 1 2 3 4 5 6 7 Average

RMSE

All variables 9.92 11.12 11.97 12.07 10.98 10.30 10.93 11.04

level one nonlinear pruning

9.95 11.14 12.12 12.05 10.99 10.27 10.90 11.06

level two nonlinear pruning

10.36 10.89 12.11 12.00 11.12 10.43 10.82 11.10

level one linear pruning 10.07 11.14 11.91 11.99 11.00 10.33 10.89 11.05

level two linear pruning 10.00 11.00 11.78 11.98 11.16 10.61 10.89 11.06

MAE

All variables 7.26 8.34 9.09 9.50 8.58 8.26 8.71 8.53

level one nonlinear pruning

7.30 8.36 9.29 9.49 8.53 8.18 8.72 8.55

level two nonlinear pruning

7.43 8.36 9.05 9.44 8.70 8.34 8.81 8.59

level one linear pruning 7.26 8.33 9.01 9.44 8.57 8.22 8.71 8.51

level two linear pruning 7.35 8.21 8.92 9.41 8.61 8.16 8.58 8.46

The parameter optimization for the support vector machine optimized the values

of the sigma and cost function. Figure 4-28 contains a 3D plot, while Figure 4-29

contains a 2D plot of the support vector machine parameter optimization for the model

with all the variables. Increasing the cost function clearly led to a decline in the model’s

93

performance. Also, in models with a higher sigma value, the error increased

exponentially when the cost function took a higher value.

Figure 4-28. A 3D plot of the support vector machine parameter optimization


94

Figure 4-29. A 2D plot of the support vector machine parameter optimization

Table 4-22 summarizes the best-performing univariate and multivariate models.

The univariate ARMA (8,8) model obviously outperformed all other models, including

the multivariate ones. As a result, the ARMA was selected for the final simulation and

validation process.

Table 4-22. Summary of the best performing models (unit: frequency of projects)

Error term Model Feature

set 1 2 3 4 5 6 7 Average

RMSE

ARMA Univariate 10.88 11.75 11.78 11.43 10.69 9.66 8.81 10.72

LSTM Univariate 10.08 11.02 11.07 11.65 10.5 10.29 10.67 10.75

GLM Pruned 9.78 11.94 13.69 13.14 10.94 10.27 10.87 11.52

Neural net Level two nonlinear pruning

9.87 10.99 12.01 11.91 11.06 10.36 10.9 11.01

SVM level two

linear pruning

10 11 11.78 11.98 11.16 10.61 10.89 11.06

95

Table 4-22. Continued

Error term Model Feature set 1 2 3 4 5 6 7 Average

MAE

ARMA Univariate 8.5 9.19 9.05 9.3 8.8 7.29 7.03 8.45

LSTM Univariate 7.46 8.5 8.58 9.7 8.57 8.56 8.72 8.58

GLM Pruned 10.8 8.56 8.01 8.25 10 8.6 11.28 9.36

Neural net Level two

nonlinear pruning 7.15 7.7 8.94 9.29 8.6 9.08 8.69 8.49

SVM level two linear

pruning 7.35 8.21 8.92 9.41 8.61 8.16 8.58 8.46

Final Model Diagnostic Checks

After choosing the final model (ARMA (8,8)), it was necessary to run a diagnostic

analysis on it to verify its stability. As the best model was an ARMA model, and as the

data consisted of time series, an autocorrelation test and a Box-Ljung (portmanteau)

test was conducted on the ARMA model’s residuals. Figure 4-20 depicts the

correlogram of the ARMA model’s residuals in the seven cross-sections of the dataset

used for cross-validation. There was no evidence of a high correlation or a correlation

beyond the significance boundaries. As a result, the ARMA model passed the

autocorrelation test.

96

Figure 4-30. Residual autocorrelations a) cross-section 1 b) cross-section 2 c) cross-section 3 d) cross-section 4 e) cross-section 5 f) cross-section 6 g) cross-section 7

After testing the autocorrelation for each lag, the Box-Ljung test is used to assess

the overall randomness of the residuals for each model. The Box-Ljung test is a

statistical measure that checks the goodness of fit of a time-series model. A small p-

value suggests the possibility of significant autocorrelation, while a high p-value implies

an insignificant autocorrelation in the residuals, and thus, proves the randomness of the

errors. Table 4-23 contains the results of the Box-Ljung test for the seven data sections

from the cross-validation method. All the p-values were above 0.9. Thus, the model did

not exhibit a significant lack of fit.

97

Table 4-23. Result of Box-Ljung test 1 2 3 4 5 6 7

Chi-squared 7.20 8.62 8.59 9.15 4.68 4.95 4.90

df 16 16 16 16 16 16 16

p-value 0.969 0.928 0.929 0.907 0.997 0.996 0.996

Cost and Duration Characterization

Cost and duration were the two variables that were sampled from a fitted

distribution from past projects. A Pearson correlation test resulted in a 0.662 correlation

coefficient with 0.000 significance for the relationship between duration and cost at the

project level. Thus, there was a relatively strong linear relationship between the two

variables, and that link was taken into consideration when sampling from the cost and

duration distributions. Figure 4-31 contains a scatterplot of project durations and costs,

with the best-fit line visually illustrating the strong association between the two

variables. The R2 value for the complete dataset was 0.4358, which implied a weak

linear relationship between cost and duration. However, the figure also makes it clear

that several data points were obviously outliers. These introduced a high level of error

that drastically reduced the R2 value. For example, removing the top 16 outliers from the

2,816 data points increased the R2 value to 0.7.

98

Figure 4-31. Scatterplot illustrating the relationship between duration and cost

Figure 4-32 plots the cumulative cost per month, along with the quarterly and

annual MA for the 12-year period from 2003-2015. The figure highlights a decreasing

trend in the dollar value of the projects in more recent years.

Figure 4-32. Cumulative cost per month

Figure 4-33 plots the project frequency for each month during the 12-year period

(2003-2015). It also presents the quarterly and annual MAs. The trend appears

relatively constant, apart from a slight increase in later years. Considering the decline in

y = 42292x - 5E+06

R² = 0.4358

-$50,000,000.00

$0.00

$50,000,000.00

$100,000,000.00

$150,000,000.00

$200,000,000.00

$250,000,000.00

$300,000,000.00

$350,000,000.00

$400,000,000.00

$450,000,000.00

$500,000,000.00

0 500 1000 1500 2000 2500

Co

st (

Do

llar

s)

Duration (Days)

0.00

50.00

100.00

150.00

200.00

250.00

300.00

350.00

400.00

450.00

500.00

1/1

/20

03

7/1

/20

03

1/1

/20

04

7/1

/20

04

1/1

/20

05

7/1

/20

05

1/1

/20

06

7/1

/20

06

1/1

/20

07

7/1

/20

07

1/1

/20

08

7/1

/20

08

1/1

/20

09

7/1

/20

09

1/1

/20

10

7/1

/20

10

1/1

/20

11

7/1

/20

11

1/1

/20

12

7/1

/20

12

1/1

/20

13

7/1

/20

13

1/1

/20

14

7/1

/20

14

1/1

/20

15

Mill

ion

s

Cos per montht Moving average quarterly Moving average annualy

99

the combined monthly project budget (Figure 4-32), along with the slight increase in the

number of projects per month, individual project budgets have, on average, decrease.

Figure 4-33. Project frequency per month

To identify the distribution functions for project cost and duration, a set of

continuous distribution functions (including: normal, logistic, lognormal, loglogistic,

inverse Gaussian, exponential, beta, gamma, Weibull, Cauchy, uniform, student,

triangular, Laplace, Levy, Rayleigh, Pert, Fréchet, fatigue life, extreme value, Dagum,

Erlang, and hyperbolic secant) were fitted to the same cross-validation data sections

used for the project frequency model training and ranked using the AIC. This method

demonstrated how the distribution function and its representative parameters changed

throughout the dataset and helped us to choose the best representation. Table 4-24

reveals the best-fitted distribution functions for the cross-validation datasets. For all

seven data sections, the inverse Gaussian method was the best at representing the

project durations, while the lognormal function most accurately represented the project

costs.

0

10

20

30

40

50

601

/1/2

00

3

6/1

/20

03

11

/1/2

00

3

4/1

/20

04

9/1

/20

04

2/1

/20

05

7/1

/20

05

12

/1/2

00

5

5/1

/20

06

10

/1/2

00

6

3/1

/20

07

8/1

/20

07

1/1

/20

08

6/1

/20

08

11

/1/2

00

8

4/1

/20

09

9/1

/20

09

2/1

/20

10

7/1

/20

10

12

/1/2

01

0

5/1

/20

11

10

/1/2

01

1

3/1

/20

12

8/1

/20

12

1/1

/20

13

6/1

/20

13

11

/1/2

01

3

4/1

/20

14

9/1

/20

14

Day

s

Frequency per montht Moving average quarterly Moving average annualy

100

Table 4-24. Best fitted distribution function on cross-validation datasets 1 2 3 4 5 6 7

Duration Inverse

Gaussian Inverse

Gaussian Inverse

Gaussian Inverse

Gaussian Inverse

Gaussian Inverse

Gaussian Inverse

Gaussian

Cost Lognormal Lognormal Lognormal Lognormal Lognormal Lognormal Lognormal

A quantitative description of the distribution of the main variables and the

correlations among the parameters such as cost and duration was essential for the

stochastic generation of the project stream and to validate the generator’s results.

Figure 4-34 illustrates the empirical density and cumulative distribution of the project

durations.

Figure 4-34. Empirical density and cumulative distribution of the project durations

Figure 4-35 contains a histogram, a corresponding fitted distribution, and a

cumulative distribution for the durations of FDOT projects. The AIC indicated that an

inverse Gaussian distribution with µ= 244.67 and λ= 273.93 provided the best fit.

101

Figure 4-35. Fitted distribution function and cumulative distribution of the project durations

Figure 4-36 contains the empirical density and cumulative distribution of the

project costs. A concentration of project costs in a specific region (less expensive

projects) was evident, along with scattered expensive projects.

Figure 4-36. Empirical density and cumulative distribution of the project costs

102

Figure 4-37 contains a histogram, a corresponding fitted distribution, and a

cumulative distribution for FDOT project costs. The AIC indicated that a lognormal

distribution with (mean log) µ= 14.413319 and (standard deviation log) σ = 1.524961

provided the best fit.

Figure 4-37. Fitted distribution function and cumulative distribution of the project costs

The performances of the various model components presented in this section

indicated the viability of an integrated project stream forecaster to predict, within a

simulation environment, project frequencies and empirical distributions of project

durations and costs. Specifically, the goal was for the generator to produce stochastic

streams of unknown future FDOT projects.

103

CHAPTER 5 SIMULATION RESULTS AND DISCUSSION

Simulation Results

The project frequency modeling section demonstrated that an ARMA (8,8) model

was the model that produced the best representation of project frequencies. The next

step was to validate the model using the holdout samples (2015 and 2016). Training the

ARMA model on the data from 2003 to 2014 and testing it on the 2015 and 2016 data

resulted in an RMSE of 8.69 and an MAE of 7.47 for the error on the validation set.

These figures were lower than those for the error in the model selection tests. This

finding may have resulted from having 12 years for training, in accordance to what was

inferred from figure 4-9, including more training data improves the model’s performance.

The lower errors proved that the selected model performed better, or at least

consistently with the results presented in the model selection section. Figure 5-1

outlines the final fitted ARMA model. The predicted values are in blue, and the actual

data are plotted in red. The dark gray refers to the 80% prediction interval, and the light

gray to the 95% prediction interval.

Figure 5-1. The ARMA (8,8) model’s project frequency forecast

104

It was necessary to run a diagnostic check on the model to ensure that it was

stable and that the error was random. Figure 5-2 illustrates the autocorrelation for the

residuals of the ARMA model. It was clearly within the boundaries, and no significant

correlations were found.

Figure 5-2. Autocorrelation plot of the project frequency forecast error

The Box-Ljung test results were as follows: chi-squared = 12.249, df = 16, and p-

value = 0.7267. The p-value was lower than those presented in the model selection

section, but it was still well above the 0.05 level, indicating that the model did not exhibit

a significant lack of fit. Finally, figure 5-3 contains a histogram of the forecast errors and

a normal distribution fitted to them. According to that figure, the error was normally

distributed, and the model was acceptable.

105

Figure 5-3. Histogram of the forecast errors

In the final step of the project frequency modeling process, a simulation of the

ARMA model for the years from 2015 to 2018 was conducted. The results are

presented in figure 5-4.

Figure 5-4. An example of project frequency simulation

To represent the project costs and durations, two marginal distributions for each

variable were identified and their dependency based on a correlation analysis was

106

measured. The aim was to be able to sample from those distributions while taking the

correlation into consideration. One solution was using a copula. A copula represents a

multivariate distribution taking the relationships among the variables into account.

Copulas are functions that help to build multivariate distributions and generate samples

of correlated data. This process can be done by identifying the marginal distributions for

each variable and choosing a copula to construct the multivariate model. By using a

copula, a multivariate distribution was developed for which sampling yielded two values,

one for cost and one for duration. This distribution took into account the underlying

relationship between these two variables.

The first step was to identify a suitable copula function for our variables. This

goal was achieved by fitting all the copula functions and using the AIC to rank them and

chose the best representation. Table 5-1 provides the copula function fit results. The t

copula function best represented our data. The best-fitting model was a bivariate t

copula with the following parameters: par = 0.86, df = 5.51, and tau = 0.66.

Table 5-1. Copula functions fit results.

Function AIC

t -3,793.7032

Gumbel -3,719.3256

Gaussian -3,538.6109

Frank -3,399.4464

GumbelR -3,359.1308

ClaytonR -3,236.5057

Clayton -2,610.7184

Visualizing that information, figure 5-5 depicts the probability density function of

the bivariate copula that represented the project costs and durations. Figure 5-6

provides a 3D scatterplot of the probability density of the data sampled from the defined

copula.

107

Figure 5-5. The copula’s probability density function

Figure 5-6. The probability density of the data sampled from the defined copula

To demonstrate a means of sampling from the multivariate distribution and its

representativeness of future project streams, the results of a random sampling are

plotted against the actual data from 2015 and 2016 in figure 5-7. Figure 5-7 is useful in

terms of carrying out a visual inspection of the sampling method.

108

Figure 5-7. A sampled dataset plotted against actual values

To compare the sampled data and the actual data in more quantitative terms, the mean

and standard deviation of the actual data and the sampled values were examined.

Table 5-2 offers the mean and standard deviation of duration and cost for the actual and

simulated data. The results indicated, however, that both simulated variables were an

acceptable representation of the actual data. That said, the duration metrics were in

closer alignment than were the cost values. This difference can be justified, as the

distribution function used to represent the duration variables was a better fit based on

the AIC metric. These comparisons were based on one sample. While other iterations

could yield different numbers, the larger picture would not change.

Table 5-2. Mean and standard deviation of the actual and simulated data Mean Standard deviation

Actual duration (day) 295.06 256.82

Simulated duration (day) 259.59 242.43

Actual cost ($) 7294763.00 13164990.00

Simulated Cost ($) 6704132.00 17846240.00

109

To measure the model’s performance and validity in more quantitative terms, the

distribution functions identified in the model characterization section were compared

with the validation dataset to assess their ability to represent future project streams.

Figure 5-8 contains the kernel density estimates for the duration data. The blue curve is

the training data, and the red curve is the validation data from 2015 and 2016. It is

apparent that the small values were denser in the training dataset.

Figure 5-8. Kernel density estimates for the duration data

The distribution functions used in the characterization section were fitted onto the

validation data to assess whether the same distribution function was the best

representation of the project durations. Table 5-3 contains the goodness-of-fit results for

the five best distributions in terms of their AIC ranks. The inverse Gaussian distribution

was found to be the best fit, which was the same as in the training dataset.

110

Table 5-3. Goodness of fit for the duration distribution function

Distribution AIC

Inverse Gaussian 4997.2206

Frechet 4997.4358

Lognormal 4999.0683

Dagum 4999.4734

Fatigue Life 5002.0779

Another means of testing the goodness of fit of the proposed distribution

consisted of comparing the properties of each best fitting distribution. Table 5-4 shows

the mean, mode, median, standard deviation, skewness, kurtosis, and AIC for an

inverse Gaussian distribution fitted to the training and validation datasets. The AIC of

the training distribution was the goodness-of-fit measure used to apply the proposed

distribution function with the same parameters. The small difference indicated that the

chosen distribution fit the data appropriately. Although the values were not exactly

equal, they were very close, demonstrating that the selected distribution successfully

represented future project durations.

Table 5-4. Comparison of the best-fitting distribution’s properties of project duration Inverse Gaussian train Inverse Gaussian test

Mean (day) 244.6700 295.0601

Mode (day) 81.2425 113.7535

Median (day) 171.1620 217.7877

Std. Deviation (day) 231.2338 253.1105

Skewness 2.8353 2.5677

Kurtosis 16.3978 13.9886

AIC 5030.2328 4997.2206

Regarding the final comparison, figure 5-9 contains a histogram of the project

durations in the validation dataset and the previously discussed representative

111

distributions. The blue curve is the distribution compiled from the training dataset, and

the green one is the distribution developed on the basis of the validation dataset.

Figure 5-9. Comparison of the project durations and representative distributions

Figure 5-10 offers the kernel density estimates of the cost data. The blue curve is

the training data (2003 to 2015), and the red curve is the validation data from 2015 and

2016. The small values were denser in the training dataset, and higher values were

associated with more noise.

Figure 5-10. Kernel density estimates for the cost data

112

The distribution functions used in the characterization section were fitted onto the

validation data to gauge whether the same distribution function was the best

representative of the project costs. Table 5-5 illustrates the goodness-of-fit results for

the best five distributions ranked by AIC. The inverse Gaussian distribution had the

highest score, closely followed by the lognormal distribution. The best distribution in the

training dataset was the lognormal one.

Table 5-5. Goodness of fit for the cost distribution function

Distribution AIC

Inverse Gaussian 12704.2243

Lognormal 12705.1336

Loglogistic 12709.5431

Dagum 12711.5327

Frechet 12720.2120

The next step in testing the goodness of fit of the proposed distribution was

comparing the properties of each best fits. The goodness-of-fit test indicated that the

inverse Gaussian distribution was a slightly better fit for the validation dataset. Table 5-6

provides the mean, mode, median, standard deviation, skewness, kurtosis, and AIC for

the lognormal distribution fitted onto the training and validation datasets. The AIC of the

training distribution was the goodness-of-fit measure used to apply the proposed

distribution function with the same parameters. The small difference between the three

distributions’ properties demonstrated that the chosen lognormal distribution fits the

data appropriately. Putting the inverse Gaussian distribution aside, however, the

lognormal distributions were not always found to have the same properties. That said,

these values were close to each other, demonstrating that the chosen distribution

effectively represented the future project durations.

113

Table 5-6. Comparison of the best-fitting distribution’s properties of project cost Lognormal train Lognormal test Inverse Gaussian test

Mean ($) 5752815.816 7563901.0963 7294762.9285

Mode ($) 179799.9279 307830.4175 445507.2042

Median ($) 1812105.524 2627870.3292 2468295.9731

Std. Deviation ($) 17333507.31 20375708.7638 14770719.9266

Skewness 36.393 27.5152 5.9198

Kurtosis 12666.7026 5907.5680 61.4064

AIC 12722.4063 12705.1336 12704.2243

Figure 5-11. contains a histogram of project costs in the validation dataset and

the previously discussed representative distributions. The blue curve represents the

distribution compiled from the training dataset, while the green one refers to the

distribution created from the validation dataset. It is apparent that the two closely

followed each other.

Figure 5-11. Comparison of project costs and representative distributions

Analysis and Discussion

This chapter explained how the simulation and forecasting of future project

streams works. The project frequency modeling results demonstrated that the ARMA

model is stable and lacks systematic errors. The magnitude of error was, in fact, lower

than that found in the model selection and training set, and the decline is attributable to

the use of additional training data and the correspondingly more accurate coefficients.

114

The multivariate distribution sampling procedure was tested and verified.

Furthermore, the marginal distributions were compared against the validation dataset,

and the goodness of fit was found to be similar to that of the best-fitting distribution for

the validation dataset. In conclusion, the hold-out dataset validated the performance of

the project stream generator.

The proposed method is not a standalone portfolio management framework.

Rather, it should be considered a supplementary component to the current PPM

frameworks that is capable of extending those models’ planning horizons. Figure 5-12

demonstrates how the proposed method could be implemented and how the research

outcomes could be utilized. Historical data was used as the model’s input. Then, the

number of projects, along with their cost and duration, is forecast as the model’s output.

Finally, the output of the proposed model, along with known projects (advertised

projects) can be used as inputs for those PPM models currently implemented by a

company. For instance, the model’s output could be used as an input for the models of

Liu and Wang (2011) or Archer and Ghasemzadeh (1999) to extend their strategic

planning horizons.

Figure 5-12. Example of the functioning of the proposed method.

115

CHAPTER 6 CONCLUSIONS AND RECOMMENDATIONS

This research has outlined an extension to the existing project portfolio planning

framework to enable users to consider unknown (but statistically quantifiable) projects,

along with known projects, for strategic planning purposes. This research project

focused on developing, validating, and testing a stream generator that stochastically

forecasts possible samples of future FDOT projects (in terms of time of occurrence,

expected duration, and expected cost) based on historical data and economic

indicators. The proposed model should be considered as a supplement to the current

PPM framework, one that can extend the portfolio planning horizon to improve strategic

planning.

This research discussed a general modeling approach with multiple potential

training and validating options, and it presented the findings on developing, validating,

and testing a stream generator to forecast FDOT projects in terms of their time of

occurrence, expected duration, and anticipated cost. A set of potentially relevant

predictors, including macroeconomics metrics and construction indices, was identified to

test further improvements to the model using multivariate methods. This research

demonstrated how univariate and multivariate models can be used to forecast project

frequencies and project cost and duration distributions, and it also discussed the

relationship between these latter two variables.

With the goal of forecasting project frequency, a set of univariate models was

tested, and the ARMA model was found to be the top performer. To move from

univariate modeling to multivariate modeling, a set of exploratory data analyses was

conducted, and the results were used to prune the multivariate models’ features and

116

identify appropriate models for forecasting. Then, a generalized linear method, a

multilayer perceptron, and a support vector machine were trained and tested on the

identified independent variables. This procedure involved parameter tuning and feature

selection with the aim of identifying the best explanatory variables and parameters. The

generalized linear model indicated that the best explanatory variables were the

unemployment rate in the construction sector and the Brent oil price. However, that

model’s performance fell significantly short of that of the ARMA model. The multilayer

perceptron performed better than the generalized linear model. The best explanatory

variables were number of job openings in the construction sector, the unemployment

rate in that industry, and the crude oil price. However, the performance of the multilayer

perceptron was substantially lower than that of the ARMA model. The support vector

machine performed relatively similar to the neural network. The best explanatory

variables were the unemployment rate in the construction sector, Florida's employment,

the unemployment rate, Florida’s unemployment rate, the number of building permits,

and Florida unemployment. Overall, employment and oil prices play an important role in

the frequency of projects. The number of building permits was also found to be

significant, which may be connected to that factor’s effect on employment. However, the

multivariate models failed to improve on the benchmark model’s performance. As a

result, the best-performing model (ARMA) was chosen to be the final component in the

simulation. The lack of improvement could be because of the size of the dataset. The

whole dataset included around 3,100 projects spanning 14 years. However, that period

only translated into 168 months as data points for forecasting project frequencies. This

figure was not enough for the machine learning tools to perform at their best.

117

Based on the 12 years of data, the inverse Gaussian distribution was found to be

the best representation of the project durations, while the lognormal distribution most

accurately represented the project costs. The correlation between project costs and

durations was found to be significant, and this relationship was incorporated into the

simulation by using a copula to build a multivariate distribution for sampling. The

identified distributions were verified by comparing them to the verification data from

2015 and 2016. In conclusion, the results indicated that each component and the overall

model did not produce any systematic errors. Finally, it was verified that the proposed

framework generated representative future FDOT design-bid-build project streams.

The proposed method is not a standalone portfolio management framework.

Instead, it constitutes a supplement to current PPM frameworks that is capable of

extending the planning horizon. The complete framework will allow users to examine

different bidding and project-selection strategies in terms of their impact on a company’s

portfolio and its future resource demands. Furthermore, it will lead to the selection of

more optimal strategies and resource distributions in the future. Finally, taking into

account uncertainties in future project streams might decrease the extent of continuous

adjustments to a company’s portfolio plan due to the addition of new projects.

This research could be advanced by testing the model in a real-life scenario and

demonstrating its capabilities and effectiveness. Another recommendation for future

work is to expand the model to include factors such as contingency funds and different

value-added definitions. This research could also be extended by adding other elements

(e.g., different project types) to the project stream generator.

118

It is strongly recommended to apply the same procedure and framework in

different contexts, including more fluid markets where the environmental uncertainties

have more impact on the future project streams. Such an evaluation could improve the

performance of the multivariate models and make progress towards achieving this

study’s overall aim of identifying and capturing the impact of environmental uncertainties

on future project streams.

119

LIST OF REFERENCES

Araúzo, J. A., Pajares, J., and Lopez-Paredes, A. (2010). “Simulating the dynamic scheduling of project portfolios.” Simulation Modelling Practice and Theory, 18(10), 1428–1441.

Archer, N., and Ghasemzadeh, F. (1999). “An integrated framework for project portfolio selection.” International Journal of Project Management, 17(4), 207–216.

Archer, N., and Ghasemzadeh, F. (2004). “Project Portfolio Selection and Management.” The Wiley guide to managing projects, 237–255.

Association for Project Management. (2006). APM Body of Knowledge. Association for Project Management.

Auyang, S. (2005). “Synthetic analysis of complex systems I-Theories.” <http://www2.econ.iastate.edu/tesfatsi/Auyang.ComplexSystemsTheories.htm> (Mar. 7, 2014).

Bengtsson, M., Müllern, T., Söderholm, A., and Wåhlin, N. (2009). A grammar of organizing. Edward Elgar Publishing.

Blichfeldt, B. S., and Eskerod, P. (2008). “Project portfolio management - There’s more to it than what management enacts.” International Journal of Project Management, 26(4), 357–365.

Browning, T. R., and Yassine, A. A. (2010). “Resource-constrained multi-project scheduling: Priority rule performance revisited.” International Journal of Production Economics, Elsevier, 126(2), 212–228.

Cao, L. J., and Tay, F. E. H. (2003). “Support vector machine with adaptive parameters in financial time series forecasting.” IEEE Transactions on Neural Networks, 14(6), 1506–1518.

Carazo, A. F., Gómez, T., Molina, J., Hernández-Díaz, A. G., Guerrero, F. M., and Caballero, R. (2010). “Solving a comprehensive model for multiobjective project portfolio selection.” Computers and Operations Research, 37(4), 630–639.

Cargnoni, C., M?ller, P., and West, M. (1997). “Bayesian Forecasting of Multinomial Time Series through Conditionally Gaussian Dynamic Models.” Journal of the American Statistical Association, 92(438), 640–647.

Choubin, B., Khalighi-Sigaroodi, S., Malekian, A., and Ki?i, ?zg?r. (2016). “Multiple linear regression, multi-layer perceptron network and adaptive neuro-fuzzy inference system for forecasting precipitation based on large-scale climate signals.” Hydrological Sciences Journal, Taylor & Francis, 61(6), 1001–1009.

Cleden, D. (2009). Managing project uncertainty. Gower Publishing, Ltd.

120

Cooper, R., Edgett, S., and Kleinschmidt, E. (2001). “Portfolio management for new product development: results of an industry practices study.” R and D Management, 31(4), 361–380.

Cooper, R. G., Edgett, S. J., and Kleinschmidt, E. J. (1997). “Portfolio management in new product development: Lessons from the leaders—I.” Research-Technology Management, 40(5), 16–28.

Daft, R. L. (2009). “Organziation Theory and Design.” South-Western Cengage Learning, 138–157.

Dahlgren, J., and Söderlund, J. (2002). “Management control in multi-project organizations: a study of R&D companies.” at IRNOP V.

Danilovic, M., and Sandkull, B. (2002). “Managing Complexity and Uncertainty in a Multiproject Environment.” IRNOP V, International Research Network on Organizing By Projects, Proceedings.

Danilovic, M., and Sandkull, B. (2005). “The use of dependence structure matrix and domain mapping matrix in managing uncertainty in multiple project situations.” International Journal of Project Management, 23(3), 193–203.

Duncan, R. B. (1972). “Characteristics of Organizational Environments and Perceived Environmental Uncertainty.” Administrative Science Quarterly, 313–327.

Dye, L. D., and Pennypacker, J. S. (1999). Project portfolio management: selecting and prioritizing projects for competitive advantage. Center for Business Practices.

Elonen, S., and Artto, K. A. (2003). “Problems in managing internal development projects in multi-project environments.” International Journal of Project Management, 21(6), 395–402.

Enders, W. (2015). Applied econometric time series.

Engwall, M. (2003). “No project is an island: Linking projects to history and context.” Research Policy, 32(5), 789–808.

Engwall, M., and Jerbrant, A. (2003). “The resource allocation syndrome: The prime challenge of multi-project management?” International Journal of Project Management, 21(6), 403–409.

Exterkate, P., Groenen, P. J. F., Heij, C., and van Dijk, D. (2016). “Nonlinear forecasting with many predictors using kernel ridge regression.” International Journal of Forecasting, Elsevier B.V., 32(3), 736–753.

Floricel, S., and Miller, R. (2003). “An exploratory comparison of the management of innovation in the New and Old economies.” R&D Management, 33(5), 501–525.

121

Gers, F. A., Eck, D., and Schmidhuber, J. (2002). “Applying LSTM to time series predictable through time-window approaches.” Neural Nets WIRN Vietri-01, Springer, 193–200.

Ghasemzadeh, F., Archer, N., and Iyogun, P. (1999). “A zero-one model for project portfolio selection and scheduling.” Journal of the Operational Research Society, 50(7), 745–755.

Githens, G. D. (2002). “Programs, Portfolios, and Pipelines: How to Anticipate Executives’ Strategic Questions.” Managing Multiple Projects: Planning, Scheduling, and Allocating Resources for Competitive Advantage, CRC Press, 83–90.

Goh, A. T. C. (1995). “Back-propagation neural networks for modeling complex systems.” Artificial intelligence in engineering., Computational Mechanics Publications, [Ashurst, Southampton, England], 9(3), 143.

Gray, C. F., and Larson, E. W. (2008). “Project Management. The Managerial Process, ISBN: 9780073525150.” McGraw-Hill, a business unit of the McGraw-Hill Companies, Inc.

Henriksen, A. D., and Traynor, A. J. (1999). “A practical r&d project-selection scoring tool.” IEEE Transactions on Engineering Management, 46(2), 158–170.

Kerzner, H. (2009). Project Management: a Systems Approach To Planning, Scheduling and Control. John Wiley & Sons.

Killen, C. P., Hunt, R. A., and Kleinschmidt, E. J. (2007). “Managing the New Product Development Project Portfolio: A Review of the Literature and Empirical Evidence.” PICMET ’07 - 2007 Portland International Conference on Management of Engineering & Technology, IEEE, 1864–1874.

Kohzadi, N., Boyd, M. S., Kermanshahi, B., and Kaastra, I. (1996). “A comparison of artificial neural network and time series models for forecasting commodity prices.” Neurocomputing, 10(2), 169–181.

Levine, H. A. (2005). Project portfolio management: A Practical Guide to Selecting Projects, Managing Portfolios, and Maximizing Benefit. San Francisco, CA: Jossey-Bass., John Wiley & Sons.

Li, J., and Chen, W. (2014). “Forecasting macroeconomic time series: LASSO-based approaches and their forecast combinations with dynamic factor models.” International Journal of Forecasting, Elsevier B.V., 30(4), 996–1015.

Liu, S.-S., and Wang, C.-J. (2011). “Optimizing project selection and scheduling problems with time-dependent resource constraints.” Automation in Construction, 20(8), 1110–1119.

122

Markowitz, H. (1952). “PORTFOLIO SELECTION*.” The Journal of Finance, 7(1), 77–91.

Martinsuo, M. (2013). “Project portfolio management in practice and in context.” International Journal of Project Management, 31(6), 794–803.

Martinsuo, M., and Lehtonen, P. (2007). “Role of single-project management in achieving portfolio management efficiency.” International Journal of Project Management, 25(1), 56–65.

McFarlan, W. F. (1981). “Portfolio approach to information systems.” Harvard Business Review, 59(5), 142–150.

De Meyer, A., Loch, C. H., and Pich, M. T. (2002). “Managing project uncertainty: From variation to chaos.” IEEE Engineering Management Review, 30(3), 91–98.

Olden, J. D., and Jackson, D. A. (2002). “Illuminating the ‘black box’: a randomization approach for understanding variable contributions in artificial neural networks.” Ecological modelling., Elsevier], Amsterdam, 154(1), 135.

Pennypacker, J. S., and Dye, L. D. (2002). “Project Portfolio Management and Managing Multiple Projects : Two Sides of the Same Coin ?” Managing Multiple Projects: Planning, Scheduling, and Allocating Resources for Competitive Advantage, CRC Press, 1–10.

Persson, J. S., Mathiassen, L., Boeg, J., Madsen, T. S., and Steinson, F. (2009). “Managing Risks in Distributed Software Projects : An Integrative Framework.” Engineering Management, IEEE Transactions on, 56(3), 508–532.

Petit, Y., and Hobbs, B. (2010). “Project portfolios in dynamic environments: Sources of uncertainty and sensing mechanisms.” Project Management Journal, 41(4), 46–58.

Project Management Institute. (2013a). The Standard for Portfolio Management. Book, Project Management Institute.

Project Management Institute. (2013b). A Guide to the Project Management Body of Knowledge (PMBOK). Management, Project Management Institute.

Rajegopal, S., Waller, J., and McGuin, P. (2007). Project Portfolio Management: Leading the Corporate Vision. Palgrave Macmillan.

Scott, W. R. (2002). Organizations: Rational, Natural, and Open Systems. Prentice Hall.

Shahandashti, S. M., and Ashuri, B. (2016). “Highway Construction Cost Forecasting Using Vector Error Correction Models.” Journal of Management in Engineering, 32(2), 4015040.

123

Teller, J. (2013). “Portfolio risk management and its contribution to project portfolio success: An investigation of organization, process, and culture.” Project Management Journal, 44(2), 36–51.

Thomas Ng, S., Cheung, S. O., Martin Skitmore, R., Lam, K. C., and Wong, L. Y. (2000). “Prediction of tender price index directional changes.” Construction Management and Economics, 18(7), 843–852.

Thompson, J. D. (1967). Organizations in action: Social science bases of administrative theory. Transaction publishers.

Turner, J. R., and Müller, R. (2003). “On the nature of the project as a temporary organization.” International Journal of Project Management, 21(1), 1–8.

Unger, B. N., Kock, A., Gemünden, H. G., and Jonas, D. (2012). “Enforcing strategic fit of project portfolios by project termination: An empirical study on senior management involvement.” International Journal of Project Management, Elsevier Ltd, 30(6), 675–685.

Voyant, C., Notton, G., Darras, C., Fouilloy, A., and Motte, F. (2017). “Uncertainties in global radiation time series forecasting using machine learning: The multilayer perceptron case.” Energy, Elsevier B.V., 125, 248–257.

Ward, S., and Chapman, C. (2003). “Transforming project risk management into project uncertainty management.” International Journal of Project Management, 21(2), 97–105.

Wideman, R. M. (1992). “Project and program risk management a guide to managing project risks and opportunities.” The PMBOK handbook series v. no. 6, Project Management Institute.

Wong, J. M. W., and Ng, S. T. (2010). “Forecasting construction tender price index in Hong Kong using vector error correction model.” Construction Management and Economics, 28(12), 1255–1268.

Young, M., and Conboy, K. (2013). “Contemporary project portfolio management: Reflections on the development of an Australian competency standard for project portfolio management.” International Journal of Project Management, Elsevier Ltd, 31(8), 1089–1100.

Yu, X., and Liong, S.-Y. (2007). “Forecasting of hydrologic time series with ridge regression in feature space.” Journal of Hydrology, 332(3–4), 290–302.

124

BIOGRAPHICAL SKETCH

Alireza Shojaei Kol Kachi is a researcher in the field of computer-based modeling

and simulation in construction management and built environments. He received his

Ph.D. from the University of Florida in design, construction, and planning in the August

of 2017. He also received a master’s degree in management at the College of Business

Administration at the University of Florida. Moreover, he holds a master’s degree in

construction management from the University of Reading and a bachelor’s degree in

civil engineering from the University of Amir Kabir.

Apart from the study presented in this dissertation, he has also conducted

research in various fields, including information communication technology in

construction, sustainable development, project management and economics in

construction, and uncertainty and risk management.

extending the portfolio and strategic planning …

Documents