the effect of spatial segmentation on safety …
TRANSCRIPT
The Pennsylvania State University
The Graduate School
School of Science, Engineering, and Technology
THE EFFECT OF SPATIAL SEGMENTATION ON SAFETY
PERFORMANCE FUNCTION MODELING
A Thesis in
Computer Science
by
Xingsheng Wang
© 2017 Xingsheng Wang
Submitted in Partial Fulfillment
of the Requirements
for the Degree of
Master of Science
December 2017
ii
The thesis of Xingsheng Wang was reviewed and approved* by the following:
Jeremy Blum
Associate Professor of Computer Science
Thesis Adviser
Thang N. Bui
Graduate Program Chair
Associate Professor of Computer Science
Linda Null
Associate Professor of Computer Science
Sukmoon Chang
Associate Professor of Computer Science
Hyuntae Na
Assistant Professor of Computer Science
*Signatures are on file in the Graduate School.
iii
ABSTRACT
Building predictive models called safety performance functions (SPFs) is important for
the study of roadway safety. The first step in SPF modeling is roadway segmentation, which
partitions roadways into segments. To build the predictive models, we train the models
with a certain amount of observations. The observations cover as many cases as possible
in order to build better and transferable model. These observations with different
geometrical parameters and number of crashes are derived from the segmentation.
Roadway segmentation is not only an essential but a challenging step. Previous studies
have found that segmentation approaches affect the models’ transferability, for example,
their predictive ability for future crashes or crashes on other roadways. Some researchers
find that a little shift in segmentation yields very different models.
To find better approaches to segmentation, in this thesis, we propose a novel
segmentation methodology, which is driven by a machine learning clustering approach.
While this approach limits in its ability to improve model transferability, it does help to
characterize the extent to which segmentation approaches affect conclusions drawn from
the models. In the clustering step of this approach, roadway segmentation is based on a
weighted distance between adjacent segments. Segmented roadway data is used to build
models that allow for the estimation of the gradient in the error metric as a function of the
segmentation weights. The weights are updated based on this gradient, and this process
repeats with the performance of models guiding the updating of weights and the resulting
segmentation.
iv
TABLE OF CONTENTS
List of Tables ............................................................................... v
List of Figures ............................................................................... vi
List of Abbreviations ............................................................................... vii
Acknowledgements ............................................................................... ix
Chapter 1. INTRODUCTION ....................... 1
Chapter 2. RELATED WORKS ....................... 5
2.1. Roadway segmentation methods ....................... 6
2.2. Feature selection in roadway crash research ....................... 7
2.3. Evolution of the modeling methodologies ...................... 8
2.4. Challenges in the modeling of crash-frequency data ..................... 11
2.5. Influence of segmentation on resulting model ..................... 12
Chapter 3. METHODOLGY ..................... 14
3.1. Data for segmentation and modeling ..................... 15
3.2. Algorithm for segmentation and modeling ..................... 18
3.3. Parameters for segmentation and modeling ..................... 29
Chapter 4. RESULTS ..................... 33
4.1. The performance of the system on the whole dataset ..................... 33
4.2. Validity in the models based on the initial segmentation parameters ... 34
4.3. The limit of the generalizability of models ..................... 36
Chapter 5. CONCLUSIONS AND FUTURE WORK ..................... 41
5.1. Significance of this thesis ..................... 41
5.2. Future work ..................... 42
REFERENCES ..................... 43
Appendix: Route parameters for modeling ..................... 46
v
List of Tables
Table 3.1. Five groups of data for cross-validation .............. 24
Table 4.1. Parameters of the negative binomial models with the
different initial segmentation weights .............. 35
Table 4.2. The final normalized segmentation weights for five
experiments with different initial segmentation weights .............. 36
Table 4.3. The predicted testing errors with different initial weights. .............. 38
vi
List of Figures
Figure 3.1. Overall description of methodology ........... 15
Figure 3.2. Features of the horizontal curve ........... 17
Figure 3.3. Features of the vertical curve ........... 17
Figure 3.4. Algorithm for segmentation and modeling ........... 19
Figure 3.5. Algorithm of fragments clustering ........... 21
Figure 3.6. Distance measurement between two adjacent segments/clusters ......... 21
Figure 3.7. Attribute value updating formula for merging segments
during the clustering ........... 22
Figure 3.8. An example for roadway clustering ........... 23
Figure 3.9. Algorithm of building the statistical model ........... 23
Figure 3.10. Algorithm of coordinate-descent calculation ........... 25
Figure 3.11. Weights screening and coordinate-descent generation ........... 26
Figure 3.12. Coordinate-descent update ........... 27
Figure 3.13. An example for weights updating ........... 28
Figure 3.14. Formulas for three types of error measurements ........... 31
Figure 3.15. Average error of the training and the testing datasets
with different target number of segments ........... 32
Figure 4.1. The training errors of the learning with different initial
segmentation weights ........... 34
Figure 4.2. The errors with the learning for 20 experiments based on Table 4.3 .... 40
vii
List of Abbreviations
AADT Annual Average Daily Traffic
absPG Algebraic difference in Gradients of vertical curve
ADT Average Daily Traffic
ANN Artificial Neural Network
BegMP Beginning of Mile Post
EndMP End of Mile Post
GBM Generalized Boosted Models
hcLen Length of horizontal curve
hcMSE Max Super Elevation of horizontal curve
hcR Radius of horizontal curve
HSM Highway Safety Manual
len Length of fragment
MAPE Mean Absolute Percentage Error
NB Negative Binominal models
QIC Quasilikelihood under the Independence model Criterion
RMSE Root Mean Squared Error
sd Average of left and right shoulder width
sdC Average of left and right center median side shoulder width
sdL Left shoulder width
sdLC Left center median side shoulder width
sdR Right shoulder width
sdRC Right center median side shoulder width
SPF Safety Performance Function
SVM Support Vector Machine
vcLen Length of vertical curve
viii
WAPE Weighted Absolute Percent Error
wid Roadway width
ix
Acknowledgements
First and foremost, I would like to express my sincere gratitude to my advisor Dr. Jeremy
Blum who continuously supports my research, for his patience, encouragement,
enthusiasm, insightful comments, and immense knowledge. I very much enjoyed working
with him. Without his guidance and constant feedback this thesis would not have been
achievable.
Secondly, I would like to thank Dr. Linda Null, Master Program Coordinator, and all
professors in this program, who provided me with the opportunity to enroll in this program.
I would like to thank Mrs. Jeanne M. Miller, Administrative Support Assistant, for all her
help and support.
I greatly appreciate Dr. Linda Null, Dr. Thang Bui, Dr. Jeremy Blum, Dr. Sukmoon
Chang, Dr. Omar El Ariss, and Dr. Hyuntae Na. They provided me with strong knowledge
and experience in the areas of database design, algorithm design, system design, data
mining, machine learning, and natural language processing.
I would like to thank my family, my wife, Li Liu and my daughter, Wendy Lily Wang.
Without my wife’s hard work and support, it is hard to image how I could have finished
my courses and thesis.
Last but not the least, I would like to thank Dr. Semontee and Dr. Stephanie at the
Learning Center, Penn State Harrisburg, Mr. Chad Snyder, Dr. Richard Lee Gill Jr, Dr.
Christie McCracken, and Mr. Harry Li for their suggestions about writing.
Xingsheng Wang
1
Chapter 1. INTRODUCTION
Roadway safety is an important task to be solved. According to a 2015 World Health
Organization global status report on road safety, more than 1.2 million people die each year
on the world’s roads [1]. This number has plateaued since 2007. On U.S. roadways, the
number of motor vehicle crash fatalities reached 30,000 in 2014. About two million people
were injured in motor vehicle traffic crashes [2], an increase of 1.1 percent as compared to
2013. To study roadway safety, scientists have built two major statistical methods since the
1970s. One method is based on descriptive statistics, which describes the basic features of
historical data. Another method is based on inferential statistics, which makes inferences
and predictions based on crash data.
The inferential method is best suited to study the relationship between car crashes and
their causes. The aim of this method is to build predictive models called safety performance
functions (SPFs). SPFs predict the likelihood of crashes on a roadway segment as a
function of the segment length, traffic counts, and roadway features. These functions,
derived using statistical and machine learning analyses, are an important tool to identify
infrastructure improvements that can increase roadway safety.
To build predictive models for car crashes, we must first segment the roadways. The
roadway segments form the observations for our models, which have an ability to predict
the number of crashes based on the attributes of each segment. Scientists have employed
various segmentation methods based on fixed-length segmentation or homogeneous
segmentation. Fixed-length segmentation divides roadways into fragments with the same
length, while homogeneous segmentation separates roadways into fragments with the same
roadway attributes. These methods, however, have been shown to have significant
shortcomings. Importantly, previous research has found that segmentation choices can
2
affect the transferability of resulting models, i.e. their ability to accurately predict future
crashes or crashes on other roadways .
In this study, to solve the challenges in segmentation, we propose a new methodology to
segment the roadway, called machine learning driven segmentation. The roadways are first
partitioned into segments with the same length. Each fragment contains the number of
crashes and five roadway geometrical attributes that belong to this fragment. The algorithm
consists of the repetition of three steps, including spatial weights updates, roadway
segmentation, and model fitting. The performance of the models in the current iteration
guides the updating of the spatial weights and segmentation for the subsequent iteration.
In the first iteration, the spatial weights are set to initial values. Segmentation consists
of clustering neighboring segments with the minimal distance. The distance between two
neighboring segments is the sum of five weighted differences. Each weighted difference is
the product of the spatial weight of an attribute and the difference between two neighbor
attributes. Once the roadway is segmented, the predictive model is built with a widely used
method for modelling crash count data, known as negative binominal regression.
In subsequent iterations, the spatial weights are updated in an attempt to reduce the error
in the predictive model. The model error is a non-continuous function of the spatial weights,
so gradient descent approaches are not appropriate for this function. Instead, a coordinate-
descent approach is used to iteratively reduce the error function by making small changes
to the segmentation weights.
Developing effective and transferable models is a difficult problem for a number of
reasons beyond the challenges presented by roadway segmentation, and thus, this thesis
addresses only a portion of the overall problem. For example, roadway crashes are rare,
3
random events that are caused by many different contributing factors, which are beyond
the control of transportation engineers. These factors include the roadway environment, the
driver, and the vehicle. Lum and Reagan [3] estimated that three percent, fifty-seven
percent, and two percent of car crashes are due solely to the roadway environment, the
driver, and the vehicle, respectively. Twenty-seven percent of car crashes are due to a
combination of the roadway environment and the driver. Three percent of car crashes are
due to a combination of all three factors. In this study, we investigate the frequency of car
crashes and roadway geometry parameters and build predictive models for the frequency
of car crashes based on these geometry parameters. It is important to note that this study is
limited in that it ignores important factors related to the driver and the vehicle. Moreover,
due to limits on available data, this study only considers a portion of roadway environment
characteristics. Due to the nature of crashes and the limited attributes considered, the
modelling approaches would explain only a portion of crashes that occur.
This thesis advances knowledge in the effect of segmentation choices on modelling
outcomes. Specifically, this thesis includes the following novel contributions:
It presents a new coordinate-descent approach for roadway segmentation, which
allows for different initial weights that can reflect the biases that investigators
bring to these studies with respect to the importance of different roadway features
on safety.
It introduces a new error metric for segmentation studies, Weighted Absolute
Percentage Error (WAPE). WAPE has been used in business applications for
demand forecasting. It is particularly appropriate for segmentation studies, due
to its independence of number of segments and its ability to deal with excessive
zeroes present in the data.
The coordinate-descent approach shows that the error is a locally convex
function of the spatial weights. The approach is able to consistently reduce the
4
error in a training dataset. However, the improvements in the performance of the
model on the training dataset do not always correspond to improvements in a
held-out testing dataset.
The spatial clustering algorithm illustrates the extent to which emphasizing the
importance of different features in the clustering can affect the resulting SPFs.
The differences in the final segmentation, the coefficients in the resulting models,
and the uneven performance on held-out testing data, all suggest inherent
limitations in the transferability of the resulting models.
This thesis is organized into five chapters. Chapter 2 begins with a literature review of
the related works. It includes roadway segmentation methods, feature selection in roadway
crash research, evolution of the modeling methodologies, challenges in the modeling of
crash-frequency data, and influence of segmentation on resulting model
Chapter 3 describes the novel segmentation approach, the machine learning driven
segmentation. After discussing the data used in this study, we present the details of the
algorithm for segmentation and modeling. Finally, we discuss the parameters for the
segmentation and modeling.
Chapter 4 discusses the results, including the performance of the system on the whole
dataset, validity in the models based on the initial segmentation parameters, and the limit
of the generalizability of models.
Chapter 5, which is the last chapter, presents a discussion of the findings and looks into
areas for future study.
5
Chapter 2. RELATED WORKS
Building predictive models for roadway car crashes includes two major steps. The first step
is the segmentation of the roadway. There are three widely used segmentation methods
including fixed-length segmentation, homogeneous segmentation, and variable-length
heterogeneous segmentation. Roadway segmentation is a process to create units of
observation. Each observation contains several features of the segment which need to be
selected from various features.
The second step includes selecting a predictive model and then modeling the data from
roadway segmentation. Various deterministic and non-deterministic models have been
investigated. Deterministic models include a number of regression models. Linear
regression models are not appropriate for modeling of car crashes because car crashes are
rare events and count data. Poisson regression is appropriate to build models for the count
data, but may overstate or understate the likelihood of car crashes due to the over-
dispersion of car crash data. Negative binominal models allow the variance to exceed the
mean and are widely used in this area. Non-deterministic models such as artificial neural
network (ANN) and support vector machine (SVM) provide an alternative way to model
car crashes.
Building statistical models for car crashes data faces several challenges that come from
the inaccuracy of the data including incomplete reporting of crash data and time varying
parameters, the property of the data such as the over-dispersion of data, and the complicities
of modeling including omitted variables and functional form of the modeling.
Roadway segmentation is a challenging step for the modeling. Studies show that it
heavily affects the resulting models. To solve the challenge, this thesis uses a coordinate-
6
descent approach to find a heterogeneous segmentation that represents a local minima in
the error function. This method requires the use of deterministic models in order to be able
to obtain a consistent estimate of coordinate-descent in modeling error as a function of
segmentation weights.
2.1. Roadway segmentation methods
Roadway segmentation is to partition roadways into segments to create the observations
for modeling. Segmentation heavily affects the modeling of roadway crash frequencies
because crashes are rare events. Very short segments result in a large number of segments
with zero crashes, which leads to over-dispersion. Over-dispersion means that the variance
of crash data exceeds its mean. It creates challenges for statistical inference because it is
difficult to accurately assign a crash to a segment if segment lengths are very small.
As mentioned earlier, there are three widely used roadway segmentation methods
including fixed-length segmentation, homogeneous segmentation, and variable-length
heterogeneous segmentation.
Fixed-length segmentation divides roadways into fragments with equal length. Each
fragment is likely to have varying attributes. Lengths of fragments from eighty meters to
several hundred meters have been studied [4]. Miaou and Lum [5] reported that segments
with eighty meters could produce bias with the linear models because of very short of
segments.
The second widely used method is homogeneous segmentation. Homogeneous segments
are defined as segments that do not vary with respect to one or more roadway geometry
parameters, Annual Average Daily Traffic (AADT), or both. Cafiso and Silvestro [6]
7
showed that the segment length should be related to AADT; lower AADT values require
longer segment lengths. The Highway Safety Manual (HSM) [7, 8] recommends “the use
of homogeneous segments with respect to AADT, number of lanes, curvature, presence of
ramp at the interchange, lane width, outside and inside shoulder widths, median width and
clear zone width”. Cafiso et al. [8] investigated five segmentation methods and found that
different segmentation methods ended in very different performances in modeling.
The method of homogeneous segmentation obviously results in many segments of
different lengths. One drawback of this approach is that the segmentation may create many
short segments. To avoid bias, the short segments may be combined (aggregated) and
heterogeneous segments generated [8]. In a winding roadway, homogeneous segmentation
results in some relatively short curves and it may be hard to confidently ascribe a particular
crash to a segment. Koorey [9] carefully discussed how to deal with short segments with
length less than fifty meters. He suggested combing curves with less than two degree of
total deflection with the subsequent data, creating a special segment type for tight reverse
curves, and removing short segments with tight curves at an intersection. The cut-off values
for these rules are somewhat arbitrarily chosen, but the aim of this method is to eliminate
short segments.
2.2. Feature selection in roadway crash research
In roadway car crash research, researchers have focused on explanatory variables including
the roadway environment, the driver, and the vehicle. Numerous researchers have focused
on the relationships between crashes and roadway geometric design variables, such as
horizontal and vertical curvature, land width, and shoulder width [10, 11, 12, 13]. In
addition, the Annual Average Daily Traffic (AADT) is usually considered in modeling [11].
These models also include models for special roadway configurations such as highway
8
intersections [14], highway-railway crossings [15], and highway construction zones [16,
17]. Driving conditions, a critical risk factor, has also proven to affect the likelihood of car
crashes [18]. Snowfall, icy roadways, and watery roadways make vehicle handling more
difficult. Several studies that developed models for winter crash have been reported [18,
19, 20].
Drivers’ information and their behavior [21] is another important factor, which heavily
affects the frequency of car crashes. Blevins [22], for example, reported that 80 percent of
crashes are caused by distracted driving. Amarasingha and Dissanayake [23], focused
instead on driver’s age as a surrogate for driving behavior, when they modeled injury
severity of young drivers using highway crash data. Hu et al. [12], on the other hand,
created temporal models of crash counts for senior and non-senior drivers.
In this study, we ignore driving conditions, the driver-specific factors, and the vehicle-
specific factors. Instead, the effect of these factors is incorporated in the error term. The
models described in this thesis are based on average daily traffic (ADT) and roadway
geometry information including horizontal curvature, vertical curvature, the roadway
width in feet, the shoulder width, and the median side shoulder width.
2.3. Evolution of the modeling methodologies
Modeling methodologies have evolved for the modeling of crash likelihood. Linear
regression is not suitable for modeling of car crash count data. Poisson regression is
suitable for count data modeling, but it may overstate or understate the likelihood of car
crashes because of over-dispersed property of crash data. Negative binomial models allow
the variance to exceed the mean and it is the best choice among these models. Recently,
more complex models such as support vector machine models (SVM) and artificial neural
9
networks (ANNs) have been employed in this area.
Linear regression is not appropriate for car crash data modeling. Car crashes are
random, discrete, nonnegative, and rare events. Conventional linear regression lacks the
distributional property to describe this count data, which is not normal distribution [24].
Miaou and Lum [5] proposed two special linear regression models, additive linear
regression model and multiplicative linear regression model. However, the test statistics
showed that neither is appropriate for crash data.
Poisson regression may overstate or understate the likelihood of car crashes. The
Poisson model has three properties [25]: the mean of the counts experienced by an
individual equals to its variance; events in the Poisson process are independent and
memoryless; and the event rate within time intervals is a constant. Count data is the number
of events per time interval. The mean count can be evaluated by distributions of Poisson
family. The Poisson regression was used for the modeling of car crashes [26, 13]. The
model takes the following form:
P(Yi = yi) =
𝑖
𝑦𝑖𝑒−𝑖
𝑦𝑖! (i = 1, 2, 3, …, n)
where
i = E(Yi) = 𝑒∑ 𝑥𝑖𝑗𝑗
𝑘𝑗=1
where 1, 2, …, k are k unknown regression parameters, P(Yi = yi) is the probability of
n crashes occurring on a roadway segment i in one year, and i is the expected crash
frequency for segment i.
The parameters 1, 2, …, and k can be estimated with the maximum likelihood method
[27], the quasi-likelihood method [28], or the generalized least squares method [29]. The
maximum likelihood method uses the likelihood function (L()) to estimate the coefficient
vector = (1, 2, …, k) [30].
10
L() = ∏ 𝑃(𝑌𝑖 = 𝑦𝑖)𝑛𝑖=1 = ∏
𝑖
𝑦𝑖𝑒−𝑖
𝑦𝑖!
𝑛𝑖=1 (i = 1, 2, 3, …, n)
where
i = E(Yi) = 𝑒∑ 𝑥𝑖𝑗𝑗
𝑘𝑗=1 = exp(Xi)
That is,
ln(i) = Xi
The limitation of the Poisson regression is the equality between mean and variance of
the counts. In many applications, count data for car crashes show over-dispersion, that is,
the variance of the data is greater than the mean. As a result, Miaou and Lum [5] reported
that Poisson regression models may overstate or understate the likelihood of car crashes.
Negative binominal models allow the variance to exceed the mean. To overcome the
over-dispersion problem, an error term (i) is added to the expected crash frequency (i) in
the negative binomial regression model [13, 31, 32]:
ln(i) = Xi + i
𝑒𝑖 is a gamma-distributed error with mean one and variance [30]. It relaxes the
assumption of the Poison model in which the mean of crash frequencies equals the variance.
The negative binominal model allows the variance to exceed the mean. It is a widely used
model to successfully build predictive model for over-dispersion data. In this study, we use
this model to predict the likelihood of car crashes.
Other methods in the statistical analysis of crash-frequency data. Various methods in
the statistical analysis of crash-frequency data have been widely reviewed [33, 34]. Besides
different regression models, neural networks [30], Bayesian neural networks [31], and
support vector machine models (SVM) [32] have been reported. For example, Chang [30]
demonstrated that artificial neural network (ANN) is an alternative method for the study in
this area. These methods provide alternative ways to build predictive models for roadway
11
car crashes.
2.4. Challenges in the modeling of crash-frequency data
Challenges in the modeling of crash-frequency data have been widely discussed in two
recent reviews [33, 35]. Here, we discuss the potential challenges that are related to our
research. These challenges may come from the inaccuracy of the data including incomplete
reporting of crash data and time varying parameters, the property of the data such as the
over-dispersion of data, the complexities of modeling including omitted variables and
functional form of the modeling, and segmentation.
Time varying parameters Usually, the data for the modeling are considered over some
time period and some parameter values may change during this period. If we ignore within-
period variations, the results may lose explanatory information. To minimize the influence
of time-varying parameters, the data we used spans only one year.
Over-dispersion of incident counts Car crashes are rare events and the variance of
crash data exceeds its mean due to large number of segments with no crashes, which is
called over-dispersion data. In this study, we address this problem by using the negative
binomial model, which can handle over-dispersion in the data.
Omitted variables bias When the size of a dataset limits the number of parameters that
can be estimated, researchers must decrease the complexity of the model. This results in
biased estimates. Some researchers have used random parameter models to try to capture
segment specific heterogeneity resulting from omitted variables [36].
Functional form of the modeling The functional form is a very important factor for
12
modeling. For over-dispersed crash data, a large number of researches have demonstrated
that non-linear forms are much better than linear forms [37, 24]. However, non-linear
models are more complex and need estimation procedures to increase the accuracy of the
estimated expected crash frequency [38, 39].
Incomplete reporting of crash data Kumara and Chin [35] reported that under-
reporting might produce biased estimates. Less severe crashes are under reported. Some
potential serious problems are also not reported. We do not know the magnitude of
incomplete reporting, but studies have showed that it has biased the modeling.
2.5. Influence of segmentation on resulting model
Roadway segmentation heavily affects car crashes modeling. For example, Cafiso et al. [8]
investigated and assessed five different segmentation approaches including homogeneous
segmentation with respect to AADT and curvature, each segment containing 2 curves and
2 tangents and avoiding short segments, each segment having constant AADT, fixed-length
segmentation, and each segment containing variables in a stepwise procedure. They used
the Quasi-likelihood under the Independence model Criterion (QIC) to evaluate the
goodness of fit of the models. The values of QIC for these five segmentation methods are
3322, 1082, 1762, 2707, and 4511, respectively. While the second method had the best
results, the parameters for the models varied widely. This study is important as it
demonstrated how roadway segmentation affects the modeling.
The work here seeks to extend this analysis by using coordinate-descent approach to
explore how segmentation produces different models and results. Non-deterministic
methods such as ANN and Generalized Boosted Models (GBM), are potentially good
methods for the study of crash-frequency data. But these methods cannot provide stable
13
results after many iterations of learning. For current work, segmentation and modeling are
conducted alternately. Roadway segmentation uses a spatial clustering algorithm based on
the weighted distances between features of adjacent segments. Segmented data is then used
for the modeling. The segmentation weights are updated from coordinate-descent approach
based on the errors of models. Segmentation and modeling are repeated until a threshold is
achieved. The changes in segmentation model selection must be deterministic to maintain
consistency for the coordinate-descent calculations. Therefore, we use a deterministic
method, the negative binomial method, to build models.
14
Chapter 3. METHODOLGY
To build predictive models for the count of car crashes based on roadway geometry
parameters, we designed a spatial clustering algorithm which uses coordinate-descent to
find a roadway segmentation. As shown in Figure 3.1, the raw data contains four types of
roadway geometry information, average daily traffic, and car crash information. After the
data is cleaned, the roadway is partitioned into fragments with fixed length. Each fragment
contains five geometric parameters, average daily traffic, and the number of crashes in this
fragment.
The fixed-length segments are further clustered based on the weighted distance between
two neighboring segments. Then, the clustered segments are divided into three sub-datasets:
a training dataset, a validation dataset, and a testing dataset. The training dataset is used to
build a negative binomial model followed by calculation of the training error. The training
error cooperated with a set of errors from weights screening is used to generate coordinate-
descent. Weight screening is a parallel process to screen clustering weights with small
changes. It selects the best change in the clustering weights to update these weights for the
next iteration. After the clustering weights are updated, the new iteration begins. The
updating of the clustering weights plays a central role in the algorithm. It continues until a
threshold is attained.
15
Figure 3.1. Overall description of methodology. Weights Screening based on the clustering
weights contains ten paralleling processes of segmentation and modeling. Java or R
package names are shown in the parentheses.
3.1. Data for segmentation and modeling
Data resources
The raw data contains roadway geometry information, average daily traffic (ADT), and car
crash information from 2012 in Washington state. The data include 33 two-lane state routes
with 33,000 crashes, spanning about 1,800 miles.
The initial data comes from four geometry files and an ADT file (Figure 3.1). Each
16
record in these five files contains the state route number, the beginning milepost (BegMP),
and the ending milepost (EndMP). The milepost data is recorded to the nearest hundredth
of a mile. Each record in the car crash file contains the state route number and the specific
milepost where the car crash occurs.
Attributes of data
The geometric information for roadways includes horizontal curve, vertical curve, lane,
and shoulder width information. Horizontal curves [40] (Figure 3.2) are characterized by
three features: the radius of the curve (hcR), the length of the curve in feet (hcLen), and
the max super elevation (hcMSE). Super elevation is the banking of the roadway such that
the outside edge of pavement is higher than the inside edge. Vertical curves (Figure 3.3)
are characterized by the length of the curve in feet (vcLen), the algebraic difference in
gradients (absPG), and the length of parabolic curve (L). L is the projection of the curve
onto a horizontal surface which corresponds to the plane distance. As shown in Figure 3.3,
the point of curvature (PC) is the beginning of a vertical curve. The point of tangency (PT)
is the end of the vertical curve. The point of intersection of the tangents (PI) is the point of
vertical intersection.
The lane data has two attributes: the number of lanes and the roadway width in feet (wid).
For the shoulder width, there are four attributes: the left shoulder width (sdL), the right
shoulder width (sdR), the left center median side shoulder width (sdLC), and the left center
median side shoulder width (sdRC). Left and right are labeled based on travelling on the
roadway in the direction of increasing milepost values. In this study, we use average left
and right shoulder width (sd) and average left and right median side shoulder width (sdC).
17
Figure 3.2. Features of the horizontal curve.
Figure 3.3. Features of the vertical curve. A curve is shown in dark blue.
Data cleaning and transformation
For this study, the roadway is first partitioned into very short fixed-length segments. The
length of each segment is 0.01 miles. For any segment without horizontal curvature, the
radius of the horizontal curve (hcR) of the segment is assigned a value of 40,000 which is
significantly larger than values in the dataset. For any segment without vertical curvature,
the length of the vertical curve (hcLen) is assigned a value of 0.
The car crash file contains the specific milepost where the car crash occurs. Each crash
is distributed to its two neighbor fragments: i.e., each fragment is assigned 0.5 crashes. Any
segment with missing attributes is deleted from the dataset. Crash data is not included in
segmentation, but it is used for the modeling.
The attributes are altered in order to eliminate direction-specific values, attributes that
18
are highly correlated, and attributes that are not statistically significant. The left and right
shoulder attributes are replaced with a single attribute equal to their sum. The center left
and center right shoulder attributes are similarly replaced with a single attribute
representing their sum. Other geometric attributes for the modeling of this study include
lane width, horizontal curvature, and vertical curvature. Our statistical models also use
non-geometric attributes, crash counts, average daily traffic, and segment length. For the
segmentation, attributes are normalized, while for the modeling, attributes are used directly.
3.2. Algorithm for segmentation and modeling
The roadway segmentation algorithm uses an iterative, coordinate-descent, spatial
clustering approach. This approach tries to find a local optima in the roadway segmentation
space with respect to the ability of the statistical model to predict car crashes. As shown in
Figure 3.4, the algorithm starts by initial fixed-length segmentation and clustering fixed-
length segments based on the current set of attribute weights. Then the algorithm builds a
statistical model based on the clustered data. The clustering weight for each attribute value
is then perturbed, both increasing it and decreasing it. This is called Weights Screening.
For each change, new clustering is performed and models are built. Based on these alternate
clusters and models, the algorithm estimates the coordinate-scent in the statistical model’s
loss function as a function of weighting. The weights are then adjusted to follow this
coordinate-descent, and the process repeats until the termination criteria are met.
19
Figure 3.4. Algorithm for segmentation and modeling. Segmentation is the process of
clustering two neighboring segments based on the weighted distance between them.
Algorithm for segmentation and modeling
1. initialize , , and N // is a weight vector for clustering
2. data0 initial_segment(raw data) // fixed-length segmentation
3. data cluster(data0, , N) // cluster data based on attributes, , and N
4. error model(data) // modeling with negative binomial models
5. while ( > 0.001) // is the increment value for
6. coordinate_descent(, , error) // Weights Screening
7. if (|| is 0) // if each element in equals 0
8. / 2
9. else
10. update(,) // clustering weights updating
11. data cluster(data0, , N) // cluster based on updated and N
12. error model(data) // modeling and calculation of error
13. end if
14. end while
20
Variables initialization (Step 1)
Weights ( = [10, 2
0, ..., 50]) are initialized to a set of initial values. The weights, 1
0,
20, ..., 5
0 represent the initial segmentation weights for the radius of horizontal curve
(hcR), the length of vertical curve (vcLen), the roadway width (wid), average left and right
shoulder width (sd), and average left and right median side shoulder width (sdC),
respectively. To establish a baseline clustering, the initial values of weights are all set one.
Subsequent runs of the system assign the initial weights with all of one except for one
attribute’s weight, which is assigned an initial value of two. These alternative initial
weights allow us to explore more of the segmentation space. It determines whether the
local minima in the segmentation space will produce models that lead to different
conclusions.
The variable is used to control the magnitude of the change in the weights during
each iteration. This is later explained in the “Coordinate-descent to update clustering
weights” section. We have tried different initial values of from 0.001 to 5.0. The results
indicate that the initial value of 0.5 produces the best results.
The variable of N is used as the target number of segments for the modeling after
clustering. That is, the clustering algorithm will stop when the number of clusters of N is
attained. Here, each cluster is a segment of the roadway. The process that was used to
determine the value for this parameter is described in the last section of this chapter.
Segmentation algorithm (Step 2, Step 3, and Step 11)
The segmentation algorithm contains two steps. The first step is fixed-length segmentation
(Step 2 in Figure 3.4). The second step is spatial clustering fixed-length segments into a
21
certain number of clusters (Step 3 and Step 11 in Figure 3.4). Attributes are normalized by
scaling all values in the range from 0 to 1 (0-1-normalization) before segmentation.
Roadways are first partitioned into fixed-length fragments with length of 0.01 miles
(Step 2). Next, the fixed-length fragments are clustered into the target number of clusters
(Step 3 and Step 11). After these steps, each cluster contains one or more continuous fixed-
length segments in a roadway. The clustering algorithm is shown in Figure 3.5.
Figure 3.5. Algorithm of fragments clustering. The process of fragments clustering is the
fragments segmentation. Each cluster is a segmented fragment.
The distance between the jth segment/cluster and the j+1th segment/cluster is calculated
with a weighted Manhattan distance formula (Figure 3.6).
Figure 3.6. Distance measurement between two adjacent segments/clusters.
For i = 1..5, 𝑎𝑖𝑗 represents the values in the 𝑗𝑡ℎ cluster of the radius of horizontal curve
(hcR), the length of vertical curve (vcLen), the roadway width (wid), average left and right
shoulder width (sd), and average left and right median side shoulder width (sdC),
Algorithm of clustering
3.1. n the number of fixed-length segments
3.2. N the target number of clusters
3.3. while n > N
3.4. find two neighbor fragments/clusters with the minimal distance
3.5. combine these two fragments/clusters into a new fragment/cluster
3.6. calculate distances between the new fragment and its two neighbor fragments
3.7. n--
3.8. end while
Distance measurement
distance(j, j+1) = ∑ 𝑖|𝑎𝑖𝑗
− 𝑎𝑖𝑗+1
|5𝑖=1
22
respectively. i is the weight for the attribute i. Clusters j and j+1 are adjacent clusters on
the roadway.
There are two types of attributes. The first type includes hcR, vcLen, wid, sd, and sdC.
They are combined with a distance weighted formula. The value of the attribute in the
newly merged cluster, 𝑎𝑖𝑛𝑒𝑤, is calculated from the clusters j and j+1 (Figure 3.7).
Figure 3.7. Attribute value updating formula for merging segments during the clustering.
𝑙𝑖𝑗 and 𝑙𝑖
𝑗+1 are lengths of the clusters j and j+1.
The second type includes the number of crashes. When segments/clusters are merged,
we just sum the number of crashes from clusters j and j+1.
To illustrate the algorithm of roadway clustering, an example is shown in Figure 3.8.
There are 10 fragments before a new iteration of the clustering. The weight for each
attribute is 1. Distances between two neighboring fragments are calculated by using the
formula shown in Figure 3.6. The distance of 3 between fragments 9 and 10 is the minimal
value among nine distances. Therefore, the fragments of 9 and 10 are combined and a new
fragment of 9’ is generated. For the length and the number of crashes for the fragment 9’,
they are summed from the fragments of 9 and 10. For attributes of a1, a2, and a3, they are
calculated with the formula in Figure 3.7.
Attribute value update when merging segments
𝑎𝑖𝑛𝑒𝑤 =
𝑙𝑗𝑎𝑖𝑗
+ 𝑙𝑗+1𝑎𝑖𝑗+1
𝑙𝑗 + 𝑙𝑗+1
23
Figure 3.8. An example for roadway clustering. The first column is the fragment sequential
number from 1 to 10. The second column is the length of each fragment. The third column
is the number of crashes for each fragment. The fourth to the sixth columns are roadway
geometry attributes including a1, a2, and a3. The last column is distances between two
neighboring fragments. The last row is the parameters for the combined fragment 9’ after
combing fragments 9 and 10.
Building the statistical model (Step 4 & Step 12)
Building the statistical model that predicts the frequency of crashes for each segment
includes several steps as shown in Figure 3.9.
Figure 3.9. Algorithm of building the statistical model.
Algorithm of modeling
4.1. split the clustered/segmented data into training, validation, and testing
datasets (Optional)
4.2. build negative binomial (NB) model based on the training datasets
4.3. calculate error (called error or WAPE) for training dataset
24
Table 3.1. Five groups of data for cross-validation.
Group No. Route No. in Washington State
1 12
2 5, 3, 129, 548, 166
3 2, 525, 903, 100, 308, 128
4 155, 23, 504, 161, 821, 501, 411, 223, 528, 193, 204, 433
5 26, 9, 27, 174, 11, 270, 531, 538, 906
There are 33 routes in Washington State used in this study. The segmented/clustered
roadways are further divided into five groups based on route numbers (Table 3.1) for the
purpose of cross-validation. The lengths in miles of all five groups are equal.
To test the generalizability of models, we partition these five groups data into three folds
(Step 4.1). One fold, the testing dataset, is used to estimate the performance of the model
on unseen data. A second fold, the validation dataset, is used to prevent the overfitting of
the model during the learning. The remaining three groups of data, the training dataset, are
used to train the model. The algorithm is then run a total of 120 times (see “4.3 The limit
of the generalizability of models” in Chapter 4).
To evaluate the effect of initial weights on resulting clustering and statistical models, the
entire dataset is used as the training dataset (see “4.1 The performance of the system on the
whole dataset” in Chapter 4) since estimates of the performance of the models on unseen
data had already been established.
For step 4.2, the negative binomial (NB) models are built on the training dataset. For the
modeling, geometrical attributes of hcR, vcLen, wid, sd, and sdC, average daily traffic (adt),
and the length of fragment (len) are used as the input data. The output of the modeling is
the number of crashes that is the predicted number of crashes. The error is calculated from
the real number of crashes (yi) and the predicted number of crashes (pi), which is shown in
25
Figure 3.14. Non-normalized values for these attributes are used to build the statistical
models. In addition to the geometrical attributes, both average daily traffic and the length
of the segment are used as the offset for the modeling, in which the number of crashes is
proportional to these two parameters.
Coordinate-descent to update clustering weights (Step 6)
The algorithm of coordinate-descent generation is shown in Figure 3.10. First, ten sets of
weights (Step 6.1) are created by increasing and decreasing the current weights () by the
weight increment (). Each set of weights is used to segment/cluster the data (Step 6.3
and 6.5), and segmented/clustered data is then used to model and generate error (Step 6.4
and 6.6). Lastly, a coordinate-descent approach is used to update the weights (Step 6.8)
based on ten sets of errors and the error from the previous iteration (Figure 3.11).
Figure 3.10. Algorithm of coordinate-descent calculation.
Algorithm of coordinate-descent calculation
coordinate_descent(, , error) {
6.1. generate 10 sets of weights (1+, 2+, ..., 5+, 1-, 2-, …, 5-) based on
and
6.2. for j in 1..5
6.3. data cluster(data0, j+, N) // clustering
6.4. ej+ model(data) // model and calculate error
6.5. data cluster(data0, j-, N) // clustering
6.6. ej- model(data) // model and calculate error
6.7. end for
6.8. update(error, e = [e1+, e2+, ..., e5+, e1-, e2-, …, e5-])
6.9. return
}
26
Figure 3.11. Weights screening and coordinate-descent generation.
In step 6.1, ten sets of weights (1+, 2+, ..., 5+, 1-, 2-, …, 5-) are generated based
on the weight of the previous iteration () and . That is, a new set of weights are derived
from the weight by modifying one weight of an attribute to plus or minus for this
weight. Thus, ten sets of weights can be generated because there are five attributes in this
study. After getting ten sets of weights, we cluster the fixed-length segments based on each
set of weights. It produces segmented/clustered data (Step 6.3 and 6.5). Based on these ten
datasets, ten models are built and ten errors (e = [e1+, e2+, ..., e5+, e1-, e2-, …, e5-]) are
calculated (Step 6.4 and 6.6).
The weight update step (Step 6.8) estimates the best change in weights based on errors
from the previous iteration (error) and the weights screening step (e = [e1+, e2+, ..., e5+, e1-,
e2-, …, e5-]). From the above analysis, we know that two sets of weights are generated for
one attribute (j+ and j- for attribute j). After segmenting based on these two sets of
weights and modeling built on two sets of segmented data, we calculate two errors (ej+ and
ej- for attribute j). The coordinate-descent update algorithm constructed on these three
errors (error, ej+, and ej-) is shown below.
27
Figure 3.12. Coordinate-descent update. j = 1..5 represents the attributes of hcR, vcLen,
wid, sd, and sdC, respectively.
Weights update (Step 10)
The new weight (i) for an attribute is updated based on its weight from the previous
iteration, the difference in error metric (i), and .
i i * (1 + i * / G)
G ∑ |𝛁𝒊5𝑖=1 |
G is the sum of values of the difference in the error metric (). i = 1..5 represents the
attributes of hcR, vcLen, wid, sd, and sdC, respectively. Thus, this formula adjusts the
weights in the direction of improvement by a factor proportional to the improvement in the
error metric.
To illustrate weights updating, an example is shown in Figure 3.13. The error before
weight screening (e0) is 100. Ten errors (e1+, e2+, …, e5+, e1-, e2-, …, e5-) are generated from
weight screening. The third row () is the coordinate-descent calculated from the algorithm
shown in Figure 3.12. The row labeled is the weights before updating and the row labeled
Coordinate-descent update
update(error, e = [e1+, e2+, ..., e5+, e1-, e2-, …, e5-]) {
initialize gradient, = [0, 0, 0, 0, 0]
for j in 1...5
if (ej+ < ej-)
if (ej+ < error) j error – ej+
else
if (ej- < error) j ej- – error
end if
end for
return
}
28
’ is the weights after updating. As shown in Figure 3.13, the performance increment
determines the changes in weight updating. For attribute a1, the best performance is when
the error is 50, which is corresponding to increasing weight for a1 by . After weights
updating, the weight for a1 is increased by 25%. For attribute a2, there is no increase in
performance, so the weight for a2 does not change after weights updating. In addition, The
direction of weights updating (increasing or decreasing by ) determines whether
increasing or decreasing weight. For example, for attribute a4, the best performance is when
the error is 90, which is corresponding to decreasing weight for a4 by . Therefore, after
weights updating, the weight for a4 is decreased by 5%.
Figure 3.13. An example for weights updating.
Stop criterion for the algorithm (Step 5)
The variable of is the threshold for the algorithm (Step 5). It will be modified (Step 8)
when there is no improvement in the performance of the training dataset. The algorithm
will stop when is less than 0.001.
29
Implementation and running time of algorithms
Segmentation is implemented with Java and modeling is implemented with R and MASS
package [41]. Penn State LionX clusters are used and parallel computing is implemented
with MPICH2. Weights screening and coordinate-descent generation include ten
independent tasks per iteration and it allows parallel computation. It takes about ten
minutes per iteration with four nodes and sixteen gigabytes of memory. Each experiment
takes 3-48 hours depending on the convergence of the algorithm.
3.3. Parameters for segmentation and modeling
Error measurement
One of the challenges in selecting a loss function for the algorithm is that typical loss
functions will not allow for comparison between models with different numbers of
segments/clusters. Commonly used error metrics, such as Root Mean Squared Error
(RMSE) and Mean Absolute Percentage Error (MAPE) are not appropriate for this problem.
Instead, we introduce the use of Weighted Absolute Percentage Error (WAPE), which has
not been used in the studies of car crashes. WAPE has been used in demand forecasting
models, which have similar requirements to the Safety Performance Function (SPF) models.
Root Mean Squared Error (RMSE) is defined as shown in Figure 3.14. This metric is
scale dependent. This means when it is used in SPF models, its value is a function of the
number of segments. The total number of car crashes in the dataset is fixed. When
decreasing the number of segments, the average number of car crashes for each fragment
increases. RMSE will therefore increase intrinsically when the number of segments
30
decrease.
Another commonly used metric form is Mean Absolute Percentage Error (MAPE),
which is calculated as show in Figure 3.14 [42]. Note that MAPE cannot work when the
values of the actual variable (yi) are 0. MAPE, like RMSE, is case dependent.
Weighted Absolute Percent Error (WAPE) is an alternative to MAPE, and is a widely
used metric for forecasting errors [42]. WAPE is defined as shown in Figure 3.14. In time-
series demand forecasting models, it offers advantages over MAPE. Unlike MAPE, it can
handle when the values of the actual variable (yi) are 0. Moreover, WAPE is scale
independent, which is important when varying time lengths in demand forecasting models.
In this study, MAPE is the percent of car crashes correctly predicted from the model.
The rationale for using WAPE over MAPE is based on the distribution of the data and
the goal of this research. Car crashes are rare events. Most of the segments will experience
no crashes over the course of a year. Thus, there are many segments with 0 values. Rather
than requiring scale independence over time period lengths, this work requires scale
independence over segment length. MAPE makes the error metric scale dependent. In
addition, the goal of this study is to use predictive models to reduce the total number of
crashes. MAPE weights an accident on segments with small numbers of crashes more than
an accident on a segment with more crashes. This difference does not make sense given
the goal of the modelling. Lastly, WAPE provides a way to compare the performance of a
model with a naïve model that predicts zero crashes for all segments. Such a naïve model
would have an WAPE of 100.
31
RMSE = √𝟏
𝑵∑ (𝒑𝒊 − 𝒚𝒊)𝟐𝑵
𝒊=𝟏
MAPE = 𝟏𝟎𝟎
𝑵∑ |
𝒑𝒊−𝒚𝒊
𝒚𝒊|𝑵
𝒊=𝟏
WAPE = 𝟏𝟎𝟎×∑ |𝒑𝒊−𝒚𝒊|𝑵
𝒊=𝟏
∑ 𝒚𝒊𝑵𝒊=𝟏
Figure 3.14. Formulas for three types of error measurements. N is the number of segments
for the modeling. yi is the number of crashes for the ith segment. pi is the predicted number
of crashes from the model for the ith segment.
Choosing the target number of clusters
To understand how the target number of segments/clusters affects the model performance,
we perform an unweighted spatial segmentation with different numbers of targets. After
segmentation, the data are partitioned into five groups with equal length (Table 3.1). Four
groups of data, the training dataset, are used to build the negative binomial model. One
group of data, the testing dataset, is used to evaluate the model. A 5-fold cross-validation
is used to estimate the error of the model on unseen data.
There are 179,754 fixed-length segments. These fixed-length segments form 18,637
homogeneous segments. We built models with target numbers of segments ranging from
100 to 18,000. As shown in Figure 3.15, both training error and testing error increase as
the target number of segments increases. The training errors increase from 44.6% to 98.6%
when the target number of segments jumps from 300 to 18,000, while the testing errors
increase from 46.0% to 135.5%. When the target number of segments is less than 300, the
negative binomial models cannot be successfully built because there are too few
observations.
32
Figure 3.15. Average error of the training and the testing datasets with different target
number of segments. Average values are calculated from the results of 5-fold cross-
validation. Training error and testing error are drawn in black and red. The bars in red or
black are the standard deviations from the results of the 5-fold cross-validation.
The standard deviations of both training and testing error are minimized when the target
number of segments is 500. The differences between the training error and testing error are
also small. This suggests that the model can be more easily generalized when the target
number of segments reaches 500. This is reasonable since the number of crashes increases
per segment when the target number of segments decreases. Consequently, crashes, as rare
events, can be better modeled. For this reason, in the remaining analysis, we only
investigate the cases when the target number of segments is 500. To enable comparisons of
errors across different segmentations, which is essential to accurately estimating the
coordinate-descent, we also fixed the number of segments for each route, which is shown
in Appendix .
33
Chapter 4. RESULTS
To investigate the performance and the generalizability of our method, we perform two
types of experiments. The first type of experiments use the whole dataset as the training
dataset to evaluate the effect of initial weights on resulting clustering and statistical models.
The second type of experiments employ the method of five-fold cross validation to
systematically explore the generalizability. Based on these studies, we find the following:
(1) Both the roadway segmentation and resulting models are sensitive to the initial
segmentation weights. The minimal error and the maximal error span a large range from
22.8% to 129.6%. The average and the standard deviation of errors are 53.3% and 22.8%.
(2) Some of the final models can converge to a similar roadway segmentation, resulting in
consistent models. However, this convergence does not always occur, and the resulting
models can lead to very different conclusions. (3) The resulting models do not consistently
generalize to the unseen data. This may be the result of the small sample size, the exclusion
of driver behavior, or other features of modeling.
4.1. The performance of the system on the whole dataset
To investigate the performance of the segmentation algorithm based on initial weights, we
conduct segmentation and modeling with different initial segmentation weights. These
initial weights are one, except for a single attribute whose initial weight is two. In this study,
five sets of initial weights can be used since our dataset contains five attributes and thus
five experiments are conducted. For these experiments, the whole dataset is used to train
the models. As shown in Figure 4.1, the training errors smoothly decrease with each
iteration of the algorithm. The average initial and finial errors within these five experiments
are 52.1 1.8% and 45.1 1.5%. The variance among these five models is small, about
1.8%. After the learning, the performance of the training is increased by ~7% in the average.
34
This data shows that the clustering on the whole dataset yields consistent improvement,
regardless of the initial weights.
Figure 4.1. The training errors of the learning with different initial segmentation weights.
The horizontal axis is the number of iterations for the learning and the vertical axis is the
error for the training dataset. “x = 2” (x = hcR, vcLen, wid, sd, and sdC) means that the
initial segmentation weights are all of one except for attribute “x” of two. The training
errors with each iteration are shown in lines with the different colors.
4.2. Validity in the models based on the initial segmentation parameters
From the above analysis, we know that the models produce steady and comparable
improvement in the performance based on the whole dataset, regardless of the different
initial segmentation weights. There is also some consistency in resulting models. As shown
in Table 4.1, the model coefficients for the intercept, vcLen, wid, and sd are almost same
in five models, as are the p-values.
The coefficients of hcR in these five models have the largest variance. The average and
35
the standard deviation of hcR coefficients are 46 and 77. p-values of two models are less
than 0.05, which means that statistical significances of hcR coefficients for these two
models are significant. For other three models, however, p-values of hcR coefficients are
greater than 0.2 and this variable is not statistically significant.
Table 4.1. Parameters of the negative binomial models with the different initial
segmentation weights. Modeling parameters from five experiments are shown in the third
to the seventh rows. The first column means the initial weights (w0) for the segmentation.
“x” (x = hcR, vcLen, wid, sd, and sdC) in the first column means that the initial
segmentation weights are all of one except for attribute “x” of two. “Int” in the second
column is the coefficient of intercept. “est” and “Pr” represent the estimate coefficient
value and p-value (Pr > |t|, where t is t value) from the model.
w0
Int hcR vcLen wid Sd
est Pr est Pr Est Pr est Pr est Pr
hcR -6.4 2E-16 62 0.2 -1E-4 0.007 -6E-3 2E-5 -3E-3 2E-9
vcLen -6.2 2E-16 -36 0.7 -1E-4 0.03 -6E-3 6E-6 -4E-2 4E-13
wid -6.3 2E-16 117 0.001 -1E-4 0.08 -7E-3 1E-6 -3E-2 2E-10
sd -6.2 2E-16 -34 0.7 -1E-4 0.008 -5E-3 6E-5 -4E-2 1E-14
sdC -6.4 2E-16 119 9E-5 -1E-4 0.04 -7E-3 1E-7 -3E-2 3E-9
Modeling parameters including coefficients and their statistical significances are
different when the initial segmentation weights are different. Table 4.2 shows the final
segmentation weights after the learning. To compare the final weights, we normalize them
to one for each row. As shown in Table 4.2, the average weight of width is the smallest and
that of vcLen is the biggest, which suggest that the width of roadway contributes less to
the segmentation while the length of horizontal curve contributes more to the segmentation.
36
Table 4.2. The final normalized segmentation weights for five experiments with different
initial segmentation weights. The final weights for five experiments are shown in the
second to the sixth rows, respectively. The last row is the mean and the standard deviation
of weights within five experiments. The first column is the initial weights (w0) for the
segmentation. “x=2” (x = hcR, vcLen, wid, sd, and sdC) in the first column means that the
initial weights are all of one except for attribute “x” of two. The last row is the mean and
the standard deviation for each attribute.
w0 hcR vcLen wid sd sdC
hcR=2 0.21 0.37 0.03 0.04 0.34
vcLen=2 0.11 0.69 0.04 0.03 0.13
wid=2 0.20 0.19 0.03 0.08 0.51
sd=2 0.10 0.72 0.02 0.04 0.12
sdC=2 0.29 0.24 0.04 0.08 0.35
mean SD 0.18 0.08 0.44 0.25 0.03 0.01 0.05 0.02 0.29 0.17
In summary, within the different models, even though the performances are similar, both
modeling and segmentation parameters can vary significantly. Hence, the segmentation is
largely affected by the initial segmentation weights used.
4.3. The limit of the generalizability of models
While the resulting models are fairly consistent, their value may lie in their generalizability.
To test the generalizability of models, we systematically design 120 experiments based on
five-fold cross-validation with different initial segmentation weights. To prevent
overfitting, the whole dataset is divided into five folds (Table 3.1). We hold one fold of
data used as the validation dataset (column 2 in Table 4.3), and choose the weights as the
final weights, the ones that performed best on the validation dataset during the learning. A
second fold is used as the testing dataset (column 1 in Table 4.3) to estimate the
37
performance of the model on unseen data. The remaining three folds of data are used as the
training datasets to train the model. We investigate 20 cases which take turns holding two
groups of data as the validation and the testing datasets. For each case, we perform six
experiments with different sets of initial weights for the segmentation. One experiment
starts with the initial weights which are all one. Other five experiments begin with the
initial weights of one except for one attribute. The initial values of attributes of hcR, vcLen,
wid, sd, and sdC take turns with an initial value of two. In total, 120 experiments are
performed.
Table 4.3 and Figure 4.2 show the best performances for 20 cases among six experiments
with different initial segmentation weights. In these 20 experiments, the minimal error and
the maximal error are 22.8% and 129.6%. The average and the standard deviation of 20
errors are 53.3% and 22.8%. They are bigger than those from the whole dataset of 45.1
1.5% (Figure 4.1). The average errors for fold 1 to 5 (Table 4.3) are 43.8 16.7%, 60.4
46.7%, 55.4 14.3%, 50.6 14.1%, and 56.5 14.5%, respectively. Based on these data,
we can see that different folds of data yield different performance. This indicates that the
models do not consistently exhibit the ability to generalize to unseen data.
From the above analysis, we know that the model parameters vary with the different
initial weights. In this study, only 500 samples are used for the modeling. Therefore, to
further improve the generalization of models, we need to use more samples to build models.
Also, models other than the negative binomial regression could be investigated. In addition,
among fifteen features in the original data, seven of them are used and four of them are
combined into two attributes to minimize the complicity of models and meet the capacity
of our small data size. With more data, more complex models can be built by using more
features, which allows the generalization of models.
38
Table 4.3. The predicted testing errors with different initial weights. The first column and
the second column are group numbers. “all=1” in the fourth column means that all initial
segmentation weights equal one. “x=2” means that initial segmentation weights equal one
except initial weight of the attribute “x” equals two, where “x” are “hcR”, “sd”, “sdC”,
“vcLen”, or “wid”. #min in column 5 is the iteration number when validation error reaches
the minimal value (ERRmin) shown in column 6 during the learning. ERRpredict in the last
column is the corresponding testing error with the iteration number of #min. The third
column is the panel labels for Figure 4.2.
Testing
dataset
(group
No)
Validation
dataset
(group
No)
Panel
label
in
Figure
4.2
Initial weights Iteration
number
(#min)
Validation
error (ERRmin)
Testing error
(ERRpredict)
1
2 A sd = 2 2 140.87 65.974
3 B vcLen = 2 3 86.353 25.733
4 C hcR = 2 1 73.1 44.268
5 D all = 1 3 68.501 39.125
2
1 E all = 1 3 65.325 129.61
3 F sd = 2 8 54.62 27.363
4 G hcR = 2 2 55.356 44.81
5 H vcLen = 2 3 56.417 39.7
3
1 I sd = 2 4 26.216 74.588
2 J all = 1 6 50.126 57.42
4 K hcR = 2 1 49.796 47.783
5 L vcLen = 2 6 26.049 41.965
4
1 M vcLen = 2 4 43.874 66.115
2 N vcLen = 2 3 44.894 53.083
3 O vcLen = 2 6 46.21 31.99
5 P vcLen = 2 4 51.378 51.22
5
1 Q sd = 2 1 39.417 74.515
2 R all = 1 2 40.914 58.222
3 S vcLen = 2 1 43.152 39.395
4 T vcLen = 2 2 53.183 53.853
39
40
Figure 4.2. The errors with the learning for 20 experiments based on Table 4.3. The black,
red, and blue lines are training error, validation error, and testing error, respectively. The
selected 20 plots from panel A to panel T are corresponding to 20 experiments shown in
Table 4.3 with the best performance of the six experiments which begin with different
initial segmentation weights.
41
Chapter 5. CONCLUSIONS AND FUTURE WORK
5.1. Significance of this thesis
The first step of car crash modeling is roadway segmentation. There are three widely used
roadway segmentation methods including fixed-length segmentation, homogeneous
segmentation, and variable-length segmentation. These methods could cause very short
segments or heterogeneous segments, both of which can bias the models. In this thesis, we
propose a machine learning-driven methodology to segment the roadway. It starts with
roadway segmentation based on the distance between adjacent segments and their initial
weights. Segmented data is used to build models. Modeling errors drive a coordinate-
descent algorithm for the updating of segmentation weights.
To better evaluate and compare the models with the different number of segments, we
introduce a new metric, Weighted Absolute Percentage Error (WAPE). This metric is
suitable for the study of modeling when the first step is segmentation, especially for the
modeling of car crashes.
In this study, we choose a target number of segments based on the model that minimizes
the standard deviations with five-fold cross validation. Based on this finding, we conduct
machine learning-driven segmentation with negative binomial regression and coordinate-
descent weights updating for the segmentation. The results show that within the different
models, both modeling and segmentation parameters can vary significantly. This indicates
that segmentation, whether driven by a researcher or an automated process, will have a
profound impact on conclusions drawn from the model. It further suggests that the ability
to generalize to unseen data is limited.
42
5.2. Future work
From the above analysis, we know the limit of the generalizability of models. To improve
the generalizability, we can use more samples, employ more features, build more complex
models, or investigate separated optimization for the segmentation and the modeling.
In this study, crash data and roadway geometry parameters come from Washington State
only. It forms 179,754 fixed-length segments and 18,637 homogeneous segments, which
generate bias results because of very short segments with zero crashes. To solve this
problem, we conduct segmentation before modeling, which clusters neighbor segments
based on the similarity metrics. We find that 500 segments are best for the performance
and minimize the standard deviations. It means that the data for the modeling only contains
500 samples. For the further studies, data from other states could be included to improve
the generalizability of models.
Roadway geometry parameters from horizontal curve, vertical curve, lanes, and shoulder
width contain 14 features. In this study, to segment the roadway, we only use the attributes
that were found to be statistically significant: the radius of horizontal curve (hcR), the
length of vertical curve (vcLen), the roadway width (wid), average left and right shoulder
width (sd), and average left and right median side shoulder width (sdC). To build predictive
models, we use these five features, the length of segments (len), and average daily traffic
(adt). More complex models such as using more features or transformation features could
be further studied with the larger dataset.
43
REFERENCES
[1] "Global Status Report on Road Safety 2015," World Health Organization, Geneva, Switzerland,
2016.
[2] "2014 Motor Vehicle Crashes: Overview," U.S. Department of Transportation, Washington DC,
United States, 2016.
[3] H. Lum and J. A. Reagan, "Interactive Highway Safety Design Model: Accident Predictive Module,"
Public Roads, vol. 59, no. 2, 1995.
[4] J. Ogle, P. Alluri and W. Sarasua, "MMUCC and MiRE: The Role of Segmentation in Safety Analysis,"
in Transportation Research Board Annual Meeting Proceedings, Washington DC, United States,
2011.
[5] S.-P. Miaou and H. Lum, "Modeling Vehicle Accidents and Highway Geometric Design
Relationships," Accident Analysis & Prevention, vol. 25, no. 6, pp. 689-709, 1993.
[6] S. Cafiso and G. D. Silvestro, "Performance of Safety Indicators in Identification of Black Spots on
Two-Lane Rural Roads," Journal of the Transportation Research Board, vol. 2237, p. 78–87, 2011.
[7] Highway Safety Manual (HSM), First Edition, FHW A American Association of State Highway and
Transportation Officials, U.S.A, 2010.
[8] S. Cafiso, C. D'Agostino and B. Persaud, "Investigating Influence of Segmentation in Estimating
Safety Performance Functions for Roadway Sections," in Transportation Research Board 92nd
Annual Meeting, Washington DC, United States, 2013.
[9] G. Koorey, "Road Data Aggregation and Sectioning Considerations for Crash Analysis," Journal of
the Transportation Research Board, vol. 2103, p. 61–68, 2009.
[10] U. Chattaraj and M. M. Garnaik, "Effects of Highway Geometric Elements on Accident Modeling,"
Implementing Innovative Ideas in Structural Engineering and Project Management, pp. 1-6, 2015.
[11] R. R. Souleyrette, R. P. Haas and T. H. Maze, "Validation and Implication of Segmentation on
Empirical Bayes for Highway Safety Studies," WIT Transactions on Biomedicine and Health, vol.
11, pp. 85-94, 2007.
[12] S. Hu, J. N. Ivan, N. Ravishanker and J. Mooradian, "Temporal Modeling of Highway Crash Counts
for Senior and Non-senior Drivers," Accident Analysis & Prevention, vol. 50, pp. 1003-1013, 2013.
[13] S.-P. Miaou, "The Relationship between Rruck Accidents and Geometric Design of Road Sections:
Poisson Versus Negative Binomial Regressions," Accident Analysis & Prevention, vol. 26, no. 4,
pp. 471-482, 1994.
[14] A. M. Roshandeh, B. R. Agbelie and Y. Lee, "Statistical Modeling of Total Crash Frequency at
Highway Intersections," Journal of Traffic and Transportation Engineering, vol. 3, no. 2, pp. 166-
171, 2016.
44
[15] N. Eluru, M. Bagheri, L. F. Miranda-Moreno and L. Fu, "A Latent Class Modeling Approach for
Identifying Vehicle Driver Injury Severity Factors at Highway-Railway Crossings," Accident
Analysis & Prevention, vol. 47, pp. 119-127, 2012.
[16] O. Ozturk, K. Ozbay, H. Yang and B. Bartin, "Crash Frequency Modeling for Highway Construction
Zones," in Transportation Research Board 92nd Annual Meeting, Washington DC, United States,
2013.
[17] H. Yang, K. Ozbay, K. Xie and B. Bartin, "Modeling Rrash Risk of Highway Work Zones with
Relatively Short Durations," in Transportation Research Board 94th Annual Meeting, Washington
DC, United States, 2015.
[18] A. Khaleghei Ghosheh Balagh, F. Naderkhani and V. Makis, "Highway Accident Modeling and
Forecasting in Winter," Transportation Research Part A, vol. 59, pp. 384-396, 2014.
[19] A. G. B. Khaleghei, F. Z. G. Naderkhani, V. Makis and S. Nichol, "Highway Winter Crash Modeling
with Stochastic Covariates and Missing Data," in Proceedings of the 2012 Industrial and Systems
Engineering Research Conference, Orlando, Florida, USA, 2012.
[20] B. Brown and K. Baass, "Seasonal Variation in Frequencies and Rates of Highway Accidents as
Function of Severity," Journal of the Transportation Research Board, vol. 1581, no. 1, pp. 59-65,
1997.
[21] A. P. Tarko and A. M. Figueroa Medina, "Modeling Endogenous Relationship between Driver
Behavior and Highway Safety," in TRB 85th Annual Meeting Compendium of Papers CD-ROM,
Washington DC, United States, 2006.
[22] S. Blevins, "80 Percent of Crashes Caused by Distracted Driving," WWBT NBC12, 2010. [Online].
Available: http://www.nbc12.com/story/12390065/80-percent-of-crashes-caused-by-
distracted-driving.
[23] N. Amarasingha and S. Dissanayake, "Modeling Injury Severity of Young Drivers Using Highway
Crash Data from Kansas," Journal of the Transportation Research Forum, vol. 52, no. 1, pp. 5-22,
2013.
[24] F. F. Saccomanno and C. Buyco, "Generalized Loglinear Models of Truck Accident Rates,"
Transportation Research Record, vol. 1172, pp. 23-31, 1988.
[25] E. L. Plan, "Modeling and Simulation of Count Data," CPT Pharmacometrics Syst Pharmacol, vol.
3, no. 8, p. e129, 2014.
[26] S. C. Joshua and N. J. Garber, "Estimating Truck Accident Rate and Involvements Using Linear and
Poisson Regression Models," Transportation Planning and Technology, vol. 15, pp. 41-58, 1990.
[27] J. S. Cramer, Econometric Applications of Maximum Likelihood Methods, New York: Cambridge
University Press, 1989.
[28] P. McCullagh and J. A. Nelder, Generalized Linear Models, London: Chapman and Hall/CRC, 1989.
[29] R. J. Carroll and D. Ruppert, Transformation and Weighting in Regression, New York: Chapman
and Hall, 1988.
[30] L.-Y. Chang, "Analysis of Freeway Accident Frequencies: Negative Binomial Regression Versus
45
Artificial Neural Network," Safety Science, vol. 43, no. 8, p. 541–557, 2005.
[31] V. Shankar, F. L. Mannering and W. Barfield, "Effect of Roadway Geometrics and Environmental
Factors on Rural Accident Frequencies," Article in Accident Analysis & Prevention, vol. 27, no. 3,
pp. 371-389, 1995.
[32] M. Poch and F. Mannering, "Negative Binomial Analysis of Intersection-Accident Frequencies,"
Journal of Transportation Engineering, vol. 122, no. 2, pp. 105-133, 1996.
[33] D. Lord and F. Mannering, "The Statistical Analysis of Crash-Frequency Data: A Review and
Assessment of Methodological Alternatives," Transportation Research Part A, vol. 44, no. 5, p.
291–305, 2010.
[34] F. L. Mannering and C. R. Bhat, "Methodological Frontier and Future Directions," Analytic
Methods in Accident Research, vol. 1, p. 1–22, 2014.
[35] S. S. P. Kumara and H. C. Chin, "Application of Poisson Underreporting Model to Examine Crash
Frequencies at Signalized Three-Legged Intersections," Journal of the Transportation Research
Board, vol. 1908, pp. 46-50, 2005.
[36] N. Venkataraman, V. Shankar, J. Blum, B. Hariharan and J. Hong, "Transferability Analysis of
Heterogeneous Overdispersion Parameter Negative Binomial Safety Performance Functions: A
Case Study from California," in Transportation Research Board 95th Annual Meeting, 2016.
[37] P. P. Jovanis and H.-l. Chang, "Modeling the Relationship of Accidents to Miles Rraveled,"
Transportation Research Record, vol. 1068, pp. 42-51, 1986.
[38] S.-P. Miaou and D. Lord, "Modeling Traffic Crash-Flow Relationships for Intersections: Dispersion
Parameter, Functional Form, and Bayes Versus Empirical Bayes Methods," Journal of the
Transportation Research Board, vol. 1840, p. 31–40, 2003.
[39] J. A. Bonneson and M. P. Pratt, "Procedure for Developing Accident Modification Factors from
Cross-Sectional Data," Journal of the Transportation Research Board, vol. 2083, pp. 40-48, 2008.
[40] "Introduction to Road Design," [Online]. Available:
http://683516636325254566.weebly.com/superelevation.html.
[41] R. Brian, B. Venables, D. M. Bates, K. Hornik, A. Gebhardt and D. Firth, "Package ‘MASS’. CRAN
Repository," 2013.
[42] C. W. Chase, in Demand-Driven Forecasting: A Structured Approach to Forecasting, John Wiley &
Sons, 2009, pp. 113-114.
46
APPENDIX
Route parameters for modeling
Route ID Final number of
segments for modeling
Length in miles Number of segments per
mile
5 273 277 0.99
12 69 431 0.16
2 59 326 0.18
3 19 60 0.32
26 17 134 0.13
9 16 97 0.16
525 5 31 0.16
166 5 5 0.97
27 5 90 0.06
204 4 2 1.69
903 3 10 0.30
11 3 21 0.14
501 2 14 0.14
155 1 78 0.01
100 1 5 0.21
161 1 36 0.03
193 1 3 0.39
129 1 43 0.02
128 1 2 0.45
270 1 10 0.10
308 1 3 0.29
411 1 13 0.07
433 1 1 1.08
223 1 4 0.26
504 1 52 0.02
174 1 41 0.02
528 1 3 0.29
531 1 10 0.10
538 1 4 0.28
548 1 14 0.07
821 1 25 0.04
23 1 66 0.02
906 1 3 0.38