the effect of spatial segmentation on safety …

The Pennsylvania State University

The Graduate School

School of Science, Engineering, and Technology

THE EFFECT OF SPATIAL SEGMENTATION ON SAFETY

PERFORMANCE FUNCTION MODELING

A Thesis in

Computer Science

by

Xingsheng Wang

© 2017 Xingsheng Wang

Submitted in Partial Fulfillment

of the Requirements

for the Degree of

Master of Science

December 2017

ii

The thesis of Xingsheng Wang was reviewed and approved* by the following:

Jeremy Blum

Associate Professor of Computer Science

Thesis Adviser

Thang N. Bui

Graduate Program Chair


Linda Null


Sukmoon Chang


Hyuntae Na

Assistant Professor of Computer Science

*Signatures are on file in the Graduate School.

iii

ABSTRACT

Building predictive models called safety performance functions (SPFs) is important for

the study of roadway safety. The first step in SPF modeling is roadway segmentation, which

partitions roadways into segments. To build the predictive models, we train the models

with a certain amount of observations. The observations cover as many cases as possible

in order to build better and transferable model. These observations with different

geometrical parameters and number of crashes are derived from the segmentation.

Roadway segmentation is not only an essential but a challenging step. Previous studies

have found that segmentation approaches affect the models’ transferability, for example,

their predictive ability for future crashes or crashes on other roadways. Some researchers

find that a little shift in segmentation yields very different models.

To find better approaches to segmentation, in this thesis, we propose a novel

segmentation methodology, which is driven by a machine learning clustering approach.

While this approach limits in its ability to improve model transferability, it does help to

characterize the extent to which segmentation approaches affect conclusions drawn from

the models. In the clustering step of this approach, roadway segmentation is based on a

weighted distance between adjacent segments. Segmented roadway data is used to build

models that allow for the estimation of the gradient in the error metric as a function of the

segmentation weights. The weights are updated based on this gradient, and this process

repeats with the performance of models guiding the updating of weights and the resulting

segmentation.

iv

TABLE OF CONTENTS

List of Tables ............................................................................... v

List of Figures ............................................................................... vi

List of Abbreviations ............................................................................... vii

Acknowledgements ............................................................................... ix

Chapter 1. INTRODUCTION ....................... 1

Chapter 2. RELATED WORKS ....................... 5

2.1. Roadway segmentation methods ....................... 6

2.2. Feature selection in roadway crash research ....................... 7

2.3. Evolution of the modeling methodologies ...................... 8

2.4. Challenges in the modeling of crash-frequency data ..................... 11

2.5. Influence of segmentation on resulting model ..................... 12

Chapter 3. METHODOLGY ..................... 14

3.1. Data for segmentation and modeling ..................... 15

3.2. Algorithm for segmentation and modeling ..................... 18

3.3. Parameters for segmentation and modeling ..................... 29

Chapter 4. RESULTS ..................... 33

4.1. The performance of the system on the whole dataset ..................... 33

4.2. Validity in the models based on the initial segmentation parameters ... 34

4.3. The limit of the generalizability of models ..................... 36

Chapter 5. CONCLUSIONS AND FUTURE WORK ..................... 41

5.1. Significance of this thesis ..................... 41

5.2. Future work ..................... 42

REFERENCES ..................... 43

Appendix: Route parameters for modeling ..................... 46

v

List of Tables

Table 3.1. Five groups of data for cross-validation .............. 24

Table 4.1. Parameters of the negative binomial models with the

different initial segmentation weights .............. 35

Table 4.2. The final normalized segmentation weights for five

experiments with different initial segmentation weights .............. 36

Table 4.3. The predicted testing errors with different initial weights. .............. 38

vi

List of Figures

Figure 3.1. Overall description of methodology ........... 15

Figure 3.2. Features of the horizontal curve ........... 17

Figure 3.3. Features of the vertical curve ........... 17

Figure 3.4. Algorithm for segmentation and modeling ........... 19

Figure 3.5. Algorithm of fragments clustering ........... 21

Figure 3.6. Distance measurement between two adjacent segments/clusters ......... 21

Figure 3.7. Attribute value updating formula for merging segments

during the clustering ........... 22

Figure 3.8. An example for roadway clustering ........... 23

Figure 3.9. Algorithm of building the statistical model ........... 23

Figure 3.10. Algorithm of coordinate-descent calculation ........... 25

Figure 3.11. Weights screening and coordinate-descent generation ........... 26

Figure 3.12. Coordinate-descent update ........... 27

Figure 3.13. An example for weights updating ........... 28

Figure 3.14. Formulas for three types of error measurements ........... 31

Figure 3.15. Average error of the training and the testing datasets

with different target number of segments ........... 32

Figure 4.1. The training errors of the learning with different initial

segmentation weights ........... 34

Figure 4.2. The errors with the learning for 20 experiments based on Table 4.3 .... 40

vii

List of Abbreviations

AADT Annual Average Daily Traffic

absPG Algebraic difference in Gradients of vertical curve

ADT Average Daily Traffic

ANN Artificial Neural Network

BegMP Beginning of Mile Post

EndMP End of Mile Post

GBM Generalized Boosted Models

hcLen Length of horizontal curve

hcMSE Max Super Elevation of horizontal curve

hcR Radius of horizontal curve

HSM Highway Safety Manual

len Length of fragment

MAPE Mean Absolute Percentage Error

NB Negative Binominal models

QIC Quasilikelihood under the Independence model Criterion

RMSE Root Mean Squared Error

sd Average of left and right shoulder width

sdC Average of left and right center median side shoulder width

sdL Left shoulder width

sdLC Left center median side shoulder width

sdR Right shoulder width

sdRC Right center median side shoulder width

SPF Safety Performance Function

SVM Support Vector Machine

vcLen Length of vertical curve

viii

WAPE Weighted Absolute Percent Error

wid Roadway width

ix

Acknowledgements

First and foremost, I would like to express my sincere gratitude to my advisor Dr. Jeremy

Blum who continuously supports my research, for his patience, encouragement,

enthusiasm, insightful comments, and immense knowledge. I very much enjoyed working

with him. Without his guidance and constant feedback this thesis would not have been

achievable.

Secondly, I would like to thank Dr. Linda Null, Master Program Coordinator, and all

professors in this program, who provided me with the opportunity to enroll in this program.

I would like to thank Mrs. Jeanne M. Miller, Administrative Support Assistant, for all her

help and support.

I greatly appreciate Dr. Linda Null, Dr. Thang Bui, Dr. Jeremy Blum, Dr. Sukmoon

Chang, Dr. Omar El Ariss, and Dr. Hyuntae Na. They provided me with strong knowledge

and experience in the areas of database design, algorithm design, system design, data

mining, machine learning, and natural language processing.

I would like to thank my family, my wife, Li Liu and my daughter, Wendy Lily Wang.

Without my wife’s hard work and support, it is hard to image how I could have finished

my courses and thesis.

Last but not the least, I would like to thank Dr. Semontee and Dr. Stephanie at the

Learning Center, Penn State Harrisburg, Mr. Chad Snyder, Dr. Richard Lee Gill Jr, Dr.

Christie McCracken, and Mr. Harry Li for their suggestions about writing.

Xingsheng Wang

1

Chapter 1. INTRODUCTION

Roadway safety is an important task to be solved. According to a 2015 World Health

Organization global status report on road safety, more than 1.2 million people die each year

on the world’s roads [1]. This number has plateaued since 2007. On U.S. roadways, the

number of motor vehicle crash fatalities reached 30,000 in 2014. About two million people

were injured in motor vehicle traffic crashes [2], an increase of 1.1 percent as compared to

2013. To study roadway safety, scientists have built two major statistical methods since the

1970s. One method is based on descriptive statistics, which describes the basic features of

historical data. Another method is based on inferential statistics, which makes inferences

and predictions based on crash data.

The inferential method is best suited to study the relationship between car crashes and

their causes. The aim of this method is to build predictive models called safety performance

functions (SPFs). SPFs predict the likelihood of crashes on a roadway segment as a

function of the segment length, traffic counts, and roadway features. These functions,

derived using statistical and machine learning analyses, are an important tool to identify

infrastructure improvements that can increase roadway safety.

To build predictive models for car crashes, we must first segment the roadways. The

roadway segments form the observations for our models, which have an ability to predict

the number of crashes based on the attributes of each segment. Scientists have employed

various segmentation methods based on fixed-length segmentation or homogeneous

segmentation. Fixed-length segmentation divides roadways into fragments with the same

length, while homogeneous segmentation separates roadways into fragments with the same

roadway attributes. These methods, however, have been shown to have significant

shortcomings. Importantly, previous research has found that segmentation choices can

2

affect the transferability of resulting models, i.e. their ability to accurately predict future

crashes or crashes on other roadways .

In this study, to solve the challenges in segmentation, we propose a new methodology to

segment the roadway, called machine learning driven segmentation. The roadways are first

partitioned into segments with the same length. Each fragment contains the number of

crashes and five roadway geometrical attributes that belong to this fragment. The algorithm

consists of the repetition of three steps, including spatial weights updates, roadway

segmentation, and model fitting. The performance of the models in the current iteration

guides the updating of the spatial weights and segmentation for the subsequent iteration.

In the first iteration, the spatial weights are set to initial values. Segmentation consists

of clustering neighboring segments with the minimal distance. The distance between two

neighboring segments is the sum of five weighted differences. Each weighted difference is

the product of the spatial weight of an attribute and the difference between two neighbor

attributes. Once the roadway is segmented, the predictive model is built with a widely used

method for modelling crash count data, known as negative binominal regression.

In subsequent iterations, the spatial weights are updated in an attempt to reduce the error

in the predictive model. The model error is a non-continuous function of the spatial weights,

so gradient descent approaches are not appropriate for this function. Instead, a coordinate-

descent approach is used to iteratively reduce the error function by making small changes

to the segmentation weights.

Developing effective and transferable models is a difficult problem for a number of

reasons beyond the challenges presented by roadway segmentation, and thus, this thesis

addresses only a portion of the overall problem. For example, roadway crashes are rare,

3

random events that are caused by many different contributing factors, which are beyond

the control of transportation engineers. These factors include the roadway environment, the

driver, and the vehicle. Lum and Reagan [3] estimated that three percent, fifty-seven

percent, and two percent of car crashes are due solely to the roadway environment, the

driver, and the vehicle, respectively. Twenty-seven percent of car crashes are due to a

combination of the roadway environment and the driver. Three percent of car crashes are

due to a combination of all three factors. In this study, we investigate the frequency of car

crashes and roadway geometry parameters and build predictive models for the frequency

of car crashes based on these geometry parameters. It is important to note that this study is

limited in that it ignores important factors related to the driver and the vehicle. Moreover,

due to limits on available data, this study only considers a portion of roadway environment

characteristics. Due to the nature of crashes and the limited attributes considered, the

modelling approaches would explain only a portion of crashes that occur.

This thesis advances knowledge in the effect of segmentation choices on modelling

outcomes. Specifically, this thesis includes the following novel contributions:

It presents a new coordinate-descent approach for roadway segmentation, which

allows for different initial weights that can reflect the biases that investigators

bring to these studies with respect to the importance of different roadway features

on safety.

It introduces a new error metric for segmentation studies, Weighted Absolute

Percentage Error (WAPE). WAPE has been used in business applications for

demand forecasting. It is particularly appropriate for segmentation studies, due

to its independence of number of segments and its ability to deal with excessive

zeroes present in the data.

The coordinate-descent approach shows that the error is a locally convex

function of the spatial weights. The approach is able to consistently reduce the

4

error in a training dataset. However, the improvements in the performance of the

model on the training dataset do not always correspond to improvements in a

held-out testing dataset.

The spatial clustering algorithm illustrates the extent to which emphasizing the

importance of different features in the clustering can affect the resulting SPFs.

The differences in the final segmentation, the coefficients in the resulting models,

and the uneven performance on held-out testing data, all suggest inherent

limitations in the transferability of the resulting models.

This thesis is organized into five chapters. Chapter 2 begins with a literature review of

the related works. It includes roadway segmentation methods, feature selection in roadway

crash research, evolution of the modeling methodologies, challenges in the modeling of

crash-frequency data, and influence of segmentation on resulting model

Chapter 3 describes the novel segmentation approach, the machine learning driven

segmentation. After discussing the data used in this study, we present the details of the

algorithm for segmentation and modeling. Finally, we discuss the parameters for the

segmentation and modeling.

Chapter 4 discusses the results, including the performance of the system on the whole

dataset, validity in the models based on the initial segmentation parameters, and the limit

of the generalizability of models.

Chapter 5, which is the last chapter, presents a discussion of the findings and looks into

areas for future study.

5

Chapter 2. RELATED WORKS

Building predictive models for roadway car crashes includes two major steps. The first step

is the segmentation of the roadway. There are three widely used segmentation methods

including fixed-length segmentation, homogeneous segmentation, and variable-length

heterogeneous segmentation. Roadway segmentation is a process to create units of

observation. Each observation contains several features of the segment which need to be

selected from various features.

The second step includes selecting a predictive model and then modeling the data from

roadway segmentation. Various deterministic and non-deterministic models have been

investigated. Deterministic models include a number of regression models. Linear

regression models are not appropriate for modeling of car crashes because car crashes are

rare events and count data. Poisson regression is appropriate to build models for the count

data, but may overstate or understate the likelihood of car crashes due to the over-

dispersion of car crash data. Negative binominal models allow the variance to exceed the

mean and are widely used in this area. Non-deterministic models such as artificial neural

network (ANN) and support vector machine (SVM) provide an alternative way to model

car crashes.

Building statistical models for car crashes data faces several challenges that come from

the inaccuracy of the data including incomplete reporting of crash data and time varying

parameters, the property of the data such as the over-dispersion of data, and the complicities

of modeling including omitted variables and functional form of the modeling.

Roadway segmentation is a challenging step for the modeling. Studies show that it

heavily affects the resulting models. To solve the challenge, this thesis uses a coordinate-

6

descent approach to find a heterogeneous segmentation that represents a local minima in

the error function. This method requires the use of deterministic models in order to be able

to obtain a consistent estimate of coordinate-descent in modeling error as a function of

segmentation weights.

2.1. Roadway segmentation methods

Roadway segmentation is to partition roadways into segments to create the observations

for modeling. Segmentation heavily affects the modeling of roadway crash frequencies

because crashes are rare events. Very short segments result in a large number of segments

with zero crashes, which leads to over-dispersion. Over-dispersion means that the variance

of crash data exceeds its mean. It creates challenges for statistical inference because it is

difficult to accurately assign a crash to a segment if segment lengths are very small.

As mentioned earlier, there are three widely used roadway segmentation methods

including fixed-length segmentation, homogeneous segmentation, and variable-length

heterogeneous segmentation.

Fixed-length segmentation divides roadways into fragments with equal length. Each

fragment is likely to have varying attributes. Lengths of fragments from eighty meters to

several hundred meters have been studied [4]. Miaou and Lum [5] reported that segments

with eighty meters could produce bias with the linear models because of very short of

segments.

The second widely used method is homogeneous segmentation. Homogeneous segments

are defined as segments that do not vary with respect to one or more roadway geometry

parameters, Annual Average Daily Traffic (AADT), or both. Cafiso and Silvestro [6]

7

showed that the segment length should be related to AADT; lower AADT values require

longer segment lengths. The Highway Safety Manual (HSM) [7, 8] recommends “the use

of homogeneous segments with respect to AADT, number of lanes, curvature, presence of

ramp at the interchange, lane width, outside and inside shoulder widths, median width and

clear zone width”. Cafiso et al. [8] investigated five segmentation methods and found that

different segmentation methods ended in very different performances in modeling.

The method of homogeneous segmentation obviously results in many segments of

different lengths. One drawback of this approach is that the segmentation may create many

short segments. To avoid bias, the short segments may be combined (aggregated) and

heterogeneous segments generated [8]. In a winding roadway, homogeneous segmentation

results in some relatively short curves and it may be hard to confidently ascribe a particular

crash to a segment. Koorey [9] carefully discussed how to deal with short segments with

length less than fifty meters. He suggested combing curves with less than two degree of

total deflection with the subsequent data, creating a special segment type for tight reverse

curves, and removing short segments with tight curves at an intersection. The cut-off values

for these rules are somewhat arbitrarily chosen, but the aim of this method is to eliminate

short segments.

2.2. Feature selection in roadway crash research

In roadway car crash research, researchers have focused on explanatory variables including

the roadway environment, the driver, and the vehicle. Numerous researchers have focused

on the relationships between crashes and roadway geometric design variables, such as

horizontal and vertical curvature, land width, and shoulder width [10, 11, 12, 13]. In

addition, the Annual Average Daily Traffic (AADT) is usually considered in modeling [11].

These models also include models for special roadway configurations such as highway

8

intersections [14], highway-railway crossings [15], and highway construction zones [16,

17]. Driving conditions, a critical risk factor, has also proven to affect the likelihood of car

crashes [18]. Snowfall, icy roadways, and watery roadways make vehicle handling more

difficult. Several studies that developed models for winter crash have been reported [18,

19, 20].

Drivers’ information and their behavior [21] is another important factor, which heavily

affects the frequency of car crashes. Blevins [22], for example, reported that 80 percent of

crashes are caused by distracted driving. Amarasingha and Dissanayake [23], focused

instead on driver’s age as a surrogate for driving behavior, when they modeled injury

severity of young drivers using highway crash data. Hu et al. [12], on the other hand,

created temporal models of crash counts for senior and non-senior drivers.

In this study, we ignore driving conditions, the driver-specific factors, and the vehicle-

specific factors. Instead, the effect of these factors is incorporated in the error term. The

models described in this thesis are based on average daily traffic (ADT) and roadway

geometry information including horizontal curvature, vertical curvature, the roadway

width in feet, the shoulder width, and the median side shoulder width.

2.3. Evolution of the modeling methodologies

Modeling methodologies have evolved for the modeling of crash likelihood. Linear

regression is not suitable for modeling of car crash count data. Poisson regression is

suitable for count data modeling, but it may overstate or understate the likelihood of car

crashes because of over-dispersed property of crash data. Negative binomial models allow

the variance to exceed the mean and it is the best choice among these models. Recently,

more complex models such as support vector machine models (SVM) and artificial neural

9

networks (ANNs) have been employed in this area.

Linear regression is not appropriate for car crash data modeling. Car crashes are

random, discrete, nonnegative, and rare events. Conventional linear regression lacks the

distributional property to describe this count data, which is not normal distribution [24].

Miaou and Lum [5] proposed two special linear regression models, additive linear

regression model and multiplicative linear regression model. However, the test statistics

showed that neither is appropriate for crash data.

Poisson regression may overstate or understate the likelihood of car crashes. The

Poisson model has three properties [25]: the mean of the counts experienced by an

individual equals to its variance; events in the Poisson process are independent and

memoryless; and the event rate within time intervals is a constant. Count data is the number

of events per time interval. The mean count can be evaluated by distributions of Poisson

family. The Poisson regression was used for the modeling of car crashes [26, 13]. The

model takes the following form:

P(Yi = yi) =

𝑖

𝑦𝑖𝑒−𝑖

𝑦𝑖! (i = 1, 2, 3, …, n)

where

i = E(Yi) = 𝑒∑ 𝑥𝑖𝑗𝑗

𝑘𝑗=1

where 1, 2, …, k are k unknown regression parameters, P(Yi = yi) is the probability of

n crashes occurring on a roadway segment i in one year, and i is the expected crash

frequency for segment i.

The parameters 1, 2, …, and k can be estimated with the maximum likelihood method

[27], the quasi-likelihood method [28], or the generalized least squares method [29]. The

maximum likelihood method uses the likelihood function (L()) to estimate the coefficient

vector = (1, 2, …, k) [30].

10

L() = ∏ 𝑃(𝑌𝑖 = 𝑦𝑖)𝑛𝑖=1 = ∏

𝑖

𝑦𝑖𝑒−𝑖

𝑦𝑖!

𝑛𝑖=1 (i = 1, 2, 3, …, n)

where

i = E(Yi) = 𝑒∑ 𝑥𝑖𝑗𝑗

𝑘𝑗=1 = exp(Xi)

That is,

ln(i) = Xi

The limitation of the Poisson regression is the equality between mean and variance of

the counts. In many applications, count data for car crashes show over-dispersion, that is,

the variance of the data is greater than the mean. As a result, Miaou and Lum [5] reported

that Poisson regression models may overstate or understate the likelihood of car crashes.

Negative binominal models allow the variance to exceed the mean. To overcome the

over-dispersion problem, an error term (i) is added to the expected crash frequency (i) in

the negative binomial regression model [13, 31, 32]:

ln(i) = Xi + i

𝑒𝑖 is a gamma-distributed error with mean one and variance [30]. It relaxes the

assumption of the Poison model in which the mean of crash frequencies equals the variance.

The negative binominal model allows the variance to exceed the mean. It is a widely used

model to successfully build predictive model for over-dispersion data. In this study, we use

this model to predict the likelihood of car crashes.

Other methods in the statistical analysis of crash-frequency data. Various methods in

the statistical analysis of crash-frequency data have been widely reviewed [33, 34]. Besides

different regression models, neural networks [30], Bayesian neural networks [31], and

support vector machine models (SVM) [32] have been reported. For example, Chang [30]

demonstrated that artificial neural network (ANN) is an alternative method for the study in

this area. These methods provide alternative ways to build predictive models for roadway

11

car crashes.

2.4. Challenges in the modeling of crash-frequency data

Challenges in the modeling of crash-frequency data have been widely discussed in two

recent reviews [33, 35]. Here, we discuss the potential challenges that are related to our

research. These challenges may come from the inaccuracy of the data including incomplete

reporting of crash data and time varying parameters, the property of the data such as the

over-dispersion of data, the complexities of modeling including omitted variables and

functional form of the modeling, and segmentation.

Time varying parameters Usually, the data for the modeling are considered over some

time period and some parameter values may change during this period. If we ignore within-

period variations, the results may lose explanatory information. To minimize the influence

of time-varying parameters, the data we used spans only one year.

Over-dispersion of incident counts Car crashes are rare events and the variance of

crash data exceeds its mean due to large number of segments with no crashes, which is

called over-dispersion data. In this study, we address this problem by using the negative

binomial model, which can handle over-dispersion in the data.

Omitted variables bias When the size of a dataset limits the number of parameters that

can be estimated, researchers must decrease the complexity of the model. This results in

biased estimates. Some researchers have used random parameter models to try to capture

segment specific heterogeneity resulting from omitted variables [36].

Functional form of the modeling The functional form is a very important factor for

12

modeling. For over-dispersed crash data, a large number of researches have demonstrated

that non-linear forms are much better than linear forms [37, 24]. However, non-linear

models are more complex and need estimation procedures to increase the accuracy of the

estimated expected crash frequency [38, 39].

Incomplete reporting of crash data Kumara and Chin [35] reported that under-

reporting might produce biased estimates. Less severe crashes are under reported. Some

potential serious problems are also not reported. We do not know the magnitude of

incomplete reporting, but studies have showed that it has biased the modeling.

2.5. Influence of segmentation on resulting model

Roadway segmentation heavily affects car crashes modeling. For example, Cafiso et al. [8]

investigated and assessed five different segmentation approaches including homogeneous

segmentation with respect to AADT and curvature, each segment containing 2 curves and

2 tangents and avoiding short segments, each segment having constant AADT, fixed-length

segmentation, and each segment containing variables in a stepwise procedure. They used

the Quasi-likelihood under the Independence model Criterion (QIC) to evaluate the

goodness of fit of the models. The values of QIC for these five segmentation methods are

3322, 1082, 1762, 2707, and 4511, respectively. While the second method had the best

results, the parameters for the models varied widely. This study is important as it

demonstrated how roadway segmentation affects the modeling.

The work here seeks to extend this analysis by using coordinate-descent approach to

explore how segmentation produces different models and results. Non-deterministic

methods such as ANN and Generalized Boosted Models (GBM), are potentially good

methods for the study of crash-frequency data. But these methods cannot provide stable

13

results after many iterations of learning. For current work, segmentation and modeling are

conducted alternately. Roadway segmentation uses a spatial clustering algorithm based on

the weighted distances between features of adjacent segments. Segmented data is then used

for the modeling. The segmentation weights are updated from coordinate-descent approach

based on the errors of models. Segmentation and modeling are repeated until a threshold is

achieved. The changes in segmentation model selection must be deterministic to maintain

consistency for the coordinate-descent calculations. Therefore, we use a deterministic

method, the negative binomial method, to build models.

14

Chapter 3. METHODOLGY

To build predictive models for the count of car crashes based on roadway geometry

parameters, we designed a spatial clustering algorithm which uses coordinate-descent to

find a roadway segmentation. As shown in Figure 3.1, the raw data contains four types of

roadway geometry information, average daily traffic, and car crash information. After the

data is cleaned, the roadway is partitioned into fragments with fixed length. Each fragment

contains five geometric parameters, average daily traffic, and the number of crashes in this

fragment.

The fixed-length segments are further clustered based on the weighted distance between

two neighboring segments. Then, the clustered segments are divided into three sub-datasets:

a training dataset, a validation dataset, and a testing dataset. The training dataset is used to

build a negative binomial model followed by calculation of the training error. The training

error cooperated with a set of errors from weights screening is used to generate coordinate-

descent. Weight screening is a parallel process to screen clustering weights with small

changes. It selects the best change in the clustering weights to update these weights for the

next iteration. After the clustering weights are updated, the new iteration begins. The

updating of the clustering weights plays a central role in the algorithm. It continues until a

threshold is attained.

15

Figure 3.1. Overall description of methodology. Weights Screening based on the clustering

weights contains ten paralleling processes of segmentation and modeling. Java or R

package names are shown in the parentheses.

3.1. Data for segmentation and modeling

Data resources

The raw data contains roadway geometry information, average daily traffic (ADT), and car

crash information from 2012 in Washington state. The data include 33 two-lane state routes

with 33,000 crashes, spanning about 1,800 miles.

The initial data comes from four geometry files and an ADT file (Figure 3.1). Each

16

record in these five files contains the state route number, the beginning milepost (BegMP),

and the ending milepost (EndMP). The milepost data is recorded to the nearest hundredth

of a mile. Each record in the car crash file contains the state route number and the specific

milepost where the car crash occurs.

Attributes of data

The geometric information for roadways includes horizontal curve, vertical curve, lane,

and shoulder width information. Horizontal curves [40] (Figure 3.2) are characterized by

three features: the radius of the curve (hcR), the length of the curve in feet (hcLen), and

the max super elevation (hcMSE). Super elevation is the banking of the roadway such that

the outside edge of pavement is higher than the inside edge. Vertical curves (Figure 3.3)

are characterized by the length of the curve in feet (vcLen), the algebraic difference in

gradients (absPG), and the length of parabolic curve (L). L is the projection of the curve

onto a horizontal surface which corresponds to the plane distance. As shown in Figure 3.3,

the point of curvature (PC) is the beginning of a vertical curve. The point of tangency (PT)

is the end of the vertical curve. The point of intersection of the tangents (PI) is the point of

vertical intersection.

The lane data has two attributes: the number of lanes and the roadway width in feet (wid).

For the shoulder width, there are four attributes: the left shoulder width (sdL), the right

shoulder width (sdR), the left center median side shoulder width (sdLC), and the left center

median side shoulder width (sdRC). Left and right are labeled based on travelling on the

roadway in the direction of increasing milepost values. In this study, we use average left

and right shoulder width (sd) and average left and right median side shoulder width (sdC).

17

Figure 3.2. Features of the horizontal curve.

Figure 3.3. Features of the vertical curve. A curve is shown in dark blue.

Data cleaning and transformation

For this study, the roadway is first partitioned into very short fixed-length segments. The

length of each segment is 0.01 miles. For any segment without horizontal curvature, the

radius of the horizontal curve (hcR) of the segment is assigned a value of 40,000 which is

significantly larger than values in the dataset. For any segment without vertical curvature,

the length of the vertical curve (hcLen) is assigned a value of 0.

The car crash file contains the specific milepost where the car crash occurs. Each crash

is distributed to its two neighbor fragments: i.e., each fragment is assigned 0.5 crashes. Any

segment with missing attributes is deleted from the dataset. Crash data is not included in

segmentation, but it is used for the modeling.

The attributes are altered in order to eliminate direction-specific values, attributes that

18

are highly correlated, and attributes that are not statistically significant. The left and right

shoulder attributes are replaced with a single attribute equal to their sum. The center left

and center right shoulder attributes are similarly replaced with a single attribute

representing their sum. Other geometric attributes for the modeling of this study include

lane width, horizontal curvature, and vertical curvature. Our statistical models also use

non-geometric attributes, crash counts, average daily traffic, and segment length. For the

segmentation, attributes are normalized, while for the modeling, attributes are used directly.

3.2. Algorithm for segmentation and modeling

The roadway segmentation algorithm uses an iterative, coordinate-descent, spatial

clustering approach. This approach tries to find a local optima in the roadway segmentation

space with respect to the ability of the statistical model to predict car crashes. As shown in

Figure 3.4, the algorithm starts by initial fixed-length segmentation and clustering fixed-

length segments based on the current set of attribute weights. Then the algorithm builds a

statistical model based on the clustered data. The clustering weight for each attribute value

is then perturbed, both increasing it and decreasing it. This is called Weights Screening.

For each change, new clustering is performed and models are built. Based on these alternate

clusters and models, the algorithm estimates the coordinate-scent in the statistical model’s

loss function as a function of weighting. The weights are then adjusted to follow this

coordinate-descent, and the process repeats until the termination criteria are met.

19

Figure 3.4. Algorithm for segmentation and modeling. Segmentation is the process of

clustering two neighboring segments based on the weighted distance between them.

Algorithm for segmentation and modeling

1. initialize , , and N // is a weight vector for clustering

2. data0 initial_segment(raw data) // fixed-length segmentation

3. data cluster(data0, , N) // cluster data based on attributes, , and N

4. error model(data) // modeling with negative binomial models

5. while ( > 0.001) // is the increment value for

6. coordinate_descent(, , error) // Weights Screening

7. if (|| is 0) // if each element in equals 0

8. / 2

9. else

10. update(,) // clustering weights updating

11. data cluster(data0, , N) // cluster based on updated and N

12. error model(data) // modeling and calculation of error

13. end if

14. end while

20

Variables initialization (Step 1)

Weights ( = [10, 2

0, ..., 50]) are initialized to a set of initial values. The weights, 1

0,

20, ..., 5

0 represent the initial segmentation weights for the radius of horizontal curve

(hcR), the length of vertical curve (vcLen), the roadway width (wid), average left and right

shoulder width (sd), and average left and right median side shoulder width (sdC),

respectively. To establish a baseline clustering, the initial values of weights are all set one.

Subsequent runs of the system assign the initial weights with all of one except for one

attribute’s weight, which is assigned an initial value of two. These alternative initial

weights allow us to explore more of the segmentation space. It determines whether the

local minima in the segmentation space will produce models that lead to different

conclusions.

The variable is used to control the magnitude of the change in the weights during

each iteration. This is later explained in the “Coordinate-descent to update clustering

weights” section. We have tried different initial values of from 0.001 to 5.0. The results

indicate that the initial value of 0.5 produces the best results.

The variable of N is used as the target number of segments for the modeling after

clustering. That is, the clustering algorithm will stop when the number of clusters of N is

attained. Here, each cluster is a segment of the roadway. The process that was used to

determine the value for this parameter is described in the last section of this chapter.

Segmentation algorithm (Step 2, Step 3, and Step 11)

The segmentation algorithm contains two steps. The first step is fixed-length segmentation

(Step 2 in Figure 3.4). The second step is spatial clustering fixed-length segments into a

21

certain number of clusters (Step 3 and Step 11 in Figure 3.4). Attributes are normalized by

scaling all values in the range from 0 to 1 (0-1-normalization) before segmentation.

Roadways are first partitioned into fixed-length fragments with length of 0.01 miles

(Step 2). Next, the fixed-length fragments are clustered into the target number of clusters

(Step 3 and Step 11). After these steps, each cluster contains one or more continuous fixed-

length segments in a roadway. The clustering algorithm is shown in Figure 3.5.

Figure 3.5. Algorithm of fragments clustering. The process of fragments clustering is the

fragments segmentation. Each cluster is a segmented fragment.

The distance between the jth segment/cluster and the j+1th segment/cluster is calculated

with a weighted Manhattan distance formula (Figure 3.6).

Figure 3.6. Distance measurement between two adjacent segments/clusters.

For i = 1..5, 𝑎𝑖𝑗 represents the values in the 𝑗𝑡ℎ cluster of the radius of horizontal curve

(hcR), the length of vertical curve (vcLen), the roadway width (wid), average left and right

shoulder width (sd), and average left and right median side shoulder width (sdC),

Algorithm of clustering

3.1. n the number of fixed-length segments

3.2. N the target number of clusters

3.3. while n > N

3.4. find two neighbor fragments/clusters with the minimal distance

3.5. combine these two fragments/clusters into a new fragment/cluster

3.6. calculate distances between the new fragment and its two neighbor fragments

3.7. n--

3.8. end while

Distance measurement

distance(j, j+1) = ∑ 𝑖|𝑎𝑖𝑗

− 𝑎𝑖𝑗+1

|5𝑖=1

22

respectively. i is the weight for the attribute i. Clusters j and j+1 are adjacent clusters on

the roadway.

There are two types of attributes. The first type includes hcR, vcLen, wid, sd, and sdC.

They are combined with a distance weighted formula. The value of the attribute in the

newly merged cluster, 𝑎𝑖𝑛𝑒𝑤, is calculated from the clusters j and j+1 (Figure 3.7).

Figure 3.7. Attribute value updating formula for merging segments during the clustering.

𝑙𝑖𝑗 and 𝑙𝑖

𝑗+1 are lengths of the clusters j and j+1.

The second type includes the number of crashes. When segments/clusters are merged,

we just sum the number of crashes from clusters j and j+1.

To illustrate the algorithm of roadway clustering, an example is shown in Figure 3.8.

There are 10 fragments before a new iteration of the clustering. The weight for each

attribute is 1. Distances between two neighboring fragments are calculated by using the

formula shown in Figure 3.6. The distance of 3 between fragments 9 and 10 is the minimal

value among nine distances. Therefore, the fragments of 9 and 10 are combined and a new

fragment of 9’ is generated. For the length and the number of crashes for the fragment 9’,

they are summed from the fragments of 9 and 10. For attributes of a1, a2, and a3, they are

calculated with the formula in Figure 3.7.

Attribute value update when merging segments

𝑎𝑖𝑛𝑒𝑤 =

𝑙𝑗𝑎𝑖𝑗

+ 𝑙𝑗+1𝑎𝑖𝑗+1

𝑙𝑗 + 𝑙𝑗+1

23

Figure 3.8. An example for roadway clustering. The first column is the fragment sequential

number from 1 to 10. The second column is the length of each fragment. The third column

is the number of crashes for each fragment. The fourth to the sixth columns are roadway

geometry attributes including a1, a2, and a3. The last column is distances between two

neighboring fragments. The last row is the parameters for the combined fragment 9’ after

combing fragments 9 and 10.

Building the statistical model (Step 4 & Step 12)

Building the statistical model that predicts the frequency of crashes for each segment

includes several steps as shown in Figure 3.9.

Figure 3.9. Algorithm of building the statistical model.

Algorithm of modeling

4.1. split the clustered/segmented data into training, validation, and testing

datasets (Optional)

4.2. build negative binomial (NB) model based on the training datasets

4.3. calculate error (called error or WAPE) for training dataset

24

Table 3.1. Five groups of data for cross-validation.

Group No. Route No. in Washington State

1 12

2 5, 3, 129, 548, 166

3 2, 525, 903, 100, 308, 128

4 155, 23, 504, 161, 821, 501, 411, 223, 528, 193, 204, 433

5 26, 9, 27, 174, 11, 270, 531, 538, 906

There are 33 routes in Washington State used in this study. The segmented/clustered

roadways are further divided into five groups based on route numbers (Table 3.1) for the

purpose of cross-validation. The lengths in miles of all five groups are equal.

To test the generalizability of models, we partition these five groups data into three folds

(Step 4.1). One fold, the testing dataset, is used to estimate the performance of the model

on unseen data. A second fold, the validation dataset, is used to prevent the overfitting of

the model during the learning. The remaining three groups of data, the training dataset, are

used to train the model. The algorithm is then run a total of 120 times (see “4.3 The limit

of the generalizability of models” in Chapter 4).

To evaluate the effect of initial weights on resulting clustering and statistical models, the

entire dataset is used as the training dataset (see “4.1 The performance of the system on the

whole dataset” in Chapter 4) since estimates of the performance of the models on unseen

data had already been established.

For step 4.2, the negative binomial (NB) models are built on the training dataset. For the

modeling, geometrical attributes of hcR, vcLen, wid, sd, and sdC, average daily traffic (adt),

and the length of fragment (len) are used as the input data. The output of the modeling is

the number of crashes that is the predicted number of crashes. The error is calculated from

the real number of crashes (yi) and the predicted number of crashes (pi), which is shown in

25

Figure 3.14. Non-normalized values for these attributes are used to build the statistical

models. In addition to the geometrical attributes, both average daily traffic and the length

of the segment are used as the offset for the modeling, in which the number of crashes is

proportional to these two parameters.

Coordinate-descent to update clustering weights (Step 6)

The algorithm of coordinate-descent generation is shown in Figure 3.10. First, ten sets of

weights (Step 6.1) are created by increasing and decreasing the current weights () by the

weight increment (). Each set of weights is used to segment/cluster the data (Step 6.3

and 6.5), and segmented/clustered data is then used to model and generate error (Step 6.4

and 6.6). Lastly, a coordinate-descent approach is used to update the weights (Step 6.8)

based on ten sets of errors and the error from the previous iteration (Figure 3.11).

Figure 3.10. Algorithm of coordinate-descent calculation.

Algorithm of coordinate-descent calculation

coordinate_descent(, , error) {

6.1. generate 10 sets of weights (1+, 2+, ..., 5+, 1-, 2-, …, 5-) based on

and

6.2. for j in 1..5

6.3. data cluster(data0, j+, N) // clustering

6.4. ej+ model(data) // model and calculate error

6.5. data cluster(data0, j-, N) // clustering

6.6. ej- model(data) // model and calculate error

6.7. end for

6.8. update(error, e = [e1+, e2+, ..., e5+, e1-, e2-, …, e5-])

6.9. return

}

26

Figure 3.11. Weights screening and coordinate-descent generation.

In step 6.1, ten sets of weights (1+, 2+, ..., 5+, 1-, 2-, …, 5-) are generated based

on the weight of the previous iteration () and . That is, a new set of weights are derived

from the weight by modifying one weight of an attribute to plus or minus for this

weight. Thus, ten sets of weights can be generated because there are five attributes in this

study. After getting ten sets of weights, we cluster the fixed-length segments based on each

set of weights. It produces segmented/clustered data (Step 6.3 and 6.5). Based on these ten

datasets, ten models are built and ten errors (e = [e1+, e2+, ..., e5+, e1-, e2-, …, e5-]) are

calculated (Step 6.4 and 6.6).

The weight update step (Step 6.8) estimates the best change in weights based on errors

from the previous iteration (error) and the weights screening step (e = [e1+, e2+, ..., e5+, e1-,

e2-, …, e5-]). From the above analysis, we know that two sets of weights are generated for

one attribute (j+ and j- for attribute j). After segmenting based on these two sets of

weights and modeling built on two sets of segmented data, we calculate two errors (ej+ and

ej- for attribute j). The coordinate-descent update algorithm constructed on these three

errors (error, ej+, and ej-) is shown below.

27

Figure 3.12. Coordinate-descent update. j = 1..5 represents the attributes of hcR, vcLen,

wid, sd, and sdC, respectively.

Weights update (Step 10)

The new weight (i) for an attribute is updated based on its weight from the previous

iteration, the difference in error metric (i), and .

i i * (1 + i * / G)

G ∑ |𝛁𝒊5𝑖=1 |

G is the sum of values of the difference in the error metric (). i = 1..5 represents the

attributes of hcR, vcLen, wid, sd, and sdC, respectively. Thus, this formula adjusts the

weights in the direction of improvement by a factor proportional to the improvement in the

error metric.

To illustrate weights updating, an example is shown in Figure 3.13. The error before

weight screening (e0) is 100. Ten errors (e1+, e2+, …, e5+, e1-, e2-, …, e5-) are generated from

weight screening. The third row () is the coordinate-descent calculated from the algorithm

shown in Figure 3.12. The row labeled is the weights before updating and the row labeled

Coordinate-descent update

update(error, e = [e1+, e2+, ..., e5+, e1-, e2-, …, e5-]) {

initialize gradient, = [0, 0, 0, 0, 0]

for j in 1...5

if (ej+ < ej-)

if (ej+ < error) j error – ej+

else

if (ej- < error) j ej- – error

end if

end for

return

}

28

’ is the weights after updating. As shown in Figure 3.13, the performance increment

determines the changes in weight updating. For attribute a1, the best performance is when

the error is 50, which is corresponding to increasing weight for a1 by . After weights

updating, the weight for a1 is increased by 25%. For attribute a2, there is no increase in

performance, so the weight for a2 does not change after weights updating. In addition, The

direction of weights updating (increasing or decreasing by ) determines whether

increasing or decreasing weight. For example, for attribute a4, the best performance is when

the error is 90, which is corresponding to decreasing weight for a4 by . Therefore, after

weights updating, the weight for a4 is decreased by 5%.

Figure 3.13. An example for weights updating.

Stop criterion for the algorithm (Step 5)

The variable of is the threshold for the algorithm (Step 5). It will be modified (Step 8)

when there is no improvement in the performance of the training dataset. The algorithm

will stop when is less than 0.001.

29

Implementation and running time of algorithms

Segmentation is implemented with Java and modeling is implemented with R and MASS

package [41]. Penn State LionX clusters are used and parallel computing is implemented

with MPICH2. Weights screening and coordinate-descent generation include ten

independent tasks per iteration and it allows parallel computation. It takes about ten

minutes per iteration with four nodes and sixteen gigabytes of memory. Each experiment

takes 3-48 hours depending on the convergence of the algorithm.

3.3. Parameters for segmentation and modeling

Error measurement

One of the challenges in selecting a loss function for the algorithm is that typical loss

functions will not allow for comparison between models with different numbers of

segments/clusters. Commonly used error metrics, such as Root Mean Squared Error

(RMSE) and Mean Absolute Percentage Error (MAPE) are not appropriate for this problem.

Instead, we introduce the use of Weighted Absolute Percentage Error (WAPE), which has

not been used in the studies of car crashes. WAPE has been used in demand forecasting

models, which have similar requirements to the Safety Performance Function (SPF) models.

Root Mean Squared Error (RMSE) is defined as shown in Figure 3.14. This metric is

scale dependent. This means when it is used in SPF models, its value is a function of the

number of segments. The total number of car crashes in the dataset is fixed. When

decreasing the number of segments, the average number of car crashes for each fragment

increases. RMSE will therefore increase intrinsically when the number of segments

30

decrease.

Another commonly used metric form is Mean Absolute Percentage Error (MAPE),

which is calculated as show in Figure 3.14 [42]. Note that MAPE cannot work when the

values of the actual variable (yi) are 0. MAPE, like RMSE, is case dependent.

Weighted Absolute Percent Error (WAPE) is an alternative to MAPE, and is a widely

used metric for forecasting errors [42]. WAPE is defined as shown in Figure 3.14. In time-

series demand forecasting models, it offers advantages over MAPE. Unlike MAPE, it can

handle when the values of the actual variable (yi) are 0. Moreover, WAPE is scale

independent, which is important when varying time lengths in demand forecasting models.

In this study, MAPE is the percent of car crashes correctly predicted from the model.

The rationale for using WAPE over MAPE is based on the distribution of the data and

the goal of this research. Car crashes are rare events. Most of the segments will experience

no crashes over the course of a year. Thus, there are many segments with 0 values. Rather

than requiring scale independence over time period lengths, this work requires scale

independence over segment length. MAPE makes the error metric scale dependent. In

addition, the goal of this study is to use predictive models to reduce the total number of

crashes. MAPE weights an accident on segments with small numbers of crashes more than

an accident on a segment with more crashes. This difference does not make sense given

the goal of the modelling. Lastly, WAPE provides a way to compare the performance of a

model with a naïve model that predicts zero crashes for all segments. Such a naïve model

would have an WAPE of 100.

31

RMSE = √𝟏

𝑵∑ (𝒑𝒊 − 𝒚𝒊)𝟐𝑵

𝒊=𝟏

MAPE = 𝟏𝟎𝟎

𝑵∑ |

𝒑𝒊−𝒚𝒊

𝒚𝒊|𝑵

𝒊=𝟏

WAPE = 𝟏𝟎𝟎×∑ |𝒑𝒊−𝒚𝒊|𝑵

𝒊=𝟏

∑ 𝒚𝒊𝑵𝒊=𝟏

Figure 3.14. Formulas for three types of error measurements. N is the number of segments

for the modeling. yi is the number of crashes for the ith segment. pi is the predicted number

of crashes from the model for the ith segment.

Choosing the target number of clusters

To understand how the target number of segments/clusters affects the model performance,

we perform an unweighted spatial segmentation with different numbers of targets. After

segmentation, the data are partitioned into five groups with equal length (Table 3.1). Four

groups of data, the training dataset, are used to build the negative binomial model. One

group of data, the testing dataset, is used to evaluate the model. A 5-fold cross-validation

is used to estimate the error of the model on unseen data.

There are 179,754 fixed-length segments. These fixed-length segments form 18,637

homogeneous segments. We built models with target numbers of segments ranging from

100 to 18,000. As shown in Figure 3.15, both training error and testing error increase as

the target number of segments increases. The training errors increase from 44.6% to 98.6%

when the target number of segments jumps from 300 to 18,000, while the testing errors

increase from 46.0% to 135.5%. When the target number of segments is less than 300, the

negative binomial models cannot be successfully built because there are too few

observations.

32

Figure 3.15. Average error of the training and the testing datasets with different target

number of segments. Average values are calculated from the results of 5-fold cross-

validation. Training error and testing error are drawn in black and red. The bars in red or

black are the standard deviations from the results of the 5-fold cross-validation.

The standard deviations of both training and testing error are minimized when the target

number of segments is 500. The differences between the training error and testing error are

also small. This suggests that the model can be more easily generalized when the target

number of segments reaches 500. This is reasonable since the number of crashes increases

per segment when the target number of segments decreases. Consequently, crashes, as rare

events, can be better modeled. For this reason, in the remaining analysis, we only

investigate the cases when the target number of segments is 500. To enable comparisons of

errors across different segmentations, which is essential to accurately estimating the

coordinate-descent, we also fixed the number of segments for each route, which is shown

in Appendix .

33

Chapter 4. RESULTS

To investigate the performance and the generalizability of our method, we perform two

types of experiments. The first type of experiments use the whole dataset as the training

dataset to evaluate the effect of initial weights on resulting clustering and statistical models.

The second type of experiments employ the method of five-fold cross validation to

systematically explore the generalizability. Based on these studies, we find the following:

(1) Both the roadway segmentation and resulting models are sensitive to the initial

segmentation weights. The minimal error and the maximal error span a large range from

22.8% to 129.6%. The average and the standard deviation of errors are 53.3% and 22.8%.

(2) Some of the final models can converge to a similar roadway segmentation, resulting in

consistent models. However, this convergence does not always occur, and the resulting

models can lead to very different conclusions. (3) The resulting models do not consistently

generalize to the unseen data. This may be the result of the small sample size, the exclusion

of driver behavior, or other features of modeling.

4.1. The performance of the system on the whole dataset

To investigate the performance of the segmentation algorithm based on initial weights, we

conduct segmentation and modeling with different initial segmentation weights. These

initial weights are one, except for a single attribute whose initial weight is two. In this study,

five sets of initial weights can be used since our dataset contains five attributes and thus

five experiments are conducted. For these experiments, the whole dataset is used to train

the models. As shown in Figure 4.1, the training errors smoothly decrease with each

iteration of the algorithm. The average initial and finial errors within these five experiments

are 52.1 1.8% and 45.1 1.5%. The variance among these five models is small, about

1.8%. After the learning, the performance of the training is increased by ~7% in the average.

34

This data shows that the clustering on the whole dataset yields consistent improvement,

regardless of the initial weights.

Figure 4.1. The training errors of the learning with different initial segmentation weights.

The horizontal axis is the number of iterations for the learning and the vertical axis is the

error for the training dataset. “x = 2” (x = hcR, vcLen, wid, sd, and sdC) means that the

initial segmentation weights are all of one except for attribute “x” of two. The training

errors with each iteration are shown in lines with the different colors.

4.2. Validity in the models based on the initial segmentation parameters

From the above analysis, we know that the models produce steady and comparable

improvement in the performance based on the whole dataset, regardless of the different

initial segmentation weights. There is also some consistency in resulting models. As shown

in Table 4.1, the model coefficients for the intercept, vcLen, wid, and sd are almost same

in five models, as are the p-values.

The coefficients of hcR in these five models have the largest variance. The average and

35

the standard deviation of hcR coefficients are 46 and 77. p-values of two models are less

than 0.05, which means that statistical significances of hcR coefficients for these two

models are significant. For other three models, however, p-values of hcR coefficients are

greater than 0.2 and this variable is not statistically significant.

Table 4.1. Parameters of the negative binomial models with the different initial

segmentation weights. Modeling parameters from five experiments are shown in the third

to the seventh rows. The first column means the initial weights (w0) for the segmentation.

“x” (x = hcR, vcLen, wid, sd, and sdC) in the first column means that the initial

segmentation weights are all of one except for attribute “x” of two. “Int” in the second

column is the coefficient of intercept. “est” and “Pr” represent the estimate coefficient

value and p-value (Pr > |t|, where t is t value) from the model.

w0

Int hcR vcLen wid Sd

est Pr est Pr Est Pr est Pr est Pr

hcR -6.4 2E-16 62 0.2 -1E-4 0.007 -6E-3 2E-5 -3E-3 2E-9

vcLen -6.2 2E-16 -36 0.7 -1E-4 0.03 -6E-3 6E-6 -4E-2 4E-13

wid -6.3 2E-16 117 0.001 -1E-4 0.08 -7E-3 1E-6 -3E-2 2E-10

sd -6.2 2E-16 -34 0.7 -1E-4 0.008 -5E-3 6E-5 -4E-2 1E-14

sdC -6.4 2E-16 119 9E-5 -1E-4 0.04 -7E-3 1E-7 -3E-2 3E-9

Modeling parameters including coefficients and their statistical significances are

different when the initial segmentation weights are different. Table 4.2 shows the final

segmentation weights after the learning. To compare the final weights, we normalize them

to one for each row. As shown in Table 4.2, the average weight of width is the smallest and

that of vcLen is the biggest, which suggest that the width of roadway contributes less to

the segmentation while the length of horizontal curve contributes more to the segmentation.

36

Table 4.2. The final normalized segmentation weights for five experiments with different

initial segmentation weights. The final weights for five experiments are shown in the

second to the sixth rows, respectively. The last row is the mean and the standard deviation

of weights within five experiments. The first column is the initial weights (w0) for the

segmentation. “x=2” (x = hcR, vcLen, wid, sd, and sdC) in the first column means that the

initial weights are all of one except for attribute “x” of two. The last row is the mean and

the standard deviation for each attribute.

w0 hcR vcLen wid sd sdC

hcR=2 0.21 0.37 0.03 0.04 0.34

vcLen=2 0.11 0.69 0.04 0.03 0.13

wid=2 0.20 0.19 0.03 0.08 0.51

sd=2 0.10 0.72 0.02 0.04 0.12

sdC=2 0.29 0.24 0.04 0.08 0.35

mean SD 0.18 0.08 0.44 0.25 0.03 0.01 0.05 0.02 0.29 0.17

In summary, within the different models, even though the performances are similar, both

modeling and segmentation parameters can vary significantly. Hence, the segmentation is

largely affected by the initial segmentation weights used.

4.3. The limit of the generalizability of models

While the resulting models are fairly consistent, their value may lie in their generalizability.

To test the generalizability of models, we systematically design 120 experiments based on

five-fold cross-validation with different initial segmentation weights. To prevent

overfitting, the whole dataset is divided into five folds (Table 3.1). We hold one fold of

data used as the validation dataset (column 2 in Table 4.3), and choose the weights as the

final weights, the ones that performed best on the validation dataset during the learning. A

second fold is used as the testing dataset (column 1 in Table 4.3) to estimate the

37

performance of the model on unseen data. The remaining three folds of data are used as the

training datasets to train the model. We investigate 20 cases which take turns holding two

groups of data as the validation and the testing datasets. For each case, we perform six

experiments with different sets of initial weights for the segmentation. One experiment

starts with the initial weights which are all one. Other five experiments begin with the

initial weights of one except for one attribute. The initial values of attributes of hcR, vcLen,

wid, sd, and sdC take turns with an initial value of two. In total, 120 experiments are

performed.

Table 4.3 and Figure 4.2 show the best performances for 20 cases among six experiments

with different initial segmentation weights. In these 20 experiments, the minimal error and

the maximal error are 22.8% and 129.6%. The average and the standard deviation of 20

errors are 53.3% and 22.8%. They are bigger than those from the whole dataset of 45.1

1.5% (Figure 4.1). The average errors for fold 1 to 5 (Table 4.3) are 43.8 16.7%, 60.4

46.7%, 55.4 14.3%, 50.6 14.1%, and 56.5 14.5%, respectively. Based on these data,

we can see that different folds of data yield different performance. This indicates that the

models do not consistently exhibit the ability to generalize to unseen data.

From the above analysis, we know that the model parameters vary with the different

initial weights. In this study, only 500 samples are used for the modeling. Therefore, to

further improve the generalization of models, we need to use more samples to build models.

Also, models other than the negative binomial regression could be investigated. In addition,

among fifteen features in the original data, seven of them are used and four of them are

combined into two attributes to minimize the complicity of models and meet the capacity

of our small data size. With more data, more complex models can be built by using more

features, which allows the generalization of models.

38

Table 4.3. The predicted testing errors with different initial weights. The first column and

the second column are group numbers. “all=1” in the fourth column means that all initial

segmentation weights equal one. “x=2” means that initial segmentation weights equal one

except initial weight of the attribute “x” equals two, where “x” are “hcR”, “sd”, “sdC”,

“vcLen”, or “wid”. #min in column 5 is the iteration number when validation error reaches

the minimal value (ERRmin) shown in column 6 during the learning. ERRpredict in the last

column is the corresponding testing error with the iteration number of #min. The third

column is the panel labels for Figure 4.2.

Testing

dataset

(group

No)

Validation

dataset

(group

No)

Panel

label

in

Figure

4.2

Initial weights Iteration

number

(#min)

Validation

error (ERRmin)

Testing error

(ERRpredict)

1

2 A sd = 2 2 140.87 65.974

3 B vcLen = 2 3 86.353 25.733

4 C hcR = 2 1 73.1 44.268

5 D all = 1 3 68.501 39.125

2

1 E all = 1 3 65.325 129.61

3 F sd = 2 8 54.62 27.363

4 G hcR = 2 2 55.356 44.81

5 H vcLen = 2 3 56.417 39.7

3

1 I sd = 2 4 26.216 74.588

2 J all = 1 6 50.126 57.42

4 K hcR = 2 1 49.796 47.783

5 L vcLen = 2 6 26.049 41.965

4

1 M vcLen = 2 4 43.874 66.115

2 N vcLen = 2 3 44.894 53.083

3 O vcLen = 2 6 46.21 31.99

5 P vcLen = 2 4 51.378 51.22

5

1 Q sd = 2 1 39.417 74.515

2 R all = 1 2 40.914 58.222

3 S vcLen = 2 1 43.152 39.395

4 T vcLen = 2 2 53.183 53.853

40

Figure 4.2. The errors with the learning for 20 experiments based on Table 4.3. The black,

red, and blue lines are training error, validation error, and testing error, respectively. The

selected 20 plots from panel A to panel T are corresponding to 20 experiments shown in

Table 4.3 with the best performance of the six experiments which begin with different

initial segmentation weights.

41

Chapter 5. CONCLUSIONS AND FUTURE WORK

5.1. Significance of this thesis

The first step of car crash modeling is roadway segmentation. There are three widely used

roadway segmentation methods including fixed-length segmentation, homogeneous

segmentation, and variable-length segmentation. These methods could cause very short

segments or heterogeneous segments, both of which can bias the models. In this thesis, we

propose a machine learning-driven methodology to segment the roadway. It starts with

roadway segmentation based on the distance between adjacent segments and their initial

weights. Segmented data is used to build models. Modeling errors drive a coordinate-

descent algorithm for the updating of segmentation weights.

To better evaluate and compare the models with the different number of segments, we

introduce a new metric, Weighted Absolute Percentage Error (WAPE). This metric is

suitable for the study of modeling when the first step is segmentation, especially for the

modeling of car crashes.

In this study, we choose a target number of segments based on the model that minimizes

the standard deviations with five-fold cross validation. Based on this finding, we conduct

machine learning-driven segmentation with negative binomial regression and coordinate-

descent weights updating for the segmentation. The results show that within the different

models, both modeling and segmentation parameters can vary significantly. This indicates

that segmentation, whether driven by a researcher or an automated process, will have a

profound impact on conclusions drawn from the model. It further suggests that the ability

to generalize to unseen data is limited.

42

5.2. Future work

From the above analysis, we know the limit of the generalizability of models. To improve

the generalizability, we can use more samples, employ more features, build more complex

models, or investigate separated optimization for the segmentation and the modeling.

In this study, crash data and roadway geometry parameters come from Washington State

only. It forms 179,754 fixed-length segments and 18,637 homogeneous segments, which

generate bias results because of very short segments with zero crashes. To solve this

problem, we conduct segmentation before modeling, which clusters neighbor segments

based on the similarity metrics. We find that 500 segments are best for the performance

and minimize the standard deviations. It means that the data for the modeling only contains

500 samples. For the further studies, data from other states could be included to improve

the generalizability of models.

Roadway geometry parameters from horizontal curve, vertical curve, lanes, and shoulder

width contain 14 features. In this study, to segment the roadway, we only use the attributes

that were found to be statistically significant: the radius of horizontal curve (hcR), the

length of vertical curve (vcLen), the roadway width (wid), average left and right shoulder

width (sd), and average left and right median side shoulder width (sdC). To build predictive

models, we use these five features, the length of segments (len), and average daily traffic

(adt). More complex models such as using more features or transformation features could

be further studied with the larger dataset.

43

REFERENCES

[1] "Global Status Report on Road Safety 2015," World Health Organization, Geneva, Switzerland,

2016.

[2] "2014 Motor Vehicle Crashes: Overview," U.S. Department of Transportation, Washington DC,

United States, 2016.

[3] H. Lum and J. A. Reagan, "Interactive Highway Safety Design Model: Accident Predictive Module,"

Public Roads, vol. 59, no. 2, 1995.

[4] J. Ogle, P. Alluri and W. Sarasua, "MMUCC and MiRE: The Role of Segmentation in Safety Analysis,"

in Transportation Research Board Annual Meeting Proceedings, Washington DC, United States,

2011.

[5] S.-P. Miaou and H. Lum, "Modeling Vehicle Accidents and Highway Geometric Design

Relationships," Accident Analysis & Prevention, vol. 25, no. 6, pp. 689-709, 1993.

[6] S. Cafiso and G. D. Silvestro, "Performance of Safety Indicators in Identification of Black Spots on

Two-Lane Rural Roads," Journal of the Transportation Research Board, vol. 2237, p. 78–87, 2011.

[7] Highway Safety Manual (HSM), First Edition, FHW A American Association of State Highway and

Transportation Officials, U.S.A, 2010.

[8] S. Cafiso, C. D'Agostino and B. Persaud, "Investigating Influence of Segmentation in Estimating

Safety Performance Functions for Roadway Sections," in Transportation Research Board 92nd

Annual Meeting, Washington DC, United States, 2013.

[9] G. Koorey, "Road Data Aggregation and Sectioning Considerations for Crash Analysis," Journal of

the Transportation Research Board, vol. 2103, p. 61–68, 2009.

[10] U. Chattaraj and M. M. Garnaik, "Effects of Highway Geometric Elements on Accident Modeling,"

Implementing Innovative Ideas in Structural Engineering and Project Management, pp. 1-6, 2015.

[11] R. R. Souleyrette, R. P. Haas and T. H. Maze, "Validation and Implication of Segmentation on

Empirical Bayes for Highway Safety Studies," WIT Transactions on Biomedicine and Health, vol.

11, pp. 85-94, 2007.

[12] S. Hu, J. N. Ivan, N. Ravishanker and J. Mooradian, "Temporal Modeling of Highway Crash Counts

for Senior and Non-senior Drivers," Accident Analysis & Prevention, vol. 50, pp. 1003-1013, 2013.

[13] S.-P. Miaou, "The Relationship between Rruck Accidents and Geometric Design of Road Sections:

Poisson Versus Negative Binomial Regressions," Accident Analysis & Prevention, vol. 26, no. 4,

pp. 471-482, 1994.

[14] A. M. Roshandeh, B. R. Agbelie and Y. Lee, "Statistical Modeling of Total Crash Frequency at

Highway Intersections," Journal of Traffic and Transportation Engineering, vol. 3, no. 2, pp. 166-

171, 2016.

44

[15] N. Eluru, M. Bagheri, L. F. Miranda-Moreno and L. Fu, "A Latent Class Modeling Approach for

Identifying Vehicle Driver Injury Severity Factors at Highway-Railway Crossings," Accident

Analysis & Prevention, vol. 47, pp. 119-127, 2012.

[16] O. Ozturk, K. Ozbay, H. Yang and B. Bartin, "Crash Frequency Modeling for Highway Construction

Zones," in Transportation Research Board 92nd Annual Meeting, Washington DC, United States,

2013.

[17] H. Yang, K. Ozbay, K. Xie and B. Bartin, "Modeling Rrash Risk of Highway Work Zones with

Relatively Short Durations," in Transportation Research Board 94th Annual Meeting, Washington

DC, United States, 2015.

[18] A. Khaleghei Ghosheh Balagh, F. Naderkhani and V. Makis, "Highway Accident Modeling and

Forecasting in Winter," Transportation Research Part A, vol. 59, pp. 384-396, 2014.

[19] A. G. B. Khaleghei, F. Z. G. Naderkhani, V. Makis and S. Nichol, "Highway Winter Crash Modeling

with Stochastic Covariates and Missing Data," in Proceedings of the 2012 Industrial and Systems

Engineering Research Conference, Orlando, Florida, USA, 2012.

[20] B. Brown and K. Baass, "Seasonal Variation in Frequencies and Rates of Highway Accidents as

Function of Severity," Journal of the Transportation Research Board, vol. 1581, no. 1, pp. 59-65,

1997.

[21] A. P. Tarko and A. M. Figueroa Medina, "Modeling Endogenous Relationship between Driver

Behavior and Highway Safety," in TRB 85th Annual Meeting Compendium of Papers CD-ROM,

Washington DC, United States, 2006.

[22] S. Blevins, "80 Percent of Crashes Caused by Distracted Driving," WWBT NBC12, 2010. [Online].

Available: http://www.nbc12.com/story/12390065/80-percent-of-crashes-caused-by-

distracted-driving.

[23] N. Amarasingha and S. Dissanayake, "Modeling Injury Severity of Young Drivers Using Highway

Crash Data from Kansas," Journal of the Transportation Research Forum, vol. 52, no. 1, pp. 5-22,

2013.

[24] F. F. Saccomanno and C. Buyco, "Generalized Loglinear Models of Truck Accident Rates,"

Transportation Research Record, vol. 1172, pp. 23-31, 1988.

[25] E. L. Plan, "Modeling and Simulation of Count Data," CPT Pharmacometrics Syst Pharmacol, vol.

3, no. 8, p. e129, 2014.

[26] S. C. Joshua and N. J. Garber, "Estimating Truck Accident Rate and Involvements Using Linear and

Poisson Regression Models," Transportation Planning and Technology, vol. 15, pp. 41-58, 1990.

[27] J. S. Cramer, Econometric Applications of Maximum Likelihood Methods, New York: Cambridge

University Press, 1989.

[28] P. McCullagh and J. A. Nelder, Generalized Linear Models, London: Chapman and Hall/CRC, 1989.

[29] R. J. Carroll and D. Ruppert, Transformation and Weighting in Regression, New York: Chapman

and Hall, 1988.

[30] L.-Y. Chang, "Analysis of Freeway Accident Frequencies: Negative Binomial Regression Versus

45

Artificial Neural Network," Safety Science, vol. 43, no. 8, p. 541–557, 2005.

[31] V. Shankar, F. L. Mannering and W. Barfield, "Effect of Roadway Geometrics and Environmental

Factors on Rural Accident Frequencies," Article in Accident Analysis & Prevention, vol. 27, no. 3,

pp. 371-389, 1995.

[32] M. Poch and F. Mannering, "Negative Binomial Analysis of Intersection-Accident Frequencies,"

Journal of Transportation Engineering, vol. 122, no. 2, pp. 105-133, 1996.

[33] D. Lord and F. Mannering, "The Statistical Analysis of Crash-Frequency Data: A Review and

Assessment of Methodological Alternatives," Transportation Research Part A, vol. 44, no. 5, p.

291–305, 2010.

[34] F. L. Mannering and C. R. Bhat, "Methodological Frontier and Future Directions," Analytic

Methods in Accident Research, vol. 1, p. 1–22, 2014.

[35] S. S. P. Kumara and H. C. Chin, "Application of Poisson Underreporting Model to Examine Crash

Frequencies at Signalized Three-Legged Intersections," Journal of the Transportation Research

Board, vol. 1908, pp. 46-50, 2005.

[36] N. Venkataraman, V. Shankar, J. Blum, B. Hariharan and J. Hong, "Transferability Analysis of

Heterogeneous Overdispersion Parameter Negative Binomial Safety Performance Functions: A

Case Study from California," in Transportation Research Board 95th Annual Meeting, 2016.

[37] P. P. Jovanis and H.-l. Chang, "Modeling the Relationship of Accidents to Miles Rraveled,"

Transportation Research Record, vol. 1068, pp. 42-51, 1986.

[38] S.-P. Miaou and D. Lord, "Modeling Traffic Crash-Flow Relationships for Intersections: Dispersion

Parameter, Functional Form, and Bayes Versus Empirical Bayes Methods," Journal of the

Transportation Research Board, vol. 1840, p. 31–40, 2003.

[39] J. A. Bonneson and M. P. Pratt, "Procedure for Developing Accident Modification Factors from

Cross-Sectional Data," Journal of the Transportation Research Board, vol. 2083, pp. 40-48, 2008.

[40] "Introduction to Road Design," [Online]. Available:

http://683516636325254566.weebly.com/superelevation.html.

[41] R. Brian, B. Venables, D. M. Bates, K. Hornik, A. Gebhardt and D. Firth, "Package ‘MASS’. CRAN

Repository," 2013.

[42] C. W. Chase, in Demand-Driven Forecasting: A Structured Approach to Forecasting, John Wiley &

Sons, 2009, pp. 113-114.

46

APPENDIX

Route parameters for modeling

Route ID Final number of

segments for modeling

Length in miles Number of segments per

mile

5 273 277 0.99

12 69 431 0.16

2 59 326 0.18

3 19 60 0.32

26 17 134 0.13

9 16 97 0.16

525 5 31 0.16

166 5 5 0.97

27 5 90 0.06

204 4 2 1.69

903 3 10 0.30

11 3 21 0.14

501 2 14 0.14

155 1 78 0.01

100 1 5 0.21

161 1 36 0.03

193 1 3 0.39

129 1 43 0.02

128 1 2 0.45

270 1 10 0.10

308 1 3 0.29

411 1 13 0.07

433 1 1 1.08

223 1 4 0.26

504 1 52 0.02

174 1 41 0.02

528 1 3 0.29

531 1 10 0.10

538 1 4 0.28

548 1 14 0.07

821 1 25 0.04

23 1 66 0.02

906 1 3 0.38

the effect of spatial segmentation on safety …

Documents