yiwen hu and nathan jiwatram advanced vector gis final …€¦ · yiwen developed a svm package...

Car Accident Patterns for Manhattan, NYC: 2015-2017

Yiwen Hu and Nathan Jiwatram [email protected], [email protected]

Advanced Vector GIS Final Project Report: December 15, 2017

Abstract

This project examines car accidents using kernel density, average nearest neighbor index,

emerging hot spot analysis, Ordinary Linear Square(OLS), Geographically Weighted

Regression(GWR) and Support Vector machine(SVM). Kernel density mapping shows that car

accidents are concentrated in the downtown areas. Average nearest neighbor index is used to

identify distribution of the car accidents. The Emerging hot spot analysis reveals that the

highways running around the city have far less car accidents than the other areas. The regression

analysis using OLS and GWR was is used to identify which of the explanatory factors are

contributing to car accidents. The SVM allows us to take a sample of car accident data and create

an accurate estimate of total car accident distributions.

Problem statement

In recent years, the potential benefits of using GIS to present car accidents data and

identify key areas and features where they are concentrated. Valuable information is gained such

as in a study by Katharine D. Bennet. She analyzes emergency 911 call-for-service records for

motor vehicle accidents in Johnson City, Tennessee and found out that twice as many motor

vehicle accidents occur near commercial properties compared to residential properties. Motor

vehicle accidents are more likely to occur on arterial thoroughfares. Approximately 40% of

injury accidents happen at roadway intersections, with 22% occurring at signalized

intersections.(Bennett, 2010)

Pervious studies have used techniques such as regression and density mapping, and

hotspot analysis to compare car accidents with features such as road type, neighboring land use

and traffic lights (Anderson, 2009). Other projects have expanded on this such as one conducted

in Mashhad city, Iran that introduced five classifications for determining the eventfulness of car

accidents in the given study area (Shafabakhsh et al., 2017). This project explores several

untested spatial variables, Hospitals, Schools, Universities, Plazas, slope and traffic levels were

compared with car accidents locations. These explanatory variables demonstrate a different

pattern than the variables used in the previous studies and shed light onto the placement of these

public areas and their impact on car accidents. Additionally this project created a accident

prediction tool given only a sample of the car accident data.

Data

All of the data was collected from two sources: NYC Open Data and gis.ny.gov. The

traffic data was the only piece taken from the gis.ny.gov. This site is a clearinghouse for both

local and New York state data. The traffic data was available and downloaded for 2015 in a

vector shapefile as line data. The NYC Open data website is a aggregation of New York City

data created by the Mayor’s Office of Data Analytics (MODA) and the Department of

Information Technology and Telecommunications (DoITT). The data was available in individual

shapefiles from 2015-2017 that we collected including Hospitals, Schools, Bus Stations ,

Hydrants, Universities , Parking signs , Plazas and Elevation data of the Manhattan, New York

City. The data came in the form of shapefiles Initially the data had to be projected as the files

were in WGS 1984. The car accident data was downloaded in a .csv file and using the latitude

and longitude coordinates we created a shapefile of car accident points. After looking at several

possible projections we decided that NAD 1983 StatePlane New York Long Island FIPS 3104

Feet would provide the most accurate maps for the New York City area. Next we calculated the

slope by inputting the elevation point data in the tool point to raster and allowed us to calculate

the slope. We then classified the slope values into 10 classes using natural breaks in order to load

the data faster.

Methodology

GIS analysis/operations performed (with references to papers that have used this

method). Also GIS tools parameters used and why you chose the tool.

In order to determine the spatial distribution of the car accidents we ran the Average Nearest

Neighbor index which analyzes location data, not values found in the attribute table and is best to

use with point data.

In creating the density maps we created used the Kernel density tool. This tool calculates a

magnitude per unit area from point features using a kernel function to create a smooth surface

from point to point. We used a cell size of 45 which was determined via trial and the unit

measure of square feet.

The car accident point data was input into the optimized hotspot analysis tool. Our focus

in using this tool is on the spatial presence of the incidents and not the attributes of each point.

This tool aggregated the car accident and determines an appropriate scale of analysis. The

parameters used were for incident data aggregation method: count incidents within fishnet

polygons. The “bounding polygons defining where incidents are possible” option was filled by

our nyc_area polygon which sets our study area.

In order to further see the distribution of car accidents we created a space time cube in

order to see changes in the car accidents via a temporal and spatial lens. We used a time step of 1

month in order to capture seasonal changes in car accidents. The cube was then put into the

emerging hot spot analysis to visualize the cube. The emerging hot spot analysis tool calculates

the Getis-Ord Gi statistic for each bin in the space time cube using neighborhood distance and

the neighborhood time step parameter. We used “create fishnet” tool to create and calculate a

fishnet grid in order to perform hotspot analysis, OLS, GWR and SVM. We determined through

trial that for our image having 60 columns and 80 rows each with a cell size of 200m by 250

meters was the optimal way to allow the patterns in the data to be visualized.

In order to see the contributing factors of car accidents, OLS and GWR were utilized to

assess the theft incidents. Independent variables included traffic volume, roads, hospitals,

schools, universities, plazas, bus station, slope. The reason why we choose those factors is that

we think there are more cars and people near the hospitals, schools, universities and plazas,

therefore, there are more likely to have a car accidents. For the hospitals, schools, universities,

plazas, we made a buffer for 100 meters, as we think those variables have influences on

surrounding areas. The dependent variable was the car accidents. For the roads, hospitals,

schools, universities, plazas, bus station, we count the amount of those variables in each fishnet

grid; for the traffic volume, we sum its amount in each fishnet grid; for the slope we calculate the

mean values in each fishnet grid.

Lastly, we used the Support Vector Machines (SVM) , a supervised Machine Learning

algorithm, to predict the car accident frequency degree. The SVM can efficiently perform a

non-linear classification using the kernel trick, implicitly mapping their inputs into

high-dimensional feature spaces. The process of SVM is giving a set of training examples, each

marked as belonging to one or the other of two categories, an SVM training algorithm builds a

model that assigns new examples to one category or the other, making it a non-probabilistic

binary linear classifier (wiki). Yiwen developed a SVM package for raster image classification in

R geospatial analysis class, and we made some changes to make it suit for the car accidents

prediction. The default parameters for SVM are: kernel type is radial, gamma is 1 and cost is 1,

and we changed the gamma to 0.14 (gamma= 1/ (number of independent variables)) and cost to

100(we tried 1, 10, 50 and 100 for cost, and the 100 has the best results). For the SVM, we used

the same independent variables as OLS, and classified the car accidents frequency into 6 degrees

(0-5), then integrated them into one table, therefore, for each fishnet grid it has the attributes of

independent variables and car accidents frequency degree. After that, I randomly selected 50% of

data in the table as training data, to build the SVM model, (Normally, SVM does not need 50%

data as training data, but more than half of our data has 0 frequency degree. In order to get

enough training data for all degrees, we have to enlarge our training dataset. However, the other

solution is to select training data manually, not randomly, to get enough training data for each

degrees, but the results is easily to be influenced by people’s selection of training data). When

we have the SVM model, we can input the independent variables to predict the car accidents

frequency.

Results

The results for the Average Nearest Neighbor index shows that the car accident spatial

distribution is clustered and that there is less than 1% chance the data is random. This means the

project could continue and examine the explanatory variables. The density mapping reveals that

the car accidents are centered in lower Manhattan/midtown areas of Manhattan. This area is a

focal point geographically as it is the area where the Queenboro bridge, Queens midtown tunnel,

and Lincoln tunnel (Figure 3-a). In figure 3-b we can see the density mapped by the time of day,

segmented into 6 hour sections. The first and last time of day density maps show a slightly lower

overall car accident density than the other two images reflecting the increased traffic and

increased car accidents during the commuting times in the city.

The optimized hot spot analysis presents a the simple insight that the highways running

along the outside of the island of Manhattan are areas of low car accidents compared to the

streets and avenues of the city itself (Figure 4). The emerging hot spot analysis map (figure 5)

shows a spatial and temporal view of car accidents. The majority of the midtown and lower

Manhattan area is oscillating hotspot indicating an uneven increase over time. There are a few

cells of new hotspot analysis in the Southern tip of the city which could be due to extensive

revitalization and construction of the area.

The results for the OLS (Figure 6-a) is poor (Figure 6-b), the AICc values is extremely

high; R-squared and adjusted R-squared are less than 0.5. We ran the Moran’s I tool, it shows

the residuals are clustered. Thus, this model cannot explain the relationship between dependent

variable and independent variables. Because the OLS could not explain the relationship we then

used GWR(Figure 7-a), the result is better than OLS (Figure 7-b). The R-squared and adjusted

R-squared are greater than OLS. The AICc is less than OLS, but it is still high.

The result of SVM is promising. From the actual car accidents frequency classification

map(Figure８) and prediction map(Figure 9-a), we can see both of them show the high

frequency at downtown area and lower frequency at uptown area, and 0 frequency out of the

Manhattan area. However, there are more high frequency area in the prediction map. In addition

from the SVM accuracy results (Figure 9-b), the overall accuracy is more than 85%, but the

accuracy for each degree is not good, except degree 0 has 97% accuracy, others are much lower

than 85%. It is because the degree 0 has more training data than others.

Conclusion

The results of this project reveal spatial and temporal patterns of car accidents occurring

between 2014-2017. We concluded that the lower Manhattan and Midtown areas are where the

highest density of car accidents occurred during the time period. The trend is increasing car

accidents in that area over time for this area but not at a consistent rate. The explanatory

variables were better illustrated via the GWR tool and not the OLS tool. The regression revealed

that some of our variables were able to explain the car accidents better than others. The SVM

shows a high accuracy of 85% and is able to map the trend of car accidents. Some of the

limitations of our project are that the explanatory variables do not 100% explain car accidents

and that our data is limited to only 3 years and may not fully demonstrate the actually number of

car accidents as many accidents are never reported. Future work for this project could be looking

at and running the same tools above on the different attributes in the car accident layer such as

type of car, number of cars and number of injuries.

Figures and Tables

Title Source Data Format URLs

car accidents data NYC Open Data Csv https://opendata.cityofnewyork.us/

Elevation points NYC Open Data Shapefile (.shp) https://opendata.cityofnewyork.us/

Plazas NYC Open Data Shapefile (.shp) https://opendata.cityofnewyork.us/

Hospitals NYC Open Data Shapefile (.shp) https://opendata.cityofnewyork.us/

Universities NYC Open Data Shapefile (.shp) https://opendata.cityofnewyork.us/

Schools NYC Open Data Shapefile (.shp) https://opendata.cityofnewyork.us/

Bus station NYC Open Data Shapefile (.shp) https://opendata.cityofnewyork.us/

Traffic Volume GIS.NY.GOV Shapefile (.shp) https://gis.ny.gov/gisdata

Table 1 Data sources

Figure 1-a Flowchart

Figure 1-b Flowchart for Regression Analysis

Figure 2 ANN Results

Figure 3-a Car Accidents Kernel Density Map

Figure 3-b Car Accidents Kernel Density Map by Time

Figure 4 Optimized Hot Spot Map

Figure 5 Emerging Hot Spot Map

Figure 6-a OLS Residual Map

Figure 6-b OLS Statistic Results

(count _1: number of roads, count _2: number of hospitals, count _3: number of schools, count _4: number of universities, count _5: number of bus stations, count _6: number of

plazas, SUM_AADT: sum of traffic volume, AVG_GRID_C: mean of slope)

Figure 7-a GWR Residual Map

Figure 7-b GWR Statistic Results

Figure 8 Actual Car Accident Frequency Classification Map

Figure 9-a SVM Prediction Car Accident Frequency Classification Map

Figure 9-b SVM accuracy results

References

[1] Tessa K. Anderson, Kernel density estimation and K-means clustering to profile road accident hotspots Accident Analysis & Prevention, 41 (3) (2009), pp. 359-364 [2] Gholam Ali Shafabakhsh et al., GIS-based spatial analysis of urban traffic accidents: Case study in Mashhad, Iran, Journal of Traffic and Transportation Engineering, Volume 4, June 2017, Pages 290-299 [3] Katharine D. Bennett, Spatial Analysis of Motor Vehicle Accidents in Johnson City, Tennessee, as Reported to Washington County Emergency Communications District (911). Electronic Theses and Dissertations, 2010 [4] Support vector machine, . (n.d.). In Wikipedia. Retrieved December 13, 2017, Available from https://en.wikipedia.org/wiki/Support_vector_machine#Nonlinear_classification

https://en.wikipedia.org/wiki/Support_vector_machine#Nonlinear_classification

yiwen hu and nathan jiwatram advanced vector gis final …€¦ · yiwen developed a svm package...

Documents