yiwen hu and nathan jiwatram advanced vector gis final …€¦ · yiwen developed a svm package...
TRANSCRIPT
Car Accident Patterns for Manhattan, NYC: 2015-2017
Yiwen Hu and Nathan Jiwatram [email protected], [email protected]
Advanced Vector GIS Final Project Report: December 15, 2017
Abstract
This project examines car accidents using kernel density, average nearest neighbor index,
emerging hot spot analysis, Ordinary Linear Square(OLS), Geographically Weighted
Regression(GWR) and Support Vector machine(SVM). Kernel density mapping shows that car
accidents are concentrated in the downtown areas. Average nearest neighbor index is used to
identify distribution of the car accidents. The Emerging hot spot analysis reveals that the
highways running around the city have far less car accidents than the other areas. The regression
analysis using OLS and GWR was is used to identify which of the explanatory factors are
contributing to car accidents. The SVM allows us to take a sample of car accident data and create
an accurate estimate of total car accident distributions.
Problem statement
In recent years, the potential benefits of using GIS to present car accidents data and
identify key areas and features where they are concentrated. Valuable information is gained such
as in a study by Katharine D. Bennet. She analyzes emergency 911 call-for-service records for
motor vehicle accidents in Johnson City, Tennessee and found out that twice as many motor
vehicle accidents occur near commercial properties compared to residential properties. Motor
vehicle accidents are more likely to occur on arterial thoroughfares. Approximately 40% of
injury accidents happen at roadway intersections, with 22% occurring at signalized
intersections.(Bennett, 2010)
Pervious studies have used techniques such as regression and density mapping, and
hotspot analysis to compare car accidents with features such as road type, neighboring land use
and traffic lights (Anderson, 2009). Other projects have expanded on this such as one conducted
in Mashhad city, Iran that introduced five classifications for determining the eventfulness of car
accidents in the given study area (Shafabakhsh et al., 2017). This project explores several
untested spatial variables, Hospitals, Schools, Universities, Plazas, slope and traffic levels were
compared with car accidents locations. These explanatory variables demonstrate a different
pattern than the variables used in the previous studies and shed light onto the placement of these
public areas and their impact on car accidents. Additionally this project created a accident
prediction tool given only a sample of the car accident data.
Data
All of the data was collected from two sources: NYC Open Data and gis.ny.gov. The
traffic data was the only piece taken from the gis.ny.gov. This site is a clearinghouse for both
local and New York state data. The traffic data was available and downloaded for 2015 in a
vector shapefile as line data. The NYC Open data website is a aggregation of New York City
data created by the Mayor’s Office of Data Analytics (MODA) and the Department of
Information Technology and Telecommunications (DoITT). The data was available in individual
shapefiles from 2015-2017 that we collected including Hospitals, Schools, Bus Stations ,
Hydrants, Universities , Parking signs , Plazas and Elevation data of the Manhattan, New York
City. The data came in the form of shapefiles Initially the data had to be projected as the files
were in WGS 1984. The car accident data was downloaded in a .csv file and using the latitude
and longitude coordinates we created a shapefile of car accident points. After looking at several
possible projections we decided that NAD 1983 StatePlane New York Long Island FIPS 3104
Feet would provide the most accurate maps for the New York City area. Next we calculated the
slope by inputting the elevation point data in the tool point to raster and allowed us to calculate
the slope. We then classified the slope values into 10 classes using natural breaks in order to load
the data faster.
Methodology
GIS analysis/operations performed (with references to papers that have used this
method). Also GIS tools parameters used and why you chose the tool.
In order to determine the spatial distribution of the car accidents we ran the Average Nearest
Neighbor index which analyzes location data, not values found in the attribute table and is best to
use with point data.
In creating the density maps we created used the Kernel density tool. This tool calculates a
magnitude per unit area from point features using a kernel function to create a smooth surface
from point to point. We used a cell size of 45 which was determined via trial and the unit
measure of square feet.
The car accident point data was input into the optimized hotspot analysis tool. Our focus
in using this tool is on the spatial presence of the incidents and not the attributes of each point.
This tool aggregated the car accident and determines an appropriate scale of analysis. The
parameters used were for incident data aggregation method: count incidents within fishnet
polygons. The “bounding polygons defining where incidents are possible” option was filled by
our nyc_area polygon which sets our study area.
In order to further see the distribution of car accidents we created a space time cube in
order to see changes in the car accidents via a temporal and spatial lens. We used a time step of 1
month in order to capture seasonal changes in car accidents. The cube was then put into the
emerging hot spot analysis to visualize the cube. The emerging hot spot analysis tool calculates
the Getis-Ord Gi statistic for each bin in the space time cube using neighborhood distance and
the neighborhood time step parameter. We used “create fishnet” tool to create and calculate a
fishnet grid in order to perform hotspot analysis, OLS, GWR and SVM. We determined through
trial that for our image having 60 columns and 80 rows each with a cell size of 200m by 250
meters was the optimal way to allow the patterns in the data to be visualized.
In order to see the contributing factors of car accidents, OLS and GWR were utilized to
assess the theft incidents. Independent variables included traffic volume, roads, hospitals,
schools, universities, plazas, bus station, slope. The reason why we choose those factors is that
we think there are more cars and people near the hospitals, schools, universities and plazas,
therefore, there are more likely to have a car accidents. For the hospitals, schools, universities,
plazas, we made a buffer for 100 meters, as we think those variables have influences on
surrounding areas. The dependent variable was the car accidents. For the roads, hospitals,
schools, universities, plazas, bus station, we count the amount of those variables in each fishnet
grid; for the traffic volume, we sum its amount in each fishnet grid; for the slope we calculate the
mean values in each fishnet grid.
Lastly, we used the Support Vector Machines (SVM) , a supervised Machine Learning
algorithm, to predict the car accident frequency degree. The SVM can efficiently perform a
non-linear classification using the kernel trick, implicitly mapping their inputs into
high-dimensional feature spaces. The process of SVM is giving a set of training examples, each
marked as belonging to one or the other of two categories, an SVM training algorithm builds a
model that assigns new examples to one category or the other, making it a non-probabilistic
binary linear classifier (wiki). Yiwen developed a SVM package for raster image classification in
R geospatial analysis class, and we made some changes to make it suit for the car accidents
prediction. The default parameters for SVM are: kernel type is radial, gamma is 1 and cost is 1,
and we changed the gamma to 0.14 (gamma= 1/ (number of independent variables)) and cost to
100(we tried 1, 10, 50 and 100 for cost, and the 100 has the best results). For the SVM, we used
the same independent variables as OLS, and classified the car accidents frequency into 6 degrees
(0-5), then integrated them into one table, therefore, for each fishnet grid it has the attributes of
independent variables and car accidents frequency degree. After that, I randomly selected 50% of
data in the table as training data, to build the SVM model, (Normally, SVM does not need 50%
data as training data, but more than half of our data has 0 frequency degree. In order to get
enough training data for all degrees, we have to enlarge our training dataset. However, the other
solution is to select training data manually, not randomly, to get enough training data for each
degrees, but the results is easily to be influenced by people’s selection of training data). When
we have the SVM model, we can input the independent variables to predict the car accidents
frequency.
Results
The results for the Average Nearest Neighbor index shows that the car accident spatial
distribution is clustered and that there is less than 1% chance the data is random. This means the
project could continue and examine the explanatory variables. The density mapping reveals that
the car accidents are centered in lower Manhattan/midtown areas of Manhattan. This area is a
focal point geographically as it is the area where the Queenboro bridge, Queens midtown tunnel,
and Lincoln tunnel (Figure 3-a). In figure 3-b we can see the density mapped by the time of day,
segmented into 6 hour sections. The first and last time of day density maps show a slightly lower
overall car accident density than the other two images reflecting the increased traffic and
increased car accidents during the commuting times in the city.
The optimized hot spot analysis presents a the simple insight that the highways running
along the outside of the island of Manhattan are areas of low car accidents compared to the
streets and avenues of the city itself (Figure 4). The emerging hot spot analysis map (figure 5)
shows a spatial and temporal view of car accidents. The majority of the midtown and lower
Manhattan area is oscillating hotspot indicating an uneven increase over time. There are a few
cells of new hotspot analysis in the Southern tip of the city which could be due to extensive
revitalization and construction of the area.
The results for the OLS (Figure 6-a) is poor (Figure 6-b), the AICc values is extremely
high; R-squared and adjusted R-squared are less than 0.5. We ran the Moran’s I tool, it shows
the residuals are clustered. Thus, this model cannot explain the relationship between dependent
variable and independent variables. Because the OLS could not explain the relationship we then
used GWR(Figure 7-a), the result is better than OLS (Figure 7-b). The R-squared and adjusted
R-squared are greater than OLS. The AICc is less than OLS, but it is still high.
The result of SVM is promising. From the actual car accidents frequency classification
map(Figure8) and prediction map(Figure 9-a), we can see both of them show the high
frequency at downtown area and lower frequency at uptown area, and 0 frequency out of the
Manhattan area. However, there are more high frequency area in the prediction map. In addition
from the SVM accuracy results (Figure 9-b), the overall accuracy is more than 85%, but the
accuracy for each degree is not good, except degree 0 has 97% accuracy, others are much lower
than 85%. It is because the degree 0 has more training data than others.
Conclusion
The results of this project reveal spatial and temporal patterns of car accidents occurring
between 2014-2017. We concluded that the lower Manhattan and Midtown areas are where the
highest density of car accidents occurred during the time period. The trend is increasing car
accidents in that area over time for this area but not at a consistent rate. The explanatory
variables were better illustrated via the GWR tool and not the OLS tool. The regression revealed
that some of our variables were able to explain the car accidents better than others. The SVM
shows a high accuracy of 85% and is able to map the trend of car accidents. Some of the
limitations of our project are that the explanatory variables do not 100% explain car accidents
and that our data is limited to only 3 years and may not fully demonstrate the actually number of
car accidents as many accidents are never reported. Future work for this project could be looking
at and running the same tools above on the different attributes in the car accident layer such as
type of car, number of cars and number of injuries.
Figures and Tables
Title Source Data Format URLs
car accidents data NYC Open Data Csv https://opendata.cityofnewyork.us/
Elevation points NYC Open Data Shapefile (.shp) https://opendata.cityofnewyork.us/
Plazas NYC Open Data Shapefile (.shp) https://opendata.cityofnewyork.us/
Hospitals NYC Open Data Shapefile (.shp) https://opendata.cityofnewyork.us/
Universities NYC Open Data Shapefile (.shp) https://opendata.cityofnewyork.us/
Schools NYC Open Data Shapefile (.shp) https://opendata.cityofnewyork.us/
Bus station NYC Open Data Shapefile (.shp) https://opendata.cityofnewyork.us/
Traffic Volume GIS.NY.GOV Shapefile (.shp) https://gis.ny.gov/gisdata
Table 1 Data sources
Figure 1-a Flowchart
Figure 1-b Flowchart for Regression Analysis
Figure 2 ANN Results
Figure 3-a Car Accidents Kernel Density Map
Figure 3-b Car Accidents Kernel Density Map by Time
Figure 4 Optimized Hot Spot Map
Figure 5 Emerging Hot Spot Map
Figure 6-a OLS Residual Map
Figure 6-b OLS Statistic Results
(count _1: number of roads, count _2: number of hospitals, count _3: number of schools, count _4: number of universities, count _5: number of bus stations, count _6: number of
plazas, SUM_AADT: sum of traffic volume, AVG_GRID_C: mean of slope)
Figure 7-a GWR Residual Map
Figure 7-b GWR Statistic Results
Figure 8 Actual Car Accident Frequency Classification Map
Figure 9-a SVM Prediction Car Accident Frequency Classification Map
Figure 9-b SVM accuracy results
References
[1] Tessa K. Anderson, Kernel density estimation and K-means clustering to profile road accident hotspots Accident Analysis & Prevention, 41 (3) (2009), pp. 359-364 [2] Gholam Ali Shafabakhsh et al., GIS-based spatial analysis of urban traffic accidents: Case study in Mashhad, Iran, Journal of Traffic and Transportation Engineering, Volume 4, June 2017, Pages 290-299 [3] Katharine D. Bennett, Spatial Analysis of Motor Vehicle Accidents in Johnson City, Tennessee, as Reported to Washington County Emergency Communications District (911). Electronic Theses and Dissertations, 2010 [4] Support vector machine, . (n.d.). In Wikipedia. Retrieved December 13, 2017, Available from https://en.wikipedia.org/wiki/Support_vector_machine#Nonlinear_classification