bayesian forecast for transit demand final report tuan leklaskey/capstone/msseor... · estimation...

George Mason University – SEOR Department

OR/SYST 699 MS Capstone Project – Spring 2018

Bayesian Forecast for Transit Demand

Final Report

Tuan Le

Acknowledgement

I would like to sincerely acknowledge Dr. Vadim Sokolov for his advice and support over the years. Hehas introduced me to the Origin Destination demand matrix estimation problem, to teach me the powerof being a Baysianist (rather than a Frequentist!), and to give me chances to present my work at bigconferences like INFORMS 2017 (Houston, TX) and. He has been a good advisor by all occasions and Iam very grateful for his help in defining and structuring the research work. He also guided me throughthe process of scientific writing and presenting my work to both the technical and non-technicalaudiences. He has offered his personal time for advising me on the research direction of this project, andto help guide me through the darkest days in my PhD life. I also want to thank him for his helpfulfeedback on this report that helps me improve it substantially.I also want to thank Dr. Hubert Ley - Director of Transportation Research and Analysis ComputingCenter (TRACC) at Argonne National Laboratory, and Mr. James Garner - Manager of Research andAnalysis Department at Pace Surburban Bus for spending their time to discuss thoroughly multipleissues over the collected datasets used in our analysis, and for their kind sharing with us about theoperations systems of PACE’s buses.Next, I would like to thank Dr. Kathryn Laskey for her helpful feedback on this report and thepresentation throughout the course. Those feedback help improve the quality of this paper substantially.Finally, I would like to express my greatest gratitude to my family and my friends. I have appreciatedtheir encouragement and understanding, especially in situations where I did not manage to fully explainthe problems troubling me.

2

Contents

1 Executive Summary 4

2 Introduction 52.1 Project Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Motivation and Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Objectives, Scope and Deliverables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Model Formulation, Assumptions and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Exploratory Data Analysis 103.1 APC Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 VENTRA Fare Card Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Bayesian Model 174.1 Introduction about Stan - A probabilistic Programming Language . . . . . . . . . . . . . . . . 184.2 Normal - Normal (prior - likelihood) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3 Poisson - Normal (prior - likelihood) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.4 Poisson - Poisson (prior - likelihood) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.5 Normal - Poisson (prior - likelihood) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.6 Bayesian Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.7 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Future Work 27

A Appendix - Literature Research Summary 29A.0.1 Gaussians mixture generative model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29A.0.2 Single-level time dependent path flow estimation model . . . . . . . . . . . . . . . . . 30A.0.3 Bayesian modeling for large-scale dynamic network flow . . . . . . . . . . . . . . . . . 31A.0.4 Model Mapping for Bayesian Emulations of Dynamic Gravity Models (DGMs) by

BDFMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34A.0.5 Bayesian Inference on network traffic using link count data . . . . . . . . . . . . . . . 35

A.1 Use-Case Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37A.1.1 Routing Matrix Construction Heuristic Algorithm . . . . . . . . . . . . . . . . . . . . 37A.1.2 Kalman-Filter - Analytical Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38A.1.3 MCMC Simulation - Numerical Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 39A.1.4 Historical Demand (dh ) is available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39A.1.5 Historical Demand (dh ) is unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

B Heatmap Analysis for APC and Ventra dataset 43B.1 Differences between APC and Ventra datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

C Appendix 1 - Figures for Exploratory Data Analysis on APC dataset 47

D Figures for Exploratory Data Analysis on Ventra dataset 59

E Figures for comparing between APC versus Ventra 62

F Figures for Hierarchical Bayesian Model 66

3

G Earned Value Management Chart 78

4

1 Executive Summary

Information on the origin-destination (OD) matrix of a transport network is a fundamental requirementin much transportation planning .This is an area that is fundamental in transportation research and hasreceived a lot of attention in the past two or three decades. Consider a transport network with a numberof OD nodes connected through directed links. An OD matrix consists of traffic counts from all originsto all destinations. Historically, trips have been estimated through roadside interviews, number platesurveys, etc., which are expensive in terms of manpower requirements and disruptions of traffic flows.Another way is to estimate an OD matrix using a single observation of traffic flows on a specific set ofnetwork links. The advantages of lower costs and being used for several purposes (accident studies,maintenance planning, etc.) make it very attractive for inference about OD matrices. However, there arethree major challenges for inference on an OD matrix from a single observation of traffic flows on aspecific set of network links. First, this is a underspecified problem, where the number of links on whichmeasurements of traffic volumes are made is typically much less than the number of unknownparameters of interest. Thus, these unknown parameters cannot be uniquely determined based solely onthe collected data. Secondly, based on previous research in transportation, traffic volumes measured onthe monitored network links have multivariate Poisson distributions (likelihood) and multivariatenegative binomial distributions (marginal distributions). These multivariate distributions are analyticallyintractable. Finally, since the dimensions of transport networks are extremely high in most applications,computational cost is always a major issue for research in this area.On the other hand, a relatively inexpensive method to update an OD matrix is to draw inference aboutthe OD matrix based on a single observation of traffic flows on a specific set of network links, where theBayesian approach is a natural choice to combine the prior knowledge about the OD matrix and thecurrent observation of traffic flows. Indeed, some extensive research has been paid to the estimation ofOD matrices in this direction. Most notably, [10] investigated a Bayesian inference using MCMCsimulation that combines Metropolis-Hastings steps within an overall Gibbs sampling framework, and[8] investigated an application to a particular region of Leicester via a Bayesian inference using MCMC.Realizing this power of Bayesian statistic in estimating population means of traffic flows, reconstructtraffic flows and predict , Professor Vadim Sokolov in the System Engineering and Operations Researchdepartment at George Mason University, who is the main sponsor for this project, introduces to me aproject of developing statistical model utilizing Bayesian statistic to forecast transit demand of PACE’sbuses (PACE is the largest suburban bus services in Chicago area). In this project, we provided a detailedexploratory analysis on a very large dataset of approximately 69.4 millions of rows collected by PACE’sbuses using Automated Vehicle Location technology, as well as the 1-month (October 2015) datasetcollected from Ventra card swipes, to examine the behavioral traveling patterns of people using PACE’spublic buses (both in-bound and out-bound). In addition, we apply Bayesian statistics together withMarkov Chain Monte Carlo sampling technique to determine the distribution of the demand in certainchosen time periods (for example, one day or one month), given the available (but noisy) data onhistorical demand and currently observed number of passengers getting on/off at each bus stop. We testthe accuracy of our model and their corresponding assumptions on the prior and likelihood distributionsagainst the analytical result given by Kalman-Filter, and provided the recommendations on the bestassumptions to follow by PACE’s managers. The output of this project would be considered as one of theoperations support tools that would aid the managers at PACE, who is also another sponsor for thisproject to make better decisions in terms of coordinating transit response planning in real-time(especially in all-hazard emergency events), mitigating the potential damages to city riders, minimizingunwanted delays in PACE’s bus system (and potentially in other similar systems such as metro, railwayor paratransit) and improving their service’s on-time performance.

5

2 Introduction

2.1 Project Background

The notion of mobility was first introduced around 2009, and since then, it has gradually incorporatedindividual practices and lifestyles into the analysis of transport demand, which made it necessary to rede-fine the meaning of the term. The aim of technically optimizing the straightforward spatial movementof goods and individuals (i.e., planning, flow, traffic, vehicle technology, etc.) has been supplemented, oreven replaced, by the objective of obtaining a detailed understanding of the variation in individuals ’ abilityto travel (accessibility), individuals ’ experience of daily travel conditions (comfort, sustainability), and/oreven the role that mobility plays in individuals’ lifestyles in terms of both actual and possible interactions.Consequently, mobility is now studied by economists, sociologists, urban planners, geographers, and datascientists.

Traditionally, the analysis of mobility is based on travel surveys (origin-destination surveys, householdtravel surveys (HTSs)). However, these surveys tend to be expensive and consequently are undertakenfairly infrequently, which means any current developments and the public policies that aim to influencethem are not closely monitored. In recent years, there has been increased interest in using completelyanonymous data from real-time smart card collection systems to better understand the behavioral habitsof public transport passengers. Such use of smart card data to generate insights into passengers’ travelpractices and to identify or predict travel patterns becomes a very active research area. In particular, theproblem of making inference on the arrival time and modeling dynamic, real-time trafficorigin–destination (OD) to estimate demand flow of a bus network based on either link count or smartcard data, have been covered extensively in the literature over the last twenty years. A number of novelmethods for modeling and analyzing the dynamic OD demand flow of a large-scale public bus networkinclude: passenger clustering with Gaussian mixture generative model using smart card data overfive-year span and taking into account the continuous representation of time and the usage habits ofpassengers and their behavioral changes over time (see [2]), Bayesian approach with Markov ChainMonte Carlo (MCMC) simulation methods ([10] and [3]), and single-level time-dependent path flowestimation model with constraints on traffic flow dynamics and updated states (see[5]). Due to inherentstructural features in our problems of inference about OD demand flow during disruptive events such asflooding, tornadoes, blizzards, and man-made emergencies, we present a Bayesian model and method foreffectively analyzing transit data. Our objective for this project includes four goals: first, I want todevelop a statistical model incorporating prior knowledge about the historical demand and the noise ofthe number of on/off-boarding passengers (ON/OFF count) at each stop to estimate demand betweenindividual (or groups of) bus stops. Secondly, based on our estimated demand, I predict the crowdedzone-level destinations of riders, and also obtain the estimated populations’ means of traffic flowsbetween any two zones. Finally, I analyze riders’ traveling patterns based on two datasets (APC andVentra) and documented them.

The remainder of this paper is organized as follows. Section 2 presents the background of this project,then provides the reason why the PACE’s managers care about the result of our projects, and the detaileddescription of our problem. In addition, the objectives, scope and the results expected to be delivered toour sponsor are also mentioned in detailed within this section. Section 3 provides my insights on riders’traveling patters when conducting exploratory data analysis on the two datasets: APC (whole year 2015,October 2016) and Ventra (March 2016). Finally, Section 4 starts with a brief description of our approachto recover the historical demand, and a detailed description of Stan (including what Stan does, howmodels are specified and how inference in Stan works). Next I provide the simulation results for the fourpairs of prior - likelihood: Normal - Normal, Poisson - Normal, Poisson - Poisson, Normal - Poisson. I

6

then analyze the results with comparison plots and two metrics (mean-squared error (MSE) and meanabsolute percentage error (MAPE)) to determine the optimal pairs of prior-likelihood (meaning, the pairprovides the best estimated demand when comparing to the simulated true demand). I then conductBayesian sensitivity analysis to examine whether our Bayesian model is robust with respect to differentdistributions assigned to the prior, and whether the result of our optimal pair of prior-likelihood isrobust with respect to the variance parameter of our prior. Finally, Section 5 provides some plausibleresearch directions for anyone who is interested in exploring this topic further.

2.2 Motivation and Problem Statement

Since this project stems from my sponsor’s direct collaboration with PACE, I would like to give a briefoverview about PACE. Headquartered in Arlington Heights, Illinoise, PACE is the suburban busdivision of the RTA which oversees several Chicago area public transit organizations such as ChicagoTransit Authority (CTA), Metra (a suburban railway network). Besides its most popular fixed-route busservice, which processes 32.0M trips annually, PACE also offers other transportation services such asad-hoc demand-response (1.5M trips/year), Van-pool commuter services (1.5M trips/year), and ADApara-transit services (3.7M trips/year). PACE also has 9 garages containing hundreds of buses, vans andother owned vehicles across the six counties of IL, and several other major counties of Chicago andIndiana. It covers the total of 3500 square mile service area, and 5.6 million people in service area outsideof Chicago (another 3 million people). Due to an increasing number of riders for PACE’s buses annually,the managers at PACE want to improve their fleet allocation (meaning, re-routing and re-scheduling theirbuses, given that they only have limited number of buses). In order to do this effectively, they want abetter demand forecasting tool to predict the future traffic flows. This tool is expected to incorporatetheir prior knowledge on historical demand of riders and capture the uncertainty in ON/OFF countsdata into its forecasting process. Since Origin - Destination (OD) matrix is the most fundamentalrepresentation of demand between bus stops (or groups of bus stops), PACE’s managers would like us tocome up with a statistical model to estimate this matrix pretty accurately. Our estimated OD matrixwould be utilized by PACE’s managers as a fundamental input for improving POLARIS, which is theintegrated traffic simulation model supporting PACE’s managers in re-routing and re-scheduling buses tomeet demand of riders.

Inspired by the model used in [10]) when the authors solved a similar problem dealing with Internetnetwork traffic. However, the authors made the major assumption that the collected data is perfectlyaccurate, which is unrealistic. My advisor - Dr. Sokolov, who is an expert in utilizing Bayesian statistic tosolve transportation problems, proposed that for solving this problem, one plausible approach isutilizing Bayes’s formula to incorporate the prior knowledge on the historical demand and capture theuncertainty in ON/OFF counts data. Therefore, Bayesian model is a natural choice to tackle ourproblem at hand. In addition, with this approach, after obtaining the estimated demand as well as itsdistributions, we can also predict the crowded zone-level destinations, and obtain the estimatedpopulation means of traffic flows. Through our work, we hope it motivates the use of Bayesian statisticsin modeling large-scaled transportation network.

2.3 Objectives, Scope and Deliverables

To achieve an unique goal of the PACE’s managers, the objective of this project includes four majorpoints:

• Develop a statistical model to estimate the OD matrix based on the two historical datasets (APCand Ventra) obtained from PACE.

7

• Estimate population means of traffic flows, based on the estimated results by our statistical model.

• Predict the crowded "zone-level" destinations of riders based on the estimated results.

• Analyze and document riders’ traveling patters based on the APC and Ventra datasets.

The scope of this project include performing data exploratory analysis on the APC dataset, which isavailable for the whole year of 2015 and October 2016, and Ventra dataset, which is only available inMarch 2016, and providing insights on riders’ traveling pattern at different time variations. Furthermore,our Bayesian model used to estimate demand is considered only with two specific classes of distributionsassigned to prior - likelihood (namely, Normal - Normal, Poisson - Normal, Poisson - Poisson, Normal -Poisson). The reasons we choose these pairs are two-fold: first, we are inspired by the positive result ofprevious research work in [10]. Second, there exists an analytical solution in one of such cases (Normal -Normal), which is helpful for comparing against the performance of our numerical solution in the caseof Normal - Normal.Finally, the deliverables of this project include a detailed exploratory data analysis on analyzing riders’traveling pattern, a detailed introduction of Stan and its functionality, the simulation results of each ofthe four possible pairs of prior-likelihood to make a recommendation to PACE’s managers on theoptimal pair of prior-likelihood for our Bayesian model (i.e, the pair whose estimated demand is closestto the simulated true demand over a time period of 365 days). The sensitivity analysis of our result givenby that optimal pair of prior-likelihood with respect to priors and the variance parameter (representsnoise in the COUNT data) of a specific distribution assigned to prior is also provided.

2.4 Model Formulation, Assumptions and Limitations

As our Bayesian model contains routing matrix in one of its parameter, I will introduce "routing matrix"in the context of our problem through a simple example: assume we have a bus network with only threestops: A, B and C with exactly three routes. Assume we only have data on the number of people gettingON at stop A and get off at stop C (see Figure 1). For our particular three-node network,the columnlabels denote all feasible routes in the following order: AB , BC , AC , and the two row labels are Ai n andC ou t . For the first row, the (1,1) and (1,3) entries are 1 because the first row denote all possibledestinations of people getting on at stop A (so they would either get off at stop B or stop C ), whichcorresponds to column AB and AC (BC are irrelevant in this case, since the first row Ai n denotes onlyroutes starting from A). Similarly, for C ou t , the (2,2) and (3,2) are 1 because those are all possible pathsfor riders to get off at stop C (either they start from A or B , which corresponds to columns AC and BC ).The first entry in the same row corresponds to column AB is 0, because AB is irrelevant in this case.

Figure 1: Three-node network

8

Figure 2: Routing Matrix

We can extend the above argument to the network with more than 3 nodes, and the general form of therouting matrix is following

Figure 3: General Form of Routing Matrix

The probabilistic graphical model (Figure 4) below indicates the relationship between our given data(extracted from APC dataset), our demand variable d and currently observed ON/OFF counts (x).From the relationship between data layer and the demand variable d (meaning, given dh , we can makeinference on possible values of d ), and between the demand variable d and ON/OFF counts x (meaning,given d , we can make inference on possible values of C ), we denote the prior and likelihood distributionsas P (d |d h,A) and P (C |d ).Now, our objective is to obtain P (d |dh ,C ), which is the posterior d |dh ,C .Inspired by the success of the work by West and Tebaldi ([10]), the fact that the ON/OFF counts dataare discrete, and the technical validity of Bayes’s formula, I make three following assumptions:

• Only two classes of distributions, specifically Poisson and Normal, are considered for prior andlikelihood.

• Counts data (x) and demand (dh ) are conditionally independent given historical demand (dh )

• The traffic flows of any two routes are independent.

Under these assumptions, by applying Bayes’s formula twice, we obtain

P (d |dh , x)∝ P (x|d )P (d |dh ) (1)

9

Figure 4: Relationship between variables d , c and data A, dh

The major difficulty for evaluating the posterior distribution P (d |dh , x) in equation (1) above is thenon-existence of analytical formula for most pairs of prior - likelihood (with the exception of Normal -Normal, in which case we can use Kalman Filter to obtain a Normal distribution for the posterior, withclosed-form formulas to compute mean and variance). However, if we only consider Normal - Normalfor prior-likelihood, we limit the flexibility and suitability of our model when taking into account thegiven APC dataset. Therefore, in most cases, Markov Chain Monte Carlo (MCMC) simulation is apowerful technique to obtain the posterior distribution of d |dh ,C , because it can deal with anydistributions assigned to prior and likelihood. Now, by noting that x =Ad , we can model the

likelihood x|d ∼ P1(A∗ d ,σ1) where P1 is either Normal or Poisson, and σ1 is a vector accounting for thenoise in the estimated demand d . Similarly, based on the methods used to collect the counts data in theAPC dataset, the historical demand dh most likely under-estimates the true demand. Therefore, ourprior can be modeled as d |dh ∼ P2(dh ,σ2) where P2 is also Poisson or Normal, and σ2 is a vectorrepresenting the difference between our historical demand dh and true demand.

Finally, our approach using Bayes’s formula and MCMC simulation as main tools has three majordrawbacks: first of all, the assumed distributions for either prior or likelihood might not be true, whichthen leads to inaccurate posterior distribution if our Bayesian analysis is not robust. This also means theobtained posterior distribution is valid only with respect to specific class of distributions ofprior-likelihood. Furthermore, the computational cost of MCMC is very high when working withhigh-dimensional data (for one iteration, the MCMC took R 18 hours to run its simulation on myMacbook Air, and ended up crashing). Due to this limitation, MCMC is not scalable to very large-scalednetworks with approximately one million nodes.

10

3 Exploratory Data Analysis

3.1 APC Dataset

Thanks to our collaboration with PACE, the largest suburban bus division in Illinois, I obtained the twodatasets denoted as APC and Ventra, which contain data of all PACE’s bus rides. Together, these datasetshave 71.18 millions of data points, with 16 common qualitative categories such as "Latitude", "Longitude","Days of a week", "Route", "Stop Name", "ON", "OFF" (these are counts data) and "Bus ID". We wereinformed by the PACE’s managers that the Counts data in the APC dataset were quite inaccurate, andsometime misleading, because there were certain days in which the sensors are mal-functional or the busdriver did not keep track of the counts correctly for cash-paid passengers. In addition, to comply with theregulatory and non-discriminatory requirements, the buses must be assigned to different routes everyday.

With that said, the APC dataset was collected in two different time periods: one was in October 2016,and another was in the entire year of 2015 (equivalent to 365 days, from 01/01/2015 to 12/31/2015).There are three major differences between the one-month dataset and the one-year dataset. First, the"Trip Time" column was included in the one-month data, but not in the whole-year. Second, theone-month dataset only records trips made by 63 buses departing from the same garage of PACE locatedin the Northwest region of Chicago. The second dataset include all 635 buses from the 9 garages locatedacross IL. Thirdly, for the one-month dataset, approximately 85% of the data was collected duringweekdays. However, in both datasets, the counts data are noisy.

First, I provide the exploratory data analysis for the one-month dataset. Utilizing the library packages"tidyverse", "ggplot2" and ’groupby’ function in R, I computed the average number of riders ON andOFF across every hours of a day, and created the bar plots to observe the trends throughout differenthours (note that our dataset does not data at 3am). The result is interesting, as the highest average hourlyAPC ON occured at 10pm (= 0.2758), while the lowest average hourly APC ON occured at 1am(= 0.2187). However, a different pattern was observed for the highest and lowest average APC OFF,which was at 11pm (= 0.2831) and 4am (= 0.2333). To check whether these differences are statisticallysignificant or just due to noises in the data, I conduct the Mann-Whitney-Wilcoxon’s test using thefunction wilcox.test() in R, and obtained the p-value = 0.0685, which is greater than 0.05, and thestandard errors for average ON and OF F are 0.2251 and 0.2518, respectively. This implies we cannotreject the null hypothesis that the two groups came from nonidentical populations, so the differencebetween average ON and average OFF at the above hours are mainly due to noise in the counts data. Ithen look at the total ON count to observe the most active time of the majority of riders. I used"ggplot()" function to create the barplot for the total APC ON across different hours of a day (see Figure7), and observed that the early morning time period (6am - 7am) is the most active time of riders, whilevery few riders travel between midnight and 4am. However, the peak time of total hourly APC ON andOFF, which are both at are quite different from the average hourly APC ON. The reason is following: atthe time 10pm when average hourly APC ON peak, the amount of data collected is only equal toapproximately one-fifth that of data at 6am, at which the total APC ON and OFF peak (= 42483). Since

average hourly APC ON at time t = total APC ON at time tnumber of ON counts data collected at time t

, and the ratio between countsdata at 10pm and 6am is larger than the ratio between total hourly APC ON at 10pm and 6am

(5.035> 424839863 ≈ 4.307), we conclude that the substantial difference in the amount of data collected at 6am

and 10pm cause the difference in the peak time of total APC ON versus average APC ON.

11

Figure 5: Distribution of average hourly APC ON and OFF (October 2016)

Figure 6: Distribution of total hourly APC ON and OFF (25 = 1am, 26 = 2pm - October 2016)

12

Figure 7: Distribution of total APC ON per week-days (October 2016)

I conducted similar analysis for the average APC ON and OFF per days of a week to observe thedistributions of ON and OFF. The conclusion is that the same pattern holds for both average APC ONand OFF: the peak and trough days are both on Saturday and Monday, respectively. The absolutemagnitudes are not substantially different: for average daily APC ON, the highest and lowest are 0.2960and 0.2434, while that of APC OFF are 0.300 and 0.2457. Here the averages are not much differentbecause only high demand routes are served on the weekend. I then conducted theMann-Whitney-Wilcoxon’s test using wilcox.test() function in R to see if the difference between averageAPC ON and OFF are statistically significant. The standard errors obtained for average APC ON andOFF is 0.219 and 0.232, and the p-value obtained is 0.127, which is greater than 0.05. Thus, we cannotreject the null hypothesis that the average APC ON and OFF comes from an identical data distribution,so the minor difference between the highest and lowest average APC ON and OFF is due to the noise inthe data. In addition, I also computed the total number of APC ON and OFF per days of a week, andobserve that Monday is the most active time of riders, evidenced by the highest APC ON and OFF(86772 and 87610), while Sunday is the most inactive time of riders with the lowest total APC ON andOFF (34246 and 34091). Once again, I conduct the Mann-Whitney-Wilcoxon’s test to check if there isany statistical significance in the difference between highest total APC ON and average APC ON, as wellas between highest and lowest total APC OFF. For total APC ON and OFF, the standard errors are 25.3and 28.9 and the p value for total versus average APC ON and OFF are 0.0417 and 0.0439, respectively.Since the p-values are both less than 0.05, the differences in total APC ON and average APC ON (as wellas total APC OFF and average APC OFF) per days of a week are statistically significant at the 95%confidence level. When conducting the same analysis and computing the standard error and p-value fort-test statistics with respect to the total daily APC ON and OFF, the above pattern still holds, withweekday (Thursday, 10/19) has the highest number of riders and weekend (Sunday, 10/30) has the lowestnumber of riders (the standard errors for total daily APC ON and OFF in the month of October is145.22 and 152.31. The p-values for total daily APC ON versus total daily APC OFF are 0.067, whichare greater than 0.05. This implies the difference in total daily APC ON and OFF are not statistically

13

significant at the 95% confidence level, and such difference is due to noise in counts data).

To complete my analysis for this one-month APC dataset, I examine the number of riders withinparticular route per days of a week, hours of a day and days of a year. This is helpful to the managers atPACE because they can gain insight into the usage frequency of each route, which help them in futureplanning if they want to eliminate certain low-usage route and re-route certain buses. First, I found that aroute 215 is only utilized for two days of a week (Thursday and Friday), and serves on average 240 ridersper day. Second, route 237 that was only available on Thursday, and very few people used that route(only 63 riders in October). Researching on this particular route, I found that this route operates forspecial events or serves riders taking long outbound trips to Chicago. Since there was no big eventhappening during October 2016 in Chicago, this explains why it only manages, on average,approximately 2 riders per day in October.

Now, I perform the exact same exploratory data analysis for the one-year APC dataset, which has 69.4millions of rows and 18 columns. First, using library ’dplyr’ and ’groupby’ function in R, I createdbarplot for the total number of riders across different days of a week. I observe that the peak of total ONis on Tuesday and Wednesday, while the trough is on Sunday (see Figure 8). In addition, I examine the

average APC ON across days of a week (= total APC ON per particular day of a weekamount of data collected on the same day of a week in a year

) to see if it has

different pattern. And indeed, the opposite pattern happens here: average APC ON peak on Sunday.Once again, I conducted the Mann-Whitney-Wilcoxon’s test to examine if the difference in thedistribution between average and total APC ON across days of week is statistically significant. Thestandard errors of average and total APC ON are 0.169 and 11,528, and the p-value = 0.045, which is lessthan 0.05, so the difference in the distribution between total and average APC ON is statisticallysignificant at the 95% confidence level. I then observe the trends in the average and total monthly APCON in the corresponding barplots. The pattern is pretty much similar: both average and total monthlyAPC ON peaks on October, and the distribution of average and total monthly APC ON are bothskewed to the left! This means October is the most active month of riders (see Figures 10-11). Finally,regards to the total APC ON per days of year 2015, the peak is approximately 75,000 and occurs nearlyin the middle of the year. The day-to-day variations are quite large, with the approximated range between5,000 and 60,000 (see Figure 12. Note that the x-axis includes all days starting from 01/01/2015 to12/31/2015).

14

3.2 VENTRA Fare Card Dataset

Technological advancements allow the transit authority in Chicago to issue a declining balance RFIDenabled card called "Ventra" which allows passengers to add an unlimited number of ride passes intotheir card at any time periods. Moreover, passengers can now transfer between different transit agencies(CTA and PACE, PACE and Metra, etcetera) without having to purchase a new fare, unlike the old farepayment method. These features significantly help improve user experience for frequent riders and/orcommuters, which in turn encourage more people in the area to use public transportation for gettingaround the city. Note that the Ventra dataset does not have the OFF counts (unlike APC dataset).However, Ventra dataset also contains information that were not available in the APC dataset such aspassengersâAZ transaction history and their transfer points, and customersâAZ trip types. Theseinformation were not collected previously, because the old payment methods could not obtain themwithout affecting riders’ user experience negatively.

The Ventra dataset is available for March 2016(03/01 - 03/31), and it includes buses coming from all thenine garages of PACE (unlike the one-month APC dataset, which only covers buses coming from theNorthwest garage of PACE). The dataset has 88 columns for each observation, and a total number of404643 observations. For the purpose of my data manipulation, I choose only columns that are crucial toperform the exploratory data analysis. Those columns location information ("lat", "lon"), trip’s starttime ("START TIME"), direction of the bus trips ("DIRECTION"), hour ("Hour"), the types of tripsriders made ("TRIP TYPE"), bus stop number ("STOP"), the number of on-boarding riders ("Count"),the specific route where a bus covers ("ROUTE Number"), status of transactions recorded from the cardswipe ("Transaction status").

Repeating the same steps as with the APC datasets above, I first compute the total and average ON acrossdifferent hours, and create the bar plots to observe the trends in the distributions of these two quantities.For the Ventra dataset, the "COUNT" data is available for every hours of a day (unlike the APC dataset,which does not have Counts data at 3am). Once again, the average ON count at time t is computed with

the formula total ON count at time ttotal amount of ON count data at time t

. After creating the bar plots for the average and total ONcount across hours of a day, the following trends are observed: at time 2am, the average "ON" count

16

peaks (= 2.20822 riders), and it troughs at 4am (= 1.1172 riders). For the total ON count, it peaks at 3pm(= 54859 "ON" riders) and troughs at 1am (= 225 "ON" riders). In order to verify if the difference isstatistically significant, once again I conducted the Mann-Whitney-Wilcoxon’s test using wilcox.test()function in R. The standard errors obtained for total and average ON counts are 358.17 and 1.451, andthe p-value is 0.0317, which is less than 0.05. Thus, the difference between total and average ON count isstatistically significant. Finally, I observe that the distribution of total hourly "ON" count is similar tothe combination of two Normal distributions (one for the morning - noon period, and one for theafternoon-midnight period), and the distribution of average hourly "ON" count looks similar to thePoisson distribution.

17

Furthermore, for the total and average "ON" count per days of a week, the trends observed from the barplots are similar to the ones observed in the two APC datasets. The distribution of average ON countper days of a week peak on Saturday (= 1.3941 riders) and trough on weekday (= 1.3088 riders), whilethat of total ON counts peak on Tuesday (= 105647 riders) and one trough on Sunday (= 16513 riders).The Mann-Whitney-Wilcoxon’s test was conducted once again, and gives the standard errors for total andaverage as 11251.72 and p-value= 0.0415, which is less than 0.05. Thus, the difference in total and averageON counts is statistically significant. Finally, the distributions of the total ON counts per each day ofMarch is very similar to that of the total ON counts per days of a week, as it peaks on weekday (Tuesday)and trough on weekend (Sunday). T the most active time for riders is Tuesday, and inactive time

4 Bayesian Model

In this section, I will utilize the Bayesian model demonstrated in Section 2.6 above to obtain theposterior distribution of d |dh , x, with dh and x are data given from the APC dataset in 2015. This isequivalent to compute P (x|d )P (d |dh ) where prior-likelihood are among four possible choices: Normal -Normal, . The first step is recovering the historical demand dh , which is the solution to the system oflinear equations Adh = xh where A is the routing matrix between individual bus stops. This system oflinear equations is inconsistent, so it does not have unique solution. Therefore, I employed theleast-squares method with the ’nnls’ package in R to solve for the non-negative solution dh that

18

minimizes ||Adh − x||22. Unfortunately, this package is unable to handle the size of matrix A. In order toresolve this scaling issue, one of the common ways is to divide the maps into zones and group bus stopsin the same zones together. Given the ‘stopzone.csv‘ file that my advisor made that contain the column‘ZONE‘ where each bus stop is mapped into an unique zone using their ’lat-lon’ pair (the region aredivided into 1993 different polygon zones), I utilized ’leftjoin()’ function in R to assign the zones to eachbus stop based on their common ’geonodeID.’ After that, I aggregated the total ON counts of all thestops within the same zones, and re-construct the "zone-level" routing matrix A. Dimensionalities of Aare reduced to the size of hundreds × hundreds. Now, using the same ’nnls’ package, I recover thenon-negative zone-level solution dh that minimizes ||Adh − x||22 (note that by doing this, I implicitly relaxthe constraint dh must be integer). This dh then is chosen to be the rate λ of the Poisson distributionwhen assigning to our prior or likelihood.

4.1 Introduction about Stan - A probabilistic Programming Language

In order to specify the data, the prior and likelihood in our Bayesian model, the parameter d and tocompute the Baysian inference for continuous-variable models through MCMC simulation, I extensivelyuse Stan - a C++ program to perform Bayesian inference. A Stan program makes inference bycomputing directly the log-posterior density function over parameters conditioned on specified data andconstants. The result is a set of posterior simulations of the parameters in the model (or a point estimate,if Stan is set to optimize). Stan differs from BUGS and JAGS in two different ways: first, Stan is based ona new probabilistic programming language that is more flexible and expressive than the declarativegraphical modeling languages underlying BUGS or JAGS, in ways such as declaring variables with typesand supporting local variables and conditional statements. Second, Stan’s MCMC simulation is based onHamiltonian Monte Carlo (HMC), a more efficient and robust sampler than Gibss sampling orMetropolis-Hastings for models with complex posteriors. Stan has multiple interfaces for command lineshell (cmdstan), for Python (pystan), and for R (library ’rstan’).A typical Stan program includes multiple blocks, each block serves an unique purpose. Such blocks mustbe specified in the same order as follows. A Stan program always starts with the data block (unless aprogram has a user-defined function, then those functions must be specified before the data block), whichdeclares the data (double types) required to fit the model. From the modeling approach, this is differentwhen comparing to BUGS and JAGS, which determines which variables are data and which areparameters at run time based on the shape of the data input to them. Thanks to these declarations, Stancompiles a much more efficient code (the underlying language supporting Stan’s compiling is C++,which compiles data variables as double types much faster). The next (optional) block is transformeddata block, which may be used to define new variables that can be computed based on the data. Thisblock is executed during construction, after the data is read in (note that the transformed data variablescan only be used after they are declared). Next is the parameter block, which defines the parameters thatwe are interested in finding the posterior distribution and/or point estimate. This block is executedevery time the log density is evaluated. The probability distribution defined by a Stan program workswith unconstrained support (i.e, no points of zero probability), and so for variables declared withconstrained support, they are implicitly transformed to an unconstrained space over which the modelblock is defined. These unconstrained parameters are then inverse transformed back to satisfy theirconstraints before executing any statements in the model block. To account for this change of variables,the log absolute Jacobian determinant of the inverse transform is added to the overall log density. Novalidation required for this parameter block. Next is the (optional) transformed parameters block, whichis executed after the parameter block. Constraints are validated after all of the statements defining thetransformed parameters have executed. If the constraints are not satisfied, the execution of the logdensity function is halted. Next is the model block, which is to define the log density on the constrained

19

parameter space. It can contain as many sampling statements as possible, but every such statements aretranslated to the log density functions (e.g. if parameter ’beta ∼ normal(0,1)’ has the exact same effect asincrementing log density directly with the value of the log probability density function for the Normaldistribution using the target increment statement: target += normal-lpdf(beta|0, 1)). Stan does notrequire proper priors, but if the posterior is improper, Stan will halt with an error message. Finally, an(optional) generated quantities block allows values that depend on parameters and data, and might beused to compute predictive inferences. Figure 7 below is an example of a Stan model in vectorizationform that contains 3 must-have blocks: data block, parameter block, and model block.Finally, despite its strength in computing the log-posterior density to perform Bayesian inference, themain limitation of Stan is that it does not allow inference for discrete parameters. Stan allows discretedata and discrete-data models such as logistic regressions, but it cannot perform inference for discreteunknowns. This explains why in my approach for obtaining the posterior distribution when the priorfollows Poisson distribution, I have to use Normal distribution with equal mean and approximatelyequal variance to approximate the original Poisson distribution.

Figure 8: A Stan program modeling a linear regres-sion with unknown coefficients

The following four subsections contain the simulation results given by MCMC (and analytical solutiongiven by Kalman Filer for the Normal - Normal case) for all four pairs of prior - likelihood. EachMCMC simulation contain estimated demand in 365 days. For each day, the estimated demand vector dhas hundreds of parameters to sample. Notice that some pairs of prior-likelihood provides very goodestimations between certain zone-level origin-destination pairs on some days, but did very poorly onsome other days. Other pairs consistently provide good estimated demand d over the entire 365 days,and such pairs are the ones that should be chosen by our sponsor. The comparison plots shown belowmainly focus on the days where certain pairs of prior-likelihood provide bad results. For the pairs thatprovide consistently good estimated results across all the days, the comparison plots shown over certain

20

days were chosen randomly. Note that the label on the x-axis only stands for the ordering of differentzone pairs on a particular day (e.g. zone index = 100 corresponds to zone pairs 1795− 1375 on day 15).

4.2 Normal - Normal (prior - likelihood)

The prior and likelihood follow N (d ,σ2) and N (Ad + ε1,σ1) where ε1 is randomly drawn from U (0,20)indicates the random difference between Ad and x, vectors σ1 and σ2 are drawn from U (0,10) (component-wise). They represent noise in each component of d and Ad+ε1. Since we do not have currently observedtrue demand d ′, we simulate it as dh + ε2 where ε2 ∼ U (0,10) is the expected difference between dh andd ′ (since count data xh is underestimated, which leads to dh underestimates d ′). For this particular case,an analytical solution exists by Kalman-Filter. The posterior follows Normal distribution and explicitformula for computing mean and variance exists (see Appendix A.1.2). The numerical solution is obtainedfrom running MCMC simulations with 250 iterations and 3 chains. From the one-year APC dataset, Iobserve that each day has a different zone-level routing matrix A, and the estimated demand vector d alsohas dimensions change day-to-day. Computational complexity of MCMC simulation require me to putour model written in Stan, the execution code written in R and data, all onto Amazon Web Server withthe package ‘m5.2x large‘ (32GB RAM, 8 CPUs). With the library(rstan) and the ’sampling()’ functionin R, MCMC calculation is performed. Each figure, on average, still took more than 3896 seconds to beproduced. Below are the comparison plots between numerical and analytical solutions versus simulatedtrue demand in 4 different days.

Figure 9: Day 4 (Left) - 5(Right). MCMC’s solution (black) versus Kalman-Filter (green) and true demand(red)

21

Figure 10: Day 12 (Left) and 361 (Right). MCMC’s solution (black) versus Kalman-Filter (green) and truedemand (red)

From the four plots, the solution by Kalman Filter always underestimates the true demand across mostof the zones, except it catches pretty closely with the crowded zone pair 151 and 1287 on Day 12 andzone pairs 40 and 70 on Day 4. The MCMC’s solution matches well with the true demand at the zonepairs that are not crowded (zone pairs 52 and 136 or zone pairs 198 and 160), but it severelyunderestimates the zone pairs with sudden peak in demand (zone pairs 40 and 70 on Day 4, 151 and 1287on Day 12). I then computed the mean squared error (MSE) and the mean absolute percentage error

(MAPE) for each of these 4 days (the formulas for MSE and MAPE are:||d ′−d ||22

dnumber of components of vectordand

||(d ′−d )/d ′||1number of components of vectord

, respectively). I obtained large values for both MSE and MAPE across 4 days:

21.37, 15.91, 13.26, 12.48, and 42.33%, 38.44%, 33.17% and 26.41%. The major sources of errors in MSEand MAPE come from the underestimation of MCMC’s solution at the most crowded zone pairs (suchas 40 and 70 on Day 4, 151 and 1287 on Day 12). Combining the above results, I conclude that Normal -Normal gives poor estimated demand d over certain days for crowded zone pairs (40 and 70, 151 and1287), although it does capture pretty well for some low-demand zone pairs (52 and 136, 198 and 160).

4.3 Poisson - Normal (prior - likelihood)

When the prior is Poisson, d ∼ Pois(dh ). Our estimated parameter d now is in the discrete unboundedspace, and as mentioned in subsection 4.1, Stan cannot sample discrete parameter. However, by observingthat most of the non-zero components of dh are sufficiently large, I approximate Poisson distributionwith rate dh by a Normal distribution having the same mean and approximately equal variance (recall thatmean and variance of Poisdh are both dh ). Therefore, our Normal distribution approximation is of theform N (dh , dh + k) where k ∼ U (0.02,0.5) (the reason we have dh + k rather than dh is because somecomponents of dh are zero, but Stan, by default, starts its MCMC simulation from the interval [−2,2].Since log0 is not well-defined, we want to avoid those cases. But we also want k to be small enough tomatch the second moment of N (dh , dh+k)with that of Poiss(dh )). Thus, we run MCMC simulation with250 simulations and 3 chains to obtain the posterior of N (dh , dh + k)×N (Ad + ε1, σ1). Below are thecomparison plots over 4 different days between our MCMC’s solution and the simulated true demand d ′.

22

Figure 11: Day 4 (Left) and 364 (Right)- MCMC’s solution versus simulated true demand d ′

Figure 12: Day 363 (Left) and 361 (Right)- MCMC’s solution versus simulated true demand d ′

From the four graphs above, I see that the estimated demand d given by Poisson - Normal modelcaptures pretty well the most crowded zone pairs on Day 4 (zone pairs 92 and 40). However, on the sameDay 4, in other not crowded zone pairs, such as zone pairs 635 and 394 and 5 and 58, it completelyunder-estimates. On the remaining three days shown above, it severely underestimates at some zone pairssuch as 206 and 40 on Day 364, 69 and 251 on Day 361. However, the true demand between zone pairssuch as 685 and 274, 24 and 180 are capture quite well with this model. The mean squared error andMAPE computed in those 4 days are still pretty high: 24.32, 18.82, 11.71 and 20.75, and 30.19%, 21.93%,

23

18.52% and 24.47%. The major contributing factors to large MSE and MAPE for Days 364 and 363 aredue to the severe underestimations of moderately crowded zone pairs, while for Days 361 and 4, theywere the most crowded zone pairs (92 and 40 for Day 4, 459 and 234 on Day 361). We conclude that thispair of prior-likelihood is better than Normal - Normal as it captures the most crowded zone pairs quitewell, although it still misses on other zone pairs that are not so crowded.

4.4 Poisson - Poisson (prior - likelihood)

Since the prior is still Poisson, I still use the same Normal distribution N (dh , dh+k)where k ∼ U (0.02,0.5)to approximate Poisson distribution. Our likelihood in this case becomes Pois(Ad + ε1). Using MCMCsimulation (250 iterations, 3 chains) to obtain the posterior distribution of N (dh , dh + k)×Pois(Ad + ε1),below are the comparison plots over 4 different days:

Figure 13: Day 1 (Left) and 10 (Right) - MCMC’s solution versus true simulated demand d ′

24

Figure 14: Day 11 (Left) and 12 (Right) - MCMC’s solution versus true simulated demand d ′

From the four graphs above, the MCMC’s solution always overestimates the true demand by a widemargin between crowded and not-so-crowded zone pairs such as 57 and 110, 215 and 1028. However, itdoes capture very well the true demand between low-demand zone pairs such as zone pairs 356 and 105,219 and 456. The MSE in the four days above are higher than in the Poisson - Normal case: 25.24, 30.31,27.55 and 21.38, but the MAPE is smaller: 20.44%, 17.25%, 16.79% and 21.30% respectively. The majorsource for the high MSE is because of the severe overestimation of the estimated solution over thelow-demand zone pairs. The lower MAPE happens because the estimated solution matches quite wellwith the true demand over low-demand zone pairs such as 422 and 1080, 52 and 1105.

4.5 Normal - Poisson (prior - likelihood)

Since the Poisson distribution is assigned to likelihood, which Stan can sample because the parameter is notdiscrete anymore, we do not have to use Normal distribution to approximate it. Using MCMC simulationagain to obtain the posterior distribution of N (dh , dh + k)× Poisson(Ad + ε1) where ε1 is drawn fromU (0,20), below are the comparison plots over 4 different days chosen randomly, since the match occursconsistently well across 365 days:

25

Figure 15: Days 362 (Left) and 351 (Right) - MCMC’s solution versus simulated true demand d ′

Figure 16: Day 361 (Left) and 1 (Right) - MCMC’s solution versus simulated true demand d ′

From the four graphs above, the MCMC’s solution fit quite well to the simulated true demand, especiallyat most crowded zone pairs such as 29 and 84 on Day 361, 315 and 71 on Day 362, and 30 and 62. Only afew zone pairs that the estimated demand underestimate the true simulated demand, include 32 and 25,84 and 62, 60 and 607. The MSE and MAPE in the four days are the lowest ones achieved among 4 cases:5.24, 2.11, 4.77 and 7.28, and 5.35%, 4.75%, 5.81% and 4.33%, respectively. The reason MSE and MAPEare small is because the estimated demand given by this particular prior-likelihood do not severelyoverestimate or underestimate the demand between any two zone pairs (there are still couple of zone

26

pairs that demand could not be matched closely, but the difference between the estimated and the trueone is not as large as before (reflected through the low MSE). This is unlike the three previous method).Finally, I create the histograms to examine the distributions of the estimated demand between zone pairs71 and 315, 91 and 39. The histograms show distributions that are similar to the Normal distribution,which is the distribution of our prior.

Figure 17: Day 362 - Distribution of two components of estimated demand look similar to Normal

4.6 Bayesian Sensitivity Analysis

From all the results above, it is easy to see that when the class of distributions assigned to Prior changes(such as from Poisson to Normal), the estimated demand changes substantially. Thus, the simulationresult from the best case of Normal - Poisson (prior-likelihood) is sensitive to specific distributionsassigned to prior - likelihood (i.e, they cannot be generalized when prior does not follow Normaldistribution). However, f I modify the variance parameter ε1 in the Normal distribution by increasing ordecreasing the range of the uniform distribution where they are drawn from. The former isk ∼ U (0.02,0.9) (instead of U (0.02,0.5)) and the later is k ∼ U (0.02,0.1). Running the MCMCsimulation again with 250 iterations and 3 chains, the estimated demand d still matches quite closely tothe true demand d , but the MAPE for those cases are approximately 4.53% and 4.94%, while MSEincreases to 7.3 and 8.5, respectively. However, this does not help fix underestimation at the zone pairs285 and 261. Therefore, the result given by Normal - Poisson (prior-likelihood) is pretty robust withrespect to the variance parameter of the Normal distribution assigned to prior.

27

Figure 18: Day 9 (Left) and 355 (Right) - MCMC’s solution versus simulated true demand

4.7 Summary of Results

From the results obtained above and the sensitivity analysis above, among four possible pairs ofprior-likelihood, the Normal - Poisson for prior-likelihood performs consistently well across 365 days, asit has the lowest MAP and MSE errors (there are still a couple of zone pairs missed, but this is obviousbecause we are doing forecasting). The second-best model are either between Poisson - Normal orPoisson - Poisson, as there are trade-off between these two (the first one has lower MSE but higher MAP,while the second one has higher MSE and lower MAP). Furthermore, based on the simulation result, Ican compute the sample mean of traffic flows between any zone pairs. This sample mean, by CentralLimit Theorem, would be expected to converge to the population’s means. However, our result is notrobust with respect to prior, so no inference can be made from our estimated demand once we relax theassumption that prior follows Normal distribution. The second best model for this particular problem isPoisson - Normal, as it has the lower mean squared error compared to that of the Poisson - Poisson andNormal - Normal case. Finally, Normal - Normal does not work well over certain days due to its severeunderestimation that results in the highest MSE and MAPE among four cases.

5 Future Work

There are many different ways that can be built upon our model and results above for the future teamwho might be interested in conducting further research on this topic. First is exploring other pairs ofprior and likelihood, both at the zone-level and at the individual-stop level. For the later case,zero-inflated negative binomial distribution assigned to prior is the promising candidate, as the countsdata has lots of zeros. Second, the future team can attempt to find the robust Bayesian model for thisAPC dataset, so that we do not have to depend on the choice of prior (or if such robust model does notexist, prove it!). Third, the future group can either incorporate other factors, such as distance traveled,transaction types, etcetera that might affect demand into the Hierarchical Bayesian model and come upwith the posterior distribution for the estimated demand, or they simply can apply the model above

28

(potentially with different choices of priors and likelihood) to other transportation systems such asAmtrak, bike-sharing and car-sharing. Fourth, even with this dataset, the future group can reduce theforecasting period from one-day to half-a-day, or even one-hour, assuming that the "Trip Time"information can be collected by PACE (which should be the case as people use Ventra card nowadays).Finally, the future team can reformulate the original problem as a two-stage non-linear stochasticoptimization problem, and obtain the estimated demand by solving such problem. Then compare theobtained results against the ones with Hierarchical Bayesian model to see which one produces the bestresult!

29

A Appendix - Literature Research Summary

A.0.1 Gaussians mixture generative model

Profile clustering has been conducted on mobility data in recent years for studying the temporal habitsof passengers on their networks, and several methods have been developed to this end. Inspired by thework of [11], the authors in [2] proposed the use of a Gaussian mixture approach instead of a unigrammixture. The two-level generative mixture model uses non-aggregated data and fits a Gaussian mixtureonto it. Specifically, the first-level model cards are partitioned into groups (card clusters) and the secondtakes all ticketing logs of the clusters’ cards to represent temporal activity profiles of these groups as a aGaussian mixture model. This choice of Gaussian mixture is reasonable because we need to preserve thecontinuous nature of timestamps.

For modeling the cluster memberships of the cards, the authors introduced the use of latent variableZ1

i ∼M (1, π) where M denotes a multinomial distribution, Z1 denotes membership of one of the K

card’s clusters and Z1i denotes membership of card i (i ∈ {1, . . . , M}) onto one of the K cards’ clusters and

follows a multinomial distribution of parameter π= (π1, . . . ,πK ). Similarly, for the second level, let Z2

denote membership of one of the H Gaussians. and Z2i j denotes membership of trip j ( j ∈ {1, . . . ,Ni}

where Ni being the number of trips of cards i ) to one of the H Gaussians to describe the temporalactivity of cluster Z1

i kfor the day Di j l (l ∈ {1, . . . , 7} being the set of the days of a week). Finally, Xi j

denotes trip time, which the authors assumed follows a Gaussian distribution N (µk h l , σk h l ). Thus,mathematically, the two-level model could be written as follows:

Z1i ∼M (1, π),

Z2i j |Z

1i k Di j l = 1∼M (1, τk h l )

Xi j |Z1i kZ2

i j h Di j l = 1∼N (µk h l , σk h l )

The conditional density of Xi j is f (Xi j |{Z1i k

Z2i j h

Di j l = 1}) =∑H

h=1τk hdi j

f (x; µk hdi j, σk hdi j

) where

f (.; µ, σ2) is the density function of Gaussian distribution of mean µ and variance σ . From this, weobtained the likelihood model:

L(θ) =M∏

i=1

K∑

k=1

πk (

Ni∏

j=1

H∏

h=1

τk hdi jf (x; µk hdi j

, σk hdi j))

At this step, we could estimate the likelihood parameters in the mixture models using either ExpectationMaximization (EM) algorithm or a Classification Expectation Maximization (CEM) when including aclassification step. Since our model consists of two levels, it is natural for us to adopt a two maximizationstep process for this parameter estimation. We would first use a complete log-likelihood as amaximization criterion for the estimation. Then a CEM algorithm is used ([6]) since it includes aclassification step that assigns each observation to its most probable cluster (rather than yielding a vectorof membership probabilities, as in the classic EM). Finally, this algorithm would take three key inputs:user id (comprised of anonymized card ID, the card type, transaction date and time, stop location,transport line, method of validation and type of transaction), the day of the week (Monday,. . . , Saturday)and the hour of validation (service hours only, with no break at midnight). Then it returns the associatedcluster for each user, the Gaussian mixture parameters and the complete log-likelihood.

30

The disadvantages of the above model is mainly due to taking into account the days of the week, whichcould increase the number of clusters combinatorially, and the data used in [2] for analysis are incomplete(lost or stolen cards don’t keep the same ID when they are replaced). Furthermore, a dedicated modelneeds to be developed if we want to better understand the motivations behind the cards’ cluster changes.

A.0.2 Single-level time dependent path flow estimation model

Time-dependent origin–destination (OD) demand matrices are fundamental inputs for dynamic trafficassignment (DTA) models to describe network flow evolution as a result of interactions of individual trav-elers. Intending to develop an internally consistent approach for the dynamic OD demand estimationproblem, single-level path flow estimators (PFEs) have been proposed for the static OD estimation prob-lem (e.g, the linear programming PFE by [9] on estimating deterministic UE path flows, and the nonlinearprogramming PFE by [1] on estimating stochastic UE path flows. Inspired by those works, the authorsin [5] presents a new path flow-based optimization model and an effective Lagrangian relaxation-based so-lution framework for jointly solving the complex OD demand estimation and UE DTA problems. Theirmodel simultaneously minimizes the deviation between measured and estimated traffic states, as well asthe deviation between aggregated path flows and target OD flows, subject to a dynamic user equilibrium(DUE) constraint, which is reformulated using an equivalent gap function. The proposed Lagrangianrelaxation-based algorithm dualizes the gap function-based DUE constraint into the objective function,and solves the single-level relaxation problem by reducing the difference between the upper and the lowerbounds. This is different from the previous research, which developed column generation algorithms tosolve the VI-based single-level model.

The nonlinear program is formulated as follows, where DN LF (r ) denotes the given DN L function ofpath flows proposed based on Newell’s simplified KW model (see [7] for more details).

Min Z = βd

∑

w

[∑

τ∈Hd

∑

p

r (w,τ, p)− d (w)]2+∑

l∈S

∑

t∈H0

{βq[q(l , t )− q(l , t )]2+βk[k(l , t )− k(l , t )]2}

Subject to(c , q , k) = DN LF (r ),

g (r ,π) =∑

w

∑

τ

∑

p

{r (w,τ, p)[c(w,τ, p)−π(w, τ)]}= 0,

c(w,τ, p)−π(w, τ)≥ 0∀w, τ, p

π(w, τ)≥ 0∀ p ∈ P (w, τ)∀w, τ

r (w, τ, p)≥ 0

r (w,τ, p)≥ 0∀w, τ, p

whereSetA= set of linksP = set of paths

31

Hd = set of discretized departure time intervalsW = set of OD pairsList of indicest= index of simulation time intervals (t = 0, . . . , T )τ = index of departure time intervals (τ ∈Hd )w = index of OD pairs (w ∈W )p = index of paths for each OD pair ( p ∈ P )l = index of links (l ∈A).

Estimation variablesr (w, τ, p) = estimated path flow on path p of OD pair w and departure time interval τc = {c(w, τ, p)∀w, τ, p}= estimated path travel time on path p of OD pair w and departure timeinterval τπ (w, τ) = estimated least path travel time of OD pair w and departure time interval τq = {q (l , t )∀ l , t}= estimated number of vehicles passing through an upstream detector on link lduring observation interval tk = {k (l , t )∀ l , t}= estimated density on link l during observation interval td (w, t ) =estimated demand of OD pair w and departure time interval τ

The final solution is a set of path flows satisfying “tolled user equilibrium” (Lawphongpanich and Hearn,2004), where the deviation with respect to traffic measurements can be viewed as an additional penaltyfor over-estimated or under-estimated path flows. By incorporating heterogeneous real-worldmeasurements in the objective function, such as link densities from video surveillance and road sidedetectors, the proposed estimation model fully utilizes available information to reflect route choices in acongestion network.

The main advantage of such formulation is that it could directly aggregate estimated path flows to obtainfinal OD flow patterns, and obviate explicit dynamic link-path incidences, as opposed to the majority ofprevious studies such as [1] or [4]. Moreoever, the proposed OD demand estimate model circumventsthe difficulty of providing complex mapping matrices between OD demand flows and thosemeasurements in most of the existing dynamic OD demand estimation methods. However, the authorsdid not explore the generalization of their modeling framework into the problems of real-time trafficstate estimation and prediction. This would require further investigation into numerous issues, such ascalibrating the maximum queue discharge rates which critically affect flows on downstream links, andaccommodating possible modeling errors and behavioral heterogeneity in the DUE assignment.

A.0.3 Bayesian modeling for large-scale dynamic network flow

The authors [3] used internet browser traffic flow through domains of the Fox news website to presentBayesian analyses of two linked classes of models which allow fast, scalable and interpretable Bayesianinference. Their strategy is as follows:

1. Developed a class of Bayesian dynamic flow models (BDFMs), which are (non-stationary and non-normal) state-space models, for streaming count data to adaptively characterize and quantify net-work dynamics effectively and efficiently in real-time.

2. Developed Poisson Dynamic Models and Multinomial Dynamic models for describing network in-flows and transitions between network nodes, respectively.

3. Utilized such efficiently implemented models as emulators of time-varying gravity models to allowcloser and formal dissection of network dynamics.

32

4. Yielded interpretable inferences on traffic flow characteristics, and on dynamics in interactionsamong network nodes.

5. Developed Bayesian model assessment methodology for sequential monitoring of flow patterns withthe ability to signal departures from predictions in real-time and allow informed interventions as aresponse

A.0.3.1 Bayesian dynamic flow models (BDFMs)

Given xt is a time series with xt |φt ∼ P (mtφt ) conditionally independent for t = 1, 2, . . . Define φt is alatent process, mt a scaling factor known at time t . Using Markov model, φt process appears as:

φt =φt−1ηt

δt, ηt ∼ B(δt rt , (1−δt )rt ), ηt and ηs , φs are independent for s < t where δt ∈ (0, 1) is a

discount factor, rt = a given function of t , x0: t−1 and independent innovationsηt

δtdrive the φt process’s

evolution.

Note: The beta distributions imply: (1)E(φt |φt−1) =φt−1, thus it is a multiplicative random walkmodel (i.e, “steady” evolution), (2) a lower value of δt leads to a more diffuse distribution for

ηt

δt, and

hence increased uncertainty about φt and adaptability to changing rates over time.

The BDFM above ensures full conjugacy in the forward filtering/Bayesian sequential learning over time.x0 is a synthetic notation for initial information

• Forward Filtering (FF): At any time t , both the prior p(φt | x0: t−1) and posterior p(φt | x0:t ) for“current” latent level are gamma distributions, with parameters that are updated as t evolves.

• One-Step Forecasts: The one-step ahead forecast distribution made at time t −1 to predict time t isgeneralized negative binomial with p.d.f.

33

p(xt | x0:t−1, δt−1) =Γ (δt rt−1+ xt )

Γ (δt rt−1)Γ (xt + 1)

mxtt (δt ct−1)

δt rt−1

(δt ct−1+mt )δ rt−1+xt

The above model could be defined by any sequence of discount factors {δt }. A constant value over timedefines a global smoothing rate; values closer to 1 constrain the stochastic innovation and hence thechange from φt−1 to φt . Also, smaller discount factor values lead to greater random changes in thesePoisson levels. Intervention to specify smaller discount factors at some time points, to reflect oranticipate higher levels of dynamic variation at those times, are sometimes relevant. In our network flowmodels below, we customize the specification of the sequence of discount factor to address issues thatarise in cases of low flow levels. That extension of discount-based modeling defines the t as time-varyingfunctions of an underlying base discount rate, and the latter are then evaluated using MML measures.

A.0.3.2 Network inflows: Poisson Dynamic Models

Adding suffices i for network nodes and setting the Poisson mean scaling factors to 1, the authors cus-tomized this model via specification of discount factor sequences. At any node i , the time t inflow tonode i is x0i t ∼ P (φi t ) independently across nodes i = 1 : I , and the latent levels φi t it follow node-specific gamma-beta discount models with discount factor it at time t . The time t → t +1 update/evolvesteps are:

1. The time t priorφi t | x0i , 0:t−1 ∼G(δi t ri , t , δi t ci , t−1) updates to the posteriorφi t | x0i , 0:t ∼G(ri t , ci t )with ri t = δi t ri , t−1 + x0i t and ci t = δi t ci , t−1+ 1.

2. This then evolves to the time t + 1 prior φi , t+|x0i , 0:t ∼ G(δi , t+1 ri , t , δi , t+1ci t ), and then so on.

3. Discount factors δi t relates to the information content of gamma distributions as measured by theshape parameters ri∗; evolution each time point reduces this by discount factor, the latter represent-ing a per-time-step decay of information induced by the stochastic evolution.

Node-specific MML measures that feed into model assessment to aid in selection of the baseline discountfactors di : These measures of short-term predictive fit of the models can also be monitored sequentiallyover time for online tracking of model performance.

A.0.3.3 Transitions from Network Nodes: Multinomial Dynamic Models

Transitions from any node i at time t are inherently multinomial with time-varying transition probabili-ties. To build flexible and scalable models for dynamics and dependencies in transition probability vectors

34

is a challenge, with computational issues for even simple models quickly dominating. The authors ex-tend the univariate Poisson/gamma-beta random walk models to enable flexibility in modeling node-pairspecific effects as they vary over time as well as scalability.

The core model is xi ,0:I , t ∼ M n(ni , t−1, θi , 0:I , t ) where the current node node i occupancy level is ni , t−1,and θi , 0:I , t is the (I + 1)-vector of transition probabilities θi j t (including the “external” node - leaving the

network - at j = 0). The decoupled BDFMs include: xi j t ∼ P (mi tφi j t ) and mi t =ni , t−1

ni , t−2independently,

with independent gamma-beta evolutions for each latent level φi j t . These BDFMs for each node pair can

be customized with node-pair specific discount factors, allowing greater or lesser degrees of variation bynode pair. The set of models for elements of φi , 0:I , t implies a dynamic model for the vector of transition

probabilities θi , 0:I , t having elements θi j t =φi j t∑

j=0:I φi j t. Independence across nodes enables scaling, as the

analyses can then be decoupled and run in parallel for the φi j t and then recoupled to infer the θi j t .

Now, the decoupled, scaled models are not predictive of overall occupancy– rather, they are decoupled,tractable models that are relevant to tracking and short-term prediction of relative occupancy levelsthrough the implied multinomial probabilities. In sequential analysis of transitions, the node-pairspecific models generate full joint predictions one-step ahead (or more, if desired) for the theoreticallyexact set of multivariate flow vectors xi , 0:I , t across all nodes.

A.0.4 Model Mapping for Bayesian Emulations of Dynamic Gravity Models (DGMs) by BDFMs

The DGM model is defined as: within each network node i = 1 : I and all j = 0 : I ,

φi j t =µtαi tβ j tγi j t

with (i) a baseline process µt ; (ii) node i main effect process αi t , adjusting the baseline intensity of flows- origin or outflow parameter process for node i ; (iii) a node j main effect process β j t ; representing the

additional “attractiveness” of node j – the destination or inflow parameter process for node j ; and (iv)an interaction term γi j t , representing the directional “affinity” of node i for j over time relative to the

combined contributions of baseline and main effects.

The authors also commented that analysis via MCMC is computationally very demanding, and theburden increases quadratically in I , and inherently non-sequentially.

Now, the mapping to DGM parameters in eqn. (13) requires aliasing constraints to match dimensions.[5] then defined ht = log(µt ) , ai t = log(αi t ), b j t = log(β j t ) and gi j t = log(γi j t ). Using the + notation

to denote summation over the range of identified indices, constrain via a+t = b+t = 0, g+ j t = gi+t = 0

for all i , j , t . We then have a bijective map between BDFM and DGM parameters; given the φi j t we can

directly compute implied, identified DGM parameters. The emulating BDFM enforces smoothness overtime in parameter process trajectories, and this acts to substantially reduce the effective model dimension.

Define fi j t = log(φi j t ) for each i = 1 : I , j = 0 : I at each time t = 1 : T . Then at each time t , we compute

the following in order:

• The baseline level ut = e ht where ht =f++t

I (I+1)

• For each i = 1 : I , the origin node main effect αi t = eαi t where αi t =fi+t

I+1 − ht

• For each j = 0 : I , the destination node main effect β j t = e b j t where b j t =f+ j t

I − ht

35

• For each i = 1 : I and j = 0 : I , the affinity γi j t = e gi j t where gi j t = fi j t − ht − ai t − b j t

The authors then apply this to all simulated ijt from the full posterior analysis under the BDFM to mapto posteriors for the DGM parameter processes.

The disadvantage of this mapping approach arises in cases of sparse flows, i.e., when multiple xi j t counts

are zero or very small for multiple node pairs. In such cases the posterior for φi j t favors very small

values and the log transforms are large and negative, which unduly impacts the resulting overall meanand/or origin or destination means. While one can imagine model extensions to address this, at apractical level it suffices to adjust the mapping as is typically done in related problems of log-linearmodels of contingency tables with structural zeros. This is implemented by simply restricting thesummations in identifiability constraints to node pairs for which xi j t > d , for some small d , and

adjusting divisors to count the numbers of terms in each summation.

A.0.5 Bayesian Inference on network traffic using link count data

In [10], the authors considered the fixed network of n nodes, arbitrarily labeled A, B , . . ., and they solvedthe problem of estimating the actual counts of messages travelling between pairs of nodes in the network,based on observation of traffic counts on all individual directed links in the network without interveningnodes. Let r be the total number of directed links in the network, s = (i , j ) represent the directed linkfrom node i to node j , and Ys for the traffic count on this link. Then given an observed link countsY := (Y1, . . . , Yr )

T , they inferred OD counts X := (X1, . . . , Xc )T . Using the relationship Y = AX where

A= r × c routing matrix {As ,a}, As ,a = 1 if the directed link s belongs to the directed route through thenetwork between OD pair a, and As ,a = 0 otherwise. From an algebraic perspective, Y imposes a set of

linear constraint on X . Note that (AAT )a,a counts the number of OD routes passing through link a, and

(AAT )a, b counts the number of routes that pass through both links a and b .

Now, the authors attempted to compute the posterior distribution p(X |Y ) for all route counts X giventhe observed link counts Y . To solve this problem, they assume X is generated from a collection ofindependent Poisson distributions for the elements Xa (i.e, Xa ∼ P (λa) independent over a). Then the

prior joint model is p(X , ∧) = p(∧)∏c

a=1 λXaa e (−λa )/(Xa !)Now, to find the posterior p(X , ∧|Y ), they

developed iterative MCMC simulation methods, in particular Gibss sampling in which they iterativelyresample from conditional posteriors for elements of the X and ∧ variables.

Since p(∧|X , Y )≡ p(∧|X ) =∏c

a=1 p(λa |Xa) which has components of the form of prior desnity p(λa)multiplied by the gamma form arising in the Poisson-based likelihood function. Thus, by conditioningon X , the authors simulated new ∧ values as a set of independent draws from the implied univariateposteriors. For such simulation, they used embed Metropolis-Hasting step in the MCMC scheme.Finally, by fixing ∧, they deduced the posterior distribution p(X |∧, Y ) with the constraints imposed onX by the equation Y =AX using simulation in the MCMC scheme. To effectively simplify thecomputations for making this inference, they used the following result

Theorem Assume A is full rank r . Then the columns of A can be reordered so that the revised routingmatrix has the form [A1, A2] =A where A1 is a non-singular r × r matrix. Also, by reordering theelements of X vector and partition X T = (X T

1 , X T2 ), it follows X1 = A−1

1 (Y −A2X2)

Using this theorem, the conditional distribution p(X |∧, Y ) is concentrated in a subspace of dimensionc − r defined by partition [A1, A2] =A of the routing matrix A. This posterior has the formp(X1|X2, ∧, Y )p(X2|∧, Y ) where p(X1|X2, ∧, Y ) is degenerate at X1 =A−1

1 (Y −A2X2), and with

X2 = (Xr+1, ..., Xc )T defining X1 = (X1, ...,Xr )

T as earlier, p(X2|∧, Y )∝∏c

a=1λ

Xaa

Xa ! where Xa ≥ 0 for all

a = 1, . . . , c . This is the product of independent Poisson priors for the Xi constrained by the identity

36

Y =AX rewritten in the form X1 =A−11 (Y −A2X2). Now, by considering each elements Xi of X2

(i = r + 1, . . . , c) and write X2,−i for the remaining elements, the authors obtained the conditionaldistribution

p(Xi |Xi−1, ∧, Y )∝λ

Xi

i

Xi !

r∏

a=1

λXaa

Xa !

where Xi ≥ 0 and Xa ≥ 0 for each a = r + 1, . . . , c and i = r + 1, . . . , c .

Gibbs and Metropolis - Hastings AlgorithmsFix starting values of the route counts X and proceed as follows:

1. Draw sampled values of the rates ∧ from the c conditionally independent posteriors p(λa |Xa)

2. Conditioning on these values of ∧, simulate a new X vector by sequencing through i = r + 1, . . . , cand at each step, sampling a new Xi from (14), with conditioning elements X2,−i set at their most

recent sampled values; at each step Xi is explicitly reevaluated via X1 = A−11 (Y −A2X2) as a function

of the most recently sampled elements of X2.

3. Return to step 1 and iterate.

This is a standard Gibbs sampling setup in which the scalar elements of both A and X are resampled fromthe relevant distribution conditional on most recently simulated values of all other uncertain quantities.Sampling steps in 1 are easy. Sampling steps in 2 require evaluation of the support of (14), and subsequentevaluation of the unnormalized posterior (14) at each step. Sampling may be performed directly, treating(14) as a simple multinomial distribution on this relevant range. But in larger, more realistic networks, theimplied evaluation of (14) across what may be a very large support, at each iteration and for each elementXi , leads to a computational burden that may be excessive when compared to alternative approaches. Todo this requires identifying the support of (14) which, as mentioned earlier, can become computationallyvery burdensome in networks of even moderate size.

A more efficient algorithm is based on embedding Metropolis-Hastings steps within the Gibbs samplingframework. Specifically, assume a fixed proposal distribution with probability mass function qi (Xi ) foreach element Xi in step 2. A candidate value X ∗i is drawn from qi (.) and accepted with probability

mi n[1,pi (X

∗i )qi (Xi )

pi (Xi )qi (X∗i)] where Xi is the current, most recently sampled value and pi (.) is the normalized

conditional posterior in equation (14). From the structure of the network equations in (1), it is possibleto identify bounds on each Xi so that a suitable range for proposal distribution can be computed. Then,based on the specified bounds, the implied vector X1 in (14) is recomputed and checked for feasibility;that is, nonnegative values. If any element of X1 is negative, then the trial value of Xa is eitherincremented, in searching for the lower bound on its range, or decremented, in searching for the upperbound. This process terminates and delivers the resulting bounds once the X1 vector has r nonnegativeentries.

Theoretical assurance that the MCMC algorithm so defined converges - that is, ultimately generatessamples from the true joint posterior p(X , ∧|Y ) - follows if we can determine that the Markov chain isirreducible. This is equivalent to determine whether or not the current value (X , ∧) can “move” to anyother point in the joint parameter space following a finite number of iterations of the scheme (1) and (2).For the elements of ∧, there is no problem, because of continuous priors with fixed support. But for theX , the support of the conditional posteriors (14) depends on resampled values of elements of X2 and so itchanges after each iteration It can be shown, however, that in fact X2 is free to move arbitrarily around its

37

parameter space in consecutive iterations, despite the support constraints and complications. Thus theresulting chain is irreducible, and convergence is assured. fect flows on downstream links, andaccommodating possible modeling errors and behavioral heterogeneity in the DUE assignment.

A.1 Use-Case Problem

From now on, we denote the distribution for prior and likelihood as P1 and P2, respectively. We thenapply our Hierarchical Bayesian model demonstrated above with prior and likelihood following Normaldistributions. In this particular case, we will use both Kalman-Filter (since the analytical solution exists inthis case) and MCMC to obtain the posterior distribution for d |dh ,A. The sample network we consideris comprised of eight bus stops, where every two pairs of them are connected either directly or indirectly.

Figure 19: Sample network with 8 bus stops

Denote set of bus stops as I =A,B ,C , D , E , F ,G, H , historical demand dh = (dh )i j where i , j ∈ I , and

di j = demand from i to j . The true demand is denoted as vector d = (d )i j . We now construct the

routing matrix A using the following heuristic algorithm:

A.1.1 Routing Matrix Construction Heuristic Algorithm

1. Initialize a zero routing matrix A whose columns are labeled i j (i 6= j ) if stops i and j areconnected directly. Row i corresponds to the ON count at stop i . Set (ai j )row i = 1.

2. Transitivity Rule

2.1. For any two columns (i j ) and ( j k), add a new column i k to the end of the matrix.

2.2. If (a j k )row i = 1 and (ak l )row i = 1, set (a j l )row i = 1.

Based on the heuristic algorithm above, and the structure of our sample network in Figure 2, we obtainthe following routing matrix A

38

• Step 1. Fill out all entries (ai j )row i = 1 for directly connected pairs of stop i and j

H =

AB AD BC BD BA DB C A EF EH F G F H GF GH H E H F

(dA)ON 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0(dB )ON 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0(dC )ON 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0(dD )ON 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0(dE )ON 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0(dF )ON 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0(dG )ON 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0(dH )ON 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1

• Step 2. Apply Transitivity Rule to every rows (note that Hi , denotes row i of matrix H )

A=

column labels of H AC DA DC C B C D EG F E GE H G

(dA)ON H1, 1 0 0 0 0 0 0 0 0(dB )ON H2, 0 0 0 0 0 0 0 0 0(dC )ON H3, 0 0 0 1 1 0 0 0 0(dD )ON H4, 0 1 1 0 0 0 0 0 0(dE )ON H5, 0 0 0 0 0 1 0 0 0(dF )ON H6, 0 0 0 0 0 0 1 0 0(dG )ON H7, 0 0 0 0 0 0 0 1 0(dH )ON H8, 0 0 0 0 0 0 0 0 1

Since the size of A is 8× 16, the size of estimated demand variable d has to be 16× 1. Each component ofd must match with both the indices of column labels of A and the corresponding row of A (for example,the first three components of d must be dAB ,dAD and dAC because row 1 of A corresponds to (dA)ON).Now, from our assumption, since prior and likelihood both follow Normal distributions, we have:P (x|d )∼N (Ad ,σ1) and P (d )∼N (dh ,σ2), where discrete σ1 ∼N (0,10), σ2 ∼N (0,5) (the range of thesenormal distributions are chosen depending on our expectation on how much deviated from the truedemand is the historical demand, and how good is our estimated demand). Since this is a use-caseexample, I simulate discrete x U (400,600). Since the equation Adh = x does not have any solution, Irecovered dh by solving for the solution dh in the least-square sense by minimizing ||Adh − x||22. Usingequation (1), we only need to compute the posterior distribution p(x|d )p(d ) = N (Ad ,σ1)N (dh ,σ2). Tocompute this, we will show two following methods: first is Kalman-Filter and second is MCMCsimulation. For the latter case, for the purpose of showing the flexibility of MCMC simulationcompared to Kalman-Filter, we apply it to obtain the posterior distribution P (d |x, dh ) even for the casewhen dh is not available. Finally, we generate the histograms to show the distribution of each componentof our estimated demand d obtained by MCMC simulation, and compare d to the simulated "true"demand d ′, where d ′ = dh +σ2.

A.1.2 Kalman-Filter - Analytical Solution

Since we need to compute the distribution of p(d |dh , x)∝ p(x|d )p(d |dh ) ∼ N (Ad ,σ1)N (dh ,σ2), which

results in another normal distribution N (d ,∑

cov), Kalman-Filter gives the following analytical formula

to compute the mean d and co-variance matrix∑

cov at a given time t ,

dt := K(xt−1−Adt−1) + (dh )t−1 (2)

39

whereK = σ2

2 AT (Aσ22 AT +σ2

1 )−1, (3)

∑

cov

:= (I −KA)σ22 (4)

and I is an identity matrix.

Using R to perform the matrix multiplications and subtractions to compute the term K and plugging itinto the equations (2) and (4), I obtain the values for the mean and co-variance matrix of the posterior

distribution N (d ,∑

cov):

d = [40.83 33.57 53.31 24.72 39.23 21.15 29.83 21.19

30.23 20.54 32.13 46.83 18.23 42.59 12.83 16.11]T

∑

cov

=

1 −0.79 −0.43 . . . −0.10 0.02−0.79 1 −0.60 . . . 0.024 −0.016

......

......

......

0.023 −0.016 −0.05 . . . −1.32 1

A.1.3 MCMC Simulation - Numerical Solution

As shown above, we need to compute p(d |dh , x)∝ p(x|d )p(d |dh ) ∼N (Ad ,σ1)N (dh ,σ2).Utilizing Stan,which is the probabilistic modeling language for statistical inference with the simple interface with R,and "sampling()" package in R to perform MCMC simulation in two different scenarios:

A.1.4 Historical Demand (dh ) is available

With this first scenario, we assume that currently observed OD flows are close to historically observedvalues. To simulate the values for these current OD flows, we add some noise term ε∼N (0, 5) into ourhistorical demand dh to account for measurement errors and taking into account the historical ODflows. Utilizing "sampling()" package in R, I perform MCMC to estimate the current OD flows with the3 sample chains with 500 iterations for each chain in our MCMC simulation. The results obtained showa pretty close match between our estimated demand d versus the simulated true demand, as reflectedthrough Figure 3. Furthermore, I am able to generate the histograms for each components of theestimated demand d to observe the distribution of its components. The four histograms for the first fourcomponent of the estimated demand d is displayed in Figure 4.

40

Figure 20: Distributions of the first four components of the true demand d

Finally, we compute the average of the third chains of MCMC simulation stored in a 3-dimensional objectin R to obtain the approximated mean value for each components of the obtained estimated demand d :

d M C M C = [41.12 34.22 53.35 24.96 40.51 22.75 30.48 22.15

31.33 21.65 32.09 45.96 19.09 42.19 12.93 16.51]T

We compare the mean d M C M C with the mean d of Kalman-Filter by computing the L2 norm error:

||d M C M C − d KF ||22 ≈ 6.989. This is sufficiently small because the true data d is discrete and in the order

of hundreds, so the estimated demand obtained by MCMC is pretty accurate!

41

Figure 21: Comparison of estimated demand against the historical and true demands, respectively

A.1.5 Historical Demand (dh ) is unknown

With this second scenario, we assume that historical OD flows and currently observed ones are learnedsimultaneously(so dh is not known anymore, but rather, we simulate dh ∼ P (24,λ) where λ is drawnrandomly 24 times from a uniform distribution U (0,40)). Adding some noise term ε∼N (0, 5) into ourhistorical demand dh to account for measurement errors and taking into account the historical ODflows, we estimate the current OD flows with MCMC (again, by using 3 sample chains with 500iterations for each chain). The results obtained allow me to generate the histograms for displaying thedistributions of the first four components of the estimated demand d .

42

Figure 22: Distributions of the first four components of the true demand d

I then proceed exactly the same as in the first scenario, and obtained the mean value of our estimateddemand d :

d M C M C = [41.53 37.83 49.52 30.15 40.51 22.75 30.48 22.15

23.31 27.55 23.90 41.69 20.91 24.25 11.23 19.15]T

I then compare the mean d obtained by MCMC with the mean d KF , and obtain the L2− no r m error

||d M C M C − d KF ||22 ≈ 9.2. This is sufficiently small because the simulated true demand data is of the

order of hundreds.

43

Figure 23: Comparison of estimated demand d and simulated true demand

B Heatmap Analysis for APC and Ventra dataset

Using the "ggmap" package in R, I generated several heat maps for the average and total APC "ON" and"OF" per hours of a day, days of a week and days of a month, to gain a better understanding about thespatial locations where people can travel with PACE’s buses within Northwest region of Chicago (notethat our APC dataset contains only buses starting from one garage located in the Northwest region ofChicago). We observed the following results:• The heat map of average APC OFF per hours of day shows that most people, on average, get off at

a few common sub-regions at each hour of the day, while that of average APC ON (also per hoursof day) has no heat at almost every hours except at 4am in the morning (the heat location at 4amhas longitude ∈ (−87.9,−87.8) and latitude ∈ (41.9,42.0)). This implies on average, at each hour ofa day, the origins of most bus trips spread out all over the places, while the destinations of thosetrips concentrate on just a few sub-regions (for example, between 5 and 7 in the morning, mostpeople get off at the rectangle regions with longitude ∈ (−88,−87.7) and latitude ∈ (41.9,42))• The heat map of average APC OFF per each day in the month of October also concentrates on a

few sub-regions (those with longitudes ∈ (−87.9,−87.7) and latitudes ∈ (42.1,42.2) or (41.9,42.0) ),and that of average APC ON per each day in October also concentrate on a particular sub-regionswith longitude ∈ (−88,−87.7) and latitude ∈ (41.9,−42). This means on average, during each dayof the month, people get on bus stops within a particular sub-regions and also get off at either thesame or another specific sub-region.• The heat map of average APC OFF per every days of a week except Sunday concentrates on a

particular sub-regions with longitude ∈ (−87.9,−87.7) and latitude ∈ (41.9,42.0)), and that ofaverage APC ON (per the corresponding days) also concentrate on that same particular regions.This means on average, people travel short trips most of the days, and they start and end the trip atthe same sub-region (but the exact locations are certainly different). On Sunday though, the travelpattern changes, and most people start at the same sub-region as the previous days, but they mostlyget off at the sub-region with longitude ∈ (−87.9,−87.8) and latitude ∈ (42.1,42.2).• All of the three previous arguments also applied to total APC ON and OFF per day-to-day,

hour-to-hour and week-to-week variations, but the sub-regions changed. We showed the resulting

44

heat maps for all of these cases (Appendix A)Finally, we examine two other aspects of this October 2015 dataset: first is the heatmap of total andaverage APC ON and OFF (based on different time variations) to observe if there is any popular"zone-level" departures and destinations of the majority of bus riders. Second is the total APC ON perroute per day of week, to see the usage of bus riders with respect to individual routes, and detect whetherany route is unusually taken much less compared to other routes (so that the managers can re-route thebus to avoid this route, or to assign fewer number of buses to cover this particular route due to the lowdemand from bus riders).Now, the heatmap of total APC ON and OFF per days of week show that most riders actually depart atthe same regions with lat x lon in [42.0,42.1]× [−87.7,−87.6] across the entire week. The fact that thesame region holds the most number of people getting on/off the bus stops imply bus riders departed atthe same place, but went to different places for arrivals, and then they all went back to the same locationby buses! This aligns with PACE’s mission of mainly serving students, workers and citizens toworkplace, schools and famous public places. The total APC ON and OFF across different months alsoshows the same region in [42.0,42.1]× [−87.7,−87.6], except during the January - March where anotherregion in [41.7,41.8]× [−87.7,−87.6] have more people get on and off. But once again, this region is themost common departure and arrival of bus riders, which confirm with our observations for the totalAPC ON/OFF (see Figures 36-39). Regards to the total and average APC ON per route per days ofweek, the route 769 was only used on Thursday and Sunday, and there were a total of only 170 ridersusing this route, compared to the most crowded route 714, which has more than 50,000 riders per week.This is in contrast to the average APC ON per route, as the peak of average APC ON of route 769 is 17on Thursday, compared to only 0.1 of route 714. This is mainly because the number of times riders takeroute 714 is more than 100 times greater than route 769, which skew the average number substantially(see Figures 40 - 43). In general, the total APC ON distributed per route across days of a week do notseem to follow a general distribution. Some routes seem to follow Normal distribution, such as route635, some seem to follow Poisson such as route 640, and some routes such as 769 do not seem to followeither Normal or Poisson (see Figures 44-45).

Using the "barplot" function, we plotted historgrams for the distributions of the aggregated number ofdifferent transaction types of riders when using the bus services across the month of March, and acrossdifferent hours of a day to examine when is bus service the most useful mean compared to otheralternative options. Counting through the list of transaction status ("Success", "Pass First Use" or "Nopayment") for each transaction, we observe that in the distribution of daily trip types in March, 82%among 404643 total riders who use Ventra’s fare card in March had to re-charge the card before trying itagain or paid the fare with Ventra’s application on smartphone the Ventra of bus services. Suchtransactions were classified with "No Payment" in the "Trans-status" column. The distribution of averagenumber of transactions for each of the three statuses per day of March also shows a clear domination of"No payment" transaction category, while the other two categories are equal. These results imply amajority of riders are frequent riders, but either are they too busy to care about the remaining balance ontheir cards, or they all switch to Ventra’s application on smartphone for convenience (i.e, no need tocarry Ventra card with them every time they travel). In addition, the distribution of average number oftransactions per each transaction type per different hours of a day also show the same pattern: themajority of transaction types were classified as "No payment" and a very small percentage of people areusing Ventra cards for the first time during March 2016, evidenced through the small number of "PassFirst Use" transaction status. Finally, to avoid potential biases from only observing trends of averagedata, we also plotted the barplots for the distributions of total number transactions per each transactiontype per each day of a month and each hour of a day to examine whether the same pattern observed fromthe barplots of average number of transactions repeat for this type of aggregated data. Indeed, the same

45

pattern still holds for the types of riders and their habits when using PACE’s bus service.

Then we applied the same analysis to the "trip types" data, because we want to know how efficientPACE’s bus service is in serving the local commuters by examining whether a majority of our riders areusing PACE’s buses for multi-ride or single-ride trips, so that we could use this information to aid ourinference on the purpose of each individual trip. To do this, we plotted the histograms for the total andaverage number of trips per each trip type to observe their distributions and how they vary daily andhourly. We see that although the average number of trips per each trip type are approximatelydistributed in the ratio of 2 : 1 for single-ride (sum of trip type labeled "0" and "2" on the histograms) tomulti-ride trip per hours of a day and per days of March, the total number of trips per each trip typeshow a completely different pattern: the majority of riders opt for trip type labeled "2", which is asingle-ride trip, especially at 7am, which is the usual time where most people get to work, and at 5pm,the time students go home. Extremely small number of riders have transfer between buses, which werelabeled as "1". Gathering all the above insights, we conclude that the majority of our riders indeed optedfor single-ride trips during March 2016, and very few opted for transfer-trips. This implies riders mostlyuse PACE buses as a mean for commuting between home and workplace, but rarely for other activitieslike grocery, laundry, shopping, etcetera. Finally, by using "ggmap" library, we also generated heat mapfor trip types and transaction types to observe more clearly their flows of geographically.

Finally, using the "ggmap" package in R, we generated several heat maps for the average and total "ON"count per hour of a day, day of a month (i.e, March 2016) and day of a week, to gain a betterunderstanding about the spatial locations where people can travel by using PACE’s buses aroundChicago (note that the Ventra dataset contains all the buses from 9 garages across Chicago). We observedthe following results:

• The heat map of average "ON" count per hours of a day shows that there are only a few people(between 2 and 4), on average, get off at locations scattered across the region with latitude (lat) xlongitude (lon) ≈ [41.6,42.2]× [−88.4,−87.5] at any hours between 5am and 21pm. Only in thevery early morning (12am - 4am, and 22pm - 23pm), the locations for those people getting off arenarrower, mostly concentrated in region with lat x lon ≈ [41.6,42.0]× [−88.0,−87.5]. In addition,the heat map almost shows heat scattered all over the places in the region with lat x lon≈ [41.6,42.2]× [−88.4,−87.5], and very little heat in regions with lat x lon≈ [42.0,42.2]× [−87.8,−87.75] and [41.7,41.8]× [−87.65,−87.7]. We also examine the heat mapof total "ON" count per hours of a day to verify if the same pattern persists. We realize that thetotal number of people travel during the early morning (from 12am to 5am) is the smallest(between 1000 and 2000 people). Starting since 6am, much greater heat appears over some separatesub-regions. The peak of the "heat" (between 4000 and 6000 people) o was at 7am, and it occurs atthree sub-regions with lat x lon ≈ [41.7,41.8]× [−88.2,−88.1], [41.8,41.9]× [−87.9,−87.7] and≈ [42.0,42.1]× [−88.0,−87.9]. This implies that even though the average distribution does notshow heat over those regions, there is a huge number of people traveling to particular locations atthree distinct hours: 7am, 15pm and 16pm. Since these hours tend to correspond to the time whenpeople get to and from work/school, this might be the main reason for the source of our heat map.At other times, much fewer people travel, and they travel to many difference places scattered acrossthe Northern part of Chicago.• The heat map of average "ON" count per each day of March 2016 has the heats on different

sub-regions on each day (for example, on 03/15, the heat region has lat x lon≈ [−87.7,−87.6]× [41.8,41.9] while on 03/28, the heat region is ≈ [−87.7,−87.6]× [41.7,41.8]).However, the heat map of total "ON" count per each day of March shows the heat, whose peak isaround 1000, coming from a unique sub-regions with lat x lon ≈ [−87.9,−87.8]× [42.0,42.1] over

46

every days of March except on the four Sundays when there were no heat (03/06, 03/13, 03/20 and03/27). This means there are a large (i.e, 1000 or so) number of riders getting on a fixed-route everydays except Sunday, and their destinations are concentrated into a unique sub-region in theNorthwest direction of Chicago.• The heat map of average "ON" count per days of a week shows a uniform distribution of the heat

source over the weekday (Monday - Friday), with the heat region ≈ [−88.3,−87.8]× [42.0,42.4],but different heat regions with lat x lon ≈ [−88.3,−87.8]× [42.0,42.4] and[−88.1,−88.0]× [42.0,42.1] were shown on the weekend. This means on average, riders travel tothe same location during weekday (most likely for going to work), and only travel to otherlocations for other activities during weekend. The heat map of total "ON" count per days of aweek show the same pattern (with different sub-regions whose lat x lon≈ [−87.9,−87.8]× [42.0,42.1]) and ≈ [−87.9,−87.8]× [42.3,42.4]). This implies people usePACE’s bus mainly during the weekday, and this intuitively makes sense: on weekday, they wouldneed to go to work, so the locations where the heat source were during those days are most likelythat of their work office. On weekend, they probably prefer using alternative services for betterconvenience and flexible in time.

B.1 Differences between APC and Ventra datasets

Even though APC and Cubic’s Ventro datasets all contain information about bus trips and number ofriders on/off at each stop, they are quite different in many aspects based on how they were collected.First, since APC data was recorded from the two main sources: the information stored on the cardswiped by each on-boarding customer, and the two sensors attached to the doors of each bus for thenumber of passengers getting off at each bus stop. However, certain problems could arise, for example,some cash transactions might be missed, or some buses may have either the sensors or the machine mightmalfunction at certain time periods where the data was collected. This means the APC ON and/or OFFdata are quite noisy, which explains our motivation to develop a Bayesian-based forecasting model. Inaddition, due to the regulation requirements on the fairness of distributing buses to everyone regardlessof their genders, incomes and social statuses, no buses were assigned the same route on two successivedays. Due to this reason, the individual APC ON/OFF data recorded are not quite meaningful, and thuscategorizing the entire APC dataset based on different route numbers is the most reasonable choice.Finally, the number of categories of our APC dataset are quite large (81 different categories), as all thebuses come from all 9 garages across IL, and the number of observations are around 69.4 millionsrepresenting an entire year of 2015, which are quite sufficient for conducting our data analysis.

On the other hand, our collaborators at PACE was able to obtain the Cubic’s Ventra dataset byextracting the information recorded in the Ventra cards taped by on-boarding customers. This resulted inmissing the number of customer getting off at each bus stop. However, there are several advantages ofthis dataset compared to the APC dataset: first, the APC ON is much less noise, and the dataset has100% coverage. Second, it includes buses coming from different garages across Chicago (rather than fromonly one garage), so many areas that were not covered before in the APC dataset were covered in thisdataset. Third, it has much larger number of qualitative categories for each observation (88 totally) butwith only 184318 observations. Fourth, all the buses covered do not encounter the situation the samesituation that occurs in the APC dataset, that is, in the Ventra dataset the same bus could be assigned tothe same route on two successive days! With these advantages, we expect to have a better datavisualization on the flow of people across the entire city of Chicago (unlike the case of the APC datasetwhere our results were only valid in the Northwest region). However, there are still two disadvantages ofthe Ventra dataset compared to APC dataset: first, the data was only recorded every thirty seconds,

47

which means that riders might get on at different positions, but correspond to the same bus stops. So it isnecessary to prepare the data by aggregating across number of counts whose recorded locations,evidenced through latitude and longitude, are actually closest to the same bus stop. Second, in thecolumn "transaction status", approximately 83.4% of the transaction with "No Payment" status waslabeled as "Success" in another "transaction status2" column, and resulted in addition "ON" counts.Thus, the total "ON" count might be exaggerated by a large margin, but this means we have to get rid ofa significant amount of our current dataset to circumvent this problem. After removing all the rowscontaining all data points with that transaction status, the total number of observations of our Ventradataset is 67083 - a 83.4% decrease from the original size of our Ventra dataset. We then plotted thehistograms to observe if there is any change regards to our above conclusion on the distributions of the"ON "count per hour of a day, days of a week and days of a month in the new Ventra dataset.Fortunately, only the three histograms for the average "ON" count changes from non-uniformdistribution to completely uniform distribution (which makes sense, since each successful transactionstatus equals to one count), while those for the total "ON" counts do not change its pattern (only theactual total counts changed, which is obvious). We also generated the heat map of this "new" Ventradataset, and the distribution of locations where passengers are getting on the bus are much more sparecompared to the previous one (i.e, Figure 55), but also dense in certain sub-regions (for example, thesub-region with lat x lon ≈ [41.65,41.7]× [−88.2,−88.0] have large number of on-boarding riders)

C Appendix 1 - Figures for Exploratory Data Analysis on APC dataset

Heatmap generated for average, total ON/OFF counts of APC Dataset per different time variations acrossa whole month of October 2016.

Figure 24: Heat map of Average APC OFF perhours (October 2016)

Figure 25: Heat map of Average APC ON perhours (October 2016)

48

Figure 26: Heat map of Average APC OFF pereach day (October 2016)

Figure 27: Heat map of Average APC ON pereach day (October 2016)

Figure 28: Heat map of Average APC OFF pereach day of a week (October 2016)

Figure 29: Heat map of Average APC ON pereach day of a week (October 2016)

Figure 30: Heat map of Total APC OFF perhours (October 2016)

Figure 31: Heat map of Total APC ON perhours (October 2016)

49

Figure 32: Heat map of Total APC OFF per day(October 2016)

Figure 33: Heat map of Total APC ON per day(October 2016)

Figure 34: Heat map of Total APC OFF perdays of a week (October 2016)

Figure 35: Heat map of Total APC ON per daysof a week (October 2016)

Distribution of total and average APC ON/OFF in October 2016 per different time variations:

Figure 36: Distribution of total APC OFF perweekdays

50

Figure 37: Distribution of Total APC ON and OFF in October 2016

Figure 38: Distribution of average APC OFF inOctober 2016

Figure 39: Distribution of average APC ON inOctober 2016

Heatmap generated for average, total ON/OFF counts of APC Dataset per different time variations(days of week, days of months, ) across whole year 2015.

51

Figure 40: Heatmap - Total APC ON per daysof a week (2015)

Figure 41: Heatmap - Total APC OFF per daysof a week (2015)

Figure 42: Heatmap - Total APC ON permonths of 2015

Figure 43: Heatmap - Total APC OFF permonths of 2015

Figure 44: Distribution of total APC On perroute 769 across days (2015)


52

Figure 46: Distribution of average APC On perroute 769 across days (2015)




Comparison of average APC ON count across different time variations (aka, days of a week, days of amonth and hours of a day) in all the areas where PACE’s buses are active.

53

Figure 50: Distribution of average APC ON per daysof month

Figure 51: Distribution of average "ON"count per days of month - Ventra com-bined dataset

Figure 52: Distribution of average APC ON perdays of week

Figure 53: Uniform distribution of average"ON" count per days of week - Ventra com-bined dataset

Figure 54: Distribution of average APC ON perhours of day

Figure 55: Distribution of average "ON" countper hours of day - Ventra combined dataset

54

Figure 56: Distribution of average APC ON perhours of day

Figure 57: Distribution of average "ON" countper hours of day - Ventra dataset

Distributions of total and average APC ON per each individual route across days of a week for APCdataset (2015)

Figure 58: Route 877 - Distribution of averageAPC ON per days of week


55



Figure 62: Route 834 - Distribution of averageAPC ON per days of a week


56





57

Figure 68: Route 877 - Distribution of totalAPC ON per days of week




58





59

D Figures for Exploratory Data Analysis on Ventra dataset

Figure 76: Distribution of average number oftransactions per each transaction type per eachday in March

Figure 77: Distribution of average number oftransactions per each transaction type in differ-ent hours of a day

Figure 78: Distribution of total transactions pereach transaction type per each day in March

60

Figure 79: Distribution of total number of tripsper each trip type per each hour of a day Figure 80: Distribution of total number of trips

per each trip type per each day in March

Figure 81: Distribution of average number oftrips per each trip type across days (March 2016)

Figure 82: Distribution of average number oftrips per each trip type across hours of a day(March 2016)

Figure 83: Heat map - Total number of transfersper each transfer type (March 2016)

Figure 84: Heat map - Average number of trans-fers per each transfer type (March 2016)

61

Figure 85: Heat map - Total number of transfersper each transfer type (March 2016)

Figure 86: Heat map - Average number of trans-fers per each transfer type (March 2016)

Figure 87: Heat map of average "ON" count perhour of a day

Figure 88: Heat map of total "ON" count perhour of a day

Figure 89: Heat map of average "ON" count pereach day of March 2016

Figure 90: Heat map of total "ON" count pereach day of March 2016

62

Figure 91: Heat map of average "ON" count perday of a week

Figure 92: Heat map of total "ON" count perday of a week

E Figures for comparing between APC versus Ventra

Figure 93: Distribution of average on-boardingriders per days of a week

Figure 94: Distribution of total on-boardingriders per days of a week

63

Figure 95: Distribution of average on-boardingriders per each day in March

Figure 96: Distribution of total on-boardingriders per each day in March

Figure 97: A sample snapshot of APC dataset Figure 98: A sample snapshot of Ventra dataset

Figure 99: Distribution map of bus stations inAPC dataset (2015)

Figure 100: Distribution map of bus stations inVentra dataset - March 2016

64

Figure 101: Distribution of total APC ON perhours of day

Figure 102: Distribution of total Ventra "ON"count per hours of day

Figure 103: Distribution of total APC ON perdays of week

Figure 104: Distribution of total Ventra "ON"count per days of week

Figure 105: Distribution of total APC ON per daysof March 2015

Figure 106: Distribution of total Ventra"ON" count per days of March 2016

65

Figure 107: Heat map - All "cash-transactions" trips

Figure 108: Distribution of total APC ON perhours of day

Figure 109: Distribution of total "ON" countper hours of day - Ventra combined dataset

66

Figure 110: Distribution of total APC ON perdays of week

Figure 111: Distribution of total "ON" countper days of week - Ventra combined dataset

Figure 112: Distribution of total APC ON per daysof month Figure 113: Distribution of total "ON"

count per days of month - Ventra com-bined dataset

F Figures for Hierarchical Bayesian Model

The comparison plots for Days 350 - 355 in each of the four cases of prior-likelihood are shown below,

together with R.

67

Figure 114: Day 350 - Posterior Distribution ofNormal - Normal case




68



Figure 120: Day 350 - Posterior Distribution ofPoisson - Normal case


69





70

Figure 126: Day 350 - Posterior Distribution ofPoisson - Poisson case




71



Figure 132: Day 350 - Posterior Distribution ofNormal - Poisson case


72





73

Figure 138: Day 350 - R of Normal - Normalcase




74



Figure 144: Day 350 - R of Poisson - Normalcase


75





76

Figure 150: Day 350 - R of Poisson - Poisson case Figure 151: Day 351 - R of Poisson - Poisson case



77

Figure 156: Day 350 - R of Normal - Poissoncase




78



G Earned Value Management Chart

79

References

[1] Michael GH Bell, Caroline M Shield, Fritz Busch, and Gunter Kruse. A stochastic user equilibriumpath flow estimator. Transportation Research Part C: Emerging Technologies, 5(3):197–210, 1997.

[2] Anne-Sarah Briand, Etienne Côme, Martin Trépanier, and Latifa Oukhellou. Analyzing year-to-yearchanges in public transport passenger behaviour using smart card data. Transportation Research PartC: Emerging Technologies, 79:274–289, 2017.

[3] Xi Chen, Kaoru Irie, David Banks, Robert Haslinger, Jewell Thomas, and Mike West. Scalablebayesian modeling, monitoring and analysis of dynamic network flow data. Journal of the AmericanStatistical Association, (just-accepted), 2017.

[4] Siriphong Lawphongpanich and Donald W Hearn. An mpec approach to second-best toll pricing.Mathematical Programming, 101(1):33–55, 2004.

[5] Chung-Cheng Lu, Xuesong Zhou, and Kuilin Zhang. Dynamic origin–destination demand flowestimation under congested traffic conditions. Transportation Research Part C: Emerging Technologies,34:16–37, 2013.

[6] Geoffrey McLachlan and Thriyambakam Krishnan. The EM algorithm and extensions, volume 382.John Wiley & Sons, 2007.

[7] Gordon F Newell. A simplified theory of kinematic waves in highway traffic, part i: General theory.Transportation Research Part B: Methodological, 27(4):281–287, 1993.

[8] Katharina Parry and Martin L Hazelton. Estimation of origin–destination matrices from link countsand sporadic routing data. Transportation Research Part B: Methodological, 46(1):175–188, 2012.

[9] Hanif D Sherali and Taehyung Park. Estimation of dynamic origin–destination trip tables for ageneral network. Transportation Research Part B: Methodological, 35(3):217–235, 2001.

[10] Claudia Tebaldi and Mike West. Bayesian inference on network traffic using link count data. Journalof the American Statistical Association, 93(442):557–573, 1998.

[11] Florian Toqué, Etienne Côme, Mohamed Khalil El Mahrsi, and Latifa Oukhellou. Forecasting dy-namic public transport origin-destination matrices with long-short term memory recurrent neuralnetworks. In Intelligent Transportation Systems (ITSC), 2016 IEEE 19th International Conference on,pages 1071–1076. IEEE, 2016.

80

bayesian forecast for transit demand final report tuan leklaskey/capstone/msseor... · estimation...

Documents