jae hyun lee song gao konstadinos g. gouliassgao/papers/2016_trb_tweetod.pdf · 2016-03-07 ·...
TRANSCRIPT
1
Can Twitter data be used to validate travel demand models?
Jae Hyun Lee
Graduate Student Researcher, Department of Geography and GeoTrans Lab
University of California, Santa Barbara
Santa Barbara, CA 93106
Phone: 805-284-7835
Email: [email protected]
Song Gao
Graduate Student Researcher, Department of Geography and STKO Lab
University of California, Santa Barbara
Santa Barbara, CA 93106
Phone: 805-980-8424
Email: [email protected]
Konstadinos G. Goulias
Professor, Department of Geography and GeoTrans Lab
University of California, Santa Barbara
Santa Barbara, CA 93106
Phone: 805-284-1597
E-mail: [email protected]
Number of Words = 6,855
Number of Tables and Figures = 11*250 = 2,750
Total equivalent = 9,605
Paper submitted for 95th Annual Transportation Research Board Meeting, Washington, D.C.,
January 10-14, 2016
2
Can Twitter data be used to validate travel demand models?
Jae Hyun Lee*
Song Gao
Konstadinos G. Goulias
*corresponding author ([email protected])
Department of Geography and GeoTrans Lab.
University of California Santa Barbara
Santa Barbara, CA 93106-4060
Abstract
In this paper we use Twitter data and a recently developed algorithm at the University of California
Santa Barbara to extract Origin-Destination pairs in the Greater Los Angeles metropolitan area
known as the Southern California Association of Governments (SCAG) region. This algorithm
contains two steps: individual-based trajectory detection and place-based trip aggregation. In
essence, if a person tweeted in different TAZs within 4 hours, it is considered to be one OD-trip.
The extracted OD-trips were aggregated into 30 minute intervals. Then, we compare these trips
with a traditional travel demand model (SCAG, 2012, 4-step model). Substantial heterogeneity is
found due to zero trip zones and a variety of social factors including the tweeting demographics.
In this paper we illustrate the results from a Tobit model and a three-class latent class regression
model that convert tweet derived trips to four-step trips accounting for zonal and trip-maker
heterogeneity. In these regression models we use measures of business density and diversity, and
population density as added explanatory/control variables, so that a unit contribution of a tweet
trip can be adjusted by land-use effects and the zero trip producing zones in the twitter data can be
explained in a more complete way. Preliminary results are encouraging and show the usefulness
of harvested large-scale mobility data from location-based social media. The results also show the
added value of latent class regression models in this experiment. The paper concludes with a
review of next steps.
Paper submitted for 95th Annual Transportation Research Board Meeting, Washington, D.C.,
January 10-14, 2016
3
1. Introduction
Household travel survey data have been playing the most significant role in travel behavior
research for many years. Although information extracted from this source allows various levels of
governments to develop their transportation plans, the cost of data collection tends to increase over
time (Leiman et al, 2006). Furthermore, recently developed modeling and simulation approaches
for travel demand forecasting require even more detailed information that possibly increase survey
costs (Goulias et al., 2011). One of the factors increasing the cost is nonresponse due to the effects
of respondent burden. Thus, there have been diverse attempts to reduce the respondent burden
with a variety of new technologies including Global Positioning System Loggers, Computer-based
Survey System(s), Smartphone Applications, Personal Digital Assistants (PDA), and Car
Navigation Systems (Tuner, 1996; Shirima et al., 2007; Auld et al., 2009, Fan et al., 2012; Nitsche
et al., 2012; Cottrill et al., 2013).
More recently, online Social Media Services (e.g. Facebook, Foursquare, Twitter) have received
much more attention from transportation research as a potential source of data for travel behavior
analysis (Collins et al., 2013; Cebelak 2013; Hassan and Ukkusuri, 2014; Gao et al., 2014). Collins
et al. (2013) used about 500 twitter texts to evaluate transit riders’ satisfaction with a Sentiment
Strength Detection Algorithm. Cebelak (2013) focused on Foursquare, and she was able to
reconstruct a zonal Origin-Destination matrix based on the check-in counts for Austin, Texas.
Another travel analysis example with Foursquare data was given by Hassan and Ukkusuri (2014).
They employed activity pattern model to infer the latent pattern of weekly activities with geo-
location data. Similarly but with different algorithms tweets were used to study temporal and
spatial patterns of activity participation by Coffey and Pozdnoukhov (2013).
A more recent algorithm and application tested a new methodology to extract Origin-Destination
pairs from Twitter data for the Greater Los Angeles metropolitan area known as the Southern
California Association of Governments (SCAG) region (Gao et al., 2014). This algorithm contains
two steps: individual-based trajectory detection and place-based trip aggregation. In essence, if a
person tweeted in different TAZs within 4 hours, it is considered to be one OD-trip. The extracted
OD-trips were aggregated into 30 minute intervals. Then, these trips were compared with the
4
commuting data in the American Community Survey (ACS) for validation. Among the Weekday,
Weekend, and Christmas datasets, Weekday data has the most similar temporal distribution of trips
(Pearson correlation coefficient=0.91, p=0.002). This high correlation may be misleading due to
extreme spatial heterogeneity that is due to different behavior among groups of residents in each
spatial unit analyzed.
In this paper, we expand this research using the algorithm in Gao et al., 2014, by comparing the
spatial distribution of Twitter based trips in weekdays with other more widely used travel demand
sources of information. In essence, our objective of this paper is to validate the social-media-based
travel data by comparing them with the results of a traditional travel demand model (SCAG, 2012,
4-step model). In doing so, we use a four-step approach: 1) Harvest geotagged tweets and extract
OD-pairs; 2) Examine spatial zoning levels for aggregating OD pairs for the comparison; 3) Test
the correlation between Ttwitter based trips with trip-based (four-step) travel demand model trips
and develop spatial distributions of travel; and 4) Create a conversion method that builds OD
matrices from tweets in regression models that account for spatial heterogeneity.
2. Data Used
In this analysis, we used datasets that are from different sources: 1) Social Media dataset
collected from Twitter; 2) Output of a traditional travel demand model; 3) Census data; and 4)
Business Establishments data to capture land use.
2.1. Twitter Data
The twitter Twitter data used here were collected on December 10 and 11 in the year 2013. In this
period, 355,059 geo-tagged tweets filtered by mobile device sources from 54,272 people are were
used to extract OD-trips. In this way, we were able to ensure that the locations of tweets reflect
individuals’ physical locations not rather than social-robots’ locations or default locations
(hometown). We used the OD trip extraction algorithm developed by Gao et al. (2014), which
requires the use of a spatial zoning system to detect trips from individual-based trajectory because
the algorithm recognizes a trip if individuals’ geo-locations belong to different zones at different
times of a day. In this analysis, we used the smallest zoning system (US Census Block, 203,191
zones) to capture even short distance trips from Twitter data. As a result, a total of 67,266 trips
5
were successfully extracted covering the entire Greater Los Angeles area. Each of these trips have
the origins in one zone and the destination in the same or another zone, and in this way provide an
origin-destination matrix (OD) as used in more traditional travel demand models.
2.2. Four-step Travel Demand Model Output
Southern California Association of Governments is the primary planning agency in Greater Los
Angeles area and has been developing travel demand forecasting models since 1967. The latest
models were originally developed and validated in 2008, and regularly updated to reflect
demographic changes as well as travel conditions. A traditional four-step model was used with
4,109 Traffic Analysis Zones (TAZs) covering the entire SCAG region, and the model consists of
seven modules (extended four-step): Household Classification and Population Synthesizer, Auto
Ownership Model, Trip Generation Model, Trip Distribution Models, Mode Choice Models,
Heavy Duty Truck (HDT) Model, Network Assignment Model. As a result, a balanced vehicle
Origin-Destination matrix was created with a total 40,392,636 vehicle trips (updated in 2012). In
this paper, we use the matrix with the smallest number of OD trips (the evening OD matrix,
2,788,173 vehicle trips) to validate the Twitter-based Origin-Destination matrix.
2.3. 2010 US Census Data & National Establishment Time-Series (NETS) Database
We use a variety of characteristics of origin and destination regions to explain OD trips.
Population size and housing units were obtained from 2010 US Census dataset, and population
and housing density are also computed with this dataset. In addition, by using the National
Establishment Time-Series Database of 2010 we enumerate the Business Establishments in each
zone to create variables such as total number of employees in each zone and a diversity indicator
computed with Shannon Index (Equation 1). Pi denotes the proportion of number of employees
belonging to an industry type based on the two-digit Standard Industrial Code i. In this way we
classify employees into each category and compute the diversity indicator.
𝐻′ = − ∑ 𝑝𝑖 𝑙𝑛𝑝𝑖𝑅𝑖=1 (1)
2.4.Spatial Aggregation
Since the sources of the four datasets are different, their spatial scales are also different from each
6
other (Table 1). In order to develop a conversion method that builds OD matrices from tweets, we
need to use a suitable spatial zoning system to avoid bias due to too many zero Twitter-based OD
entries. In addition, the zoning system should be applicable in a consistent way for all the data.
Since the SCAG travel demand model uses the most aggregate zonal system we first attempted to
use the Traffic Analysis Zones (TAZ) level (4,109 zones, 16,883,811 OD pairs). This was possible
due because TAZ are created using the US Census Blocks, but also used to create Traffic Analysis
Districts (TADs), and Public Use Microdata Areas (PUMAs) (US Census Bureau, Figure 1, Top).
However, the TAZs system was still too small to compare Twitter data with SCAG Model output,
because even at this level of aggregation we have a large number of zero Origin-Destination pairs
due to the smaller sample size of Twitter trips; the ratio between Twitter trips and TAZs OD-pair
was 0.040:1 (i.e., 67,266 Twitter trips, 16,883,811 OD-pairs in TAZs system). After a few trials
of aggregation we chose the Public Use Microdata Areas level (PUMA) to compare spatial
distributions of OD pairs (Figure 1, Bottom). Finally, we were able to create two 124-by-124
Origin Destination matrices covering the entire SCAG region, and the ratio between trips and OD-
pairs ended up to be 4.375:1 (i.e., 67,266 Twitter trips, 15,376 OD-pairs in PUMAs System).
Table 1. Data Vintage and Spatial Zoning System
Name of Dataset Vintage Spatial Zoning System Number of Units
Twitter Data 2013 Census Block 203,191
SCAG Output 2012 Traffic Analysis Zones 4109
Census Data 2010 Census Block 203,191
NETS Business Data 2010 Latitude, Longitude 2,509,672
PUMA Dataset 2010 PUMAs 124
7
Figure 1. Spatial aggregation (Top) and PUMAs zone system for SCAG region (Bottom, 124 units)
8
3. Descriptive Analysis
In order to compare Twitter Trips and SCAG Trips, we first converted the two of 124-by-124
OD matrices into a 15,376-by-2 matrix. The mean and standard deviation of Twitter trips within
and between PUMA zones were 1.31 and 5.571 respectively, and their range was from 0 to 104.
On the other hand, the average number of SCAG OD trips was 110.79 and its standard deviation
was 344.718 (min 0.072, max 6021.79). The Pearson Correlation coefficient was calculated to test
the relationship between the Twitter and SCAG trips, and a strong correlation was found at the
level of 0.001 (Person Correlation coefficient 0.839, p-value=0.000). However, this is only an
approximate indication of good agreement between a trip-based model and twitter data due to
heterogeneity of trips among zones.
Figure 2 (a) illustrates the distribution of the Twitter trips and SCAG trips (left), and the residual
distribution of the simple regression model (right). The number of SCAG trips was used as a
dependent variable, and only the number of Twitter trips was used as an independent variable in
this model. As a result, the coefficient of Twitter trips was 20.682 and was significant at the 0.001
level. However, the coefficient might be problematic because the datasets that have a large
proportion of zero cells and the OD pairs that are spatially auto-correlated require different
methods of correlation estimation (Figure 2 (b) right-side, residual distribution). In order to address
these problems, we used more regression models that are designed to account for these limitations
and are described in the following section.
Figure 4 shows the spatial distributions of Twitter trips (left) and SCAG trips (right). The red bars
indicate the number of trips whose origins are PUMA zones, and the amount of trips towards
PUMA zones are represented with blue bars. As expected, a higher number of Twitter trips were
collected in the central Los Angeles Area compared to the SCAG trips. Figure 2 (c) shows top 2%
of OD flows from Twitter (left) and SCAG (right) in the city of Los Angeles area. A relatively
higher number of short distance trips in downtown area were collected with Twitter than SCAG
dataset. These trips are between areas of higher housing, business density, smaller zonal area.
9
(a) Twitter trips and SCAG Trips
(b) Number of OD trips from Twitter (Left) and SCAG Model (Right) in PUMA level
(c) Twitter Trips and SCAG trips (Top 2 percent sample in terms of amount of OD flows)
Figure 2. Distributions of Twitter trips and SCAG trips
10
4. Model
As we discussed in the previous section, a large proportion of zero trip interchanges among
zones and the presence of spatial autocorrelation among zones are problematic in developing the
conversion methods between Twitter and SCAG OD trips. Therefore, we construct several
versions of spatial lag variables, and then we use them as explanatory variables in a Tobit
regression model and a latent class regression model to understand the relationship between
Twitter-based ODs with trip-based model generated ODs. We use SCAG ODs as a dependent
variable and Twitter trips, land use, and zonal demographic characteristics as independent
variables along with the spatial lag variables. In this way, we create a three-way comparison among
three different sources of information about Origins and Destinations for cross-validation. Land
use is described with indicators of business density and diversity. Demography is described by
population density. In this way the regression coefficient associated with a Twitter OD trip is the
multiplier that needs to represent SCAG OD trips from a zone and this is the net multiplier taking
into account land-use effects, and the zero trip producing zones in the SCAG data.
4.1. Defining Spatially Lagged Variables
Both OD-trips from Twitter and SCAG model are spatially dependent, because the number of OD
trips is correlated with the number trips between neighboring Origins and Destinations. As
discussed above, we use Spatially Lagged Variables in this analysis with regard to spatial
autocorrelation in OD trip variables. General ways of defining spatial lag variables can be found
in Anselin (1988). However, our OD-trips are doubly dependent upon space (Origins’ and
Destinations’ Neighborhood), therefore we defined three sets of spatial lag variables. Figure 6
illustrates the methodology for defining spatial lag variables. The first image (a) represents an OD-
trip that we want to compute its values for spatial lag variables. The image (b), (c) shows two ways
of calculating spatial lag variables, which indicates the number of trips from an origin to the
adjacent zones of the destination (O_Dadj), and the adjacent zones of the origin to a destination
(Oadj_D). The last image (d) in Figure 3 illustrates the method to compute the third spatially
lagged variable: the number of trips from the neighborhood area of the origin to the neighborhood
area of the destination (Oadj_Dadj). We defined the neighbor zones as all of the zones that are
located within three miles (Euclidean distance between centroids of the zones is sufficient for this
initial testing). For example, the zone located in the southern area of destination (Marked with a
11
green asterisk mark in the map (b) and (d)) was not included as neighbor area for destination zone
because the distance between centroids of the two zones was bigger than three miles. We use three
miles radius because more than 70 percent of both of the OD trips were collected within this
Euclidean distance (73.6% of SCAG trips, 90.4% of Twitter trips). We computed these spatial lag
variables for both Twitter trips and SCAG model trips, total six variables were constructed
(T_O_Dadj, T_Oadj_D, T_Oadj_Dadj, S_O_Dadj, S_Oadj_D, S_Oadj_Dadj).
Figure 3. Defining Spatially Lagged Variables for the OD trips
4.2. Tobit Model
In the SCAG OD database we find 425 OD pairs with number of trips smaller than one. A
dependent variable like this is limited from below and can be considered as censored at zero
(Monzon et al., 1989) and there are exact and approximate ways to estimate regression models
accounting for limits and censoring (Goulias and Kitamura, 1993). With this type of dependent
variable, the Tobit model is generally recommended because the model provides functionality to
handle censored distribution (Maddala, 1986, Greene 2003). Equation (2) shows the Tobit model
we used in this analysis.
12
𝑦𝑖∗ = 𝜌𝑊𝑗𝑦 + 𝛽𝑥𝑖 + 𝛾𝑊𝑗𝑥𝑖 + 𝑍𝛿 + 𝜖𝑖 (2)
𝑦𝑖 = { 𝑦∗ 𝑖𝑓 𝑦∗ > 0 0 𝑖𝑓 𝑦∗ ≤ 0
Where, 𝜖𝑖 ~ 𝑁(0, 𝜎2).
Each OD trip ID is denoted as i, and yi, xi represents OD trips from SCAG model and Twitter
respectively. And yi* is a latent variable that is assumed normally distributed. There are two sets
of spatial lag variables: endogenous (𝑊𝑦𝑖) and exogenous variables (𝑊𝑥𝑖). Wj are three of the
spatial weights structures, used to construct spatial lag variables for both dependent and
independent variables (matrices containing cells with 0 and 1 values; 1 for zones that are
considered adjacent and 0 otherwise). Lastly, 𝑍 denotes all other explanatory variables including
land use and demographical characteristics of origins and destinations, and Euclidean distance
between centroids of origins and destinations. According to Xu and Lee (2014), the Maximum
Likelihood estimator (MLE) produces consistent results for Spatial Autoregressive Tobit Model.
Therefore, we use MLE method to estimate this model with the following Likelihood function in
Equation (3). Where, 𝜏 denotes the Censoring point. In our model estimation, we set 𝜏 = 0.
𝐿 = ∏ [1
𝜎 ∅ (
𝑦− 𝜇
𝜎)]
𝑑𝑖𝑁𝑖 [1 − Φ (
𝜇−𝜏
𝜎)]
1−𝑑𝑖
(3)
The marginal effects of explanatory variables x or Z are the partial derivative of the expected value
of y with respect to x or Z (Equation 4). Since these effects indicate the magnitude of change in
dependent variable by changing one unit in each independent variable, we can obtain the unit
contribution of a Twitter OD trip on SCAG model output.
𝜕 𝐸 [𝑦|𝑥]
𝜕𝑥𝑘= Φ (nonzero)𝛽𝑘 (4)
Equation (4) shows the multiplier (also called marginal effect) to Twitter OD is different for each
OD pair and is a function of the regression coefficient k , which is the same across all zones, but
also accounts for the probability that a pair of origin destinations has in producing zero trips
13
making this marginal effect different among zones.
4.3. Latent Class Regression Model
Though the spatial lag Tobit Model provides a suitable framework to find the conversion of Twitter
OD trips to SCAG OD trips for all the Greater Los Angeles Area, it may be limited in reflecting
the heterogeneous nature of space. The Latent Class Analysis provides a method to capture many
different types of heterogeneities because this model allow us to classify observational units into
a set of latent classes and estimate class-specific regression models simultaneously. This is
particularly useful when we attempt to capture spatial heterogeneity (Deutsch-Burgner and
Goulias, 2014). With our dataset, spatially homogenous OD trips can be grouped into latent classes
and regression coefficients are estimated for each class simultaneously with the determination of
the number of classes (Vermunt and Magidson, 2013). In this way we can test if each hypothetical
class has different Twitter trip conversion multiplier (in essence different values for the regression
coefficients).
In this model, we also use SCAG OD trip as an endogenous variable, all others variables are used
as exogenous variables. There are two different types of exogenous variables in Latent Class
Regression model, and their roles are distinct from each other: 1) covariates – variables affect the
latent variable defining classes; and 2) predictors – variables that affect the dependent variable
(SCAG OD trips). Equation (4) illustrates the density of a general Latent Class Regression model
with more than one predictor as well as covariates.
𝑓(𝑦𝑖|𝑧𝑖1𝑐𝑜𝑣, 𝑧𝑖2
𝑐𝑜𝑣, 𝑧𝑖2𝑝𝑟𝑒𝑑, 𝑧𝑖2
𝑝𝑟𝑒𝑑) = ∑ 𝑃(𝑥|𝑧𝑖1𝑐𝑜𝑣, 𝑧𝑖2
𝑐𝑜𝑣)𝑓(𝑦𝑖|𝑥, 𝑧𝑖2𝑝𝑟𝑒𝑑, 𝑧𝑖2
𝑝𝑟𝑒𝑑) 𝐾𝑥=1 (4)
This model can be estimated with a combination of Expectation Maximization (EM) algorithm
with Newton-Raphson Maximum Likelihood Estimation. The EM algorithm first performs
maximum likelihood iterations until a change in likelihood value is smaller than a predetermined
difference in parameters for a given set of starting values. Then, the algorithm switches to Newton-
Raphson algorithm, and continues until a predefined criterion of convergence is reached. Although
the combination of two algorithms provide quite stable results, most latent class models are still
very sensitive to local maxima of the likelihood function (Vermunt and Magidson, 2002). In order
14
to address this issue we test multiple models with different sets of initial values of parameters
(Goulias, 1999). Another type of issue is an operational problem. Since the degrees of freedom
rapidly decrease as we increase the number of parameters, this may lead to a variety of operational
problems including model identification (not able to estimate parameter) or fail to converge
(subsequent estimation step parameters are not close enough). Therefore, we use a hierarchical
iterative process to estimate this model as follows:
a) We start with one-class assumption without covariates
b) We proceed by increasing number of classes for the models until any parameter fails
to be identified and the size of classes become too small to be meaningful.
c) We select most suitable number of classes based on changes in goodness of fit criteria,
such as Bayesian Information Criterion (BIC), Akaike Information Criterion (AIC) and
the Consistent Akaike Information Criterion (CAIC), based on McCutcheon, 2002,
and Nylund et al., 2007; and we estimate a series of Latent Class Regression models
with different combinations of exogenous variables.
d) Then, we compare the models with different specifications and decide the best model
based on multiple statistical goodness-of-fit measures like the second step as well as
classification errors and R-square values. The higher R-square indicates better model
in predicting the endogenous variable, but the lower classification error means better
model in classifying spatially homogenous groups.
5. Results and Findings
As we described in the previous sections, we use two different types of regression models to
develop a method to validate travel demand model output with Social Media data. The Tobit model
with spatial lag variables was estimated with LIMDEP 9.0 (Greene, 1986 – 2007), and we use
Latent Gold 5.0 (Vermunt and Magidson, 2005 – 2014) to estimate the Latent Class Regression
models.
5.1. Tobit Model with Spatially Lagged Variables
A few preparatory tasks were undertaken to make sure the matrices of data do not have very large
and very small numbers. First, we replace the 425 unrealistic SCAG OD trips (less than one trip
between OD) with zeros; then, we divide the number of residents, houses, employees per PUMA
15
zones by 10,000. We also divide the area of each zone by 10km2 and the Euclidean distances
between zones are measured in km. Lastly, we use people and houses per square kilo-meter as the
units of the population and housing densities.
Table 2 shows the list of the Tobit regression coefficients estimated by Maximum Likelihood
Estimator (left) as well as the marginal effects on latent dependent variable y (Equation 4 and
Table 2 right hand side). First of all, the regression coefficient of the Twitter OD trip is B = 19.737,
significant at the level of 0.001. This indicates that every trip from twitter corresponds a larger
number of SCAG OD trips as expected, while, accounting for other influencing factors. All of the
spatial lag variables are significantly different than zero at the level of 0.01. This means that both
spatial lag variables of the exogenous and endogenous variables play significant roles and are able
to control for spatial autocorrelation in this model. In terms of demographic characteristics,
positive coefficients are found for the size of origins and destinations area, population size and
housing density at origins (significant at the level of 0.05) as expected. On the other hand, negative
coefficients are found for population densities, presumably due to the zero trip zones, at both
origins and destinations. Finally, the negative coefficient associated with the Euclidean distances
signifies neighboring zones have trip interchanges than further apart.
The marginal effects of the explanatory variables can be found in the right side of Table 2, which
are scaled values of the original coefficients. Since this effect represents unit contribution of each
exogenous variable, we can identify the conversion rate of Twitter OD trips to SCAG OD trips. A
Twitter OD trip has 12.507, meaning that we can expect approximately 12 SCAG OD trips to be
represented by one Twitter trip on average throughout the region.
16
Table 2. Results of Spatial Lag Tobit Model
Independent Variables Estimated Coefficients from OLS Partial Derivatives of Expected Values Mean of
X Coef. S.E. b/St.Er. P[Z|>z] Coef. S.E. b/St.Er. P[Z|>z]
Twitter OD T_OD 19.737 0.118 166.592 0.000 12.507 0.098 128.255 0.000 4.375
Spatially
Lagged
Variables from
SCAG OD
S_Oadj_Dadj -0.017 0.002 -7.940 0.000 -0.011 0.001 -7.933 0.000 6035.197
S_Oadj_D 0.219 0.009 23.475 0.000 0.139 0.006 23.312 0.000 716.794
S_O_Dadj 0.006 0.001 5.705 0.000 0.004 0.001 5.702 0.000 6035.197
Spatially
Lagged
Variables from
T_Oadj_Dadj 1.067 0.134 7.946 0.000 0.676 0.085 7.939 0.000 78.983
T_Oadj_D -8.396 0.558 -15.053 0.000 -5.320 0.354 -15.010 0.000 8.593
T_O_Dadj -0.586 0.060 -9.775 0.000 -0.371 0.038 -9.764 0.000 78.983
Demographic
Characteristics
from US Census
O_AREA / 10km2 0.031 0.010 3.213 0.001 0.020 0.006 3.212 0.001 82.540
D_AREA /10km2 0.042 0.010 4.394 0.000 0.027 0.006 4.393 0.000 82.540
O_Population / 10,000 8.416 2.404 3.500 0.001 5.333 1.524 3.500 0.001 16.102
D_Population / 10,000 2.736 2.387 1.146 0.252 1.734 1.513 1.146 0.252 16.102
O_Housing /10,000 -9.717 5.428 -1.790 0.073 -6.158 3.440 -1.790 0.073 5.642
D_Housing / 10,000 2.059 5.328 0.386 0.699 1.305 3.377 0.386 0.699 5.642
O_POP_Density (person/km2) -0.490 0.080 -6.140 0.000 -0.311 0.051 -6.139 0.000 270.576
D_POP_Density (person/km2) -0.216 0.080 -2.709 0.007 -0.137 0.050 -2.709 0.007 270.576
O_Housing Density (House/km2) 0.747 0.209 3.569 0.000 0.473 0.133 3.569 0.000 91.889
D_Housing Density (House/km2) 0.258 0.209 1.236 0.216 0.163 0.132 1.236 0.216 91.889
Land Use
Characteristics
from NETS
dataset
O_Number of Employees (10,000) 4.059 1.140 3.561 0.000 2.572 0.722 3.561 0.000 6.914
D_Number of Employees (10,000) -1.259 1.099 -1.145 0.252 -0.798 0.697 -1.145 0.252 6.914
O_Business Diversity 6.148 24.258 0.253 0.800 3.896 15.373 0.253 0.800 3.501
D_Business Diversity 8.197 24.257 0.338 0.735 5.194 15.372 0.338 0.735 3.501
Distance Euclidean Distance Between OD (km) -4.098 0.173 -23.756 0.000 -2.597 0.110 -23.663 0.000 21.424
17
5.2. Latent Class Regression Model
As mentioned above, we used a stepwise approach to develop the Latent Class Regression Model.
The first step is identifying a suitable number of classes describing this OD trip dataset. Similar to
the Tobit regression we use SCAG OD trips as the dependent variable, and estimated a series of
Latent Class models (also called mixture regression models) staring with one-class and increasing
the classes until we find an optimal model. No explanatory variable was added in this step and
seven models were identified (Table 3). Although the model fit improves as we increase the
number of classes, the magnitudes of improvement in terms of goodness of fit indices (BIC, AIC,
AIC3) are dramatically decreased after three-class model reaching an asymptote. This indicates
the three classes are suitable to describe the distribution of SCAG OD trips. In other words, it is
possible to explain the heterogeneous nature of the SCAG trips efficiently with just three latent
classes representing three different groups of people in associating trips between the two sources
(Twitter and Foutr-step). Therefore, the subsequent latent class regression models are estimated
with three-class assumption.
Table 3. Results of Latent Class Models without Covariates
LL BIC(LL) AIC(LL) AIC3(LL) Npar Class.Err.
1-Class -124282.7 248584.7 248569.5 248571.5 2 0.000
2-Class -88644.0 177336.2 177298.0 177303.0 5 0.020
3-Class -82891.7 165860.5 165799.4 165807.4 8 0.067
4-Class -80851.1 161808.3 161724.2 161735.2 11 0.095
5-Class -79904.7 159944.4 159837.4 159851.4 14 0.108
6-Class -79382.3 158928.4 158798.5 158815.5 17 0.132
7-Class -79100.7 158394.2 158241.3 158261.3 20 0.144
The covariates and predictors play different roles in estimating Latent Class Regression Model as
discussed earlier; Covariates influence the latent classes, whereas predictors influence the
dependent variable. Since we use latent class analysis to capture the spatial heterogeneity,
covariates in this model should be able to reflect spatial characteristics of Origins and Destinations.
All of our exogenous variables contain the zonal information, therefore all of them can be used as
either covariates or predictors, or both. Since there is no consensus in the literature to define
exogenous variables for this type of analysis yet, we conducted a few experiments with different
settings in explanatory variables.
18
Table 4 shows the specifications of five latent class regression models and the estimated values of
the goodness of fit indices. All five models use SCAG OD trips as a dependent variable and Twitter
OD trips as a predictor with the three-latent-class assumption. In the first model, we use all the
land use and demographic characteristics as covariates along with all of the six spatial lag
variables. In other words, only Twitter OD trips are used as a predictor, but all other variables are
used to capture spatial heterogeneities. The model converged, but BIC, AIC, AIC3 are relatively
higher than other models in the experiments that follow. Due to the fact that we used many
variables as covariates, but one variable as predictor, least classification errors and R-square value
are found in this model. This indicates the model was able to reflect spatial characteristics well,
but SCAG OD trips were not predicted well compared to the next set of models. The second model
has the opposite specification. All the explanatory variables are used as predictors, but no
covariates are used. This model also converges with slightly higher values of goodness of model
fit indices (BIC, AIC, AIC3). As expected, the results of the model are also opposite to the first
model; the model is estimated with the largest classification errors, but highest R-square values.
This shows this model cannot capture the heterogeneous nature of space well, but predicts SCAG
OD trips well. In the third and fourth models, we divide explanatory variables into two groups:
Six spatial lag variables and fifteen variables of land use and demographic characteristics. Then,
we use them as either covariates or predictors. The third model uses spatial lag variables as
predictors and all other variables as covariates, but the fourth model does the opposite. In terms of
model fit indices, the third model produced better results, but the fourth model captures the
heterogeneous spatial structure and predicts SCAG OD trips better than the third model. Finally
the fifth model has all the explanatory variables as both covariates and predictors. This model has
the best goodness of model fit indices, the second lowest classification errors and second highest
R-square value among the five models. However, the fifth model is estimated with almost two
times more parameters than the fourth model, but the differences in classification error and R-
square are very small (0.0064 and 0.0026 respectively). Based on all the model evaluation criteria,
the fifth model seems to be the best model, but the fourth model can be the best model if we take
into account parsimony.
19
Table 4. Results of Latent Class Regression Models with Different Setting in Exogenous Variable
Exogenous Variables
LL BIC(LL) AIC(LL) AIC3(LL) Npar Class.Err. R2 Covariates Predictors
1 All None -69708.1 139927.2 139522.2 139575.2 53 0.0272 0.7649
2 None All -70606.2 141925.8 141360.4 141434.4 74 0.0683 0.8784
3 Land use
& Demo. Spatial Lag -66382.9 133334.6 132883.8 132942.8 59 0.0438 0.8129
4 Spatial Lag Land use &
Demo. -69645.0 139945.6 139426.1 139494.1 68 0.0384 0.8359
5 All All -62567.5 126311.1 125379.0 125501.0 122 0.0320 0.8385
As mentioned earlier, the fourth model was also estimated with three classes, and their
estimated membership proportions are reported in Figure 4. The largest proportion of the sample
was found in the first class (65 %, 10,145 OD trips), followed by the second and third class (28%
4,190 OD trips, 7% 1,041 OD trips, respectively). Because we use spatial lag variables as
covariates, these latent classes represent the homogeneous groups of OD flows with respect to
their neighbors’ OD flow patterns. The right hand side of Figure 7 illustrates the means of
covariates for each class. The first class has relatively small amount of OD trips as well as their
neighbors’ OD trips. The moderate level of SCAG trips and its neighbors’ OD trips represent the
second group, the third class consists of the OD pairs with the largest amount of SCAG OD trips
and their neighbors’ interactions.
20
Figure 4. Proportion of Latent Classes and Mean of Covariates
Figure 5 shows a set of maps describing Greater Los Angeles area with bar charts indicating the
amount of trips with the origins in red and destinations in blue. The first map (a) shows the total
number of SCAG OD trips, and the other three show each latent class SCAG OD trips. In this
way we can see the differences in geographical space in the maps (b), (c), (d) of OD trips that
belong to latent classes 1, 2, and 3, respectively. The first class OD pairs are more densely
concentrated in the City of Los Angeles Area, but the second and third are mostly belonging to
OD pairs in the suburban areas of Los Angeles. These maps also show spatial clusters of the zones
that have similar OD trips patterns with their neighbors’ trip patterns. Also this classification
reflects the effect of size and location of the zones because those were captured via the spatial lag
variables and used as covariates. This is the most important demonstration that we successfully
captured spatial heterogeneity of OD trips using Latent Class Analysis and we can now estimate
conversion coefficients between Twitter OD trips and SCAG OD trips that account for this
otherwise unobserved spatial heterogeneity.
Figure 5. Geographical Distribution of Latent Classes of OD trips from SCAG model
21
Table 5 shows latent class-specific coefficients of predictors as well as Wald statistics providing
the results of the statistical test if the coefficients are different between the latent classes. The
coefficients with asterisks indicate significant value at the 5% level. Significant predictors of
SCAG OD-trips are the classes themselves (i.e., the overall class specific averages are different).
Based on the Wald test statistics, all of the predictors are different among latent groups except for
the housing related variables and population size in origins. Most importantly, the coefficients of
Twitter trips turned out to be very different. The smallest unit contribution of Twitter OD trips was
found in the first class, the biggest one was found in the second class (7.085, 37.438). This means
when we validate model based OD trips, a Twitter based OD trip should be used in a different way
depending on the underlying spatial structures. This result also shows the necessity of using a
methodology, which is able to reflect the heterogeneous nature of geography and the people living
in different geographies. In terms of the directional effects (i.e., positive or negative coefficients),
all of the explanatory variables show the same type of influence on the dependent variable
regardless of the class membership.
Although the first latent class regression model was estimated with the least R-square value
(0.5964) among classes, it shows a variety of significant predictors (total 11 predictors), and their
signs are the same with the output of the Tobit model in the previous section. However, the
magnitude of coefficients is smaller than the Tobit model results. For example, the unit
contribution of a Twitter OD trip for this latent group was 7.085. (The difference with the Tobit is
5.422). However, the diversity of Business Establishments was not a significant explanatory
variable for the Tobit model, but, the diversities of both origins and destinations played significant
roles in this model. Also, higher numbers of business establishments in origins and destinations
indicate more number of trips between two zones.
The same number of significant predictors was found in the second model with the highest R-
square value (0.9902), but their combinations were different from the first set of coefficients. Two
coefficients were excluded, and the other two coefficients were compared to the first model. As
mentioned earlier, the highest coefficient of Twitter OD trips across all the classes was found in
this class (37.438). This can be interpreted that Twitter based OD trips in this spatial area can be
converted into a four-step like OD trips in a reliable way. In terms of coefficients, higher
22
coefficients than the first class were found in this second class, meaning that differences in land
use and demographics in OD-pairs imply more trips when compared to the first class ODs.
The third class regression model was estimated with R-square 0.7537, with five significant
predictor variables. Less number of significant coefficients was found in this class compared to
the other classes. This is presumably related to the smallest sample size among classes. This model
produced the closest unit contribution of a Twitter OD trip to the Tobit model. The coefficients in
the third class were much higher than the first and second classes: Euclidean distance between
Origins and destinations, and Housing size in Destinations (-13.481 → -224.131, 5.297 →
159.056, respectively). This is presumably because the large proportion of OD pairs related with
a certain area (Ventura County), so that their locational and demographical characteristics
outweigh the small number of the third class members (Ventura county is located in the north-west
corner of SCAG area, has bigger zones with larger housing supply).
Finally, the spatial lag variables played important roles as covariates in this model, the coefficients
can be found in Table 6, Based on Wald statistics, the amount of trips from neighborhood area to
the destinations from SCAG model was the most important variable to classify the latent classes
followed by two other spatial lag variables from SCAG data. In fact, this can be also verified with
Figure 7 that shows the chart of mean of covariates.
23
Table 5. Class-specific Coefficients of Predictors
Independent Variables Class1 Class2 Class3 Between classes
Mean Std.Dev. Coef. 1 z-value Coef. 2 z-value Coef. 3 z-value Wald(=) p-val.
Twitter OD T_OD 7.085 17.481 * 37.438 569.131 * 15.149 38.855 * 8463.340 0.000 16.110 13.407
Demographic
Characteristics
from US Census
O_AREA / 10km2 0.005 21.688 * 1.316 31.524 * 1.226 6.418 * 994.395 0.000 0.457 0.617
D_AREA /10km2 0.006 27.162 * 0.211 8.236 * 1.374 8.325 * 125.089 0.000 0.160 0.347
O_Population / 10,000 0.099 1.702 1.565 2.124 * 18.769 0.596 4.673 0.097 1.829 4.720
D_Population / 10,000 0.155 2.695 * 2.840 4.349 * 8.468 0.262 16.955 0.000 1.491 2.262
O_Housing /10,000 0.220 1.715 0.510 0.275 101.979 1.383 2.028 0.360 7.504 26.077
D_Housing / 10,000 0.538 4.281 * 2.108 1.301 159.056 2.048 * 5.297 0.071 12.197 40.541
O_POP_Density (person/km2) -0.006 -2.658 * 0.004 0.217 -1.538 -1.536 2.440 0.300 -0.111 0.394
D_POP_Density (person/km2) -0.006 -3.079 * -0.080 -4.153 * -1.092 -1.055 15.833 0.000 -0.103 0.275
O_Housing Density (House/km2) 0.008 1.426 -0.028 -0.548 0.649 0.258 0.518 0.770 0.043 0.168
D_Housing Density (House/km2) 0.002 0.467 -0.023 -0.446 -0.732 -0.279 0.325 0.850 -0.057 0.187
Land Use
Characteristics
from NETS dataset
O_Number of Employees (10,000) 0.462 14.446 * 3.422 14.804 * -0.538 -0.058 162.704 0.000 1.216 1.394
D_Number of Employees (10,000) 0.055 1.829 -1.457 -5.945 * -24.651 -2.132 * 41.785 0.000 -2.116 6.256
O_Business Diversity 5.581 6.607 * 20.289 2.639 * 596.274 1.564 5.953 0.051 51.494 150.509
D_Business Diversity 5.736 6.735 * 19.611 2.682 * 627.857 1.623 6.116 0.047 53.641 158.612
Distance Euclidean Distance Between OD -0.651 -55.628 * -13.481 -30.853 * -224.131 -12.874 * 1070.597 0.000 -20.045 56.615
Table 6. Class-specific Coefficients of Covariates
Independent Variables Class1 z-value Class2 z-value Class3 z-value Wald p-value
Spatially Lagged
Variables from
SCAG OD
S_Oadj_Dadj 0.001 14.460 * 0.000 -11.142 * -0.001 -16.499 * 298.773 0.000
S_Oadj_D -0.006 -22.032 * 0.002 18.460 * 0.003 23.994 * 599.306 0.000
S_O_Dadj -0.001 -15.710 * 0.000 14.372 * 0.001 16.606 * 293.315 0.000
Spatially Lagged
Variables from
T_Oadj_Dadj -0.036 -7.623 * 0.015 6.354 * 0.021 8.570 * 89.493 0.000
T_Oadj_D -0.003 -0.141 0.001 0.125 0.002 0.151 0.024 0.990
T_O_Dadj 0.008 2.413 * -0.004 -2.104 * -0.004 -2.677 * 10.681 0.005
24
6. Summary and Conclusion
In this paper, a popular Social Media database, geo-tagged Twitter data, are used to understand its
relationship with traditional travel demand model along with US Census dataset (2010), NETS
Database (2010). The two days of geo-tagged Twitter data for Greater Los Angeles regions was
used in this analysis, we successfully extracted 67,266 trips from about 54,272 people, aggregated
them into Public Use Microdata Areas (PUMAs) to avoid the zero-cell problem. We compared the
Twitter based OD matrix with a recent OD matrix (4-step model output) given by local
Metropolitan Planning Organization (MPO). Regression techniques are used to estimate
correlations between the traditional travel demand model ODs and Twitter derived ODs. This type
of regression using these datasets presents two known issues that are censoring of dependent
variable distributions and spatial autocorrelation. To this end, we used a Tobit model with spatial
lag variables to develop an unbiased conversion method between Twitter trips and Four-step
Travel Demand Model output. We also used Latent Class Regression models to take into account
of the heterogeneous nature of space. Finally, multiple unit contribution of a Twitter OD trip were
estimated with two types of regression models. The Tobit model produced a single unit-
contribution of Twitter trip, but three different unit-contributions depending on spatial structures
were obtained with Latent Class Regression model.
In summary, our approach provides a methodology to convert the OD matrix extracted from social
media data into the one created with conventional travel demand model. The OD extraction
algorithm and spatial aggregation techniques play pivotal roles to make this comparison possible.
Also, our methods to define spatially lagged variables for the OD trips (doubly dependent on space)
are developed in this paper, and those variables efficiently control spatial autocorrelations. The
Tobit model and Latent Class Regression model provide a framework to control the censored
distribution of dependent variable and to capture the spatial heterogeneity.
Although our approach provides the conversion method with spatial aggregation technique, a
better way to address this zero cell problem and to represent travel demand in space and time is to
collect larger amounts of social media data, and then extract higher numbers of OD trips that spread
throughout the region in a way that even less populated zones are represented with a sufficient
25
number of trips to analyzed them. Therefore, our immediate next step of the research is to collect
more social media data for longer periods, and then apply this approach with larger zoning system
such as Traffic Analysis District and the OD matrices from different time of day. In addition,
activity-based travel demand model provide a series of OD matrices by different types of activity
and these could be compared to social media data using a variety of filters that may yield more
fine grained information.
Acknowledgement
Dr. Hsi-Hwa Hu from the Southern California Association of Governments is acknowledged for
providing us with the SCAG OD matrix used here. Funding for this research was provided by the
Multicampus Research Program Initiative on Sustainable Transportation, the University of California
Transportation Center (UCTC), the University of California Center on Economic Competitiveness in
Transportation (UCCONNECT), and grants of the University of California Santa Barbara Office of
Research and the College of Letters and Science.
26
References
Anselin, L. (1988). Spatial econometrics: methods and models (Vol. 4). Springer Science &
Business Media.
Auld, J., Williams, C., Mohammadian, A., & Nelson, P. (2009). An automated GPS-based
prompted recall survey with learning algorithms. Transportation Letters, 1(1), 59-79.
Cebelak, Meredith Kimberly, 2013. Location-based social networking data: doubly-constrained
gravity model origin-destination estimation of the urban travel demand for Austin, TX. Master’s
Thesis, The University of Texas at Austin.
Coffey, C., & Pozdnoukhov, A. (2013, November). Temporal decomposition and semantic
enrichment of mobility flows. In Proceedings of the 6th ACM SIGSPATIAL International
Workshop on Location-Based Social Networks (pp. 34-43). ACM.
Collins, Craig, Hasan, Samiul, Ukkusuri, Satish V., 2013. A novel transit rider satisfaction
metric: rider sentiments measured from online social media data. J. Public Transport. 16 (2).
Cottrill, C. D., Pereira, F. C., Zhao, F., Dias, I. F., Lim, H. B., Ben-Akiva, M. E., & Zegras, P. C.
(2013). Future Mobility Survey. Transportation Research Record: Journal of the Transportation
Research Board, 2354(1), 59-67.
Deutsch-Burgner K and K.G. Goulias (2014) Measuring Heterogeneity in Spatial Perception for
Activity and Travel Behavior Modeling. Paper submitted for presentation at the 94th Annual
Meeting of the Transportation Research Board, Washington, D.C., January 11-15, 2015. Also
published as GEOTRANS Report 2014-8-07, Santa Barbara, CA.
Fan, Y., Chen, Q., Liao, C. F., & Douma, F. (2012, November). UbiActive: Smartphone-Based
Tool for Trip Detection and Travel-Related Physical Activity Assessment. In Submitted for
Presentation at the Transportation Research Board 92nd Annual Meeting.
Gao, S., Yang, J-A., Yan, B., Hu, Y., Janowicz, K., McKenzie, G., (2014) Detecting Origin-
Destination Mobility Flows From Geotagged Tweets in Greater Los Angeles Area Eighth
International Conference on Geographic Information Science (GIScience '14). Vienna, Austria
Goulias, K. G. and R. Kitamura (1993) Analysis of binary choice frequencies with limit cases:
Comparison of alternative estimation methods and application to weekly household mode choice.
Transportation Research Part B: Methodological, 27(1), pp. 65-78.
Goulias K.G., R.M. Pendyala, and C. R. Bhat (2013) Chapter 2: Keynote - Total Design Data
Needs for the New Generation Large-Scale Activity Microsimulation Models. In Transport
Survey Methods: Best Practice for Decision Making (eds J. Zmud, M. Lee-Gosselin, M.
Munizaga, and J. A. Carasco). Emerald Group Publishing, Bingley, UK.
Goulias, K. G., Bhat, C. R., Pendyala, R. M., Chen, Y., Paleti, R., Konduri, K. C., ... & Hu, H. H.
(2012, January). Simulator of activities, greenhouse emissions, networks, and travel
27
(SimAGENT) in Southern California. In 91st annual meeting of the Transportation Research
Board, Washington, DC.
Hasan, S., & Ukkusuri, S. V. (2014). Urban activity pattern classification using topic models
from online geo-location data. Transportation Research Part C: Emerging Technologies, 44, 363-
381.
Leiman, J. M., Bengelsdorf, T., Faussett, K., & Transportation Research Board. (2006).
Household Travel Surveys: Using Design Effects to Compare Alternative Sample Designs.
In 85th Annual Meeting of the Transportation Research Board, Washington, DC.
Maddala, G. S. (1986). Limited-dependent and qualitative variables in econometrics (No. 3).
Cambridge university press.
Monzon, J., K. G. Goulias, and R. Kitamura (1989) Trip Generation Models for Infrequent Trips.
Transportation Research Record, 1220, pp. 40-46.
Nitsche, P., Widhalm, P., Breuss, S., & Maurer, P. (2012). A strategy on how to utilize
smartphones for automatically reconstructing trips in travel surveys.Procedia-Social and
Behavioral Sciences, 48, 1033-1046.
Pew Pesearch Inter Project, 2012, Twitter Use 2012,
http://www.pewinternet.org/2012/05/31/twitter-use-2012
Shirima, K., Mukasa, O., Schellenberg, J. A., Manzi, F., John, D., Mushi, A., ... & Schellenberg,
D. (2007). The use of personal digital assistants for data entry at the point of collection in a large
household survey in southern Tanzania. Emerging themes in epidemiology, 4(1), 5.
Southern California Association of Governments(SCAG), 2012, 2012-2035 Regional
Transportation Plan/Sustainable Communities StrategyRTP/SCS Adopted April 2012,
http://rtpscs.scag.ca.gov/Documents/2012/final/f2012RTPSCS.pdf ,
http://rtpscs.scag.ca.gov/Documents/2012/draft/SR/2012dRTP_TransportationConformityReport
Turner, S. M. (1996). Advanced techniques for travel time data collection. Transportation
Research Record: Journal of the Transportation Research Board, 1551(1), 51-58. Chicago
J.K. Vermunt and J. Magidson (2013). Technical Guide for Latent GOLD 5.0: Basic, Advanced,
and Syntax. Belmont, MA: Statistical Innovations Inc
Xu, X., & Lee, L. F. (2014). Maximum likelihood estimation of a spatial autoregressive Tobit
model. Manuscript, Department of Economics, Ohio State University.