jae hyun lee song gao konstadinos g. gouliassgao/papers/2016_trb_tweetod.pdf · 2016-03-07 ·...

1

Can Twitter data be used to validate travel demand models?

Jae Hyun Lee

Graduate Student Researcher, Department of Geography and GeoTrans Lab

University of California, Santa Barbara

Santa Barbara, CA 93106

Phone: 805-284-7835

Email: [email protected]

Song Gao

Graduate Student Researcher, Department of Geography and STKO Lab



Phone: 805-980-8424

Email: [email protected]

Konstadinos G. Goulias

Professor, Department of Geography and GeoTrans Lab



Phone: 805-284-1597

E-mail: [email protected]

Number of Words = 6,855

Number of Tables and Figures = 11*250 = 2,750

Total equivalent = 9,605

Paper submitted for 95th Annual Transportation Research Board Meeting, Washington, D.C.,

January 10-14, 2016

mailto:[email protected]



2

Can Twitter data be used to validate travel demand models?

Jae Hyun Lee*

Song Gao

Konstadinos G. Goulias

*corresponding author ([email protected])

Department of Geography and GeoTrans Lab.

University of California Santa Barbara

Santa Barbara, CA 93106-4060

Abstract

In this paper we use Twitter data and a recently developed algorithm at the University of California

Santa Barbara to extract Origin-Destination pairs in the Greater Los Angeles metropolitan area

known as the Southern California Association of Governments (SCAG) region. This algorithm

contains two steps: individual-based trajectory detection and place-based trip aggregation. In

essence, if a person tweeted in different TAZs within 4 hours, it is considered to be one OD-trip.

The extracted OD-trips were aggregated into 30 minute intervals. Then, we compare these trips

with a traditional travel demand model (SCAG, 2012, 4-step model). Substantial heterogeneity is

found due to zero trip zones and a variety of social factors including the tweeting demographics.

In this paper we illustrate the results from a Tobit model and a three-class latent class regression

model that convert tweet derived trips to four-step trips accounting for zonal and trip-maker

heterogeneity. In these regression models we use measures of business density and diversity, and

population density as added explanatory/control variables, so that a unit contribution of a tweet

trip can be adjusted by land-use effects and the zero trip producing zones in the twitter data can be

explained in a more complete way. Preliminary results are encouraging and show the usefulness

of harvested large-scale mobility data from location-based social media. The results also show the

added value of latent class regression models in this experiment. The paper concludes with a

review of next steps.

Paper submitted for 95th Annual Transportation Research Board Meeting, Washington, D.C.,

January 10-14, 2016

3

1. Introduction

Household travel survey data have been playing the most significant role in travel behavior

research for many years. Although information extracted from this source allows various levels of

governments to develop their transportation plans, the cost of data collection tends to increase over

time (Leiman et al, 2006). Furthermore, recently developed modeling and simulation approaches

for travel demand forecasting require even more detailed information that possibly increase survey

costs (Goulias et al., 2011). One of the factors increasing the cost is nonresponse due to the effects

of respondent burden. Thus, there have been diverse attempts to reduce the respondent burden

with a variety of new technologies including Global Positioning System Loggers, Computer-based

Survey System(s), Smartphone Applications, Personal Digital Assistants (PDA), and Car

Navigation Systems (Tuner, 1996; Shirima et al., 2007; Auld et al., 2009, Fan et al., 2012; Nitsche

et al., 2012; Cottrill et al., 2013).

More recently, online Social Media Services (e.g. Facebook, Foursquare, Twitter) have received

much more attention from transportation research as a potential source of data for travel behavior

analysis (Collins et al., 2013; Cebelak 2013; Hassan and Ukkusuri, 2014; Gao et al., 2014). Collins

et al. (2013) used about 500 twitter texts to evaluate transit riders’ satisfaction with a Sentiment

Strength Detection Algorithm. Cebelak (2013) focused on Foursquare, and she was able to

reconstruct a zonal Origin-Destination matrix based on the check-in counts for Austin, Texas.

Another travel analysis example with Foursquare data was given by Hassan and Ukkusuri (2014).

They employed activity pattern model to infer the latent pattern of weekly activities with geo-

location data. Similarly but with different algorithms tweets were used to study temporal and

spatial patterns of activity participation by Coffey and Pozdnoukhov (2013).

A more recent algorithm and application tested a new methodology to extract Origin-Destination

pairs from Twitter data for the Greater Los Angeles metropolitan area known as the Southern

California Association of Governments (SCAG) region (Gao et al., 2014). This algorithm contains

two steps: individual-based trajectory detection and place-based trip aggregation. In essence, if a

person tweeted in different TAZs within 4 hours, it is considered to be one OD-trip. The extracted

OD-trips were aggregated into 30 minute intervals. Then, these trips were compared with the

4

commuting data in the American Community Survey (ACS) for validation. Among the Weekday,

Weekend, and Christmas datasets, Weekday data has the most similar temporal distribution of trips

(Pearson correlation coefficient=0.91, p=0.002). This high correlation may be misleading due to

extreme spatial heterogeneity that is due to different behavior among groups of residents in each

spatial unit analyzed.

In this paper, we expand this research using the algorithm in Gao et al., 2014, by comparing the

spatial distribution of Twitter based trips in weekdays with other more widely used travel demand

sources of information. In essence, our objective of this paper is to validate the social-media-based

travel data by comparing them with the results of a traditional travel demand model (SCAG, 2012,

4-step model). In doing so, we use a four-step approach: 1) Harvest geotagged tweets and extract

OD-pairs; 2) Examine spatial zoning levels for aggregating OD pairs for the comparison; 3) Test

the correlation between Ttwitter based trips with trip-based (four-step) travel demand model trips

and develop spatial distributions of travel; and 4) Create a conversion method that builds OD

matrices from tweets in regression models that account for spatial heterogeneity.

2. Data Used

In this analysis, we used datasets that are from different sources: 1) Social Media dataset

collected from Twitter; 2) Output of a traditional travel demand model; 3) Census data; and 4)

Business Establishments data to capture land use.

2.1. Twitter Data

The twitter Twitter data used here were collected on December 10 and 11 in the year 2013. In this

period, 355,059 geo-tagged tweets filtered by mobile device sources from 54,272 people are were

used to extract OD-trips. In this way, we were able to ensure that the locations of tweets reflect

individuals’ physical locations not rather than social-robots’ locations or default locations

(hometown). We used the OD trip extraction algorithm developed by Gao et al. (2014), which

requires the use of a spatial zoning system to detect trips from individual-based trajectory because

the algorithm recognizes a trip if individuals’ geo-locations belong to different zones at different

times of a day. In this analysis, we used the smallest zoning system (US Census Block, 203,191

zones) to capture even short distance trips from Twitter data. As a result, a total of 67,266 trips

5

were successfully extracted covering the entire Greater Los Angeles area. Each of these trips have

the origins in one zone and the destination in the same or another zone, and in this way provide an

origin-destination matrix (OD) as used in more traditional travel demand models.

2.2. Four-step Travel Demand Model Output

Southern California Association of Governments is the primary planning agency in Greater Los

Angeles area and has been developing travel demand forecasting models since 1967. The latest

models were originally developed and validated in 2008, and regularly updated to reflect

demographic changes as well as travel conditions. A traditional four-step model was used with

4,109 Traffic Analysis Zones (TAZs) covering the entire SCAG region, and the model consists of

seven modules (extended four-step): Household Classification and Population Synthesizer, Auto

Ownership Model, Trip Generation Model, Trip Distribution Models, Mode Choice Models,

Heavy Duty Truck (HDT) Model, Network Assignment Model. As a result, a balanced vehicle

Origin-Destination matrix was created with a total 40,392,636 vehicle trips (updated in 2012). In

this paper, we use the matrix with the smallest number of OD trips (the evening OD matrix,

2,788,173 vehicle trips) to validate the Twitter-based Origin-Destination matrix.

2.3. 2010 US Census Data & National Establishment Time-Series (NETS) Database

We use a variety of characteristics of origin and destination regions to explain OD trips.

Population size and housing units were obtained from 2010 US Census dataset, and population

and housing density are also computed with this dataset. In addition, by using the National

Establishment Time-Series Database of 2010 we enumerate the Business Establishments in each

zone to create variables such as total number of employees in each zone and a diversity indicator

computed with Shannon Index (Equation 1). Pi denotes the proportion of number of employees

belonging to an industry type based on the two-digit Standard Industrial Code i. In this way we

classify employees into each category and compute the diversity indicator.

𝐻′ = − ∑ 𝑝𝑖 𝑙𝑛𝑝𝑖𝑅𝑖=1 (1)

2.4.Spatial Aggregation

Since the sources of the four datasets are different, their spatial scales are also different from each

6

other (Table 1). In order to develop a conversion method that builds OD matrices from tweets, we

need to use a suitable spatial zoning system to avoid bias due to too many zero Twitter-based OD

entries. In addition, the zoning system should be applicable in a consistent way for all the data.

Since the SCAG travel demand model uses the most aggregate zonal system we first attempted to

use the Traffic Analysis Zones (TAZ) level (4,109 zones, 16,883,811 OD pairs). This was possible

due because TAZ are created using the US Census Blocks, but also used to create Traffic Analysis

Districts (TADs), and Public Use Microdata Areas (PUMAs) (US Census Bureau, Figure 1, Top).

However, the TAZs system was still too small to compare Twitter data with SCAG Model output,

because even at this level of aggregation we have a large number of zero Origin-Destination pairs

due to the smaller sample size of Twitter trips; the ratio between Twitter trips and TAZs OD-pair

was 0.040:1 (i.e., 67,266 Twitter trips, 16,883,811 OD-pairs in TAZs system). After a few trials

of aggregation we chose the Public Use Microdata Areas level (PUMA) to compare spatial

distributions of OD pairs (Figure 1, Bottom). Finally, we were able to create two 124-by-124

Origin Destination matrices covering the entire SCAG region, and the ratio between trips and OD-

pairs ended up to be 4.375:1 (i.e., 67,266 Twitter trips, 15,376 OD-pairs in PUMAs System).

Table 1. Data Vintage and Spatial Zoning System

Name of Dataset Vintage Spatial Zoning System Number of Units

Twitter Data 2013 Census Block 203,191

SCAG Output 2012 Traffic Analysis Zones 4109

Census Data 2010 Census Block 203,191

NETS Business Data 2010 Latitude, Longitude 2,509,672

PUMA Dataset 2010 PUMAs 124

7

Figure 1. Spatial aggregation (Top) and PUMAs zone system for SCAG region (Bottom, 124 units)

8

3. Descriptive Analysis

In order to compare Twitter Trips and SCAG Trips, we first converted the two of 124-by-124

OD matrices into a 15,376-by-2 matrix. The mean and standard deviation of Twitter trips within

and between PUMA zones were 1.31 and 5.571 respectively, and their range was from 0 to 104.

On the other hand, the average number of SCAG OD trips was 110.79 and its standard deviation

was 344.718 (min 0.072, max 6021.79). The Pearson Correlation coefficient was calculated to test

the relationship between the Twitter and SCAG trips, and a strong correlation was found at the

level of 0.001 (Person Correlation coefficient 0.839, p-value=0.000). However, this is only an

approximate indication of good agreement between a trip-based model and twitter data due to

heterogeneity of trips among zones.

Figure 2 (a) illustrates the distribution of the Twitter trips and SCAG trips (left), and the residual

distribution of the simple regression model (right). The number of SCAG trips was used as a

dependent variable, and only the number of Twitter trips was used as an independent variable in

this model. As a result, the coefficient of Twitter trips was 20.682 and was significant at the 0.001

level. However, the coefficient might be problematic because the datasets that have a large

proportion of zero cells and the OD pairs that are spatially auto-correlated require different

methods of correlation estimation (Figure 2 (b) right-side, residual distribution). In order to address

these problems, we used more regression models that are designed to account for these limitations

and are described in the following section.

Figure 4 shows the spatial distributions of Twitter trips (left) and SCAG trips (right). The red bars

indicate the number of trips whose origins are PUMA zones, and the amount of trips towards

PUMA zones are represented with blue bars. As expected, a higher number of Twitter trips were

collected in the central Los Angeles Area compared to the SCAG trips. Figure 2 (c) shows top 2%

of OD flows from Twitter (left) and SCAG (right) in the city of Los Angeles area. A relatively

higher number of short distance trips in downtown area were collected with Twitter than SCAG

dataset. These trips are between areas of higher housing, business density, smaller zonal area.

9

(a) Twitter trips and SCAG Trips

(b) Number of OD trips from Twitter (Left) and SCAG Model (Right) in PUMA level

(c) Twitter Trips and SCAG trips (Top 2 percent sample in terms of amount of OD flows)

Figure 2. Distributions of Twitter trips and SCAG trips

10

4. Model

As we discussed in the previous section, a large proportion of zero trip interchanges among

zones and the presence of spatial autocorrelation among zones are problematic in developing the

conversion methods between Twitter and SCAG OD trips. Therefore, we construct several

versions of spatial lag variables, and then we use them as explanatory variables in a Tobit

regression model and a latent class regression model to understand the relationship between

Twitter-based ODs with trip-based model generated ODs. We use SCAG ODs as a dependent

variable and Twitter trips, land use, and zonal demographic characteristics as independent

variables along with the spatial lag variables. In this way, we create a three-way comparison among

three different sources of information about Origins and Destinations for cross-validation. Land

use is described with indicators of business density and diversity. Demography is described by

population density. In this way the regression coefficient associated with a Twitter OD trip is the

multiplier that needs to represent SCAG OD trips from a zone and this is the net multiplier taking

into account land-use effects, and the zero trip producing zones in the SCAG data.

4.1. Defining Spatially Lagged Variables

Both OD-trips from Twitter and SCAG model are spatially dependent, because the number of OD

trips is correlated with the number trips between neighboring Origins and Destinations. As

discussed above, we use Spatially Lagged Variables in this analysis with regard to spatial

autocorrelation in OD trip variables. General ways of defining spatial lag variables can be found

in Anselin (1988). However, our OD-trips are doubly dependent upon space (Origins’ and

Destinations’ Neighborhood), therefore we defined three sets of spatial lag variables. Figure 6

illustrates the methodology for defining spatial lag variables. The first image (a) represents an OD-

trip that we want to compute its values for spatial lag variables. The image (b), (c) shows two ways

of calculating spatial lag variables, which indicates the number of trips from an origin to the

adjacent zones of the destination (O_Dadj), and the adjacent zones of the origin to a destination

(Oadj_D). The last image (d) in Figure 3 illustrates the method to compute the third spatially

lagged variable: the number of trips from the neighborhood area of the origin to the neighborhood

area of the destination (Oadj_Dadj). We defined the neighbor zones as all of the zones that are

located within three miles (Euclidean distance between centroids of the zones is sufficient for this

initial testing). For example, the zone located in the southern area of destination (Marked with a

11

green asterisk mark in the map (b) and (d)) was not included as neighbor area for destination zone

because the distance between centroids of the two zones was bigger than three miles. We use three

miles radius because more than 70 percent of both of the OD trips were collected within this

Euclidean distance (73.6% of SCAG trips, 90.4% of Twitter trips). We computed these spatial lag

variables for both Twitter trips and SCAG model trips, total six variables were constructed

(T_O_Dadj, T_Oadj_D, T_Oadj_Dadj, S_O_Dadj, S_Oadj_D, S_Oadj_Dadj).

Figure 3. Defining Spatially Lagged Variables for the OD trips

4.2. Tobit Model

In the SCAG OD database we find 425 OD pairs with number of trips smaller than one. A

dependent variable like this is limited from below and can be considered as censored at zero

(Monzon et al., 1989) and there are exact and approximate ways to estimate regression models

accounting for limits and censoring (Goulias and Kitamura, 1993). With this type of dependent

variable, the Tobit model is generally recommended because the model provides functionality to

handle censored distribution (Maddala, 1986, Greene 2003). Equation (2) shows the Tobit model

we used in this analysis.

12

𝑦𝑖∗ = 𝜌𝑊𝑗𝑦 + 𝛽𝑥𝑖 + 𝛾𝑊𝑗𝑥𝑖 + 𝑍𝛿 + 𝜖𝑖 (2)

𝑦𝑖 = { 𝑦∗ 𝑖𝑓 𝑦∗ > 0 0 𝑖𝑓 𝑦∗ ≤ 0

Where, 𝜖𝑖 ~ 𝑁(0, 𝜎2).

Each OD trip ID is denoted as i, and yi, xi represents OD trips from SCAG model and Twitter

respectively. And yi* is a latent variable that is assumed normally distributed. There are two sets

of spatial lag variables: endogenous (𝑊𝑦𝑖) and exogenous variables (𝑊𝑥𝑖). Wj are three of the

spatial weights structures, used to construct spatial lag variables for both dependent and

independent variables (matrices containing cells with 0 and 1 values; 1 for zones that are

considered adjacent and 0 otherwise). Lastly, 𝑍 denotes all other explanatory variables including

land use and demographical characteristics of origins and destinations, and Euclidean distance

between centroids of origins and destinations. According to Xu and Lee (2014), the Maximum

Likelihood estimator (MLE) produces consistent results for Spatial Autoregressive Tobit Model.

Therefore, we use MLE method to estimate this model with the following Likelihood function in

Equation (3). Where, 𝜏 denotes the Censoring point. In our model estimation, we set 𝜏 = 0.

𝐿 = ∏ [1

𝜎 ∅ (

𝑦− 𝜇

𝜎)]

𝑑𝑖𝑁𝑖 [1 − Φ (

𝜇−𝜏

𝜎)]

1−𝑑𝑖

(3)

The marginal effects of explanatory variables x or Z are the partial derivative of the expected value

of y with respect to x or Z (Equation 4). Since these effects indicate the magnitude of change in

dependent variable by changing one unit in each independent variable, we can obtain the unit

contribution of a Twitter OD trip on SCAG model output.

𝜕 𝐸 [𝑦|𝑥]

𝜕𝑥𝑘= Φ (nonzero)𝛽𝑘 (4)

Equation (4) shows the multiplier (also called marginal effect) to Twitter OD is different for each

OD pair and is a function of the regression coefficient k , which is the same across all zones, but

also accounts for the probability that a pair of origin destinations has in producing zero trips

13

making this marginal effect different among zones.

4.3. Latent Class Regression Model

Though the spatial lag Tobit Model provides a suitable framework to find the conversion of Twitter

OD trips to SCAG OD trips for all the Greater Los Angeles Area, it may be limited in reflecting

the heterogeneous nature of space. The Latent Class Analysis provides a method to capture many

different types of heterogeneities because this model allow us to classify observational units into

a set of latent classes and estimate class-specific regression models simultaneously. This is

particularly useful when we attempt to capture spatial heterogeneity (Deutsch-Burgner and

Goulias, 2014). With our dataset, spatially homogenous OD trips can be grouped into latent classes

and regression coefficients are estimated for each class simultaneously with the determination of

the number of classes (Vermunt and Magidson, 2013). In this way we can test if each hypothetical

class has different Twitter trip conversion multiplier (in essence different values for the regression

coefficients).

In this model, we also use SCAG OD trip as an endogenous variable, all others variables are used

as exogenous variables. There are two different types of exogenous variables in Latent Class

Regression model, and their roles are distinct from each other: 1) covariates – variables affect the

latent variable defining classes; and 2) predictors – variables that affect the dependent variable

(SCAG OD trips). Equation (4) illustrates the density of a general Latent Class Regression model

with more than one predictor as well as covariates.

𝑓(𝑦𝑖|𝑧𝑖1𝑐𝑜𝑣, 𝑧𝑖2

𝑐𝑜𝑣, 𝑧𝑖2𝑝𝑟𝑒𝑑, 𝑧𝑖2

𝑝𝑟𝑒𝑑) = ∑ 𝑃(𝑥|𝑧𝑖1𝑐𝑜𝑣, 𝑧𝑖2

𝑐𝑜𝑣)𝑓(𝑦𝑖|𝑥, 𝑧𝑖2𝑝𝑟𝑒𝑑, 𝑧𝑖2

𝑝𝑟𝑒𝑑) 𝐾𝑥=1 (4)

This model can be estimated with a combination of Expectation Maximization (EM) algorithm

with Newton-Raphson Maximum Likelihood Estimation. The EM algorithm first performs

maximum likelihood iterations until a change in likelihood value is smaller than a predetermined

difference in parameters for a given set of starting values. Then, the algorithm switches to Newton-

Raphson algorithm, and continues until a predefined criterion of convergence is reached. Although

the combination of two algorithms provide quite stable results, most latent class models are still

very sensitive to local maxima of the likelihood function (Vermunt and Magidson, 2002). In order

14

to address this issue we test multiple models with different sets of initial values of parameters

(Goulias, 1999). Another type of issue is an operational problem. Since the degrees of freedom

rapidly decrease as we increase the number of parameters, this may lead to a variety of operational

problems including model identification (not able to estimate parameter) or fail to converge

(subsequent estimation step parameters are not close enough). Therefore, we use a hierarchical

iterative process to estimate this model as follows:

a) We start with one-class assumption without covariates

b) We proceed by increasing number of classes for the models until any parameter fails

to be identified and the size of classes become too small to be meaningful.

c) We select most suitable number of classes based on changes in goodness of fit criteria,

such as Bayesian Information Criterion (BIC), Akaike Information Criterion (AIC) and

the Consistent Akaike Information Criterion (CAIC), based on McCutcheon, 2002,

and Nylund et al., 2007; and we estimate a series of Latent Class Regression models

with different combinations of exogenous variables.

d) Then, we compare the models with different specifications and decide the best model

based on multiple statistical goodness-of-fit measures like the second step as well as

classification errors and R-square values. The higher R-square indicates better model

in predicting the endogenous variable, but the lower classification error means better

model in classifying spatially homogenous groups.

5. Results and Findings

As we described in the previous sections, we use two different types of regression models to

develop a method to validate travel demand model output with Social Media data. The Tobit model

with spatial lag variables was estimated with LIMDEP 9.0 (Greene, 1986 – 2007), and we use

Latent Gold 5.0 (Vermunt and Magidson, 2005 – 2014) to estimate the Latent Class Regression

models.

5.1. Tobit Model with Spatially Lagged Variables

A few preparatory tasks were undertaken to make sure the matrices of data do not have very large

and very small numbers. First, we replace the 425 unrealistic SCAG OD trips (less than one trip

between OD) with zeros; then, we divide the number of residents, houses, employees per PUMA

15

zones by 10,000. We also divide the area of each zone by 10km2 and the Euclidean distances

between zones are measured in km. Lastly, we use people and houses per square kilo-meter as the

units of the population and housing densities.

Table 2 shows the list of the Tobit regression coefficients estimated by Maximum Likelihood

Estimator (left) as well as the marginal effects on latent dependent variable y (Equation 4 and

Table 2 right hand side). First of all, the regression coefficient of the Twitter OD trip is B = 19.737,

significant at the level of 0.001. This indicates that every trip from twitter corresponds a larger

number of SCAG OD trips as expected, while, accounting for other influencing factors. All of the

spatial lag variables are significantly different than zero at the level of 0.01. This means that both

spatial lag variables of the exogenous and endogenous variables play significant roles and are able

to control for spatial autocorrelation in this model. In terms of demographic characteristics,

positive coefficients are found for the size of origins and destinations area, population size and

housing density at origins (significant at the level of 0.05) as expected. On the other hand, negative

coefficients are found for population densities, presumably due to the zero trip zones, at both

origins and destinations. Finally, the negative coefficient associated with the Euclidean distances

signifies neighboring zones have trip interchanges than further apart.

The marginal effects of the explanatory variables can be found in the right side of Table 2, which

are scaled values of the original coefficients. Since this effect represents unit contribution of each

exogenous variable, we can identify the conversion rate of Twitter OD trips to SCAG OD trips. A

Twitter OD trip has 12.507, meaning that we can expect approximately 12 SCAG OD trips to be

represented by one Twitter trip on average throughout the region.

16

Table 2. Results of Spatial Lag Tobit Model

Independent Variables Estimated Coefficients from OLS Partial Derivatives of Expected Values Mean of

X Coef. S.E. b/St.Er. P[Z|>z] Coef. S.E. b/St.Er. P[Z|>z]

Twitter OD T_OD 19.737 0.118 166.592 0.000 12.507 0.098 128.255 0.000 4.375

Spatially

Lagged

Variables from

SCAG OD

S_Oadj_Dadj -0.017 0.002 -7.940 0.000 -0.011 0.001 -7.933 0.000 6035.197

S_Oadj_D 0.219 0.009 23.475 0.000 0.139 0.006 23.312 0.000 716.794

S_O_Dadj 0.006 0.001 5.705 0.000 0.004 0.001 5.702 0.000 6035.197

Spatially

Lagged

Variables from

Twitter

T_Oadj_Dadj 1.067 0.134 7.946 0.000 0.676 0.085 7.939 0.000 78.983

T_Oadj_D -8.396 0.558 -15.053 0.000 -5.320 0.354 -15.010 0.000 8.593

T_O_Dadj -0.586 0.060 -9.775 0.000 -0.371 0.038 -9.764 0.000 78.983

Demographic

Characteristics

from US Census

O_AREA / 10km2 0.031 0.010 3.213 0.001 0.020 0.006 3.212 0.001 82.540

D_AREA /10km2 0.042 0.010 4.394 0.000 0.027 0.006 4.393 0.000 82.540

O_Population / 10,000 8.416 2.404 3.500 0.001 5.333 1.524 3.500 0.001 16.102

D_Population / 10,000 2.736 2.387 1.146 0.252 1.734 1.513 1.146 0.252 16.102

O_Housing /10,000 -9.717 5.428 -1.790 0.073 -6.158 3.440 -1.790 0.073 5.642

D_Housing / 10,000 2.059 5.328 0.386 0.699 1.305 3.377 0.386 0.699 5.642

O_POP_Density (person/km2) -0.490 0.080 -6.140 0.000 -0.311 0.051 -6.139 0.000 270.576

D_POP_Density (person/km2) -0.216 0.080 -2.709 0.007 -0.137 0.050 -2.709 0.007 270.576

O_Housing Density (House/km2) 0.747 0.209 3.569 0.000 0.473 0.133 3.569 0.000 91.889

D_Housing Density (House/km2) 0.258 0.209 1.236 0.216 0.163 0.132 1.236 0.216 91.889

Land Use

Characteristics

from NETS

dataset

O_Number of Employees (10,000) 4.059 1.140 3.561 0.000 2.572 0.722 3.561 0.000 6.914

D_Number of Employees (10,000) -1.259 1.099 -1.145 0.252 -0.798 0.697 -1.145 0.252 6.914

O_Business Diversity 6.148 24.258 0.253 0.800 3.896 15.373 0.253 0.800 3.501

D_Business Diversity 8.197 24.257 0.338 0.735 5.194 15.372 0.338 0.735 3.501

Distance Euclidean Distance Between OD (km) -4.098 0.173 -23.756 0.000 -2.597 0.110 -23.663 0.000 21.424

17

5.2. Latent Class Regression Model

As mentioned above, we used a stepwise approach to develop the Latent Class Regression Model.

The first step is identifying a suitable number of classes describing this OD trip dataset. Similar to

the Tobit regression we use SCAG OD trips as the dependent variable, and estimated a series of

Latent Class models (also called mixture regression models) staring with one-class and increasing

the classes until we find an optimal model. No explanatory variable was added in this step and

seven models were identified (Table 3). Although the model fit improves as we increase the

number of classes, the magnitudes of improvement in terms of goodness of fit indices (BIC, AIC,

AIC3) are dramatically decreased after three-class model reaching an asymptote. This indicates

the three classes are suitable to describe the distribution of SCAG OD trips. In other words, it is

possible to explain the heterogeneous nature of the SCAG trips efficiently with just three latent

classes representing three different groups of people in associating trips between the two sources

(Twitter and Foutr-step). Therefore, the subsequent latent class regression models are estimated

with three-class assumption.

Table 3. Results of Latent Class Models without Covariates

LL BIC(LL) AIC(LL) AIC3(LL) Npar Class.Err.

1-Class -124282.7 248584.7 248569.5 248571.5 2 0.000

2-Class -88644.0 177336.2 177298.0 177303.0 5 0.020

3-Class -82891.7 165860.5 165799.4 165807.4 8 0.067

4-Class -80851.1 161808.3 161724.2 161735.2 11 0.095

5-Class -79904.7 159944.4 159837.4 159851.4 14 0.108

6-Class -79382.3 158928.4 158798.5 158815.5 17 0.132

7-Class -79100.7 158394.2 158241.3 158261.3 20 0.144

The covariates and predictors play different roles in estimating Latent Class Regression Model as

discussed earlier; Covariates influence the latent classes, whereas predictors influence the

dependent variable. Since we use latent class analysis to capture the spatial heterogeneity,

covariates in this model should be able to reflect spatial characteristics of Origins and Destinations.

All of our exogenous variables contain the zonal information, therefore all of them can be used as

either covariates or predictors, or both. Since there is no consensus in the literature to define

exogenous variables for this type of analysis yet, we conducted a few experiments with different

settings in explanatory variables.

18

Table 4 shows the specifications of five latent class regression models and the estimated values of

the goodness of fit indices. All five models use SCAG OD trips as a dependent variable and Twitter

OD trips as a predictor with the three-latent-class assumption. In the first model, we use all the

land use and demographic characteristics as covariates along with all of the six spatial lag

variables. In other words, only Twitter OD trips are used as a predictor, but all other variables are

used to capture spatial heterogeneities. The model converged, but BIC, AIC, AIC3 are relatively

higher than other models in the experiments that follow. Due to the fact that we used many

variables as covariates, but one variable as predictor, least classification errors and R-square value

are found in this model. This indicates the model was able to reflect spatial characteristics well,

but SCAG OD trips were not predicted well compared to the next set of models. The second model

has the opposite specification. All the explanatory variables are used as predictors, but no

covariates are used. This model also converges with slightly higher values of goodness of model

fit indices (BIC, AIC, AIC3). As expected, the results of the model are also opposite to the first

model; the model is estimated with the largest classification errors, but highest R-square values.

This shows this model cannot capture the heterogeneous nature of space well, but predicts SCAG

OD trips well. In the third and fourth models, we divide explanatory variables into two groups:

Six spatial lag variables and fifteen variables of land use and demographic characteristics. Then,

we use them as either covariates or predictors. The third model uses spatial lag variables as

predictors and all other variables as covariates, but the fourth model does the opposite. In terms of

model fit indices, the third model produced better results, but the fourth model captures the

heterogeneous spatial structure and predicts SCAG OD trips better than the third model. Finally

the fifth model has all the explanatory variables as both covariates and predictors. This model has

the best goodness of model fit indices, the second lowest classification errors and second highest

R-square value among the five models. However, the fifth model is estimated with almost two

times more parameters than the fourth model, but the differences in classification error and R-

square are very small (0.0064 and 0.0026 respectively). Based on all the model evaluation criteria,

the fifth model seems to be the best model, but the fourth model can be the best model if we take

into account parsimony.

19

Table 4. Results of Latent Class Regression Models with Different Setting in Exogenous Variable

Exogenous Variables

LL BIC(LL) AIC(LL) AIC3(LL) Npar Class.Err. R2 Covariates Predictors

1 All None -69708.1 139927.2 139522.2 139575.2 53 0.0272 0.7649

2 None All -70606.2 141925.8 141360.4 141434.4 74 0.0683 0.8784

3 Land use

& Demo. Spatial Lag -66382.9 133334.6 132883.8 132942.8 59 0.0438 0.8129

4 Spatial Lag Land use &

Demo. -69645.0 139945.6 139426.1 139494.1 68 0.0384 0.8359

5 All All -62567.5 126311.1 125379.0 125501.0 122 0.0320 0.8385

As mentioned earlier, the fourth model was also estimated with three classes, and their

estimated membership proportions are reported in Figure 4. The largest proportion of the sample

was found in the first class (65 %, 10,145 OD trips), followed by the second and third class (28%

4,190 OD trips, 7% 1,041 OD trips, respectively). Because we use spatial lag variables as

covariates, these latent classes represent the homogeneous groups of OD flows with respect to

their neighbors’ OD flow patterns. The right hand side of Figure 7 illustrates the means of

covariates for each class. The first class has relatively small amount of OD trips as well as their

neighbors’ OD trips. The moderate level of SCAG trips and its neighbors’ OD trips represent the

second group, the third class consists of the OD pairs with the largest amount of SCAG OD trips

and their neighbors’ interactions.

20

Figure 4. Proportion of Latent Classes and Mean of Covariates

Figure 5 shows a set of maps describing Greater Los Angeles area with bar charts indicating the

amount of trips with the origins in red and destinations in blue. The first map (a) shows the total

number of SCAG OD trips, and the other three show each latent class SCAG OD trips. In this

way we can see the differences in geographical space in the maps (b), (c), (d) of OD trips that

belong to latent classes 1, 2, and 3, respectively. The first class OD pairs are more densely

concentrated in the City of Los Angeles Area, but the second and third are mostly belonging to

OD pairs in the suburban areas of Los Angeles. These maps also show spatial clusters of the zones

that have similar OD trips patterns with their neighbors’ trip patterns. Also this classification

reflects the effect of size and location of the zones because those were captured via the spatial lag

variables and used as covariates. This is the most important demonstration that we successfully

captured spatial heterogeneity of OD trips using Latent Class Analysis and we can now estimate

conversion coefficients between Twitter OD trips and SCAG OD trips that account for this

otherwise unobserved spatial heterogeneity.

Figure 5. Geographical Distribution of Latent Classes of OD trips from SCAG model

21

Table 5 shows latent class-specific coefficients of predictors as well as Wald statistics providing

the results of the statistical test if the coefficients are different between the latent classes. The

coefficients with asterisks indicate significant value at the 5% level. Significant predictors of

SCAG OD-trips are the classes themselves (i.e., the overall class specific averages are different).

Based on the Wald test statistics, all of the predictors are different among latent groups except for

the housing related variables and population size in origins. Most importantly, the coefficients of

Twitter trips turned out to be very different. The smallest unit contribution of Twitter OD trips was

found in the first class, the biggest one was found in the second class (7.085, 37.438). This means

when we validate model based OD trips, a Twitter based OD trip should be used in a different way

depending on the underlying spatial structures. This result also shows the necessity of using a

methodology, which is able to reflect the heterogeneous nature of geography and the people living

in different geographies. In terms of the directional effects (i.e., positive or negative coefficients),

all of the explanatory variables show the same type of influence on the dependent variable

regardless of the class membership.

Although the first latent class regression model was estimated with the least R-square value

(0.5964) among classes, it shows a variety of significant predictors (total 11 predictors), and their

signs are the same with the output of the Tobit model in the previous section. However, the

magnitude of coefficients is smaller than the Tobit model results. For example, the unit

contribution of a Twitter OD trip for this latent group was 7.085. (The difference with the Tobit is

5.422). However, the diversity of Business Establishments was not a significant explanatory

variable for the Tobit model, but, the diversities of both origins and destinations played significant

roles in this model. Also, higher numbers of business establishments in origins and destinations

indicate more number of trips between two zones.

The same number of significant predictors was found in the second model with the highest R-

square value (0.9902), but their combinations were different from the first set of coefficients. Two

coefficients were excluded, and the other two coefficients were compared to the first model. As

mentioned earlier, the highest coefficient of Twitter OD trips across all the classes was found in

this class (37.438). This can be interpreted that Twitter based OD trips in this spatial area can be

converted into a four-step like OD trips in a reliable way. In terms of coefficients, higher

22

coefficients than the first class were found in this second class, meaning that differences in land

use and demographics in OD-pairs imply more trips when compared to the first class ODs.

The third class regression model was estimated with R-square 0.7537, with five significant

predictor variables. Less number of significant coefficients was found in this class compared to

the other classes. This is presumably related to the smallest sample size among classes. This model

produced the closest unit contribution of a Twitter OD trip to the Tobit model. The coefficients in

the third class were much higher than the first and second classes: Euclidean distance between

Origins and destinations, and Housing size in Destinations (-13.481 → -224.131, 5.297 →

159.056, respectively). This is presumably because the large proportion of OD pairs related with

a certain area (Ventura County), so that their locational and demographical characteristics

outweigh the small number of the third class members (Ventura county is located in the north-west

corner of SCAG area, has bigger zones with larger housing supply).

Finally, the spatial lag variables played important roles as covariates in this model, the coefficients

can be found in Table 6, Based on Wald statistics, the amount of trips from neighborhood area to

the destinations from SCAG model was the most important variable to classify the latent classes

followed by two other spatial lag variables from SCAG data. In fact, this can be also verified with

Figure 7 that shows the chart of mean of covariates.

23

Table 5. Class-specific Coefficients of Predictors

Independent Variables Class1 Class2 Class3 Between classes

Mean Std.Dev. Coef. 1 z-value Coef. 2 z-value Coef. 3 z-value Wald(=) p-val.

Twitter OD T_OD 7.085 17.481 * 37.438 569.131 * 15.149 38.855 * 8463.340 0.000 16.110 13.407

Demographic

Characteristics

from US Census

O_AREA / 10km2 0.005 21.688 * 1.316 31.524 * 1.226 6.418 * 994.395 0.000 0.457 0.617

D_AREA /10km2 0.006 27.162 * 0.211 8.236 * 1.374 8.325 * 125.089 0.000 0.160 0.347

O_Population / 10,000 0.099 1.702 1.565 2.124 * 18.769 0.596 4.673 0.097 1.829 4.720

D_Population / 10,000 0.155 2.695 * 2.840 4.349 * 8.468 0.262 16.955 0.000 1.491 2.262

O_Housing /10,000 0.220 1.715 0.510 0.275 101.979 1.383 2.028 0.360 7.504 26.077

D_Housing / 10,000 0.538 4.281 * 2.108 1.301 159.056 2.048 * 5.297 0.071 12.197 40.541

O_POP_Density (person/km2) -0.006 -2.658 * 0.004 0.217 -1.538 -1.536 2.440 0.300 -0.111 0.394

D_POP_Density (person/km2) -0.006 -3.079 * -0.080 -4.153 * -1.092 -1.055 15.833 0.000 -0.103 0.275

O_Housing Density (House/km2) 0.008 1.426 -0.028 -0.548 0.649 0.258 0.518 0.770 0.043 0.168

D_Housing Density (House/km2) 0.002 0.467 -0.023 -0.446 -0.732 -0.279 0.325 0.850 -0.057 0.187

Land Use

Characteristics

from NETS dataset

O_Number of Employees (10,000) 0.462 14.446 * 3.422 14.804 * -0.538 -0.058 162.704 0.000 1.216 1.394

D_Number of Employees (10,000) 0.055 1.829 -1.457 -5.945 * -24.651 -2.132 * 41.785 0.000 -2.116 6.256

O_Business Diversity 5.581 6.607 * 20.289 2.639 * 596.274 1.564 5.953 0.051 51.494 150.509

D_Business Diversity 5.736 6.735 * 19.611 2.682 * 627.857 1.623 6.116 0.047 53.641 158.612

Distance Euclidean Distance Between OD -0.651 -55.628 * -13.481 -30.853 * -224.131 -12.874 * 1070.597 0.000 -20.045 56.615

Table 6. Class-specific Coefficients of Covariates

Independent Variables Class1 z-value Class2 z-value Class3 z-value Wald p-value

Spatially Lagged

Variables from

SCAG OD

S_Oadj_Dadj 0.001 14.460 * 0.000 -11.142 * -0.001 -16.499 * 298.773 0.000

S_Oadj_D -0.006 -22.032 * 0.002 18.460 * 0.003 23.994 * 599.306 0.000

S_O_Dadj -0.001 -15.710 * 0.000 14.372 * 0.001 16.606 * 293.315 0.000

Spatially Lagged

Variables from

Twitter

T_Oadj_Dadj -0.036 -7.623 * 0.015 6.354 * 0.021 8.570 * 89.493 0.000

T_Oadj_D -0.003 -0.141 0.001 0.125 0.002 0.151 0.024 0.990

T_O_Dadj 0.008 2.413 * -0.004 -2.104 * -0.004 -2.677 * 10.681 0.005

24

6. Summary and Conclusion

In this paper, a popular Social Media database, geo-tagged Twitter data, are used to understand its

relationship with traditional travel demand model along with US Census dataset (2010), NETS

Database (2010). The two days of geo-tagged Twitter data for Greater Los Angeles regions was

used in this analysis, we successfully extracted 67,266 trips from about 54,272 people, aggregated

them into Public Use Microdata Areas (PUMAs) to avoid the zero-cell problem. We compared the

Twitter based OD matrix with a recent OD matrix (4-step model output) given by local

Metropolitan Planning Organization (MPO). Regression techniques are used to estimate

correlations between the traditional travel demand model ODs and Twitter derived ODs. This type

of regression using these datasets presents two known issues that are censoring of dependent

variable distributions and spatial autocorrelation. To this end, we used a Tobit model with spatial

lag variables to develop an unbiased conversion method between Twitter trips and Four-step

Travel Demand Model output. We also used Latent Class Regression models to take into account

of the heterogeneous nature of space. Finally, multiple unit contribution of a Twitter OD trip were

estimated with two types of regression models. The Tobit model produced a single unit-

contribution of Twitter trip, but three different unit-contributions depending on spatial structures

were obtained with Latent Class Regression model.

In summary, our approach provides a methodology to convert the OD matrix extracted from social

media data into the one created with conventional travel demand model. The OD extraction

algorithm and spatial aggregation techniques play pivotal roles to make this comparison possible.

Also, our methods to define spatially lagged variables for the OD trips (doubly dependent on space)

are developed in this paper, and those variables efficiently control spatial autocorrelations. The

Tobit model and Latent Class Regression model provide a framework to control the censored

distribution of dependent variable and to capture the spatial heterogeneity.

Although our approach provides the conversion method with spatial aggregation technique, a

better way to address this zero cell problem and to represent travel demand in space and time is to

collect larger amounts of social media data, and then extract higher numbers of OD trips that spread

throughout the region in a way that even less populated zones are represented with a sufficient

25

number of trips to analyzed them. Therefore, our immediate next step of the research is to collect

more social media data for longer periods, and then apply this approach with larger zoning system

such as Traffic Analysis District and the OD matrices from different time of day. In addition,

activity-based travel demand model provide a series of OD matrices by different types of activity

and these could be compared to social media data using a variety of filters that may yield more

fine grained information.

Acknowledgement

Dr. Hsi-Hwa Hu from the Southern California Association of Governments is acknowledged for

providing us with the SCAG OD matrix used here. Funding for this research was provided by the

Multicampus Research Program Initiative on Sustainable Transportation, the University of California

Transportation Center (UCTC), the University of California Center on Economic Competitiveness in

Transportation (UCCONNECT), and grants of the University of California Santa Barbara Office of

Research and the College of Letters and Science.

26

References

Anselin, L. (1988). Spatial econometrics: methods and models (Vol. 4). Springer Science &

Business Media.

Auld, J., Williams, C., Mohammadian, A., & Nelson, P. (2009). An automated GPS-based

prompted recall survey with learning algorithms. Transportation Letters, 1(1), 59-79.

Cebelak, Meredith Kimberly, 2013. Location-based social networking data: doubly-constrained

gravity model origin-destination estimation of the urban travel demand for Austin, TX. Master’s

Thesis, The University of Texas at Austin.

Coffey, C., & Pozdnoukhov, A. (2013, November). Temporal decomposition and semantic

enrichment of mobility flows. In Proceedings of the 6th ACM SIGSPATIAL International

Workshop on Location-Based Social Networks (pp. 34-43). ACM.

Collins, Craig, Hasan, Samiul, Ukkusuri, Satish V., 2013. A novel transit rider satisfaction

metric: rider sentiments measured from online social media data. J. Public Transport. 16 (2).

Cottrill, C. D., Pereira, F. C., Zhao, F., Dias, I. F., Lim, H. B., Ben-Akiva, M. E., & Zegras, P. C.

(2013). Future Mobility Survey. Transportation Research Record: Journal of the Transportation

Research Board, 2354(1), 59-67.

Deutsch-Burgner K and K.G. Goulias (2014) Measuring Heterogeneity in Spatial Perception for

Activity and Travel Behavior Modeling. Paper submitted for presentation at the 94th Annual

Meeting of the Transportation Research Board, Washington, D.C., January 11-15, 2015. Also

published as GEOTRANS Report 2014-8-07, Santa Barbara, CA.

Fan, Y., Chen, Q., Liao, C. F., & Douma, F. (2012, November). UbiActive: Smartphone-Based

Tool for Trip Detection and Travel-Related Physical Activity Assessment. In Submitted for

Presentation at the Transportation Research Board 92nd Annual Meeting.

Gao, S., Yang, J-A., Yan, B., Hu, Y., Janowicz, K., McKenzie, G., (2014) Detecting Origin-

Destination Mobility Flows From Geotagged Tweets in Greater Los Angeles Area Eighth

International Conference on Geographic Information Science (GIScience '14). Vienna, Austria

Goulias, K. G. and R. Kitamura (1993) Analysis of binary choice frequencies with limit cases:

Comparison of alternative estimation methods and application to weekly household mode choice.

Transportation Research Part B: Methodological, 27(1), pp. 65-78.

Goulias K.G., R.M. Pendyala, and C. R. Bhat (2013) Chapter 2: Keynote - Total Design Data

Needs for the New Generation Large-Scale Activity Microsimulation Models. In Transport

Survey Methods: Best Practice for Decision Making (eds J. Zmud, M. Lee-Gosselin, M.

Munizaga, and J. A. Carasco). Emerald Group Publishing, Bingley, UK.

Goulias, K. G., Bhat, C. R., Pendyala, R. M., Chen, Y., Paleti, R., Konduri, K. C., ... & Hu, H. H.

(2012, January). Simulator of activities, greenhouse emissions, networks, and travel

http://grantmckenzie.com/academics/McKenzie_DODMF.pdf

http://grantmckenzie.com/academics/McKenzie_DODMF.pdf

27

(SimAGENT) in Southern California. In 91st annual meeting of the Transportation Research

Board, Washington, DC.

Hasan, S., & Ukkusuri, S. V. (2014). Urban activity pattern classification using topic models

from online geo-location data. Transportation Research Part C: Emerging Technologies, 44, 363-

381.

Leiman, J. M., Bengelsdorf, T., Faussett, K., & Transportation Research Board. (2006).

Household Travel Surveys: Using Design Effects to Compare Alternative Sample Designs.

In 85th Annual Meeting of the Transportation Research Board, Washington, DC.

Maddala, G. S. (1986). Limited-dependent and qualitative variables in econometrics (No. 3).

Cambridge university press.

Monzon, J., K. G. Goulias, and R. Kitamura (1989) Trip Generation Models for Infrequent Trips.

Transportation Research Record, 1220, pp. 40-46.

Nitsche, P., Widhalm, P., Breuss, S., & Maurer, P. (2012). A strategy on how to utilize

smartphones for automatically reconstructing trips in travel surveys.Procedia-Social and

Behavioral Sciences, 48, 1033-1046.

Pew Pesearch Inter Project, 2012, Twitter Use 2012,

http://www.pewinternet.org/2012/05/31/twitter-use-2012

Shirima, K., Mukasa, O., Schellenberg, J. A., Manzi, F., John, D., Mushi, A., ... & Schellenberg,

D. (2007). The use of personal digital assistants for data entry at the point of collection in a large

household survey in southern Tanzania. Emerging themes in epidemiology, 4(1), 5.

Southern California Association of Governments(SCAG), 2012, 2012-2035 Regional

Transportation Plan/Sustainable Communities StrategyRTP/SCS Adopted April 2012,

http://rtpscs.scag.ca.gov/Documents/2012/final/f2012RTPSCS.pdf ,

http://rtpscs.scag.ca.gov/Documents/2012/draft/SR/2012dRTP_TransportationConformityReport

.pdf

Turner, S. M. (1996). Advanced techniques for travel time data collection. Transportation

Research Record: Journal of the Transportation Research Board, 1551(1), 51-58. Chicago

J.K. Vermunt and J. Magidson (2013). Technical Guide for Latent GOLD 5.0: Basic, Advanced,

and Syntax. Belmont, MA: Statistical Innovations Inc

Xu, X., & Lee, L. F. (2014). Maximum likelihood estimation of a spatial autoregressive Tobit

model. Manuscript, Department of Economics, Ohio State University.

http://www.pewinternet.org/2012/05/31/twitter-use-2012

http://rtpscs.scag.ca.gov/Documents/2012/final/f2012RTPSCS.pdf