discovering latent activity patterns from human...

9
Discovering Latent Activity Paerns from Human Mobility Zhan Zhao Massachusetts Institute of Technology Cambridge, MA, USA [email protected] Haris N. Koutsopoulos Northeastern University Boston, MA, USA [email protected] Jinhua Zhao Massachusetts Institute of Technology Cambridge, MA, USA [email protected] ABSTRACT Although automatically collected human travel records can accu- rately capture the time and location of human movements, they do not directly explain the hidden semantic structures behind the data, e.g., activity types and/or travel purposes. Probabilistic topic models, which are widely used in natural language processing for document classification, can be used to uncover semantic human mobility patterns from large amount of spatiotemporal data in an unsupervised manner. In this case, the trips of an individual user are treated as words in a document, and each topic is a distribution over space and time that corresponds to some activity. Specifically, a classical topic model, Latent Dirichlet Allocation (LDA), is ex- tended to incorporate multiple heterogeneous trip attributes—the destination, time of arrival, day of week, and duration of stay. The proposed methodology is demonstrated using pseudonymized tran- sit smart card data from London, U.K. The discovered topics reveal diverse spatiotemporal patterns, and are mostly interpretable. It is relatively easy to identify work/school, home, and night-life activi- ties from these topics. The results can be used to infer the latent activity patterns behind the physical human movements directly observed in mobility datasets. KEYWORDS human mobility, automatic activity discovery, topic model, gibbs sampling, transit smart card data ACM Reference Format: Zhan Zhao, Haris N. Koutsopoulos, and Jinhua Zhao. 2018. Discovering Latent Activity Patterns from Human Mobility. In Proceedings of The 7th ACM SIGKDD International Workshop on Urban Computing (UrbComp’18). August 20, 2018, London, UK, 9 pages. 1 INTRODUCTION Recent years have seen an explosion of large-scale spatiotemporal datasets related to human movements, such as cellular network data, transit smart card data, and geo-tagged social media data. Although such automated data sources can capture the time and location of human mobility in high precision and fine details, they do not explicitly provide any semantic information related to travel behavior, e.g., why people visit a certain place at a certain time. Human activities have long been recognized as the fundamental dri- ver for travel demand. In activity-based analysis of travel behavior, travel is treated as being derived from the need to pursue activities Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). UrbComp’18, August 20, 2018, London, UK © 2018 Copyright held by the owner/author(s). distributed in space [3, 4, 8, 24]. The most common way to collect such information is through manual surveys of individual activity participation, which are costly and do not scale well. Because of the limited sample size and observation period, such data may not be able to capture the diversity and dynamics of activity patterns. Also, in these surveys, activities are typically treated as predefined (e.g., home, work, school, recreation). It is questionable such cat- egorization is complete or detailed enough to represent human activities. Nevertheless, even without explicit semantic information from other data sources, longitudinal spatiotemporal data itself generally contains a significant amount of structure [9]. Assuming that people’s spatiotemporal choices for each trip they take are generated based on the activity they intend to participate in at the destination, it is possible to infer the latent activity patterns that underlie human mobility. The objective of this work is to develop a methodology that can uncover the latent activity patterns from large-scale human mobility datasets. Automatic activity discovery is a challenging task, as people’s spatiotemporal choices vary from day to day and from individual to individual. Some of the variations can be explained by differ- ent underlying activities (i.e., inter-activity variability), and some are attributed to exogenous factors (e.g., weather) and thus be- come inherent randomness for the same activity (i.e., intra-activity variability). A good approach should be able to sift through large amounts of noisy data and find meaningful underlying activities. Similar to activity-based models, a supervised learning approach would require prior knowledge in the form of predefined activity categories and labeled data [1, 20]. In contrast, an unsupervised learning approach does not require training data, and has the po- tential of automatic discovery of emerging activity patterns [11, 14]. In this work, we develop a novel methodology that extends Latent Dirichlet Allocation (LDA), a well known probabilistic topic model first introduced by [7]. Topic models are generative models that rep- resent documents as mixtures of topics, and assign a topic to each word in a document. In our model, we treat the travel history of each individual as a document, and each trip as a multi-dimensional word. Therefore, our model can incorporate multiple heterogeneous behavior dimensions, and infer a latent activity for each trip and an activity mixture for each individual. The inferred activity pat- terns can then be used to understand, and predict, travel demand. Understanding human activity patterns has important applications beyond transportation, such as urban planning, location-based ser- vices, public health and safety, and emergency response. 2 RELATED WORK A plethora of methods have been proposed in the literature to in- fer activity patterns from human mobility datasets. They can be generally categorized into two types—supervised learning meth- ods, and unsupervised learning methods. In supervised learning,

Upload: others

Post on 07-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discovering Latent Activity Patterns from Human Mobilityurbcomp.ist.psu.edu/2018/papers/discovering.pdf · through sensors in smartphones and activity information through a web-based

Discovering Latent Activity Patterns from Human MobilityZhan Zhao

Massachusetts Institute of TechnologyCambridge, MA, [email protected]

Haris N. KoutsopoulosNortheastern University

Boston, MA, [email protected]

Jinhua ZhaoMassachusetts Institute of Technology

Cambridge, MA, [email protected]

ABSTRACTAlthough automatically collected human travel records can accu-rately capture the time and location of human movements, theydo not directly explain the hidden semantic structures behind thedata, e.g., activity types and/or travel purposes. Probabilistic topicmodels, which are widely used in natural language processing fordocument classification, can be used to uncover semantic humanmobility patterns from large amount of spatiotemporal data in anunsupervised manner. In this case, the trips of an individual userare treated as words in a document, and each topic is a distributionover space and time that corresponds to some activity. Specifically,a classical topic model, Latent Dirichlet Allocation (LDA), is ex-tended to incorporate multiple heterogeneous trip attributes—thedestination, time of arrival, day of week, and duration of stay. Theproposed methodology is demonstrated using pseudonymized tran-sit smart card data from London, U.K. The discovered topics revealdiverse spatiotemporal patterns, and are mostly interpretable. It isrelatively easy to identify work/school, home, and night-life activi-ties from these topics. The results can be used to infer the latentactivity patterns behind the physical human movements directlyobserved in mobility datasets.

KEYWORDShuman mobility, automatic activity discovery, topic model, gibbssampling, transit smart card dataACM Reference Format:Zhan Zhao, Haris N. Koutsopoulos, and Jinhua Zhao. 2018. DiscoveringLatent Activity Patterns from Human Mobility. In Proceedings of The 7thACM SIGKDD International Workshop on Urban Computing (UrbComp’18).August 20, 2018, London, UK, 9 pages.

1 INTRODUCTIONRecent years have seen an explosion of large-scale spatiotemporaldatasets related to human movements, such as cellular networkdata, transit smart card data, and geo-tagged social media data.Although such automated data sources can capture the time andlocation of human mobility in high precision and fine details, theydo not explicitly provide any semantic information related to travelbehavior, e.g., why people visit a certain place at a certain time.Human activities have long been recognized as the fundamental dri-ver for travel demand. In activity-based analysis of travel behavior,travel is treated as being derived from the need to pursue activities

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).UrbComp’18, August 20, 2018, London, UK© 2018 Copyright held by the owner/author(s).

distributed in space [3, 4, 8, 24]. The most common way to collectsuch information is through manual surveys of individual activityparticipation, which are costly and do not scale well. Because ofthe limited sample size and observation period, such data may notbe able to capture the diversity and dynamics of activity patterns.Also, in these surveys, activities are typically treated as predefined(e.g., home, work, school, recreation). It is questionable such cat-egorization is complete or detailed enough to represent humanactivities. Nevertheless, even without explicit semantic informationfrom other data sources, longitudinal spatiotemporal data itselfgenerally contains a significant amount of structure [9]. Assumingthat people’s spatiotemporal choices for each trip they take aregenerated based on the activity they intend to participate in at thedestination, it is possible to infer the latent activity patterns thatunderlie human mobility. The objective of this work is to developa methodology that can uncover the latent activity patterns fromlarge-scale human mobility datasets.

Automatic activity discovery is a challenging task, as people’sspatiotemporal choices vary from day to day and from individualto individual. Some of the variations can be explained by differ-ent underlying activities (i.e., inter-activity variability), and someare attributed to exogenous factors (e.g., weather) and thus be-come inherent randomness for the same activity (i.e., intra-activityvariability). A good approach should be able to sift through largeamounts of noisy data and find meaningful underlying activities.Similar to activity-based models, a supervised learning approachwould require prior knowledge in the form of predefined activitycategories and labeled data [1, 20]. In contrast, an unsupervisedlearning approach does not require training data, and has the po-tential of automatic discovery of emerging activity patterns [11, 14].In this work, we develop a novel methodology that extends LatentDirichlet Allocation (LDA), a well known probabilistic topic modelfirst introduced by [7]. Topic models are generative models that rep-resent documents as mixtures of topics, and assign a topic to eachword in a document. In our model, we treat the travel history ofeach individual as a document, and each trip as amulti-dimensionalword. Therefore, our model can incorporate multiple heterogeneousbehavior dimensions, and infer a latent activity for each trip andan activity mixture for each individual. The inferred activity pat-terns can then be used to understand, and predict, travel demand.Understanding human activity patterns has important applicationsbeyond transportation, such as urban planning, location-based ser-vices, public health and safety, and emergency response.

2 RELATEDWORKA plethora of methods have been proposed in the literature to in-fer activity patterns from human mobility datasets. They can begenerally categorized into two types—supervised learning meth-ods, and unsupervised learning methods. In supervised learning,

Page 2: Discovering Latent Activity Patterns from Human Mobilityurbcomp.ist.psu.edu/2018/papers/discovering.pdf · through sensors in smartphones and activity information through a web-based

UrbComp’18, August 20, 2018, London, UK Z. Zhao et al.

the true activities associated with travel records need to be pro-vided. For example, using annotated GPS data, [20] proposed a newapproach for activity inference based on Relational Markov Net-works (RMN) and Conditional Random Fields (CRF). [1] adopted amulti-class Support Vector Machine (SVM) approach to infer theactivity type, and validated it on a subset of the 2001 CaliforniaPersonal Travel Survey data. More recently, researchers turned todata fusion to form labeled training samples. This was commonlydone by combining mobility data (e.g., transit smart card data)with survey data [2, 18, 19]. The advancement of information andcommunication technologies has made data fusion more feasible.For example, [17] demonstrated the feasibility of activity inferenceusing data from the Future Mobility Survey (FMS), a smartphonebased activity-travel survey system, which acquires movement datathrough sensors in smartphones and activity information througha web-based interactive process.

For unsupervised learning, the activity information is not pro-vided, and the problem is to discover and interpret latent patternsfrom the data. In one of the first studies of this kind, [9] used Prin-ciple Component Analysis (PCA) to extract a set of characteristicbehavior vectors, called “eigenbehavior” from mobile phone data.Apart from PCA, other variations of dimension reduction methodshave been applied to discover latent patterns from human mobilitydata, including non-negative matrix factorization [22], and prob-abilistic tensor factorization [25]. A Continuous Hidden MarkovModel (CHMM) was proposed in [13] to impute the sequence ofactivities for each trip chain. Overall, these methods are not suit-able for grouped data, where multiple trips associated with thesame individual user are correlated. To model this type of data, ahierarchical structure is needed.

First introduced by [7], Latent Dirichlet Allocation (LDA) is agenerative probabilistic model for collections of grouped discretedata. Each group is described as a random mixture over a set oflatent topics where each topic is a discrete distribution over the col-lection’s vocabulary. Other more recent topic models are generallyextensions of LDA, including dynamic topic model [5], supervisedtopic model [6], and Hierarchical Dirichlet Process (HDP) [26]. Orig-inally designed as a text mining tool, it has found applications inother fields such as image processing [23] and bioinformatics [21].For activity recognition, it was first applied to wearable sensordata in [16]. Regarding its application to mobility analysis, [10, 11]adapted the LDA model for annotated mobile phone data, in whicha daily mobility of an individual is represented as a “bag of loca-tion sequences”. Later, a similar approach was used by [14] to findweekly activity patterns from individual activity information sharedin social media. These studies differ from our work in two ways.First, the common goal of these studies is to identify routines fromactivity data or annotated location data. In comparison, our focusis on inferring activities from human mobility data, without anno-tation. Second, the methods proposed in these studies only workwith grouped discrete data (e.g., words), which is what LDA wasoriginally designed for. In our work, we consider human mobilityas a combination of multiple heterogeneous dimensions, includingboth discrete and continuous observations. In the next section, wewill present a new methodology that extends the LDA model towork with multi-dimensional heterogeneous data, for the purposeof inferring the latent activity associated with each trip.

A similar approach to ours was proposed by [30] for mobilecontext discovery. It considered both spatial and temporal aspectsof human behavior, but only focused on identifying temporal rou-tines. Specifically, the spatial patterns were forced to be individual-specific and could not be shared across individuals. This may limitthe method’s capability in inferring activities. The method wasvalidated on detailed mobile phone data from 20 users with com-plete survey information. For large-scale application, however, suchdetailed information is rarely available. While our methodologydraws many inspirations from [30], we distinguish our work inseveral ways. First, both spatial and temporal patterns are treatedas global; they can be shared across individuals. In our model, eachtopic (or activity) is characterized by a distinct spatiotemporal dis-tribution. Second, the duration of a stay is included in our model,which provides valuable information for activity inference. Third,we test our methodology using a large collection of individual-leveltransit smart card records. Unlike mobile phone data, transit smartcard data is intrinsic to human mobility [28]. As a result, the modelneeds to be adapted to match the characteristics of the data.

3 METHODOLOGY3.1 Problem FormulationLet us assume that for each userm (m = 1, ...,M), we observe acollection of Nm trips, and the n-th trip of userm is associated witha latent topic (or activity) zmn . The goal is to find zmn that can bestexplain the observed trips.

Each trip is characterized by a set of attributes, which should bechosen based on the problem and the available data source. For thepurpose of latent activity inference, we should choose attributesthat are observable from the data and can help distinguish betweendifferent activities. In this study, we consider four trip attributes: thedestination xmn , day of week dmn , time of day tmn , and duration ofstay rmn (i.e., how long the user stays at the destination). Both dmnand xmn are discrete, but tmn and rmn are continuous variables.Based on the activity-based analysis framework, the distributionsof these variables depend on zmn .

To reflect individual heterogeneity, each user is characterized bya multinomial distribution over topics πm . In other words, differentusers may have different composition of activities. For example,some users tend to travel more for commuting, while others travelmore for recreational purposes. πm is used to describe the distribu-tion of the latent activity zmn of a userm.

Specifically, the model assumes the data are generated accordingto the following process:

(1) For each topic z = 1, 2, ...,Z ,(a) Sample a location distribution θz ∼ Dirichlet(β)(b) Sample a day of week distribution ϕz ∼ Dirichlet(γ )(c) Sample a time of day distribution µz ∼ Normal(µ0,τ−10 )(d) Sample a duration distribution ηz ∼ Normal(η0, λ−10 )

(2) For each individualm = 1, 2, ...,M ,(a) Sample a topic distribution: πm ∼ Dirichlet(α)(b) For each trip of the individual n = 1, 2, ...,Nm ,

(i) Sample a topic zmn ∼ Multinomial(πm )(ii) Sample a location xmn ∼ Multinomial(θzmn )(iii) Sample a day of week dmn ∼ Multinomial(ϕzmn )(iv) Sample a time of day tmn ∼ Normal(µzmn ,τ

−1)

Page 3: Discovering Latent Activity Patterns from Human Mobilityurbcomp.ist.psu.edu/2018/papers/discovering.pdf · through sensors in smartphones and activity information through a web-based

Discovering Latent Activity Patterns from Human Mobility UrbComp’18, August 20, 2018, London, UK

(v) Sample a duration rmn ∼ Normal(ηzmn , λ−1)

Figure 1: Plate notation of the human mobility LDA model

Table 1: List of notations

Notation Explanation

M number of individualsZ number of topicsX number of locationsD number of days of weekN total number of tripsNm number of trips for individualmxmn destination indicator for the n-th trip of individualmdmn day of week indicator for the n-th trip of individualmtmn time of day indicator for the n-th trip of individualmrmn duration indicator for the n-th trip of individualmzmn topic assignment indicator for the n-th trip of individualmπm parameter for the topic distribution of individualmθz parameter for the location distribution of topic zϕz parameter for the day of week distribution of topic zµz , τ mean and precision for the time of day distribution of topic zηz , λ mean and precision for the duration distribution of topic zα hyperparameter for the topic distribution πmβ hyperparameter for the location distribution θzγ hyperparameter for the day of week distribution ϕz

µ0, τ0 prior mean and precision for µzη0, λ0 prior mean and precision for ηznz number of trips assigned to topic zumz number of trips of individualm assigned to topic zvzx number of trips to location x assigned to topic zwzd number of trips on day of week d assigned to topic zsz sum of time of day t for trips assigned to topic zqz sum of duration r for trips assigned to topic z

The structure of the adapted LDA model is shown in Figure 1,where the shaded circles represent the observed or pre-specifiedvariables, and the non-shaded circles represent the latent variables

to be estimated. The complete list of the notation used in this paperis summarized in Table 1. In this study, we assumeτ and λ are knownand constant regardless of the topic. Given hyperparameters α , β ,γ , µ0, τ0, η0, and λ0, the generative process described above resultsin the following joint distribution:

P(x ,d, t ,r ,z,π ,θ ,ϕ, µ,η)=P(z | π )P(x | θz )P(d | ϕz )P(t | µz ,τ )P(r | ηz , λ)·P(π )P(θ )P(ϕ)P(µ)P(η)

(1)

where x , d , t , and r are observed, and z, π , θ , π , µ, and η arelatent variables to be estimated. The hyperparameters are omittedfor clarity.

3.2 LikelihoodsTo evaluate the goodness of fit of the modelM, we use the likeli-hood function, which can be expressed as

L(M) = P(x ,d, t ,r | M)

=

M∏m=1

Nm∏n=1

Z∑zmn=1

P(zmn ,xmn ,dmn , tmn , rmn )(2)

For then-th trip of them-th individual, the above joint probabilitycan be further expanded as

P(zmn = z,xmn = x ,dmn = d, tmn = t , rmn = r )

=( ∫

πmP(z | πm )P(πm )

)·( ∫

θP(x | θz )P(θ )

)·( ∫

ϕP(d | ϕz )P(ϕ)

)·( ∫

µP(t | µz )P(µ)

)·( ∫

ηP(r | ηz )P(η)

)=

umz + αz∑Zk=1 umk + αk

· vzx + βx∑Xk=1vzk + βk

· wzd + γd∑Dk=1wzk + γk

·N(t | τsz + τ0µ0

τnz + τ0,

1τnz + τ0

+1τ

)·N

(r | λqz + λ0η0

λnz + λ0,

1λnz + λ0

+1λ

)(3)

where the first term represents the likelihood of topic assign-ments, and the second through fifth terms indicate the marginallikelihood of location, day of week, time of day, and duration ofstay choices given topic assignments. N(t | µ,σ 2) represents theprobability density function for a normal distribution with mean µand variance σ 2 (or precision σ−2).

Perplexity is a standard metric in machine learning to measurethe performance of a probabilistic model, and it has been oftenused to evaluate topic models such as LDA [11, 14]. A lower per-plexity value indicates better model performance. It can be directlycalculated based on the likelihood function:

Perplexity = exp(− log(L(M))

N

)(4)

where N is the total number of trips in the data.

3.3 Inference via Gibbs SamplingIn the literature, two types of approximate techniques have beenadopted to estimate the LDA model—variational inference [7] and

Page 4: Discovering Latent Activity Patterns from Human Mobilityurbcomp.ist.psu.edu/2018/papers/discovering.pdf · through sensors in smartphones and activity information through a web-based

UrbComp’18, August 20, 2018, London, UK Z. Zhao et al.

Gibbs sampling [12]. In this work, we use Gibbs sampling ap-proach, because it is relatively more flexible and easier to imple-ment. Gibbs sampling is a special case of the Markov Chain MonteCarlo (MCMC) methods, which can emulate the target posteriordistribution by the stationary behavior of a Markov chain. In high-dimensional cases, Gibbs sampling works by sampling each dimen-sion iteratively, conditioned on the values of all other dimensions.

In practice, only x , d , t , and r are observed, and we want toestimate latent variables z, π , θ , ϕ, µ, and η. However, the latter fivevariables can be integrated out, because they can be derived usingthe topic variable z:

πmz =umz + αz∑Zk=1 umk + αk

(5)

θzx =vzx + βx∑Xk=1vzk + βk

(6)

ϕzd =wzd + γd∑Dk=1wzk + γk

(7)

µz =τsz + τ0µ0τnz + τ0

(8)

ηz =λqz + λ0η0λnz + λ0

(9)

The strategy of integrating out some of the parameters for modelinference is often referred to as collapsed Gibbs sampling. In or-der to construct a collapsed Gibbs sampler, we need to computethe probability of a topic being assigned to a trip, given all othertopic assignments to all other trips. Therefore, we can derive thefull conditional topic distribution for a specific trip. Assuming thatxmn = x , dmn = d , tmn = t , and rmn = r , the conditional probabil-ity of zmn = z is given by

P(zmn = z | z−mn ,x ,d, t ,r )∝P(zmn = z | z−mn ) · P(xmn = x | zmn = z,z−mn ,x−mn )·P(dmn = d | zmn = z,z−mn ,d−mn )·P(tmn = t | zmn = z,z−mn , t−mn )·P(rmn = r | zmn = z,z−mn ,r−mn )

∝ u−mnmz + αz∑Z

k=1 u−mnmk + αk

· v−mnzx + βx∑X

k=1v−mnzk + βk

·w−mnzd + γd∑D

k=1w−mnzk + γk

·N(t | τs

−mnz + τ0µ0τn−mn

z + τ0,

1τn−mn

z + τ0+

)·N

(r | λq

−mnz + λ0η0

λn−mnz + λ0

,1

λn−mnz + λ0

+1λ

)

(10)

where the superscript −mn signifies leaving the n-th trip of them-th individual out of the calculation. Note that Eq. (10) is similarto Eq. (3), which is not surprising. The probability of a topic assign-ment is proportional to the joint probability of the data with thetopic assignment.

In practice, it is more convenient to store the input data x , d , t ,r in arrays, so that xi , di , ti , and ri are the attributes of the i-thtrip in the dataset. In order to keep track of the user that each tripbelongs to, we use another arraym, wheremi indicates the user IDassociated with the i-th trip. See Algorithm 1 for the detailed GibbsSampling procedure.

Algorithm 1:Adapted LDAmodel for latent activity discovery

Data: trip attributes grouped by individual x , d , t , r , andmResult: topic assignments z, and related latent variables π , θ ,

ϕ, µ, and ηbegin

randomly initialize z, and set up auxiliary variables nz ,umz , vzx ,wzd , sz , and qz ;foreach iteration do

for i ← 1 to N doz ← zi , x ← xi , d ← di , t ← ti , r ← ri ,m ←mi ;nz = nz − 1, umz = umz − 1, vzx = vzx − 1,wzd = wzd − 1 ;sz = sz − t , qz = qz − r ;for k ← 1 to Z do

calculate the conditional probabilityP(zi = k |·) based on Eq. (10) ;

endz′ ← sample from P(zi |·) ;nz′ = nz′ + 1, umz′ = umz′ + 1, vz′x = vz′x + 1,wz′d = wz′d + 1 ;sz′ = sz′ + t , qz′ = qz′ + r ;

endendfor j ← 1 toM do

calculate πj based on Eq. (5) ;endfor k ← 1 to Z do

calculate θk , ϕk , µk , and ηk based on Eqs. (6) to (9) ;endreturn z, π , θ , ϕ, µ, η ;

end

3.4 HyperparametersThe choice of hyperparameters can significantly influence the be-havior of the model. Typically, symmetric Dirichlet priors are usedin LDA, which means that the a priori assumption is that all pos-sible outcomes have the same chance of occurring. The Dirichlethyperparameters generally have a smoothing effect on multinomialparameters. Lowering the values of these hyperparameters willreduce the smoothing effect and increase sparsity of the posteriordistribution. In our model, the sparsity of the πm , θz , and ϕz arecontrolled by the Dirichlet hyperparameters α , β , and γ , respec-tively. A sparser πm means that the model prefers to characterizeeach user by few topics. Similarly, a sparser θz or ϕz means thatthe model prefers to characterize each topic by few locations ordays of week.

For normal hyperparameters, unless there are strong beliefsabout µz and ηz , it is preferable to set the prior means µ0 and η0 tothe sample averages, and the prior precisions τ0 and λ0 to a smallvalue (i.e., a large variance) so that a larger range of possible valuesof µz and ηz can be explored. In addition, the precision parameters τand λ need to be specified as well, which control the concentrationof the distributions of t and r given topic assignment. A larger τor λ means that the distribution of t or r given topic z is more

Page 5: Discovering Latent Activity Patterns from Human Mobilityurbcomp.ist.psu.edu/2018/papers/discovering.pdf · through sensors in smartphones and activity information through a web-based

Discovering Latent Activity Patterns from Human Mobility UrbComp’18, August 20, 2018, London, UK

concentrated on the mean µz or ηz . Both t and r are measuredin hours, and thus µ0, η0, µz and ηz are all in hours. The specificchoices of the hyperparameters used in this study are summarizedas follows:• αz = 50/Z , for z = 1, ...,Z ; this choice is based on [12].• βx = 0.1, for x = 1, ...,X ; this choice is based on [12].• γd = 1, for d = 1, ...,D; typically this value should be muchlarger than βx , because D ≪ X in most cases.• µ0 = 14; this is roughly the mean of t in the data• η0 = 9; this is roughly the mean of r in the data• τ0 = λ0 = 1/64 (i.e., standard deviation is 8 hours)• τ = λ = 1 (i.e., standard deviation is 1 hour)

The only hyperparameter that remains to be set is the numberof topics Z , which we will test in Section 5. Typically, when thetopics are sparse (e.g., β and γ are small, and τ and λ are large), themodel will fit better to the data if Z is set higher [15].

4 DATATo test the proposed model, we use a dataset of pseudonymised triprecords from more than 100,000 unique smart cards over two years.The data were made available by Transport for London. We assumeeach card corresponds to an individual user. The public transporta-tion system in London consists of several modes. However, thedataset only covers the rail-based modes, including London Un-derground, Overground, and part of National Rail. Therefore, thedataset can only capture a subset of the trips taken by each user,which is typical for large-scale mobility data sources.

For each trip in the dataset, we extract four attributes—destinationx , day of weekd , time of day t , and duration of stay r . The first threeattributes are directly obtained from the smart card transactionrecorded when the user exit the transit system at the destinationstation. The duration of stay for a trip is defined as the differencebetween the end time of the trip and the start time of the next trip.However, because only a subset of trips are recorded in the data, itis critical to ensure that the user does not travel to another placebetween the two observed trips, referred to as a hidden visit in [29].If there exists a hidden visit, the true duration of stay would beunknown. Therefore, we use two thresholds to filter out potentialhidden visits. For a trip to be included in the analysis, (1) the dis-tance between its destination and the next trip’s origin has to besmaller than a distance threshold δ = 2 km, and (2) the differencebetween its end time and the next trip’s start time has to be smallerthan a duration threshold T = 24 hours. Furthermore, we limit thateach user has to have at least 20 trips, i.e., Nm ≥ 20. After datapreprocessing, we obtain 3,014,534 trips committed by 19,691 users.

Figure 2 illustrates the distribution of the four trip attributesof interest. Figure 2(a) shows the distribution of the time of day t ,which is dominated by themorning and afternoon peaks. Figure 2(b)shows the distribution of the day of week d ; it is clear that thereare more trips on weekdays than weekends. The distribution of theduration of stay r is shown in Figure 2(c), which is characterized bythree peaks—one in 13-15 hours, one in 9-11 hours, and one in 1-3hours. They probably roughly correspond to the three categoriesof activities—home, work, and other. Figure 2(d) presents the top20 most popular destinations in our data, and their correspondingprobabilities. Oxford Circus is by far the most popular destination,

Figure 2: Distribution of trip attributes

followed by London Bridge and Stratford. In total, 665 stationsappear in our data, i.e., X = 665. As one can expect, most stationshave relatively low probabilities, many of which are located in thesuburban areas. Showing the top stations may not correctly reflectthe overall spatial patterns. Therefore, we use P(inner) to indicatethe total probability of all the stations located in Inner London, andP(central) for Central London. Inner London refers to the group ofLondon boroughs, and the City of London, which form the interiorpart of Greater London. Central London is located at the core ofInner London. In this study, we define Central London as the areawithin the congestion charging zone. P(inner) and P(central) areshown in the top right corner of Figure 2(d).

The distributions shown in Figure 2 are not meant to representthe overall travel patterns for all transit passengers in London,which may not be the case given the way we filter the data. For ex-ample, the popular tourist destinations are likely underrepresentedin the data, because we only include users with at least 20 trips.Instead, Figure 2 should be used to benchmark the topic-specifictrip attribute distributions presented in Section 5.

5 RESULTS5.1 Model SelectionThe application of the model requires the number of topics Z tobe selected. In the literature, perplexity is often used to choose Z[11, 14]. Figure 3 shows how the perplexity value varies by Z . Aset of potential values of Z are tested: 5, 10, 15, 20, 30, 40, 50, and60. We find that when Z < 15, increasing Z drastically decreasesthe perplexity; when Z > 15, increasing Z can slightly increase theperplexity. Therefore, we set Z = 15.

Note that our choice is much lower than the number of topicschosen in prior studies, which often set Z ≥ 100. However, they arenot comparable because our model structure is different and theinclusion of continuous variables such as t and r may significantlyaffect the number of topics needed to fit the data well. In practice,

Page 6: Discovering Latent Activity Patterns from Human Mobilityurbcomp.ist.psu.edu/2018/papers/discovering.pdf · through sensors in smartphones and activity information through a web-based

UrbComp’18, August 20, 2018, London, UK Z. Zhao et al.

a smaller number of topics is preferable as it is easier to examineand interpret the results, and less computationally costly to fit themodel.

Figure 3: Perplexity by number of topics

5.2 Topic AnalysisIt is worth noting that having Z topics does not mean that theyare equally important; some topics are more prevalent than others.To measure the importance of a topic, we use the average topicproportion per user, or 1

M∑Mm=1 πm . The discovered topics are

ranked by importance. Therefore, the topic index indicates theorder of importance for that topic. The overall importance of thetopics are shown in Figure 4.

Figure 4: Importance of topics

One way to interpret the discovered topics is by examining theconditional distributions of different trip attributes. Figure 5 showsthe distributions of P(t |z), P(d |z), P(r |z), and P(x |z) for each topicz. In the figure, each column corresponds to a topic, and each rowcorresponds to a specific trip attribute. P(x |z) is shown in the fourthrow. Because it is difficult to visually present the probabilities ofall 665 locations, we only show the top 10 locations related toeach topic. P(inner) and P(central) are embedded in the figure torepresent the overall spatial pattern of each topic.

It is relatively easy to identify topics that are related to work orschool, as morning commuting trips tend to occur during morn-ing rush hours on weekdays. Topic 1 fits this description, as itsP(t |z) concentrates around 9 am and its P(d |z) is much higher onweekdays than weekends. Some of the most likely destinations areimportant employment centers, such as Canary Wharf and Bank,and users stay at their destination for around 10 hours. WhereasTopic 1 represents the typical pattern of going to work, some vari-ations can be found from other topics. Topic 9, for example, aresimilar to Topic 1 in terms of P(t |z), P(d |z), and P(x |z), but theydiffer significantly in P(r |z). Topic 9 may be related to employeesgetting off work early, or people working on a half-day shift. Topic11, on the other hand, has a significantly longer duration of staythan Topic 1. It does not necessarily mean people sometimes workfor more than 12 hours; it is possible that people may participate insome other activities after work (e.g., having dinner with colleagues)in the same area. Note that P(central) is significantly higher forTopic 11, and Central London has a high concentration of both jobsand recreational/cultural activities. Topics 1, 9, and 11 are coloredin red in Figure 5.

In addition, we can identify topics related to going home by ex-amining (1) P(t |z) and P(r |z), because people mostly stay at home atnight, and (2) P(inner) and P(central), because residential locationstend to be more dispersed than other types of locations. Topics2, 6, and 7 are likely candidates. They are similar in the spatialdimension, but different in the temporal dimension. Topic 2 likelyrepresents the typical afternoon commuting trips, arriving homeat around 7 pm and stay there for 14-15 hours. For Topic 6, thetime of arrival is typically later and the duration of stay is shorter.Interestingly, both Topics 2 and 6 have a much lower probability ofoccurring on Fridays than other weekdays. A possible explanationfor this is that most people do not need to go to work on Saturdays;after going home on Fridays, their duration of stay tends to belonger than other weekdays. This is partly captured by Topic 7, forwhich the duration of stay is much longer than Topics 2 and 6. Moregenerally, Topic 7 likely captures the activity of going home oneday and not traveling in the morning of next day. As the durationof stay is capped at 24 hours, the activity of not traveling for morethan a day is be captured in this analysis. Topics 2, 6, and 7 arecolored in green in Figure 5.

It is more challenging to assign specific meanings to the remain-ing topics, but each of them has distinct spatiotemporal characteris-tics. Topics 3, 4, and 5 may be general-purpose short visits occurringat different times of day, with an average duration of a little morethan 2 hours. These short visits can include recreational activitiessuch as shopping or eating, or work-related activities such as busi-ness meeting or part-time jobs. It is difficult to distinguish thembased on the available data. Topics 8 and 10 are similar to Topics3 and 4, respectively, but with longer duration of stay. Topic 10,and to a less extent Topic 4, are likely related to night life activities,including, but not limited to, going to theaters, movies, restaurants,and bars/clubs. Piccadilly Circus and Leicester Square, both locatedin London’s Theater district, are among the most likely destinationsfor both topics. The duration of stay is even longer for Topics 12and 13, which are relatively more likely to occur on Saturdays inCentral London. This may be attributed to people spending a whole

Page 7: Discovering Latent Activity Patterns from Human Mobilityurbcomp.ist.psu.edu/2018/papers/discovering.pdf · through sensors in smartphones and activity information through a web-based

Discovering Latent Activity Patterns from Human Mobility UrbComp’18, August 20, 2018, London, UK

Figure 5: Topic-level spatiotemporal distributions

Page 8: Discovering Latent Activity Patterns from Human Mobilityurbcomp.ist.psu.edu/2018/papers/discovering.pdf · through sensors in smartphones and activity information through a web-based

UrbComp’18, August 20, 2018, London, UK Z. Zhao et al.

Saturday afternoon in the city center for a series of recreational ac-tivities (e.g., shopping + dinner + movies). Topics 14 and 15 indicategeneric visits to areas outside of Central London.

5.3 User AnalysisBased on the proposed model, a user m is characterized by anindividual-specific topic distribution πm . By definition, πm is avector of length Z that corresponds to a categorical probabilitydistribution over Z topics; in other words,

∑Zz=1 πmz = 1 ∀m. Thus

πm can be used as a normalized latent feature vector to describe anindividual’s activity pattern, or the combination of topics.

Correlation may exist between topics. If πmj and πmk are pos-itively correlated across users, it means that Topics j and k aremore likely to co-occur for the same individual. Figure 6 shows thecorrelation matrix across the 15 topics. We find that Topics 1 and2 are likely to co-occur. This is not surprising, as the two topicsrepresent the primary morning and afternoon commuting activities,respectively. Topics 11 and 6 are another pair of commuting activi-ties that has a high probability of co-occurrence. The difference isthat the duration of stay for Topic 11 is generally longer than Topic1, and the time of arrival at home for Topic 6 is later than Topic 2.Furthermore, Topics 14 and 15 are also likely to co-occur, as theyare both related to trips going to outside of Central London.

Figure 6: Correlation matrix across topics

The individual-level topic distribution πm also provides a newway to measure the similarity between users, which can be usedto cluster users based on their activity patterns. We apply the K-Means algorithm for user segmentation, and chooseK = 6 based onthe inertia value and interpretability of the results. Figure 7 showsthe average πm for each cluster of users. Cluster 1 represent theusers whose activity patterns are relatively evenly distributed toall topics; they are likely all-purpose travelers. Clusters 2 and 5 areboth patterns of commuters whose activity patterns concentrateon Topics 1 and 2. In comparison, users in Cluster 5 are mostlyexclusive commuters, while Cluster 2 have a higher proportion ofnon-commuting trips. Topic 6 represent an alternative commutingpattern with high probabilities for Topics 11 and 6. Clusters 3 and 4represent users who make mostly short visits. Some of these usersmay be part-time or shift works who work on one or more shifts.

Figure 7: User segmentation results based on K-means

6 DISCUSSIONAlthough automatically collected spatiotemporal records can accu-rately capture the time and location of human movements, they donot explicitly explain the hidden semantic structures behind thedata, e.g., activity types and/or travel purposes. In this study, we pro-pose a model to discover latent activities from human mobility datain an unsupervised manner. The proposed model extends the LDAtopic model by incorporating multiple heterogeneous dimensions ofhuman mobility. Specifically, four trip attributes—destination, timeof arrival, day of week, and duration of stay—are used in the modelto infer the hidden topical structure, where each topic represents anlatent activity with distinct distribution over these attributes. Themodel is demonstrated via a dataset of transit smart card recordsfrom London, U.K. The discovered topics reveal diverse spatiotem-poral patterns, and are mostly interpretable. It is relatively easy toidentify activities such as work/school, home, and night-life activi-ties from these topics. Our study makes it possible to enrich humanmobility data with semantically meaningful information, whichmitigates our reliance for manually collected activity-based sur-veys. In addition, the individual-level topic distribution can be usedto characterize the user’s activity pattern and measure similaritybetween users.

While we have shown that many insights about human activ-ities can be obtained with the proposed model, our work can beextended in several ways. First, our method requires the numberof topics Z to be selected by the researcher. Typically, preliminaryexperiments are needed to choose a reasonable value for Z , whichmay not be ideal for general applications. Nonparametric meth-ods, such as Hierarchical Dirichlet Process, relaxes this constraintby automatically inferring Z from the data [26]. Second, the topicdistribution is assumed to be stable over time, which may not betrue. However, it has been shown that individual travel patterns canchange in the long term [27]. Dynamic topic models can be usedto analyze the evolution of topics over time [5]. Future researchshould explore the applicability of such methods.

ACKNOWLEDGMENTSThe authors would like to thank Transport for London for providingthe financial and data support that made this research possible.

Page 9: Discovering Latent Activity Patterns from Human Mobilityurbcomp.ist.psu.edu/2018/papers/discovering.pdf · through sensors in smartphones and activity information through a web-based

Discovering Latent Activity Patterns from Human Mobility UrbComp’18, August 20, 2018, London, UK

REFERENCES[1] Mahdieh Allahviranloo and Will Recker. 2013. Daily activity pattern recognition

by using support vector machines with multiple classes. Transportation ResearchPart B: Methodological 58 (Dec. 2013), 16–43. https://doi.org/10.1016/j.trb.2013.09.008

[2] Azalden Alsger, Ahmad Tavassoli, Mahmoud Mesbah, Luis Ferreira, and MarkHickman. 2018. Public transport trip purpose inference using smart card fare data.Transportation Research Part C: Emerging Technologies 87 (Feb. 2018), 123–137.https://doi.org/10.1016/j.trc.2017.12.016

[3] Kay W. Axhausen and Tommy Gärling. 1992. Activity-based approaches to travelanalysis: conceptual frameworks, models, and research problems. TransportReviews 12, 4 (Oct. 1992), 323–341. https://doi.org/10.1080/01441649208716826

[4] Chandra R. Bhat and Frank S. Koppelman. 1999. Activity-Based Modeling ofTravel Demand. In Handbook of Transportation Science. Springer, Boston, MA,35–61. https://doi.org/10.1007/978-1-4615-5203-1_3

[5] David M. Blei and John D. Lafferty. 2006. Dynamic Topic Models. In Proceedingsof the 23rd International Conference on Machine Learning (ICML ’06). ACM, NewYork, NY, USA, 113–120. https://doi.org/10.1145/1143844.1143859

[6] David M. Blei and Jon D. McAuliffe. 2010. Supervised Topic Models.arXiv:1003.0783 [stat] (March 2010). http://arxiv.org/abs/1003.0783 arXiv:1003.0783.

[7] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent DirichletAllocation. Journal of Machine Learning Research 3, Jan (2003), 993–1022. http://www.jmlr.org/papers/v3/blei03a.html

[8] J. L Bowman and M. E Ben-Akiva. 2001. Activity-based disaggregate traveldemand model system with activity schedules. Transportation Research Part A:Policy and Practice 35, 1 (Jan. 2001), 1–28. https://doi.org/10.1016/S0965-8564(99)00043-9

[9] Nathan Eagle and Alex Sandy Pentland. 2009. Eigenbehaviors: identifying struc-ture in routine. Behavioral Ecology and Sociobiology 63, 7 (April 2009), 1057–1066.https://doi.org/10.1007/s00265-009-0739-0

[10] Katayoun Farrahi and Daniel Gatica-Perez. 2009. Learning and Predicting Multi-modal Daily Life Patterns fromCell Phones. In Proceedings of the 2009 InternationalConference on Multimodal Interfaces (ICMI-MLMI ’09). ACM, New York, NY, USA,277–280. https://doi.org/10.1145/1647314.1647373

[11] Katayoun Farrahi and Daniel Gatica-Perez. 2011. Discovering Routines fromLarge-scale Human Locations Using Probabilistic Topic Models. ACMTrans. Intell.Syst. Technol. 2, 1 (Jan. 2011), 3:1–3:27. https://doi.org/10.1145/1889681.1889684

[12] Thomas L. Griffiths andMark Steyvers. 2004. Finding scientific topics. Proceedingsof the National Academy of Sciences 101, suppl 1 (April 2004), 5228–5235. https://doi.org/10.1073/pnas.0307752101

[13] Gain Han and Keemin Sohn. 2016. Activity imputation for trip-chains elicitedfrom smart-card data using a continuous hidden Markov model. TransportationResearch Part B: Methodological 83 (Jan. 2016), 121–135. https://doi.org/10.1016/j.trb.2015.11.015

[14] Samiul Hasan and Satish V. Ukkusuri. 2014. Urban activity pattern classificationusing topic models from online geo-location data. Transportation Research PartC: Emerging Technologies 44 (July 2014), 363–381. https://doi.org/10.1016/j.trc.2014.04.003

[15] Gregor Heinrich. 2004. Parameter estimation for text analysis. Technical Report.[16] Tâm Huynh, Mario Fritz, and Bernt Schiele. 2008. Discovery of Activity Pat-

terns Using Topic Models. In Proceedings of the 10th International Conferenceon Ubiquitous Computing (UbiComp ’08). ACM, New York, NY, USA, 10–19.https://doi.org/10.1145/1409635.1409638

[17] Y. Kim, F. C. Pereira, F. Zhao, A. Ghorpade, P. C. Zegras, and M. Ben-Akiva. 2014.Activity Recognition for a Smartphone Based Travel Survey Based on Cross-User History Data. In 2014 22nd International Conference on Pattern Recognition.432–437. https://doi.org/10.1109/ICPR.2014.83

[18] Takahiko Kusakabe and Yasuo Asakura. 2014. Behavioural data mining of transitsmart card data: A data fusion approach. Transportation Research Part C: EmergingTechnologies 46 (Sept. 2014), 179–191. https://doi.org/10.1016/j.trc.2014.05.012

[19] Sang Gu Lee and Mark Hickman. 2014. Trip purpose inference using automatedfare collection data. Public Transport 6, 1-2 (April 2014), 1–20. https://doi.org/10.1007/s12469-013-0077-5

[20] Lin Liao, Dieter Fox, and Henry Kautz. 2005. Location-based Activity RecognitionUsing Relational Markov Networks. In Proceedings of the 19th International JointConference on Artificial Intelligence (IJCAI’05). Morgan Kaufmann Publishers Inc.,San Francisco, CA, USA, 773–778. http://dl.acm.org/citation.cfm?id=1642293.1642417

[21] Lin Liu, Lin Tang, Wen Dong, Shaowen Yao, and Wei Zhou. 2016. An overviewof topic modeling and its current applications in bioinformatics. SpringerPlus 5,1 (Sept. 2016). https://doi.org/10.1186/s40064-016-3252-8

[22] Chengbin Peng, Xiaogang Jin, Ka-Chun Wong, Meixia Shi, and Pietro Liò. 2012.Collective Human Mobility Pattern from Taxi Trips in Urban Area. PLOS ONE 7,4 (April 2012), e34487. https://doi.org/10.1371/journal.pone.0034487

[23] N. Rasiwasia and N. Vasconcelos. 2013. Latent Dirichlet Allocation Modelsfor Image Classification. IEEE Transactions on Pattern Analysis and Machine

Intelligence 35, 11 (Nov. 2013), 2665–2679. https://doi.org/10.1109/TPAMI.2013.69[24] Soora Rasouli and Harry Timmermans. 2014. Activity-based models of travel

demand: promises, progress and prospects. International Journal of Urban Sciences18, 1 (Jan. 2014), 31–60. https://doi.org/10.1080/12265934.2013.835118

[25] Lijun Sun and Kay W. Axhausen. 2016. Understanding urban mobility patternswith a probabilistic tensor factorization framework. Transportation Research PartB: Methodological 91 (Sept. 2016), 511–524. https://doi.org/10.1016/j.trb.2016.06.011

[26] Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2006.Hierarchical Dirichlet Processes. J. Amer. Statist. Assoc. 101, 476 (Dec. 2006),1566–1581. https://doi.org/10.1198/016214506000000302

[27] Zhan Zhao, Haris N. Koutsopoulos, and Jinhua Zhao. 2018. Detecting patternchanges in individual travel behavior: A Bayesian approach. TransportationResearch Part B: Methodological 112 (June 2018), 73–88. https://doi.org/10.1016/j.trb.2018.03.017

[28] Zhan Zhao, Haris N. Koutsopoulos, and Jinhua Zhao. 2018. Individual mobilityprediction using transit smart card data. Transportation Research Part C: EmergingTechnologies 89 (April 2018), 19–34. https://doi.org/10.1016/j.trc.2018.01.022

[29] Zhan Zhao, Jinhua Zhao, and Haris N. Koutsopoulos. 2016. Individual-Level TripDetection using Sparse Call Detail Record Data based on Supervised StatisticalLearning. https://trid.trb.org/view.aspx?id=1393647

[30] J. Zheng, S. Liu, and L. M. Ni. 2014. Effective Mobile Context Pattern Discoveryvia Adapted Hierarchical Dirichlet Processes. In 2014 IEEE 15th InternationalConference on Mobile Data Management, Vol. 1. 146–155. https://doi.org/10.1109/MDM.2014.24