1 opportunistic discovery of personal places using multi-source sensor...

1

Opportunistic Discovery of Personal PlacesUsing Multi-source Sensor Data

Sudip Vhaduri, Student Member, IEEE and Christian Poellabauer, Senior Member, IEEE

Abstract—Modern smartphones and wearables are able to continuously collect significant amounts of sensor data, where such datacan be helpful to study a user’s mobility or social interaction patterns, but also to deliver various services based on a user’s presence atdifferent places during certain times of the day. Therefore, it is important to accurately identify personal places of interest (POIs), suchas a user’s workplace or home. Such places are usually determined using segmentation of location traces, but frequent gaps in thedata (i.e., missing location readings) can result in a large number of small and incomplete segments that should actually be groupedtogether into a single large segment. This paper presents a segmentation approach that utilizes a user’s personal data obtained frommultiple sensor sources and devices such as the battery recharge behavior (measured on smartphones), step counts, and sleeppatterns (measured by wearables), to opportunistically fill gaps in the user’s location traces. Using the data from a mobile crowdsensing study of more than 450 users over a 2-year period, we show that our approach is able to generate fewer, but more completesegments compared to the state of the art.

Index Terms—Mobile Crowd Sensing, Location Traces, Spatio-temporal Clustering, Opportunistic Sensing.

F

1 INTRODUCTION

FOR most location-aware services, it is important to notonly know the actual geographic location of a user, but

also the type of place and the significance of that place toan individual. Examples of such services include location-dependent delivery of directory services (e.g., finding the lo-cation of the nearest gas station), social networking services(e.g., finding nearby friends), to-do lists (e.g., a shopping listreminder when near a grocery store) [1], incentive programs(e.g., to encourage users to perform physical activity whennear gyms or parks) [2], automatic selection of operationalmodes of devices (e.g., switching a phone into vibrate modewhen entering a place for personal reflection) [3], and var-ious proactive recommendation systems based on a user’spreferences, habits, or mobility patterns [4], [5], [6].

With recent advances in mobile technologies, we nowhave access to a wide range of sensors within a singledevice, including GPS, accelerometer, and gyroscope, whichcombined are capable of collecting tens or even hundredsof gigabytes of data over the course of a year. Many mobilecrowd sensing (MCS) studies [7], which may involve hun-dreds or thousands of users, will therefore often collect asignificant amount of data that will form the foundation forvarious studies on social behaviors, communication trends,etc. Fundamental to these studies is also a solid interpre-tation of the users’ locations and mobility habits, whichrequires a transformation of raw location traces into placesof interest (POIs). These POIs are the places that people visitfrequently or where people spend extensive periods of timein a day, such as their homes or workplaces [8]. Discoveryof such places is essential to effective delivery of various

• The authors are with the Department of Computer Science and Engineer-ing, University of Notre Dame, Notre Dame, IN, 46556.E-mail: {svhaduri,cpoellab}@nd.edu

location-aware services and recommendations, to promoteservices for improved health and well-being, to help avertcrimes, and to assure public safety [7], to name just a fewexamples.

In order to obtain such POIs, a user’s collected locationtraces can be grouped into different clusters (each repre-senting a POI), using various spatio-temporal clusteringapproaches [5], [8]. However, previous work has shown thaton average about 40-70% of the time the collected traceswill suffer from extensive gaps [5], [9], often because ofusers’ presence in GPS-denied areas (e.g., indoors) or othersituations that prevent data collection (e.g., a smartphoneplaced into a low-power state). Spatio-temporal clusteringon such incomplete traces will often yield a large numberof small clusters, where these small clusters should actuallybe grouped together to form a large cluster. This can beachieved by carefully augmenting the location traces toreduce or completely remove these data gaps, e.g., by usingother sources of information to determine if two smallclusters should be merged together or not. Researchershave proposed various methods for merging several smallclusters into fewer larger clusters using various types of“secondary data”, e.g., using techniques such as fingerprintmatching [10], [11] or location data sharing among co-located users [9], [12]. However, these approaches have var-ious limitations, e.g., fingerprint maps may be incomplete oroutdated and traces of co-located users (for location sharing)may not always be available.

In contrast, our approach focuses on using other sourcesof sensor data to augment the location traces and potentiallymerge clusters. For example, modern mobile devices canprovide various types of data describing device status andusage, such as the current battery charge level, chargingstatus, and screen/display mode. Various types of physio-logical data (e.g., heart rate and calorie burn) and behavioraldata (e.g., different types of physical activity, step count,

2

and sleep status) can also be obtained from different typesof mobile sensors embedded in modern wearables (e.g.,devices such as Fitbit, Apple Watch, and Microsoft Band) [8],[12], [13]. Therefore, the approach proposed in this paperproposes to augment a user’s location traces using severalother types of data, specifically: step counts (wearable),sleep data (wearable), and battery charging status (smart-phone). We claim that these types of data can be used toverify the location of a user during a gap between twopotentially co-located clusters, i.e., indicating if the useris still at the same location (thereby allowing us to mergethese two clusters) or if the user moved to a new locationbetween these clusters (in which case these clusters cannotbe merged).

Compared to our previously proposed approach [13],in this paper we also use physical activity data (i.e., stepcounts) due to its wider availability compared to sleep andbattery charging data, which have a skewed spatio-temporaldistribution (as shown in [13]). For a given user, we firstuse the location traces of the user to determine a seriesof stay point (sp) clusters using our previously proposedEnhanced Stay Point Clustering approach (described in [9]).We then extract the sleep and battery charging status aswell as the step count data from the same user duringgaps between consecutive sp-clusters and perform a clustermerging technique referred to as Opportunistic Stay PointClustering, which is fully described in Section 3.3 of thispaper. Further, in Section 5, we compare our proposedtechnique to state-of-the-art solutions using smartphone andFitbit data collected from a mobile crowd sensing study(called “NetHealth” [14]), where data from over 450 subjectsover a 2-year period were collected. Our analysis showsthat the proposed approach reduces the cluster count onaverage by about 64% and increases the cluster time byabout 15%, compared to previous solutions. We find thatactivity-assisted approaches decrease cluster count by 57%– 64% as compared to a 4% – 26% decrease in cluster countusing approaches that do not consider physical activity.Similarly, activity-assisted approaches increase cluster timesby 10% – 15% as compared to 1% – 4% by other approaches.

2 RELATED WORK

With the help of the Mobile Crowd Sensing paradigm [7],we are able to automatically collect a wide range of sensordata from a large group of subjects and their devices suchas smartphones and wearables over an extended periodof time. However, MCS also poses several challenges, in-cluding efficient task design and allocation, target partici-pant selection, incentive mechanisms, security and privacyconcerns, and ensuring the reliability, trustworthiness, andquality of the collected data. Many of these issues have beenaddressed in prior research [15], [16], [17], [18], [19], [20].While early MCS studies have focused on a few very com-mon sensor devices, such as GPS traces and accelerometerreadings, more recent efforts include many other types ofsensor readings, such as physiological data (e.g., heart rateand calorie burn), behavioral data (e.g., sleep habits, batteryrecharge patterns, and different types of physical activities),social networking data, and systems information, includingscans of Wi-Fi, Bluetooth, and cell radios. The focus of our

work is to demonstrate how these types of sensor data canbe utilized to ensure effective and accurate discovery ofplaces of interest.

Previous efforts for the discovery of POIs are eitherbased on “stationary radio emitters” or “actual locationdata.” The radio emitters-based place discovery approachis also known as fingerprint-based place identification [10],[11], where devices continuously scan for the presence ofknown stationary radio emitters (e.g., Wi-Fi routers or celltowers) to determine a place. On the other hand, the ac-tual location-based place discovery approach is known asgeometry-based place identification [1], [3], [6], [8], whichrelies on actual location data (e.g., GPS coordinates). Theselocation data are then placed into clusters that representplaces that a user frequently visits or where a user spends asignificant amount of time [1], [3], [6], [8]. Geometry-basedapproaches can further be divided into two subcategories:(1) approaches that rely entirely on GPS data for localiza-tion [1], [5], [6], [21], [22], [23] and (2) approaches that useother sources of data, such as Wi-Fi routers, cell towers, orBluetooth beacons in addition to GPS data [3], [8], [9], [12],[24], [25].

Geometry-based place discovery approaches inspectGPS traces chronologically, particularly identifying timeswhen these traces are disrupted, i.e., there are gaps in thetraces when the signal is lost (which very often happens inurban areas and indoors). A common approach to clusteringis point-based clustering, e.g., the ComMotion algorithm [1]identifies a place as a position where the GPS signal has beenlost three or more times within a given radius. However,this approach works well only for places such as homes thathave smaller radii. In [6] researchers attempt to address thelimitations of ComMotion by first registering a place as a can-didate place when the signal is lost. Then, all candidate placesare merged using a variant of k-means clustering, whichfurther reduces the false positives found by the ComMotionapproach. In another study [26], researchers use a density-based DBSCAN to implement their version of point-basedclustering.

A major limitation of geometry-based approaches is theirreliance on only GPS signals for place discovery [1], [6],where these signals often have rather low accuracy dueto the geometry of visible satellites and the loss of signalin urban areas. Possible approaches to improving the ac-curacy include the use of additional sources of data (suchas radio beacons [24]) and the development of new placediscovery approaches (such as temporal point clusteringapproaches [3]). In another study [5], researchers proposed atemporal-based clustering approach that relies on only GPSdata. In their approach, they first form day-level stay point(sp) clusters from the continuous stream of time-stampedlocation points derived from GPS data. During this clusterformation, researchers use two thresholds: (1) a maximumdistance threshold between any pair of points within thecluster, which is the maximum distance that a user cancover in a place and (2) a minimum time duration threshold,which is the minimum duration that the user needs to stayin the same cluster. Other researchers have also used similarspatio-temporal techniques that rely on both distance andtime thresholds [27], [28].

However, a major limitation of previous approaches is

3

that they do not consider the inter-sample time gaps, whichcan lead to unreliable and inaccurate place estimation dueto missing data. Therefore, in another study [8], researchersintroduced a new time constraint called the maximum timegap, which is used to limit the time between successivelocation points. This ensures that all consecutive locationpoints of a stay point cluster are close in time. A concernwith this approach is that it does not consider the firstpart of two consecutive time-separated data segments asa cluster, which could potentially form an additional staypoint cluster. In [23], a similar approach is used, wherethe minimum time duration and the maximum time gap areidentical. Further, they merge two clusters only based ontime and distance thresholds and aggregate a location pointwith a cluster based on distance only.

Researchers have identified multiple ways to determinePOIs, the most common one using GPS traces with theaddition of some other form of information, such as analysisof stable and dense logical neighborhoods, bearing changeof a user, speed of a user, and accuracy of the location points,among others [21], [22], [29], [30]. Additionally, researchershave also discovered POIs using scene analysis of photosand analysis of cellular traffic [25]. A limitation of thesetechniques is that they ignore gaps in the location traces,which can adversely affect the performance of place discov-ery. Previous work has also considered Wi-Fi-based placeidentification techniques as an alternative [31], but theseapproaches rely on fixed-location access points or routersand also require a mapping of geographical locations tothe Wi-Fi access points to determine the actual location of aplace. Additionally, Wi-Fi has a relatively large transmissionrange, which makes Wi-Fi-based approaches less reliablewhen deciding if a user remained at the same place ornot. In [9], [12], we introduced a new approach, where wefirst use a hierarchy of 1-hop and 2-hop Bluetooth finger-print matching techniques, followed by Wi-Fi fingerprintmatching, in order to discover a group of co-located usersof a particular user. Then we borrow the location pointsfrom the co-located users to fill the gaps of the target user.This approach is more reliable than previous approachessince it uses fingerprint matching only to find co-locatedusers and then uses them to fill the gaps using borrowedlocation points, i.e., it does not blindly fill the gaps justbased on fingerprint matching. Additionally, this approachis more accurate since fingerprint matching and gap fillingare performed hierarchically using short-range (≈ 10 meter)1-hop Bluetooth fingerprint matching to long-range (≈ 100meter) Wi-Fi fingerprint matching, i.e., we try to fill gapswith borrowed location points from the users that are co-located using Wi-Fi fingerprint matching only if the usersthat are co-located using 1-hop and 2-hop Bluetooth finger-print matching cannot help to fill the gaps. The approach isalso “dynamic” since it does not need to know the actuallocation mapping of static fingerprints such as Wi-Fi and italso relies on fingerprint matching using Bluetooth, whichis very widely available on portable devices (compared tofixed Wi-Fi access points).

However, a shortcoming of our previous work is that itstill depends on Wi-Fi and Bluetooth fingerprint matchingwhen discovering the co-located users. That is, this cooper-ative opportunistic cluster merging approach does not work

if there are either no co-located users or the fingerprintmatching approach fails, which can happen due to a varietyof reasons. For example, the same UUID (universally uniqueidentifier) of a BLE (Bluetooth Low Energy) transmittercan be recorded differently by two co-located smartphones,making it difficult to find co-located users based on BLEUUID matching. However, recent advances in mobile tech-nologies enable us to obtain additional contextual infor-mation that can be used to determine if an individual isat a specific place, even in the absence of direct locationdata. Therefore, the work in this paper builds upon [13],where the battery charging status of smartphones and sleepdata recorded by wearables (such as Fitbits) have been usedto fill the gaps in location traces. However, in this paper,we also consider users’ activity levels, i.e., step counts,which is commonly recorded by devices such as Fitbits. Thisinformation, combined with the sleep and battery chargingdata allows us to more successfully merge successive co-located sp-clusters, as demonstrated in this paper.

3 OPPORTUNISTIC CLUSTERING

This section describes the main shortcomings and chal-lenges of prior work on location data clustering and pro-poses a new clustering approach that addresses these short-comings.

3.1 Preliminaries

We first define the terminology used in the remainder of thispaper before describing our proposed approach:

• A location point (lp) is a location provided by oneor more of the device’s location sensors, typicallyexpressed as l = {λ, ϑ, T} [8], where λ and ϑ arethe latitude and longitude of the lp, and T is thetimestamp when the location was recorded.

• A stay point (sp) is a cluster of continuous time-stamped location points from the same place [8],typically expressed as S = {λ, ϑ, Tstart, Tend}. Theλ and ϑ pair represents the latitude and longitude ofthe centroid (C) with S.λ =

∑∀lk∈S(lk.λ)/|S| and

S.ϑ =∑∀lk∈S(lk.ϑ)/|S|, and |S| is the cardinality of

S, i.e., number of lps in S. The Tstart and Tend are theentrance-time and departure-time, i.e., the timestampsof the first and last lps in the cluster. A sp representsa user’s presence at a geographic region for a signif-icant amount of time during a day.

• An opportunistic sp-cluster is an extended type ofsp-cluster (typically larger than a regular sp-clusterobtained using existing clustering approaches) thatis discovered using our proposed approach.

Additionally, there are three thresholds: (1) Tmin, whichrepresents the minimum duration of an sp-cluster, (2) Tmax,which represents the maximum allowed time gap between apair of successive location points of the same cluster, and (3)Dmax, which represents the maximum allowed distance fora location point from the first location point of a cluster. Allexperiments are performed based on a set of time-stampedlocation traces L = {∀lk|k ∈ [1, N ]} with l1.T < l2.T <... < lN .T , where lk.T is the timestamp of an lp lk.

4

Fig. 1. Secondary sleep and battery charge sessions that overlap withthe gap between a pair of closely located sp-clusters, i.e., S1 = {∀lk|k ∈[i, j − 1]} and S2 = {∀lk|k ∈ [j,N ]}. The two sp-clusters are repre-sented by the time intervals [t1, t2] and [t3, t4], respectively, and theirlps are represented by cross (x) markers. Start and end times of asecondary session are marked by t′1 and t′2. The maroon solid horizontaldouble-sided arrow lines show the gap that remain undecided after usingsecondary sleep and battery charge sessions.

3.2 Prior Work3.2.1 Regular Stay Point Clustering (RSPC)Prior work has used theDmax and Tmin thresholds to definea day-level sp-cluster of an individual using the followingtwo constraints [5], [8]:

∆d(li, lN ) < Dmax (1)

∆t(li.T, lN .T ) > Tmin (2)

where li and lN are the start and end location points froma continuous stream L of location traces and ∆d(li, lN ) and∆t(li.T, lN .T ) are the two functions used to compute thehaversine spherical distance and time difference betweenthe two location points. Researchers later introduced an-other threshold, i.e., Tmax and a new constraint in orderto ensure reliable clustering during data gaps [8]:

∆t(lj−1.T, lj .T ) < Tmax (3)

where lj and lj−1 are two successive location points withi < j <= N and lj−1.T < lj .T . In our previous work [9],[12], [13], we referred to this existing approach as RegularStay Point Clustering or RSPC approach.

While the introduction of the upper bound time con-straint on successive samples (Constraint 3) improves thereliability, RSPC does not consider the parts of the locationtraces, i.e., li to lj−1, that appear just before the missingsegment (between lj−1 and lj), and the clustering processis re-initiated from lj . However, the part of the data streamthat exists right before the missing segment, i.e., li to lj−1,already satisfies Constraint 1, i.e., ∆d(li, lj−1) < Dmax.Now, the segment li to lj−1 could potentially form a clusterif it meets Constraint 2, i.e., ∆t(li.T, lj−1.T ) > Tmin. SinceRSPC ignores these additional clusters, the accuracy of thetemporal and spatial analysis may suffer. In Figure 1, wedescribe a scenario, where we have a continuous streamof location traces L = {∀lk|k ∈ [i,N ]} and we assumethat Constraint 1 and Constraint 2 have been satisfied forL. However, for a pair of consecutive lps, i.e., lj−1 andlj , Constraint 3 has not been violated, which splits theentire L into two parts, i.e., S1 = {∀lk|k ∈ [i, j − 1]} andS2 = {∀lk|k ∈ [j,N ]}. The RSPC approach will ignore

Fig. 2. A case where the step counts during the gap between a pair ofclosely located sp-clusters (i.e., S1 and S2) leads to cluster merging.Each activity window W (i), where i ∈ {1, ...,K}, is composed of stepcounts that are represented by circles. The green and red solid circlesare step counts that fall within and outside RS1S2

, respectively. In thisexample, all windows are classified as proof of presence.

S1 and restart the clustering process from lj . However, S1

could form a cluster since the two constraints, i.e., Con-straint 1 and Constraint 2 are satisfied. Furthermore, S1 andS2 can together form a larger cluster if all lps in L satisfy thetwo constraints, i.e., Constraint 1 and Constraint 2, and wecan reliably identify the user’s presence in the same placeduring the gap using some additional opportunistic data,which we will discuss in Section 3.3.

3.2.2 Enhanced Stay Point Clustering (ESPC)In our previous work [9], [12], we modified the regular staypoint clustering approach to capture additional clusters thatRSPC does not consider when the gap between successivelps exceeds Tmax. This ESPC (Enhanced Stay Point Cluster-ing) approach considers S1 in addition to S2 in Figure 1.However, ESPC uses the same set of threshold values usedin the RSPC approach. In the rest of this paper, we refer tothe clusters formed by ESPC as enhanced sp-clusters.

3.3 Opportunistic Stay Point ClusteringIn our proposed approach, we first find a pair of successiveenhanced sp-clusters from the same place for a target userand then try to reliably determine the user’s presence in thesame place during the data gap between the cluster pairusing additional sensor data, which we refer to as secondarydata in the remainder of this paper. The goal is to merge thecluster pair if the secondary data provides evidence that theuser remained in the same place during the gap. Our focusin this work is on three specific types of secondary data:(1) sleep periods, which we refer to as sleep sessions, (2)phone battery recharge periods, which we refer to as batterycharge sessions, and (3) physical activity in terms of stepcounts. We obtain these data from different sources, e.g.,sleep sessions and activity information can be obtained fromfitness trackers such as Fitbits, while battery charge sessionscan be obtained from smartphones. Note that sometimessome of these data types may not be available due toseveral reasons, e.g., a user may forget to wear the trackeror leave the phone behind when leaving home. However,by using different types of behavioral data from differentdata sources, we can increase the likelihood of being ableto determine a user’s location during cluster gaps. Based ondifferent combinations of the data, we distinguish betweenfive different types of opportunistic clustering strategies1:

1https://github.com/svhaduri/SBAA-SPC

5

Fig. 3. A case where the step counts during the gap between a pairof closely located sp-clusters (i.e., S1 and S2) does not lead to clustermerging. Each activity window W (i), where i ∈ {1, ...,K}, is composedof step counts that are represented by circles. The green and red solidcircles are step counts that fall within and outside RS1S2

, respectively.Window W (1) is classified as proof of presence, but W (2) is notclassified as proof of presence. The test terminates immediately afterW (2) and the remaining step counts (marked as empty circles) in theremainder of the gap are not tested.

• Sleep-Assisted Stay Point Clustering (SA-SPC)• Battery-Assisted Stay Point Clustering (BA-SPC)• Sleep- and Battery-Assisted Stay Point Clustering

(SBA-SPC)• Activity-Assisted Stay Point Clustering (AA-SPC)• Sleep-, Battery-, and Activity-Assisted Stay Point

Clustering (SBAA-SPC)

Figure 1 demonstrates four basic cases, i.e., case-1 tocase-4, where the sleep and battery charge sessions overlapeither partially (case-2 to case-4) or completely (case-1) withthe gap between the two sp-clusters. After filling the gapbetween a pair of sp-clusters with secondary sleep and/orbattery charging data, if the remaining gaps are smaller thanTmax, we can merge the two sp-clusters.

In activity-assisted cluster merging we need to obtaina proof of presence (i.e., a clear indicator that the individualhas not moved to a new location between two sp-clusters)in order to determine if two clusters can be merged. Thisapproach relies on minute-level aggregated step counts(obtained from wearable devices) as a measure of a user’sactivity during the gap to approximate a proof of presence(Figure 2) or absence (Figure 3).

For a pair of consecutive sp-clusters, we first determinethe range Rs1s2 of step counts within the cluster pair. Rangeis defined by the average and standard deviation of thestep counts, i.e., [µ − σ, µ + σ], and expresses the typicalstep count for the individual during the two clusters tobe merged. Next, the activity data during the gap is splitinto windows of fixed duration. For each window, we tryto detect whether a user was present in the same place ormoved to a different place. A window is classified as proofof presence if the majority of the step counts fall withinRs1s2 , i.e., the individual’s level of activity during the gapwindow is similar to the activity during the clusters.

Figure 2 demonstrates the case where every single win-dow is classified as proof of presence. Therefore, the clusterpair can be merged to form a larger cluster. However,Figure 3 demonstrates a case where we fail to classifythe second window as proof of presence. Therefore, themerging process terminates immediately after W (2) andall further windows after W (2) are ignored. This window-based approach allows us to terminate the merging process

1: procedure SBA-SPC-MERGEABLE(I, S1, S2)2: t1 ← S1.Tstart, t2 ← S1.Tend3: t3 ← S2.Tstart, t4 ← S2.Tend4: C1 ← FindCentroid(S1), C2 ← FindCentroid(S2)5: if ∆d(C1, C2) ≤ Dmax then6: for all [t′1, t

′2] ∈ I do

7: if t′1 ≤ t2 and t′2 ≥ t3 then . Case-18: return 19: else if t′1 > t2 and t′2 < t3 then . Case-2

10: if ∆t(t′1, t2) ≤ Tmax and ∆t(t3, t

′2) ≤

Tmax then11: return 112: end if13: else if t′2 > t2 and t′2 < t3 then . Case-314: if ∆t(t3, t

′2) ≤ Tmax then

15: return 116: end if17: else if t′1 > t2 and t′1 < t3 then . Case-418: if ∆t(t

′1, t2) ≤ Tmax then

19: return 120: end if21: end if22: end for23: end if24: return 025: end procedure

Fig. 4. Opportunistic Cluster Merging Algorithm using secondary sleepand/or battery charge sessions. I ← {∀[t′1k, t

′2k]|k ∈ [1, N ]} is the set

of all secondary data and N is the total number of sleep and batterycharge sessions. FindCentroid(S1) is used to compute the centroidusing the average λ and ϑ values of all lps that compose the S1.

as soon as the first window violating the proof of presencerequirement is detected. This early termination is particu-larly beneficial when large gaps are analyzed. Formally, theproposed approach is expressed as follows:

M(W ) =

1 if POP (W ) = 1 (4a)0 if POA(W ) = 1 (4b)0 if POU(W ) = 1 (4c)

Case 4a (proof of presence): POP (W ) = 1, if ∀i ∈{1, 2, ...,K},W (i) ⊆ W, ∃Wp(i) ⊆ W (i),∀a ∈ Wp(i) : a ∈Rs1s2 ∧

|Wp(i)||W (i)| ≥ 0.5.

Case 4b (proof of absence): POA(W ) = 1, if ∃i ∈{1, 2, ...,K},W (i) ⊆ W, ∃Wp(i) ⊆ W (i),∀a ∈ Wp(i) : a /∈Rs1s2 ∧

|Wp(i)||W (i)| ≥ 0.5.

Case 4c (proof of undecided): POU(W ) = 1, if ∃i ∈{1, 2, ...,K},W (i) ⊆W : W (i) = φ.

Here, M(W ) is the decision whether the set W of Kactivity windows can fill the gap and merge the two clusters.W (i) is a single window and Wp(i) is part of W (i), a iseach activity sample of Wp(i). |Wp(i)| and |W (i)| are thecardinality of Wp(i) and W (i), respectively.

The details of the proposed sleep- and battery-assistedclustering approaches are as follows. First, we generateenhanced sp-clusters using the same Tmin, Tmax, and Dmax

values used for the ESPC approach (Section 3.2.2). Then,for each pair of sp-clusters, we decide whether they canbe merged using secondary sleep and battery charginginformation as described in the algorithm shown in Figure 4.

6

1: procedure AA-SPC-MERGEABLE(A,w, S1, S2)2: t1 ← S1.Tstart, t2 ← S1.Tend3: t3 ← S2.Tstart, t4 ← S2.Tend4: C1 ← FindCentroid(S1), C2 ← FindCentroid(S2)5: if ∆d(C1, C2) ≤ Dmax then6: t← t27: while t < t3 do8: win.Tstart ← t, win.Tend ← t+ w9: if win.Tend > t3 then . testing last window

10: if w ≤ Tmax then11: break . acceptable gap12: else13: win.Tend ← t314: end if15: end if16: A′ ← GetStepCounts(win.Tstart, win.Tend)17: if A′ = φ then . undecided18: return 019: end if20: µ← ComputeMean(A, t1, t2, t3, t4)21: σ ← ComputeSD(A, t1, t2, t3, t4)22: lm ← FindMajorityLabel(A′, µ, σ)23: if lm = 1 then . proof of presence24: t← win.Tend25: else . proof of absence26: return 027: end if28: end while29: end if30: return 131: end procedure

Fig. 5. Algorithm to decide whether it is reliable to merge two successiveclusters based on activity (i.e., step count) data. A ← {∀ < a′k, t

′k >

|k ∈ [1, N1]} is the set of all activity (i.e., step count) values alongwith their associated timestamp from a specific subject, N1 is the totalnumber of activity samples, w is the duration of each activity window, µand σ are computed based on the activity values recorded during [t1, t2]and [t3, t4].

For each consecutive pair of sp-clusters, we first inspectwhether the clusters are from the same place. To checkthis, we verify whether the distance of the second cluster’scentroid to the first cluster’s centroid is less than Dmax

(line 5 in Figure 4). Once this is confirmed, we next inspectwhether the sleep or battery charge session can fill thegap between the two clusters either completely (line 7 inFigure 4) or partially (lines 9, 13, and 17 in Figure 4). In thecase of partial overlap, we need to compare the remaininggap times with Tmax to decide if the clusters are mergeableor not (lines 10, 14, and 18 in Figure 4). After merging,clusters are further evaluated to see if their end points(i.e., start and end times) can be modified with the help ofsecondary sessions. If the start time of the sleep or batterycharge session is before the start time of the newly createdcluster, the start time of the cluster will be adjusted to thestart time of the session. Similarly, the end time of the clustercan also be adjusted.

Next, we present the details of our activity-assisted staypoint clustering (AA-SPC) algorithm in Figures 5 and 6.For each pair of sp-clusters, we look at the step counts

1: procedure FINDMAJORITYLABEL(A′, µ, σ)2: Count1 ← 0, Count0 ← 03: for all a′ ∈ A′ do4: if a′ ≥ µ− σ and a′ ≤ µ+ σ then . within band5: Count1 ← Count1 + 16: else7: Count0 ← Count0 + 18: end if9: end for

10: if Count1 ≥ Count0 then . find majority11: return 112: end if13: return 014: end procedure

Fig. 6. Decide whether a majority of activity samples fall within the [µ−σ, µ+ σ] range.

during the clusters and during the gap between them todetermine whether the clusters can be merged (Figure 5). Asmentioned previously, we divide the entire gap into smallerwindows of duration w and analyze a subject’s step countsduring these windows. We start forming windows from theend of the first cluster (i.e., t2, line 6 in Figure 5) and proceedtowards the beginning of the second cluster (i.e., t3, line 7 inFigure 5). For each valid window (i.e., not empty), we firsttry to see if the majority of the step counts fall within RS1S2

(line 20 – 23 in Figure 5). We continue the process until weeither find an empty window (line 17 – 18), or fail to classifya window as proof of presence (line 25 – 26 in Figure 5), orreach the last window, i.e., the window that reaches t3 (line 9in Figure 5). Only if all windows are classified as proof ofpresence, we can merge the clusters (Case 4a). If the end ofthe last window exceeds the start of the second cluster (i.e.,line 9 in Figure 5), we have two possible options dependingon window duration w. If w is not longer than the maximumallowed data gap of Tmax, then we can consider the windowas part of the same clusters (line 10 – 11 in Figure 5);otherwise, we need to reset the end of the last window to thebeginning of the second cluster (i.e., t3 in line 13 in Figure 5)and perform the window-level activity analysis.

4 DATA COLLECTION AND USEFULNESS TEST

Before evaluating the performance of different opportunisticapproaches, we first need to obtain a validation dataset andprocess this dataset. Then we need to test the usefulness ofthe collected data and the feasibility of our approaches.

4.1 NetHealth Study Dataset

The NetHealth study [9], [13], [14], [32], [33], [34], [35] beganat the University of Notre Dame in 2015 with the purposeof investigating the impacts of “always-on connectivity”on the health habits, emotional wellness, and social ties ofcollege students, over a multi-year period. In this work,we analyzed smartphone and Fitbit data from a set of467 iPhone users residing in on-campus dormitories duringthe academic year. Subjects were recruited from the 2015freshmen class and were instructed to continuously wear aFitbit Charge HR device that was provided to them. Further,

7

they were given a data collection app for their smartphones.Both the Fitbit and the smartphone app collect data 24 hoursa day.

The data collected by the smartphone app includes iden-tifiers of the device’s network connections (Wi-Fi, cellular),proximity to other users or devices using BLE scanning,device state (e.g., battery charge level and charging state),application usage, device usage, geographic location, anduser communications (e.g., phone calls). The data collectedby the Fitbit device includes heart rate, calorie burn, physi-cal activity level, step count, sleep status, and self-recordedactivity labels. All collected sensor data are first transmittedto a remote check-in server during a nightly upload, andthen stored to a secure database server for use. Table 1summarizes part of the NetHealth dataset used in this work.We further conducted a survey to assess phone rechargingbehavior, e.g., users indicated if typically use a wall charger,a USB charger, or a portable charger for their smartphones.

4.2 Data Pre-Processing

As described previously, to fill the data gaps in the loca-tion traces obtained from the smartphones, we use sleepsessions and minute-by-minute step counts obtained fromFitbit devices, and battery charge sessions obtained fromthe smartphones. However, before the step count, sleep, andbattery data can be used in our algorithm, they have to beprocessed as described below.

4.2.1 Fitbit Sleep Sessions

A Fitbit expresses each sleep session with a start time, aduration (in millisecond) and an end date (i.e., year, month,and day). Since sleep sessions often begin during one dayand end on the next, the end date and duration is used todetermine if the sleep session began the previous day andthe start time is combined with the duration to obtain thesession end time. That is, a session is then expressed as 2-tuple (start-date-time, end-date-time).

4.2.2 Battery Charge Sessions

The smartphones provide three types of callback labels: (1)“charging” (i.e., the phone is plugged in), (2) “full” (thephone’s battery level has reached 100%), and (3) “unplug”(the phone has been disconnected from the charger) todescribe the battery status. We distinguish between two typesof charging sessions:

TABLE 1Summary of sensor data types

Device Sensor/ Data Type SamplingLogger Period

iPhone Location latitude, longitude 165 sBattery Status charging, full, unplug CallbackBattery Level 0 – 100% Callback

Fitbit Acceleromter step counts 60 sSleep Monitor start time, end date, Callback

durationSmartTrack [36] activity type, Callback

duration

• Type-1 (Battery partially charged): consists of (charg-ing, unplug) sessions.

• Type-2 (Battery fully charged): consists of (charging,full, unplug) sessions.

Data collection is imperfect and often leads to missinglabels. To be able to describe a charging session, we needan initial “charging” label, followed by either an “unplug”or “full” label. That is, for Type-2 sessions, if the “unplug”label is missing, we can still use the (charging, full) pair todescribe a charging session. Therefore, we subdivide Type-2further into:

• Type-2a (Battery overcharged): consists of (charging,full, unplug) sessions.

• Type-2b (Battery fully charged, but not overcharged):consists of (charging, full) sessions. This case mostlyhappens due to missing “unplug” labels.

When a device is unplugged, this is then typically followedby a series of battery charge level labels indicating the dropof the charge from 100% in 1% decrements. For Type-2bsessions, we can then approximate the missing “unplug”label by the 99% charge level label (unless this label ismissing too), promoting these Type-2b sessions to Type-2asessions.

4.2.3 Validity of Fitbit Activity Data

A Fitbit device records heart rate data only when the deviceis actually worn. However, the device continues to recordactivity data even if the device is not worn. These periods ofactivity data do not reflect a user’s actual activity level andwe consider these periods as invalid and therefore removethem from our dataset.

4.3 Applicability and Usefulness Analysis

Before we evaluate our proposed approach, we first iden-tify the types of sp-clusters that can actually benefit fromthe three secondary data sources by analyzing the spatio-temporal distribution of those data sources. Then, we deter-mine their importance using statistical analysis.

4.3.1 Spatio-Temporal Distribution of Secondary Data

First, we map the sleep and battery charge sessions ontovarious on-campus places to identify the clusters that couldutilize these sessions for merging. In order to obtain thespatial distribution of the sessions, we look at the placelabels obtained from the location traces that were recordedduring those sessions using the “place mapping” approachdescribed in [13]. We found about 94% of the battery chargesessions and about 98% of the sleep sessions are locatedinside dormitories. Furthermore, about 5% of the batterycharge sessions are located inside classrooms. While dor-mitories and classrooms are places where students spend asignificant amount of their time on campus, they are alsothose places that often suffer most from either missing orinaccurate location readings (due to the lack of GPS signalsindoors), which indicates that the sleep and battery chargesessions can be useful to fill the gaps in the location tracesat those places.

8

Fig. 7. Probability Density Function (PDF) of sleep and battery chargesession counts over the 24 hours of a day.

TABLE 2Paired T-test results when comparing step counts within different

places and walk sessions

Comparison df t-statistic p-valueDorms vs. Walk sessions 341 -107.73 < .001Classes vs. Walk sessions 341 -102.88 < .001Athletics vs. Walk sessions 341 -50.93 < .001Dining vs. Walk sessions 341 -90.28 < .001

Next, we investigate the temporal distribution of thesleep and battery charge sessions, shown in Figure 7. Bat-tery charge sessions are uniformly distributed over the 24hours of a day compared to the sleep sessions that occurmostly at night. Therefore, the battery charge sessions havea higher availability than sleep sessions and can be helpfulin cases when sleep sessions are not available. However,sleep sessions are usually longer than the battery chargesessions. Therefore, sleep sessions can contribute more tocluster time when merging pairs of sp-clusters comparedto battery charge sessions. Compared to sleep and batterycharge sessions, minute-level activity data is available acrossall places and time of a day as long as the wearable is worn.Therefore, the activity data is expected to improve all typesof spatio-temporal clusters.

4.3.2 Usefulness Test of Secondary Data

We further investigate whether step counts can improve theclusters. We try to see whether the step counts from fourmost frequently visited places (i.e., dormitories, classrooms,athletic centers, and dining halls) are significantly differentfrom the step counts during “walk sessions”, i.e., whena subject moves between places. The assumption is thatminute-level average step counts within places of interestare significantly different from the step counts during walksessions. In the rest of the paper we describe the differentinvestigated locations with the terms “dorms” (dormito-ries), “classes” (classrooms), “athletics” (athletic centers),and “dining” (dining halls).

In Figure 8 we observe that the average step countswithin different places and during walk sessions are dif-ferent. Two subcategories of athletic centers – gymnasiumsand sport venues such as a stadium – also have significantlydifferent average step counts as shown in Figure 8.

Fig. 8. Average step counts within different places and walk sessions.

TABLE 3Paired T-test results when comparing step counts within different places

Comparison df t-statistic p-valueDorms vs. Classes 341 -21.66 < .001Dorms vs. Athletics 341 -24.01 < .001Dorms vs. Dining 341 -52.57 < .001Classes vs. Athletics 341 -21.24 < .001Classes vs. Dining 341 -41.33 < .001Athletics vs. Dining 341 11.27 < .001

We first perform the Paired T-test to test the null hypoth-esis H0: “subject-level average step counts within differentplaces and walk sessions are the same.” In Table 2 weobserve that the null hypothesis is rejected for each case.Therefore, the average step counts are significantly differentwithin different places when compared to walk sessions.The subject-level average step counts within the two sub-categories of athletic centers are also significantly differentfrom the step counts during walk sessions.

Next, we perform the Paired T-test to see whether thesubject-level average step counts within different placesare significantly different. In Table 3 we reject the nullhypothesis for each case. Therefore, the average step countsare also significantly different in each place.

5 EXPERIMENTAL EVALUATION

In this section, our goal is to demonstrate that the pro-posed spatio-temporal stay point (sp) clustering approachis able to capture more context information (e.g., via largerclusters), compared to existing spatio-temporal clusteringapproaches [5], [8], [9], [12].

5.1 Experimental Setup

5.1.1 Machine and Software Configuration

For our experimentation we use two Lenovo NeXtScalenx360 servers, each having Dual 12 core Intel(R) Xeon(R)CPU E5-2680 v3 2.50GHz Haswell processors, 256 GB RAM,and 1.4 TB storage. We also use another Dell PowerEdgeR815 server with Quad 16 core 2.4 GHz AMD Opteronprocessors (Model 6378), 512 GB RAM, and 600 GB storage.We use the Matlab Parallel Computing Toolbox to accessmultiple cores to run our experimentation.

9

TABLE 4Set of clusters with different types of behavioral data

Available data type Cluster countSleep 85,530Battery charging 95,471Sleep/Battery charging 99,046Step counts 193,180Sleep/Battery charging/Step counts 218,292

5.1.2 Clustering Parameter ConfigurationFor the evaluation of our five opportunistic approachesdescribed in Section 3.3, we execute the spatio-temporalclustering algorithms with a minimum cluster durationTmin = 10 minutes, a maximum allowed time gap betweensuccessive location points Tmax = 10 minutes, and a maxi-mum cluster span Dmax = 250 meters. These are the samevalues as those used for the ESPC approach in [9], [12].

As compared to our previous work [13], in this workwe consider 341 subjects who do not use portable chargers(as determined by the phone recharging behavior survey).Each subject has at least one walk session. We consider theclusters that overlap with any of the three secondary data,i.e., sleep sessions, battery charge sessions, or activity data(step counts). Table 4 summarizes the distribution of clustercounts across different types of secondary data. We compareour opportunistic approaches with ESPC only, since in ourprevious work we have already shown that ESPC performsbetter than RSPC [9], [12]. We try to merge a pair of clustersfrom the same place. We compare the inter-centroid distancewith Dmax to decide whether the clusters are from thesame place. The choice of window duration (when trying tomerge clusters using step counts) depends on the samplingrate of step counts, and this may affect the performance ofcluster merging. Using a very large duration may make allstep counts fit into a single window. Similarly, a very shortduration may lead to an empty window or a window withvery few step counts. Since our step counts are recordedwith a period of 1 minute, we use 10 minutes as our windowduration so that we have a sufficient number of step countswhen classifying a window.

5.2 Evaluation Using Unlabelled Data5.2.1 Evaluation MetricsWe use the following metrics for evaluation:

Cluster Count Decrease (CCD), which is the difference(as a percentage) of the total number of clusters (clustercount), i.e.:

CCD =CESPC − Cop

CESPC× 100% (5)

where CESPC and Cop are the cluster counts obtained usingESPC and using our opportunistic approach, respectively.The value is close to 100% if our opportunistic approach isable to merge most of the clusters into larger clusters for auser, when ESPC is not able to do so. Similarly, the value is0% if both our approach and ESPC yield identical results.

Cluster Time Increase (CTI), which is the difference (asa percentage) in cluster time, i.e.:

CTI =Top − TESPC

Top× 100% (6)

TABLE 5Overall cluster count improvement using the super set of clusters

Approach before merging after merging CCD (%)SA-SPC 218,292 210,266 3.68BA-SPC 218,292 187,043 14.32SBA-SPC 218,292 162,095 25.74AA-SPC 218,292 94,440 56.74SBAA-SPC 218,292 77,667 64.42

Fig. 9. Subject-level average decrease in cluster count (in %).

where Top and TESPC are the cluster times obtained usingour opportunistic approach and using ESPC, respectively.The value is close to 100% if our opportunistic approachis able to recover long data gaps among the clusters duringmerging, when ESPC is not able to do so. Similarly, the valueis 0% if both our approach and ESPC yield identical results.

5.2.2 Evaluation with the Super Set of ClustersIn this section we evaluate all of our opportunistic clusteringapproaches using the set of 218,292 enhanced sp-clusters (Ta-ble 4), each of which has some overlap with sleep sessions,battery charge sessions, or activity data (i.e., step counts).We first try to merge pairs of successive sp-clusters from thesame place using the secondary data and then evaluate theopportunistic approaches based on cluster count and clustertime improvement.Cluster count improvement: Table 5 shows the total numberof clusters before and after merging pairs of sp-clustersusing different secondary data and associated clusteringapproaches. In the table we observe that with the inclusionof more secondary information, we are able to merge moreclusters and improve the percent cluster count decrease(CCD) (computed using Equation 5). When comparing CCDacross individual approaches, AA-SPC performs better thanSA-SPC and BA-SPC. The combined SBA-SPC approachachieves better performance than the individual SA-SPCand BA-SPC approaches. However, AA-SPC still achievesbetter performance than the SBA-SPC approach, which isdue to the higher availability of step count data across theentire day compared to sleep and battery charging data. Wecan achieve the highest performance of 64.42% CCD whenapplying all three types of secondary information togetherusing the SBAA-SPC approach.

Figure 9 presents the subject-level average CCD alongwith the standard error across different opportunistic ap-

10

Fig. 10. Boxplot of subject-level decrease in cluster count (in %).

proaches. The findings in the figure are similar to thoseobserved in general in Table 5. For example, using SBAA-SPC approach we obtain 62.62% subject-level average CCD(Figure 9) and 64.42% CCD in general (Table 5). The smalldifference between the subject-level average CCD and over-all CCD is due to subject bias and its effect on average valuecomputation, which is further demonstrated using boxplotsin Figure 10. Therefore, additional secondary informationhelps to improve subject-level average cluster counts.

Figure 10 presents the boxplot of subject-level CCDacross different opportunistic approaches. In the figure weobserve that BA-SPC has a better range than SA-SPC, whichis because of a higher availability of battery charge sessions,i.e., they are distributed over the entire day comparedto sleep sessions, which can be found mostly at night.Therefore, when we consider both of them, we achieve abetter performance, i.e., SBA-SPC achieves a higher rangecompared to both SA-SPC and BA-SPC. Activity has ahigher availability compared to battery and sleep sessions.In the figure we also observe that AA-SPC (i.e., consideringonly activity) achieves a better subject-level CCD rangecompared to the range obtained from SBA-SPC. Similarly,when we combine all three types of secondary information,availability of secondary information increases and that alsoincreases the chance of merging cluster pairs. Therefore,SBAA-SPC achieves the best performance, i.e., a higherrange of subject-level CCD compared to other approaches.These findings are consistent with the findings in Figure 9and Table 5. Therefore, we conclude that availability ofmore secondary information will lead to more significantimprovements of CCD in general.Cluster time improvement: Figures 11 and 12 present thecluster-level average CTI along with the standard error, andthe subject-level average CTI obtained from different op-portunistic approaches. In both figures we observe the sameorder of performance improvement, i.e., SA-SPC achievesthe lowest average CTI, BA-SPC achieves a better averageCTI than SA-SPC, and SBA-SPC achieves a better averageCTI compared to both SA-SPC and BA-SPC since SBA-SPC uses both sleep and battery charge session informationto merge cluster pairs. We further observe that AA-SPCachieves a better average CTI compared to SBA-SPC sinceAA-SPC uses activity data, which has a higher availabilitythan the other two secondary data. SBAA-SPC achieves the

Fig. 11. Cluster-level increase in cluster time (in %).

Fig. 12. Boxplot of subject-level average increase in cluster time (in %).

TABLE 6Paired T-test results of CTI (%) when comparing different approaches

using the super set of clusters

Comparison degrees of freedom t-statistic p-valueSA-SPC vs. ESPC 129,680 45.07 < .001BA-SPC vs. ESPC 138,898 85.66 < .001SBA-SPC vs. ESPC 144,254 102.65 < .001AA-SPC vs. ESPC 81,615 151.71 < .001SBAA-SPC vs. ESPC 65,272 168.59 < .001

highest average CTI compared to other approaches sinceSBAA-SPC uses all three types of secondary data, whichincreases the chance to merge more cluster-pairs. Thesefindings are similar to our findings in Figures 9 and 10.Therefore, we conclude that a combination of all three typesof secondary information will lead to greater improvementof CTI compared to individual ones.

Next, we perform the Paired T-test in order to test thesignificance of cluster time improvement using the oppor-tunistic approaches compared to the ESPC approach. Thenull hypothesis is “H0 : µD = 0,” where µD is the aver-age of the differences in cluster durations before and aftermerging using our opportunistic approaches. In Table 6 wereject the null hypothesis for all possible cases. Therefore,the merging using our opportunistic clustering approachessignificantly improves the cluster times.

Figure 13 shows the Probability Density Function (PDF)and Cumulative Distribution Function (CDF) of the increase

11

Fig. 13. Probability Density Function (PDF) and Cumulative DistributionFunction (CDF) of cluster-level increase in cluster time (in %).

TABLE 7Overall cluster count improvement using separate cluster sets

Approach before merging after merging CCD (%)SA-SPC 85,530 77,464 9.43BA-SPC 95,471 64,115 32.84SBA-SPC 99,046 42,708 56.88AA-SPC 193,180 68,786 64.39SBAA-SPC 218,292 77,667 64.42

in cluster time for merged clusters. From the PDF, weobserve that most of the increases in cluster times usingthe opportunistic approaches are in the 0–30% range, whichis consistent with the findings from Figure 12. However,within this 0–30% range, SA-SPC has fewer samples thanother approaches. After the 30% increase in cluster times,SA-SPC dominates the other opportunistic approaches, i.e.,SA-SPC has many more samples that achieve a higherincrease in cluster time than the other approaches. From theCDFs, we observe that the CDFs of BA-SPC and SBA-SPC,and the CDFs of AA-SPC and SBAA-SPC are almost identi-cal. The CDFs of BA-SPC and SBA-SPC are also significantlyabove the CDF of SA-SPC and below the CDFs of AA-SPCand SBAA-SPC until they reach saturation. The CDFs ofAA-SPC and SBAA-SPC reach 0.9 probability quickly (i.e.,around 40% CTI). Yet, the CDFs of BA-SPC and SBA-SPCreach 0.9 probability at around 70% CTI. The differencesin this figure show that for successful merging the SA-SPCapproach is able to improve the cluster times much morethan the other opportunistic approaches. This is becauseSA-SPC with its relatively longer sleep sessions is ableto fill longer data gaps among clusters when merging ascompared to other approaches.

5.2.3 Evaluation with Separate Cluster Sets

In this section we evaluate each of the opportunistic cluster-ing approaches using a separate set of enhanced sp-clustersbased on the availability of secondary data (Table 4).Cluster count improvement: In Table 7 we observe that AA-SPC and SBAA-SPC have similar CCD for their respectivecluster sets. SA-SPC has the lowest CCD compared to otherapproaches. The subject-level average values obtained forCCDs across different approaches are also similar to theoverall CCDs as presented in Table 7. In Table 8 we observe

TABLE 8CCD (%) and CTI (%) ranges when comparing different clustering

approaches using separate cluster sets

Comparison CCD (%) CTI (%) CTI (%)(subject- (subject- (cluster-

level) level) level)SA-SPC vs. ESPC 0 – 22 0 – 77 0 – 99BA-SPC vs. ESPC 5 – 60 2 – 40 0 – 59SBA-SPC vs. ESPC 34 – 72 2 – 45 0 – 63AA-SPC vs. ESPC 47 – 80 3 – 27 0 – 35SBAA-SPC vs. ESPC 47 – 81 3 – 30 0 – 38

Fig. 14. Cluster-level average increase in cluster time (in %) usingseparate cluster sets.

TABLE 9Paired T-test results of CTI (%) when comparing different approaches

using separate cluster sets

Comparison degrees of freedom t-statistic p-valueSA-SPC vs. ESPC 4,538 59.79 < .001BA-SPC vs. ESPC 21,319 101.75 < .001SBA-SPC vs. ESPC 29,638 121.17 < .001AA-SPC vs. ESPC 60,720 159.65 < .001SBAA-SPC vs. ESPC 65,272 168.59 < .001

that the ranges of subject-level CCDs are consistent with theoverall CCDs presented in Table 7.Cluster time improvement: Figure 14 presents the cluster-level average cluster time increase (CTI) along with stan-dard error. In the figure, we observe that BA-SPC and SBA-SPC have similar CTI for their respective cluster sets. SA-SPC has the highest and AA-SPC has the lowest CTI com-pared to other approaches. This happens since sleep sessionsare usually long enough to fill larger gaps compared to theother secondary information.

We further investigate the significance of cluster timeincrease caused by the opportunistic approaches using theirrespective cluster sets. In Table 9 we reject the null hy-potheses for all possible cases. Therefore, the merging in ouropportunistic clustering approaches significantly improvesthe cluster times.

5.3 Validation Using Labelled Data

Opportunistic cluster merging approaches that use sleepand/or battery charging information to merge successiveclusters, are deterministic, i.e., once sleep and/or batterycharging information is available during the data gap be-

12

tween clusters, opportunistic clustering will always mergethe successive clusters. However, the use of step counts dur-ing gaps may not always result in cluster merging depend-ing on a user’s activity level during gaps. Therefore, in thissection we validate the opportunistic clustering approachthat uses activity data (i.e., step counts) to merge clusters.

5.3.1 Location/Place Ground Truth CollectionWe analyze the performance of Activity-Assisted Stay PointClustering (AA-SPC) using labelled data, i.e., ground truththat are obtained using the same place mapping techniquediscussed in our previous paper [9]. We obtain a total of119,201 continuous location traces with ground truth labels,which will be used as candidate traces to evaluate the AA-SPC approach.

5.3.2 Creation of Location Data GapsAll these 119,201 continuous location traces are obtainedfrom a set of 368 subjects. These continuous traces havevarying lengths of at least 60 minutes. For each of thesecontinuous traces, we keep 20 minutes at the beginningand the end of the trace to mimic a scenario of 2 separatesp-clusters separated by a gap of at least 20 minutes (i.e.,twice as large as Tmax = 10 minutes). Therefore, we obtain119,201 cluster pairs, each of which has minute-level activitydata (i.e., step counts) during all three parts. The goal of ourevaluation is then to measure how successful the AA-SPCapproach is in determining whether a subject remains in thesame place during the gap between successive clusters.

5.3.3 Evaluation MetricIn addition to our previous two evaluation measures, i.e.,CCD (Equation 5) and CTI (Equation 6), in this section, wealso use the following metric for our evaluation of AA-SPC:

Merging Success (MS), which is the ratio of cluster pairsthat successfully become single clusters after merging andthe number of cluster pairs before merging, i.e.:

MS =

∑Ki=1Mi∑K

i=1(Mi +Ni)× 100% (7)

where Mi and Ni are the number of cluster pairs thatare merged and not merged into single clusters, respectively,after merging for the ith subject and K is the total numberof subjects.

5.3.4 Merging Effectiveness of AA-SPCAA-SPC is able to merge 95,605 cluster pairs into singleclusters out of the total 119,201 pairs obtained from 368subjects. Therefore, according to Equation 7, the mergingsuccess (MS) is 80.2%. On a subject-level we are able tomerge on average 79% (SD = 13%) cluster pairs with a rangeof 64% – 97%.

Those successfully merged single clusters have an aver-age duration of 161.68 minutes (SD = 138.89 minutes) witha range of 60 minutes to 333 minutes. Since each of the119,201 cluster pairs are separated by a data gap of at least20 minutes (i.e., 2 times Tmax = 10 minutes), both RSPC(Section 3.2.1) and ESPC (Section 3.2.2) are not able to mergethese clusters. However, AA-SPC is able to merge them if a

Fig. 15. Probability Density Function (PDF) and Cumulative DistributionFunction (CDF) of increase in cluster time (in %) using the AA-SPCapproach compared to ESPC.

Fig. 16. Probability Density Function (PDF) and Cumulative DistributionFunction (CDF) of subject-level merging success with the variation oflength of location data gap.

user’s activity data (i.e., step count) is available and the datacan prove the user’s presence at the same place during gaps.

We further investigate the CCD and CTI for this valida-tion test. We are able to decrease the cluster count on aver-age 40% (SD = 7%) with a range of 32% – 48%. Similarly, on acluster-level, we are able to increase cluster time on average62.78% (SD = 17.86%) with a range of 33.33% – 97.22% whencomparing AA-SPC with ESPC. To test the statistical signif-icance of this cluster time increase we perform the Paired T-test. We reject the null hypothesis: “H0 : µD = 0,” where µD

is the average of the differences in cluster durations beforeand after merging, with t(95604) = 1086.8, p < .001. There-fore, AA-SPC significantly increases the cluster durationsafter merging. We further perform the CTI analysis on asubject-level. We obtain an average CTI of 61.70% (SD =7.90%) with a range of 53% – 71%. In Figure 15 we observethat cluster-level time increase (in %) when comparing AA-SPC with ESPC is almost uniformly distributed from 40%to 100% (PDF bars). Therefore, activity data (i.e., step count)equally improves cluster times over the entire range.

5.3.5 Impact of Gap Length on AA-SPC PerformanceIn Figure 16 we observe that merging success drops (PDFbars) with the increase of gap length. In the figure, the solidhorizontal line corresponds to 0.9 probability, which crossesthe CDF of merging success (blue dashed line) at the 320

13

Fig. 17. Change of performance measures with the change of windowsize.

minute data gap (dotted vertical line). Therefore, we cansuccessfully recover up to 90% of the places for gaps of upto about 320 minutes using the step counts. Similarly, wecan successfully recover up to 95% of the places for gaps ofup to about 460 minutes.

5.3.6 Sensitivity of Performance Measures to Window SizeTo test the effect of window size when trying to mergetwo successive sp-clusters using activity data, we vary thewindow size from 3 minutes to 19 minutes in steps of 2minutes. In Figure 17 we draw an error bar around theaverage value of each performance measure at a particularwindow size and then connect the average values of aperformance measure across different window sizes usinga line. In the figure we observe that, in general, the perfor-mance improves with increasing window sizes. However,compared to MS and CCD, CTI is less sensitive to windowsize variation. This is because with increasing window sizes,we are also considering more activity samples, which thenreduces the chance of misidentifying a window due to thepresence of noisy samples in the window. Although anincreased window size improves the MS and CCD, it doesnot help much to improve the CTI of successfully mergedclusters. This finding is consistent with our earlier findingsin Section 5.2.3, where we have shown that sleep- andbattery-assisted clustering approaches are able to fill largergaps compared to activity-assisted approach.

6 DISCUSSION

While the proposed approach relies on three different typesof secondary data, not all types may always be available,e.g., sleep sessions are skewed around nights. Comparedto sleep, battery charging sessions have a better spatio-temporal distribution. Therefore, when sleep sessions canimprove the night time clusters within homes, batterycharge sessions can improve clusters from additional placessuch as workplaces, where people also typically rechargetheir phone batteries. Compared to sleep and battery chargesessions, activity data has the highest spatio-temporal avail-ability and can further improve clustering. However, theimprovements in clustering also depends on a user’s choiceof carrying multiple devices. For example, the valid activitydata that we used in our analysis requires the users to wearthe Fitbit device. Another factor is the application scenario,e.g., if the goal is a better discovery of night time clusters orclusters within homes, then battery charge sessions with and

without sleep sessions could be enough to rely on. As ob-served in Section 5.2.2, in general, the addition of more typesof secondary data increases the availability of information toinfer a user’s state and allows us to merge more clusters toobtain fewer but larger clusters. For example, it can also behelpful to combine other secondary data such as Wi-Fi andBluetooth scan records to match the fingerprints in orderto discover closely co-located users and share location dataamong the users to fill the gaps of a particular user.

Although it may seem redundant to consider activitydata when we already have sleep or battery charge infor-mation, but additional data will increase the reliability ofa user’s state prediction since battery and sleep sessionswork in a deterministic way, i.e., they always merge clus-ter pairs. In other words, those sessions never check fora user’s absence from a place compared to minute-levelactivity data. Also, all these three types of secondary dataare collected from readily available sensors in smartphonesand wearables (Table 1). Therefore, combining them can addreal value.

In Section 5.2.3 we presented the outcomes of caseswhere we consider separate sets of clusters. We observedthat the SA-SPC clustering approach improves both clus-ter count and time compared to other approaches, whichhappens due to the fact that sleep sessions are usually longenough to fill larger gaps compared to the other secondaryinformation, such as battery charge sessions and minute-level activity data. Also, in our approach of AA-SPC andSBAA-SPC even a single noisy window could stop themerging of a cluster pair.

7 CONCLUSIONS

Gaps in location traces impact the performance of clusteringtechniques that are required to identify places of importanceto a user. In this paper, we introduce a new approach thatfills such gaps using a user’s own opportunistic secondaryinformation. Specifically, we utilize secondary sensor datafrom the user’s own smartphone (i.e., the phone’s batterycharge sessions) and wearable device (i.e., sleep sessionsand step counts recorded by a Fitbit device) to determine theindividual’s spatio-temporal state. Under the assumptionthat the user remains at the same location when sleepingor recharging the phone’s battery, we can use the sleepand battery charge session data to augment the locationtraces. Also, under the hypothesis “a user’s step countsdiffer among different places and walk sessions” with theassumption that walking takes place when a user movesbetween places, we can also use the step counts to augmentthe location traces, yielding fewer but larger clusters thatmore accurately represent the places of significance. An-other advantage of using step counts (i.e., activity data) inclustering is its higher availability over different places anddifferent times of day as compared to the sleep and batterycharge sessions, which are skewed in their spatio-temporaldistributions. The evaluation results presented in this work(using a large-scale mobile crowd sensing study dataset col-lected from several hundred college students) confirm thatthe proposed approach is able to produce larger clusters thatinclude a larger percentage of a user’s traces in the clusters,where each cluster represents a place of significance. In our

14

future work, we plan to look at additional context data suchas a user’s communication trends and app usage habits inorder to further improve the place discovery.

8 ACKNOWLEDGEMENTS

The research reported in this paper was supported by theNational Heart, Lung, and Blood Institute (NHLBI) of theNational Institutes of Health (NIH) under award numberR01HL117757.

REFERENCES

[1] N. Marmasse and C. Schmandt, “Location-aware informationdelivery withcommotion,” in International Symposium on Handheldand Ubiquitous Computing, 2000.

[2] V. W. Zheng, Y. Zheng, X. Xie, and Q. Yang, “Collaborative locationand activity recommendations with gps history data,” in ACMWorld Wide Web, 2010.

[3] J. H. Kang, W. Welbourne, B. Stewart, and G. Borriello, “Extractingplaces from traces of locations,” ACM SIGMOBILE Mobile Comput-ing and Communications Review, vol. 9, no. 3, pp. 58–68, 2005.

[4] S. Vhaduri, A. Munch, and C. Poellabauer, “Assessing healthtrends of college students using smartphones,” in IEEE HI-POCT,2016.

[5] Y. Ye, Y. Zheng, Y. Chen, J. Feng, and X. Xie, “Mining individuallife pattern based on location history,” in IEEE MDM, 2009.

[6] D. Ashbrook and T. Starner, “Using gps to learn significant loca-tions and predict movement across multiple users,” Personal andUbiquitous computing, vol. 7, no. 5, pp. 275–286, 2003.

[7] B. Guo, Z. Wang, Z. Yu et al., “Mobile crowd sensing andcomputing: The review of an emerging human-powered sensingparadigm,” ACM Computing Surveys, vol. 48, no. 1, p. 7, 2015.

[8] R. Montoliu, J. Blom, and D. Gatica-Perez, “Discovering places ofinterest in everyday life from smartphone data,” Multimedia toolsand applications, vol. 62, no. 1, pp. 179–207, 2013.

[9] S. Vhaduri and C. Poellabauer, “Hierarchical cooperative discov-ery of personal places from location traces,” IEEE Transactions onMobile Computing, vol. 17, no. 8, pp. 1865–1878, 2018.

[10] Y.-C. Fan, C.-W. Chang, W.-C. Lee, K.-C. Wu, and A. L. Chen, “Onthe semantic annotation of wi-fi ssid logs in mobile applications,”Pervasive and Mobile Computing, vol. 43, pp. 131–145, 2018.

[11] S. He, S.-H. G. Chan, L. Yu, and N. Liu, “Slac: Calibration-free pedometer-fingerprint fusion for indoor localization,” IEEETransactions on Mobile Computing, 2017.

[12] S. Vhaduri and C. Poellabauer, “Cooperative discovery of personalplaces from location traces,” in ICCCN, 2016.

[13] S. Vhaduri, C. Poellabauer, A. Striegel, O. Lizardo, and D. Hachen,“Discovering places of interest using sensor data from smart-phones and wearables,” in IEEE UIC, 2017.

[14] S. Vhaduri and C. Poellabauer, “Wearable device user authentica-tion using physiological and behavioral metrics,” in IEEE PIMRC,2017.

[15] E. Wang, Y. Yang, J. Wu, W. Liu, and X. Wang, “An efficientprediction-based user recruitment for mobile crowdsensing,” IEEETransactions on Mobile Computing, vol. 17, no. 1, pp. 16–28, 2018.

[16] D. Wu, Q. Liu, Y. Li et al., “Adaptive lookup of open wifi usingcrowdsensing,” IEEE/ACM Transactions on Networking, vol. 24,no. 6, pp. 3634–3647, 2016.

[17] W. Dai, Y. Wang, Q. Jin, and J. Ma, “Geo-qti: A quality aware truth-ful incentive mechanism for cyber–physical enabled geographiccrowdsensing,” Future Generation Computer Systems, vol. 79, pp.447–459, 2018.

[18] Y. Wang, Z. Cai, X. Tong, Y. Gao, and G. Yin, “Truthful incentivemechanism with location privacy-preserving for mobile crowd-sourcing systems,” Computer Networks, 2018.

[19] M. Pouryazdan, C. Fiandrino, B. Kantarci, T. Soyata, D. Kliazovich,and P. Bouvry, “Intelligent gaming for mobile crowd-sensing par-ticipants to acquire trustworthy big data in the internet of things,”IEEE Access, vol. 5, pp. 22 209–22 223, 2017.

[20] Z. Zheng, F. Wu, X. Gao, H. Zhu, S. Tang, and G. Chen, “A budgetfeasible incentive mechanism for weighted coverage maximizationin mobile crowdsensing,” IEEE Transactions on Mobile Computing,vol. 16, no. 9, pp. 2392–2407, 2017.

[21] M. Pavan, S. Mizzaro, and I. Scagnetto, “Mining movement datato extract personal points of interest: A feature based approach,”in Information Filtering and Retrieval. Springer, 2017.

[22] A. Thomason, N. Griffiths, and V. Sanchez, “Identifying locationsfrom geospatial trajectories,” Journal of Computer and System Sci-ences, vol. 82, no. 4, pp. 566–581, 2016.

[23] Z. Fu, Z. Tian, Y. Xu, and K. Zhou, “Mining frequent route patternsbased on personal trajectory abstraction,” IEEE Access, vol. 5, pp.11 352–11 363, 2017.

[24] A. LaMarca, Y. Chawathe, S. Consolvo et al., “Place lab: Devicepositioning using radio beacons in the wild,” in InternationalConference on Pervasive Computing, 2005.

[25] F. Xu, D. Wu, Y. Li et al., “Big data driven mobile traffic under-standing and forecasting: A time series approach,” IEEE transac-tions on services computing, vol. 9, no. 5, pp. 796–805, 2016.

[26] S. Hwang, C. Evans, and T. Hanke, “Detecting stop episodesfrom gps trajectories with gaps,” in Seeing Cities Through Big Data.Springer, 2017, pp. 427–439.

[27] L. Zhu, C. Xu, J. Guan, and H. Zhang, “Sem-ppa: a semanticalpattern and preference-aware service mining method for person-alized point of interest recommendation,” Journal of Network andComputer Applications, vol. 82, pp. 35–46, 2017.

[28] S. M. Mousavi, A. Harwood, S. Karunasekera, and M. Maghrebi,“Geometry of interest (goi): spatio-temporal destination extractionand partitioning in gps trajectory data,” Journal of Ambient Intelli-gence and Humanized Computing, vol. 8, no. 3, pp. 419–434, 2017.

[29] B. Shams and S. Haratizadeh, “Graphloc: a graph based approachfor automatic detection of significant locations from gps trajectorydata,” Journal of Spatial Science, vol. 63, no. 1, pp. 115–134, 2018.

[30] V. Kulkarni, A. Moro, B. Chapuis, and B. Garbinato, “Extract-ing hotspots without a-priori by enabling signal processing overgeospatial data,” in ACM SIGSPATIAL GIS, 2017.

[31] N. Kiukkonen, J. Blom, O. Dousse et al., “Towards rich mobilephone datasets: Lausanne data collection campaign,” ICPS, 2010.

[32] S. Vhaduri and C. Poellabauer, “Impact of different pre-sleepphone use patterns on sleep quality,” in IEEE BSN, 2018.

[33] ——, “Biometric-based wearable user authentication duringsedentary and non-sedentary periods,” in IEEE/ACM IoTSec, 2018.

[34] ——, “Human factors in the design of longitudinal smartphone-based wellness surveys,” in IEEE ICHI, 2016.

[35] ——, “Opportunistic discovery of personal places using smart-phone and fitness tracker data,” in IEEE ICHI, 2018.

[36] “What should i know about smarttrack exercise detection?”Accessed: August 2017. [Online]. Available: https://help.fitbit.com/articles/en US/Help article/1933

Sudip Vhaduri received the B.Sc. degreein computer science and engineering fromBangladesh University of Engineering and Tech-nology, Bangladesh, the M.Sc. degree in com-puter science from University of Memphis, andis currently a Ph.D. student in computer scienceand engineering at University of Notre Dame.His research interests include mobile sensingand computing, mobile health, and data science.He is a student member of the IEEE and theIEEE Computer Society.

Christian Poellabauer received the Diplom-Ingenieur degree from University of TechnologyVienna and the Ph.D. degree in computer sci-ence from Georgia Institute of Technology. He isan associate professor of computer science andengineering at University of Notre Dame. Hisresearch interests are in the areas of distributedreal-time systems, mobile computing, wirelesssensor networks, smart health. He is a seniormember of the IEEE and the IEEE ComputerSociety.

https://help.fitbit.com/articles/en_US/Help_article/1933

https://help.fitbit.com/articles/en_US/Help_article/1933

1 opportunistic discovery of personal places using multi-source sensor...

Documents