exploiting service similarity for privacy in location based search queries

15
MigrantSystems This is an author created version of the article. The original manuscript is available from http://doi.ieeecomputersociety.org/10.1109/TPDS.2013.34.

Upload: migrant-systems

Post on 21-Jun-2015

352 views

Category:

Education


3 download

DESCRIPTION

Final Year IEEE Projects at Low Price Contact 9840442542

TRANSCRIPT

Page 1: Exploiting Service Similarity for Privacy in Location Based Search Queries

MigrantSystems

This is an author created version of the article. The original manuscript is available from http://doi.ieeecomputersociety.org/10.1109/TPDS.2013.34.

Page 2: Exploiting Service Similarity for Privacy in Location Based Search Queries

MigrantSystems

1

Exploiting Service Similarity for Privacy inLocation Based Search Queries

Rinku Dewri, Member, IEEE, and Ramakrisha Thurimella

Abstract—Location-based applications utilize the positioning capabilities of a mobile device to determine the current location of a user,and customize query results to include neighboring points of interests. However, location knowledge is often perceived as personalinformation. One of the immediate issues hindering the wide acceptance of location-based applications is the lack of appropriatemethodologies that offer fine grain privacy controls to a user without vastly affecting the usability of the service. While a number ofprivacy-preserving models and algorithms have taken shape in the past few years, there is an almost universal need to specify one’sprivacy requirement without understanding its implications on the service quality. In this paper, we propose a user-centric location-based service architecture where a user can observe the impact of location inaccuracy on the service accuracy before deciding thegeo-coordinates to use in a query. We construct a local search application based on this architecture and demonstrate how meaningfulinformation can be exchanged between the user and the service provider to allow the inference of contours depicting the change inquery results across a geographic area. Results indicate the possibility of large default privacy regions (areas of no change in resultset) in such applications.

Index Terms—Privacy-supportive LBS, location privacy, service quality.

F

1 INTRODUCTION

THE consumer market for location-based services(LBS) is estimated to grow from 2.9 billion dollars in

2010 to 10.4 billion dollars in 2015 [1]. While navigationapplications are currently generating the most significantrevenues, location-based advertising and local searchwill be driving the revenues going forward. The legallandscape, unfortunately, is unclear about what happensto a subscriber’s location data. The non-existence ofregulatory controls have led to a growing concern aboutpotential privacy violations arising out of the usageof a location-based application. While new regulationsto plug the loopholes are being sought, the privacy-conscious user currently feels reluctant to adopt one ofthe most functional business models of the decade.

Privacy and usability are two equally important re-quirements for successful realization of a location-basedapplication. Privacy (location) is loosely defined as a“personally” assessed restriction on when and wheresomeone’s position is deemed appropriate for disclosure.To begin with, this is a very dynamic concept. Usabilityhas a two fold meaning—a) privacy controls should beintuitive yet flexible, and b) the intended purpose of anapplication is reasonably maintained. Towards this end,prior research have led to the development of a numberof privacy criteria, and algorithms for their optimalachievement. However, there is no known attempt tobring into view the mutual interactions between theaccuracy of a location coordinate and the service qualityfrom an application using those coordinates. Therefore,

• R. Dewri and R. Thurimella are with the Department of Computer Science,University of Denver, CO 80208, USA. Email:{rdewri,ramki}@cs.du.edu.

the question of what minimal location accuracy is re-quired for a LBS application to function, remains open.The common man’s question is: “how important is myposition to get me to the nearest coffee shop?”—whichunfortunately remains unanswered in the scientific com-munity.

It is worth mentioning that a separate line of researchin analyzing anonymous location traces have revealedthat user locations are heavily correlated, and knowinga few frequently visited locations can easily identify theuser behind a certain trace [2], [3]. The privacy breachin these cases occurs because the location to identitymapping results in a violation of user anonymity. Theproposal in this work attempts to prevent the reversemapping—from user identity to user location—albeit ina user-controllable manner.

1.1 Related WorkLocation obfuscation has been extensively investigatedin the context of privacy. Obfuscation has been earlierachieved either through the use of dummy queries orcloaking regions. In the dummy query method, a userhides her actual query (with the true location) amongsta set of additional queries with incorrect locations [4], [5].The user’s actual location is one amongst the locationsin the query set. The additional processing overhead atthe LBS, resulting from the dummy queries, must beaddressed while using this method. Cheng et al. proposea data model to augment uncertainty to location datausing circular regions around all objects [6]. They use im-precise queries that hide the location of the query issuerand yield probabilistic results. The results are modeledas the amount of overlap between the query range and

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS VOL:25 NO:2 YEAR 2014

Page 3: Exploiting Service Similarity for Privacy in Location Based Search Queries

MigrantSystems

2

the circular region around the queried objects. Yiu etal. propose an incremental nearest neighbor processingalgorithm to retrieve query results [7]. The process startswith an anchor, a location different from that of the user,and it proceeds until an accurate query result can bereported. The work focuses on reducing the communi-cation cost of the repeated querying mechanism.

Trusted third party based approaches rely on ananonymizer that creates spatial regions to hide the truelocation of users. The use of spatial and temporal cloak-ing to obfuscate user locations was first proposed byGruteser and Grunwald [8]. Continuing on, Gedik andLiu develop a location privacy architecture where eachuser can specify maximum temporal and spatial toler-ances for the cloaking regions [9]. Drawing inspirationfrom the concept of k-anonymity in database privacy[10], Gedik and Liu enforce a location k-anonymityrequirement while creating the cloaking regions. Thisrequirement ensures that the user will not be uniquelylocated inside the region in a given period of time.Ghinita et al. propose a decentralized architecture toconstruct an anonymous spatial region, and eliminatethe need for the centralized anonymizer [11]. In theirapproach, mobile nodes utilize a distributed protocolto self-organize into a fault-tolerant overlay network,from which a k-anonymous cloaking set of users canbe determined. Kalnis et al. propose that all obfuscationmethods should satisfy the reciprocity property [12].This prevents inversion attacks where knowledge ofthe underlying anonymizing algorithm can be used toidentify the actual object [13]. Parameter specificationremains the biggest hindrance to real world applicationof these techniques. Even when a user has advancedknowledge to comprehend the implications of a param-eter setting on location privacy, the impact on service isunknown in these approaches. Refer to Section 1 of thesupplementary file for additional literature review.

1.2 ContributionsOur contributions in this work are two-fold. First, wepropose a novel architecture for LBS applications thatis directed towards revealing privacy/utility trade-offsto a user before an actual geo-tagged query is made.Unlike a typical competitive architecture where the LBSprovider does not actively participate in making privacydecisions, we envision a privacy-supportive LBS as aprovider willing to provide supplemental informationfor making “informed” privacy decisions. An informeddecision implies that the LBS user operates under reason-able knowledge about the service level implications ofrevealing her location with a given degree of inaccuracy.Under this platform, a user first obtains an overviewof the impact of using inaccurate locations in a certainquery. Thereafter, the actual query made to the serviceprovider is geo-tagged with a location that the user hascarefully chosen to balance result accuracy and locationprivacy. We describe in Section 2 the underlying ratio-nale, setting, expectations and components that go into

such an architecture. Refer to Section 2 of the supple-mentary file for a separate study, which demonstratesthat users have the flexibility of adding significant noiseto their locations and still obtain accurate search results.

As our second contribution, we present in Section3, a proof of concept design for a privacy-supportivelocal search LBS. Given a search term (e.g. generic onessuch as “cafes”, and targeted ones such as “starbuckscoffee”) and a highly generalized user location (e.g. themetropolitan city), the privacy-supportive LBS generatesa concise representation of the variation in the 10-nearestneighbor result set as a hypothetical user moves acrossthe large metropolitan area. Once the representation iscommunicated to the user, she can infer the geographicvariability that can be introduced in her location coordi-nates to retrieve all or a subset of the result set. Our re-sults, using a publicly available local business database,indicate that the proposed approach can precisely revealthe area boundaries within which the result set is fullypreserved (a default privacy level). Further, we observea high degree of precision in estimating the area bound-aries when user requirements on result set accuracy arerelaxed (i.e. location sensitivity is hardened). Section 4presents the empirical results to support these claims.

2 PRIVACY-SUPPORTIVE LBSFuture LBS architectures must make room for a serviceprovider to cooperate with the user in making soundprivacy decisions. There is a growing skepticism on howa LBS provider handles (or might handle) location data.If strong market adoption is an agenda item for thesebusinesses, then it becomes their responsibility to presentevidence that the sought location accuracy is indeeda characteristic requirement of the application. Further,regulatory enforcements on location data procurement,and subsequent liability in the event of improper han-dling, can make the collection of unnecessarily precisegeo-locations an unattractive choice. From a computa-tional perspective, only the service provider maintainsthe database of queried objects in real time. Therefore, itis reasonable that differences (or similarities) in the out-put of a query can be efficiently computed at the serverside. A user cannot make informed privacy decisionswithout this computation. In light of these arguments,a privacy-supportive LBS seems both appropriate andimportant. Note that a simple opt-in LBS is not privacy-supportive, since the implications of not using ones geo-location is not available to the user.

2.1 SettingThe communication setting we assume includes one ormore users equipped with GPS-enabled devices, and aLBS provider possessing a database of points-of-interest(POI). These points-of-interest may be static, as in localbusiness listings, or dynamic, as in a friend-finder servicewhere users frequently check-in/out of the underlying

Page 4: Exploiting Service Similarity for Privacy in Location Based Search Queries

MigrantSystems

3

geo-tagged query

privacy profile

query result

user device1. high-level derivation

3. service-similarity profile

2. query-outputsimilarity profiler

5. regularqueryprocessor

privacy-supportive LBS

DB

.

.

.

.

data structure

{4. location perturbation

{

Figure 1. Communication order for a location-based query in the presence of a privacy-supportive LBS.

social-networking platform. Similar to in almost all op-erating LBS applications, user access to the service isaugmented by a geographic tag identifying the positionof the user. Authentication may or may not be requiredto use the service, although many applications claim tobe able to provide a better result set in the latter case. Theservice itself may require other parameters to be spec-ified, such as search keywords or profile descriptions.The geographic tag in the query is typically the GPS-coordinates of the user device, but can also be a carefullycrafted location as explained in the next subsection.

2.2 ArchitectureThe location disclosure mechanism in a privacy-supportive LBS architecture employs an intermediatecommunication with the LBS. A high-level schematic ofthe communication pattern is depicted in Fig. 1. The userdevice forwards the query to the LBS, albeit uses a high-level generalization of the user’s geographic locationin it. This generalization may be derived as per user-specification (say at the level of the city), or obtainedautomatically from the location approximation that aprovider can infer using a cell-towers and wifi-accesspoints database1. In response to this first query phase,the user obtains a service-similarity profile. This profile isa representation of the similarities in the query outputat different geographic locations. The exact form takenby this profile, as well as the data structures employedin computing this profile, may vary from application toapplication. A location perturbation engine on the userside then determines a noisy location to use based on theuser’s privacy profile and the retrieved service-similarityprofile. The LBS processes the query with respect to thenoisy location.

A user can manually interact with the service-similarity profile to assess which locations have the high-est (or acceptable) level of result set similarity, withinthe constraints of the location noise she wants to infuse

1. Creating and updating cell-towers and wi-fi access point maps isa costly affair. The businesses that do so (Skyhook, Google, Apple,Navizon, etc.) often consider it proprietary. The legal standard foraccessing these databases is currently being litigated in a number ofcases (http://epic.org/privacy/location_privacy).

into the query. In this case, a good visualization ofthe similarity profile is required. Although this is themost flexible method of putting the trade-off informationto use, such high degree of interaction will affect theusability of the application, specially when queries aremade frequently. Hence, we assume that action axiomshave been provided by the user to make the processautomatic. The privacy profile then states how a locationis to be selected for different categories of applications,their importance, and the relative location sensitivity.Policy specifications such as these, and their integrationinto the decision making process, warrant an extensiveexploration. We will avoid this frontier in this work. Anaive approach is to allow the user to select a locationsensitivity level (much like choosing the ringer-state ina mobile phone), assess query result accuracy at thecorresponding location granularity (using the similarityprofile), and notify the user if the accuracy drops belowa threshold. Note that the policy executes within auser’s device and reveals little or no information on howlocations get chosen.

2.3 Privacy expectations and threat modelWe interpret location privacy as the accuracy with whichan adversary can determine the position of a user. Thisinterpretation resembles the intuitive perception that alocation estimated closer to our true position is moreencroaching on our privacy than a relatively distant es-timation. However, the privacy-supportive architecturedoes not make any assumption on what is “distant” andwhat is “close enough.” This is a significant departurefrom statistical measures of privacy, where a statementon “what is private” must be made pro-actively beforeissuing the query. A privacy-supportive LBS does notrequire this decision until the user determines the us-ability of the information that would be revealed as aresult of the location disclosure, if at all. In light of thisdifference, the architecture, its underlying algorithms,or the service provider itself, cannot make any claimson the enforced level of privacy. It only facilitates theprocess to enforce personally desirable levels of locationprivacy after careful consideration of its impact. Onsimilar grounds, we assume a threat model where the

Page 5: Exploiting Service Similarity for Privacy in Location Based Search Queries

MigrantSystems

4

provider is semi-honest (follows protocol but may becurious). Note that, on one hand, even the weakest ofthe adversaries may learn the precise locations of aprivacy-indifferent user (one who always reveals the truelocation), while on the other, even the strongest of theadversaries may learn nothing additional from a privacy-paranoid user. A privacy-aware user would use thesystem to her advantage, perhaps frequently revealingaccurate (not necessarily precise) positions, and occa-sionally the heavily perturbed ones. An adversary whocan classify these locations as real or dummy, infers someknowledge about the user’s whereabouts—however, thisis information that the user has opted to reveal in thefirst place.

3 A LOCAL SEARCH APPLICATIONMobile local search is demonstrating an upward markettrend, the gap with the desktop counterpart diminishingin the next three years, and then rising further2. Giventhe penetration of web-enabled handheld devices in theconsumer market, it has become exceedingly commonfor a user to instantly look up the information sheseeks to find. These search queries are estimated toproduce 27.8 billion more queries than desktop-searchby the year 2016. A vast majority of the users performingmobile search seek access to information pertinent inthe locality of the query. Multiple LBS applications—e.g.Where, AroundMe, MeetMoi, Skout and Loopt—havespawned in the past few years to address this marketsegment. In general, a local search application providesinformation on local businesses, events, and/or friends,weighted by the location of the query issuer. Locationand service accuracy trade-offs are clearly present in alocal search LBS. A privacy-supportive variant is there-fore well-suited for this application class. Local searchresults tend to cycle through periods of plateaus andminor changes as one moves away from a specifiedlocation. The plateaus provide avenues for relaxation inthe location accuracy without affecting service accuracy,while the minor changes allow one to assess accuracy ina continuous manner.

3.1 Problem statementIn the traditional usage of a local search application,the user would communicate a search keyword to theprovider, and retrieve a ranked list of records matchingthe search term. Let us denote the items that matchthe search term in the points-of-interest database byP = {P1, P2, ..., PN

}. A ranking function R is appliedto this set and a top-k subset of the ranked resultsis returned to the user. Since neighboring results areconsidered more useful, the ranking function wouldutilize the geo-location of the user. We use R

k

(P, pos)to collectively denote this result set when retrieved withrespect to the position pos.

2. Source: BIA/Kesley Press Releases, April 2012

1.00.9

0.8

0.70.6

0.50.4

0.3

0.2

0.1

0.0

Figure 2. Hypothetical query result set similarity with theuser at the center of the area.

3.1.1 An ideal scenario

Let us next consider a hypothetical scenario where theuser has access to a matrix that shows the percentagesimilarity of the result set with respect to the user’scurrent location. In order to formalize this map, let ussuperimpose a grid of r ⇥ c cells on a geographic areaG. In local search, it is sufficient to restrict focus to thisgeographic area while determining the set P . The posi-tion of the user in the grid is given as p = hx0, y0i. LetSim be a similarity function, defined in this applicationas follows.

Sim(hx, yi, hx0, y0i) = |Rk

(P, hx, yi) \Rk

(P, hx0, y0i)|k

.

For brevity, we will also use Rk

(P, hx, yi) andR

k

(P, hx0, y0i) as arguments to the Sim function. LetSx0,y0 be a matrix of r rows and c columns, with

Sx0,y0 [i, j] = Sim(hx0, y0i, hi, ji)

Hence, Sx0,y0 is a cell-by-cell measure of the similarity

of the result set retrieved for the user’s position relativeto that retrieved for any other position in the grid. Asdepicted in Fig. 2, this matrix allows the user to identifycell boundaries where the result set similarity graduallydecreases from 100% to 0%. We can call them the service-contour of the issued query. The innermost region in thefigure, S

x0,y0 = 1.0, is the default privacy region—theuser can claim to be anywhere in that region and yetretrieve the same result set as she would do by usingher precise coordinates. The size of this default region isa characteristic feature of the distribution of the pointsin the set P across the grid.

The service-contour of a query reveals the regionswhere a certain percentage of the top-k results is re-tained. Given a certain requirement on the fraction ofresults that must be retained (i.e. the utility that mustbe maintained), the area of the corresponding regionis a measure of the privacy achievable by the user,

Page 6: Exploiting Service Similarity for Privacy in Location Based Search Queries

MigrantSystems

5

since a query originating from any point in the regionwill return a result set with the desired utility. Theuser can calculate these regions for any level of utilityrequirement, which in other words imply that an overallpicture of the privacy/utility trade-offs is available tothe user for decision making. Trading between serviceaccuracy and location inaccuracy is then a question ofchoosing a point in one of the demarcated regions.

Unfortunately, the user device cannot compute Sx0,y0

without access to P , which resides at the LBS provider.The LBS cannot compute S

x0,y0 since it requires access tothe exact position hx0, y0i. The question we investigateis: what form of information can the LBS provide to theuser to help infer the service-contour?

3.1.2 Service-contour inferencing

There exists a trivial solution to the raised question—push the set P and the ranking function R to theuser, and perform the top-k ranking locally on the userdevice. As one can see, this solution clearly ignoresunderlying communication overheads and policies onsharing business intelligence. Note that the set P is notsimply a collection of positions, but includes additionalattributes about the businesses located at those positions.This could range from names, addresses, categories, sub-categories, to specifics such as value, feedback scores,and entire profiles of individuals with personal infor-mation. The ranking function R is often a well-guardedbusiness secret on how these attributes are combined.Another approach is to send a set of similarity matricesto the user, one each corresponding to a specific coordi-nate in the grid. The approach requires the computationand transfer of an inordinate amount of information(O(r2c2)). Given a geographic area, our objective is torestrict the transfer of information to a bounded size, orO(1). The service-contour inferencing problem is thendefined as follows.

Service-contour inferencing: Give a set of points P ona geographic area (represented as a r ⇥ c grid), a rankingfunction R, and a similarity function Sim, find functionsEnc and Dec such that

1) output T = Enc(P,R, Sim) is O(1) in size, and2) assuming S 0

x,y

= Dec(T , hx, yi), with hx, yi being anypoint on the grid, we have S 0

x,y

= Sx,y

.

3.1.3 Approximate inferencing

Without the bounded size constraint, the service-contourinferencing problem can be solved by computing the top-k results for each point in the grid, and then conveyingan identification vector with respect to each point. Anidentification vector uniquely identifies the k resultscorresponding to a point. The service-contour can thenbe exactly generated. This is an attractive choice pro-vided the communication overhead is not exceedinglyhigh. Note that the top-k results induce a set of orderk Voronoi regions [14], [15], [16], each region sharinga certain result set. Therefore, the information to be

V :

V1 {a,b,c,d,e}V2 {a,b,c,f,g}V3 {f,g,h,i,j}

VSim

:

V1 V2 V3

V1 1 0.6 0V2 0.6 1 0.4V3 0 0.4 1

I :

1 1 1 2 21 1 1 2 21 1 2 2 21 1 2 3 33 3 3 3 3

Figure 3. Set V shows hypothetical top-5 result sets ona 5 ⇥ 5 grid. I depicts which result set is applicable at apoint. V

Sim

shows pairwise similarity of the 3 unique resultsets for the grid. The image is a compact representationof I and V

Sim

—grey color codes used are: 1-white=1.0,

2-grey=0.6 and 3-black=0.0.

conveyed may be highly compressible. We shall use thecommunication overhead of this method as a benchmarkin the experimental analysis.

Consider a hypothetical scenario where the top-k re-sults corresponding to a point can be represented by oneof V symbols. Further, a maximum entropy condition isachieved under arbitrary distribution of the points in Pacross the grid. Therefore, each symbol is equi-probable(1/V ). Under this setting, no lossless compression ofthe symbol sequence describing the top-k results acrossthe grid can achieve a compression level better thanlog2 V bits per point, i.e. rclog2V bits for T . Assuminga 320 ⇥ 320 grid on a 32 ⇥ 32km2 area (a point thenresembles a 100m ⇥ 100m area), and V = 1000 uniquetop-k result sets generated for the points in this area,this number is around 124.5KB. While this is not a largedata transfer in itself, repeated querying will result inan accumulated overhead that is a significant fraction oftypical bandwidth limitations. We seek algorithms thatcan avoid such a communication overhead (even in theworst case); however, provide a good approximation ofSx,y

. Note that this observation assumes a worst casescenario and only pertains to the ability to correctly de-termine if two points have different (or the same) resultsets. Computing the similarity would involve encodingadditional identifier data corresponding to every set.

3.2 Privacy-supported local searchThe crucial piece of information to infer the service-contour is the similarity measure Sim that tells thepercentage overlap in the result sets from two points.Given that the top-k result sets (the output of R) do notalways change as one moves from one point to the next,the same calculation is performed (operates on samedata) by Sim for most pairs of points. Let us denote by Vthe set of distinct outputs of R for the points of the grid,i.e. V = {R

k

(P, hx, yi)|1 x c, 1 y r}. Note thatthe size of V is going to be comparatively smaller thanthe size of the grid. Let V

Sim

be a matrix that denotes

Page 7: Exploiting Service Similarity for Privacy in Location Based Search Queries

MigrantSystems

6

the Sim values on pairs of elements of V , i.e.

VSim

[i, j] = Sim(Vi

, Vj

), Vi

, Vj

2 V.

Next, we define a r ⇥ c index matrix I such thatI[i, j] = t implies R

k

(P, hi, ji) = Vt

, where Vt

is amember of V . Fig. 3 captures the relationship betweenV,V

Sim

and I. In the same figure, we also see anotherrepresentation of the three sets in the form of a 5 ⇥ 5

pixel image. The color of each pixel is indicative of pointshaving the same value in I. In addition, the similaritymeasure, as computed in V

Sim

, can be inferred from theshades of the colors.

Sim(hx, yi, hx0, y0i) = 1� |color(x, y)� color(x0, y0)|

For example, the result set similarity between thepoints h3, 3i and h5, 5i is V

Sim

[2, 3] = 0.4, which canalso be derived as 1 � |0.6 � 0.0|. The advantage hereis that the similarity information is conveyed withoutthe need to communicate V . The representation is ratherstraightforward in this example, but need not be so forarbitrary V,V

Sim

and I.

3.2.1 Multi-dimensional scaling

The example above involves determining three greyscalecolor codes (values in [0, 1]) such that the Euclideandistance between two values is proportional to the simi-larity measurements given by V

Sim

. The objective is notdifferent when V

Sim

has a significantly more numberof entries. We adopt the classical method of multi-dimensional scaling at this step. The multi-dimensionalscaling problem is stated as follows for the problem athand.

Multi-dimensional scaling: Given a set of top-k resultsets V = {V1, V2, ..., Vn

} and a similarity matrix VSim

, obtaina set of n m-dimensional vectors c1, c2, ..., cn that minimizes

X

i<j

(Euc (ci

, cj

)� (1� VSim

[i, j]))2.

Euc is the Euclidean distance function. The scalinghappens from a k-dimensional space to a m-dimensionalspace. For the case when a minimum value of zero exists(and is found), the Euclidean distance between any twovectors c

i

and cj

is equal to the dissimilarity betweentwo result sets V

i

and Vj

. Such distance preservingembedding of high dimensional data is readily usefulfor data visualization. Numerical solvers for a multi-dimensional scaling problem are included in most sta-tistical packages. We use the implementation providedin the cmdscale function of the R statistical package. Theimplementation follows the analysis of Mardia [17]. Weuse a value of m = 3 since it allows one to graphicallyvisualize the similarity trend in the form of a RGB colorimage. Higher values of m allow for the possibilityof better distance preservation, but results in a largerencoded size.

The Enc function based on 3-dimensional scaling thenoperates as follows: each component of the c

i

vectors are

box

inscribed-circle

user location<x0,y0>

fill-out

erroneousinclusion

push out

Figure 4. Heuristics for service-contour inferencing.Shaded regions depict true areas with a given servicesimilarity. Output of fill-out is shown as a dashed-linearound the determined area.

normalized to the [0, 1] interval, and a r⇥c pixel image iscreated with the RGB color of pixel (i, j) set to cI[i,j]. Thisimage is the output T produced by the Enc functionand communicated to the user. Although a vector c

i

cantake infinite values in [0, 1]3, the number of possibilitiesreduce to 16.7 million due to the color mapping. Fig.1 in Appendix A (see supplementary file) illustrates anexample image created by Enc for 10-nearest Starbuckscoffee shop locations in the city of Los Angeles, CA (1024square kilometers area centered around Los Angeles CityHall).

3.2.2 Inferring the service-contour

In order to retrieve the service-contour from T , the Decfunction uses the location of the user hx0, y0i as a pointof reference for similarity comparison. Let T

x,y

be theRGB color vector at the (x, y) pixel in T . The Euclideandistance between T

x0,y0 and the color vector Ti,j

of anyother pixel (i, j) (a point in the grid) attempts to closelyestimate the dissimilarity measure—the similarity esti-mate then being S 0

x0,y0[i, j] = 1 � Euc(T

x0,y0 , Ti,j). TheDec function then simply computes this estimate forall possible points hi, ji in the grid. Computation of theservice-contour can also be parameterized by a threshold� such that points in the grid with a similarity estimatehigher or equal to � are the only ones identified. To do so,one can begin at point hx0, y0i and continue to exploreneighboring points as long as the similarity estimatesatisfies the threshold. We explore three fast heuristics inorder to avoid a point by point generation of the service-contour. Fig. 4 illustrates the difference between them.

Box: Starting from the user location hx0, y0i, a box isgrown by pushing the four edges outward (in clockwiseorder), one point-step at a time. Edge pushing along adirection is stopped whenever doing so will result in the

Page 8: Exploiting Service Similarity for Privacy in Location Based Search Queries

MigrantSystems

7

inclusion of a point with similarity estimate less than �.Inscribed-circle: Box-expansion tends to cover inaccu-

rate points (those outside the threshold) in the corner ar-eas, specially when similarity estimates are not exact. Acircular region inscribed in the box, centered at hx0, y0i,eliminates such errors on the corners of the box.

Fill-out: While an inscribed-circle is good at reducingthe error in some cases, it cannot cover irregular shapedregions within the � threshold. The fill-out methodexpands the circular region by including neighboringpoints that has the same color vectors as points withinthe inscribed-circle.

An interactive process of inference would involvedetermining the service-contour for a given value of �(say 90%), and then progressively growing it dependingon the area of the region inferred at a certain threshold.We refrain from using methods based on computationalgeometry due to their higher processing requirements.

Note that we have excluded the possibility of a mali-cious server model in this scheme. A malicious servercan manipulate the similarity data to create the im-pression that no two neighboring cells have the sameresult set. However, it would not be correct to statethat such manipulations will force the user to revealher precise location. The decision on whether a defaultprivacy region is sufficiently large enough is user-driven.A distorted picture of the similarity profile may in factdrive the user to believe that no reasonable privacy canbe achieved in the application, and thereby discontinueusing it. In another case, a privacy-aware user may stillpick a location from a larger area, i.e. trade accuracy(although based on distorted information) for privacy.Hence, even after a malicious server manipulates thesimilarity matrix intelligently, it is not guaranteed thatthe location communicated by the user is true, or aconsequence of the privacy/accuracy trade-off process.In addition, the server must also keep the user motivatedto use the service. This in itself is much more difficultonce the user observes discrepancies in the final queryanswers and the physical realities. A formal evaluationsubstantiating these arguments would be useful; other-wise distributed methods to share trust scores on serviceproviders can be sought to identify malicious servers.

4 EMPIRICAL EVALUATION

The empirical evaluation is performed using the Sim-pleGeo Places dataset that contains information onmore that 20 million places around the world, anddistributed under the Creative Commons open license.The US part of the dataset has 12,993,248 entries,with data corresponding to multiple business categoriesand sub-categories. Entries are maintained in the Geo-JSON format, and includes attributes such as name,latitude/longitude, address, phone numbers, classifiers(category, type, subcategory) and tags. In our study, aplace is considered a match for the search keyword ifit includes the keyword in any of these attributes, and

the city matches the city attribute. The evaluation is per-formed for the four largest cities in USA—Los Angeles,Houston, Chicago and New York. One of the factorsinfluencing the top-k results is the number of objectsreturned by a query, and their distribution around thequery point. The existence of a large number of objectsimplies that the top-k results are likely to change forsmall changes in location. For objects that are low in den-sity, large variations in the location are possible withoutchanging the result set. This behavior can be reasonablyassumed irrespective of the density of users in the city.Therefore, we choose large cities where we can obtaindifferent densities of objects, specially ones with highdensities. Objects that are high in density in large citiesmay not be so in a smaller city. Hence, we believethat a comprehensive evaluation can be performed byconsidering these large cities.

For each city, a 1024km2 area is used as the high-level generalization G to generate the similarity profile. A320⇥320 cells grid is superimposed on this area. Each cellthen reflects a 100m⇥100m area. This approach implicitlyassumes that positioning a user in a cell is equivalent toexactly locating her. For Los Angeles and Houston, thecity center is at the center of this grid (h160, 160i). ForChicago and New York, the city centers are at h288, 160iand h32, 160i respectively. The geographic co-ordinatesare provided in Appendix A. Euclidean distance basednearest neighbor is used as the ranking function, withk = 10. We employ the cover tree algorithm by Beygelz-imer et al. [18] to determine the 10 nearest query matcheswith respect to a point on the grid.

Instead of experimenting with a large corpus of searchkeywords, we generalize the notion of query points intolow, medium and high density objects. Low densityobjects result from targeted queries, with frequenciesranging from 10 to 50 within the grid. Queries resultingin 50 to 200 objects are considered medium density,while frequencies higher than that are considered highdensity. We were able to generate low density objects byusing search terms such as “bowling”, “electronics store”and local grocery store names in the cities. Mediumdensity objects are generated from search terms suchas “starbucks coffee” and “police”. High density objectsare generated by heavily generic terms such as “atm”and “gas station.” For the high density case, frequen-cies were often observed to be in the range of 400to 900. The search keyword itself does not hold muchimportance for this study, but is used to retrieve querypoint distributions that reflect the real world. The resultsbelow combine performance measures irrespective ofwhat search term produced them, the only distinctionbeing made is with respect to the density.

4.1 Evaluation processPerformance of the Enc and Dec functions are measuredusing precision and recall metrics. Given a threshold �, wearrive at a set of points Z on the grid that the user can

Page 9: Exploiting Service Similarity for Privacy in Location Based Search Queries

MigrantSystems

8

prec

isio

n

0.6

0.9

reca

ll

0.2

0.5

0.8 box

inscribed−circlefill−out

distance fromcity center (km)

reca

ll−are

a(s

q.k

m)

025

50

2 6 10 14

0.7

0.80.91.0

Los Angeles

prec

isio

n

0.6

0.9

reca

ll

0.2

0.5

0.8 box

inscribed−circlefill−out

distance fromcity center (km)

reca

ll−are

a(s

q.k

m)

075

150

2 6 10 14 18 22 26

0.7

0.8

0.9

1.0

Chicago

prec

isio

n

0.6

0.9

reca

ll

0.2

0.5

0.8 box

inscribed−circlefill−out

distance fromcity center (km)

reca

ll−are

a(s

q.k

m)

030

60

2 6 10 14 18 22 26

0.70.8

0.9

1.0

New York

Figure 5. Precision and recall when searching for “starbucks coffee” in a given city. Each plot shows performance offill-out for � = 1.0 (leftmost) and then three sets of rectangles, one each for � = 0.9, 0.8 and 0.7 (from left to right).Lower edge of a rectangle represents 10

th percentile, upper edge represents the median (50th percentile), and thedot represents 25

th percentile. Also shown is the area recalled (in km2) by the fill-out heuristic as a user moves away(distance in km) from the city center. Trend lines are marked with the corresponding � value.

use to perturb her location. Depending on the accuracyof maintaing similarities, and the subsequent estimationby the three heuristics, this set of points may be overor underestimated. If Z

true

is the true set of pointssatisfying the threshold, then the precision is given asthe fraction of points in Z that are also in Z

true

. Recallis the percentage of points in Z

true

that are also in Z.

Precision =

|Z \ Ztrue

||Z| ; Recall =

|Z \ Ztrue

||Z

true

|Precision can be viewed as the probability that the

service similarity guarantee (within the threshold) is notviolated. Recall measures the ability to identify the areaswhere a certain level of service similarity is guaranteed.While precision can be viewed as a measure of thequality of service, the absolute recalled area (|Z \Z

true

|)is the size of the geographic region where the user canhide herself, and yet retrieve true query results (withinthe threshold). In other words, the recall-area may beviewed as a measure of the privacy level obtained bythe user.

Experiments are performed for four service similaritythresholds: � = 1.0, 0.9, 0.8 and 0.7. For each value,precision and recall are calculated for the three heuristicsusing a sample of points as the user location hx0, y0i onthe grid. The sample consists of 1521 points uniformlydistributed on the grid—a sample point every 800m(0.5mi) along the horizontal and vertical directions. For� = 1.0, results are only reported for the fill-out heuristic.

4.2 The case of “starbucks coffee”The case of locally searching a coffee shop—e.g. “star-bucks coffee”—often comes up in location privacy dis-cussions. We present the detailed comparative results

with respect to a privacy-aware user trying to find thenearest Starbucks coffee shop location. Fig. 5 and Fig. 6show the comparative efficiency of the three heuristicsin the four cities. For each city, the precision and recallplots show the performance of fill-out for � = 1.0(leftmost) and then three sets of rectangles, one eachfor � = 0.9, 0.8 and 0.7 (from left to right). A precisionand recall of 1.0 for fill-out at � = 1.0 implies that aprivacy-indifferent user does not lose any accuracy inthe result set as a result of the process. In addition, theheuristic exactly reveals the default privacy region withrespect to the issued query. For the other � values, eachrectangle shows the 10

th percentile (lower edge), 25

th

percentile (center dot) and 50

th percentile (upper edge)of the computed precision and recall values. Recall thatthe pth percentile is the value below which p percentageof the observations lie. The inscribed-circle and fill-outheuristics guarantee 90% or more precision for 75% (25thpercentile) of the points sampled on the grid (possibleuser locations), across the four cities. This is observed ir-respective of the service similarity requirement imposedby a user. Precision for the box heuristic is compara-tively worse because of its tendency towards erroneousinclusion of points. As expected, inscribed-circle clearlyimproves upon this, but results in an extensive pruningof the identified regions (poor recall). It is not difficultto create a heuristic with high precision; however, thedesirable one has high recall as well.

Fill-out improves upon the recall of inscribed-circlewithout heavily degrading the precision. However, therecall values themselves are all below 50%. The bottomof each plot shows trend lines depicting how the arearecalled (|Z \ Z

true

| in km2) by the fill-out heuristicchanges as a user moves away from the city center. The

Page 10: Exploiting Service Similarity for Privacy in Location Based Search Queries

MigrantSystems

9

prec

isio

n

0.6

0.9

reca

ll

0.2

0.5

0.8

boxinscribed−circlefill−out

distance fromcity center (km)

reca

ll−ar

ea

(sq.k

m)

030

60

2 6 10 14

0.7

0.8

0.91.0

Houston

Figure 6. Precision and recall when searching for “star-bucks coffee” in the city of Houston, Texas. See Fig. 5caption for details.

query object (“starbucks coffee”) has a relatively higherconcentration near the city center areas. The trend linefor � = 1.0 (for which fill-out has 100% recall) indicatesthat the default privacy region may not be significantlylarge when query objects are concentrated. However,areas as large as 20-40 km2 become available within8km (⇠ 5mi) of the city center, provided one or twoincorrect results are acceptable. This is despite the poorrecall of the heuristic. These areas will presumably belarge enough for a privacy-conscious user, given thatthe observations hold more strongly for regions that seelesser crowd. Note that changing the service accuracyrequirement further down can expand the determinedarea. Object locations in this case, although not thenearest ones, will not be unrealistically far away.

4.3 Precision/recall trends

The precision and recall trends we observe for the caseof “starbucks coffee” are repeated for the other mediumdensity experiment (derived using the keyword “po-lice”). For the fill-out heuristic, Fig. 7 shows the mean(across the search keywords) of the 25

th percentiles ofthe precision scores for different object densities. Fullprecision for low density objects is almost guaranteed,irrespective of the service accuracy threshold. How-ever, the approach has difficulty maintaining those samevalues for high density objects. High density objectsare often located close to each other, thereby creatinga scenario where moving small distances significantlychanges the result set. It also means that finding suchobjects is not difficult in the real world. Note thatthe density designation is not based on what is being

0.0

0.5

0.9

1.0 0.9 0.8 0.7!

pre

cisi

on

lowmediumhigh Los Angeles

0.0

0.5

0.9

1.0 0.9 0.8 0.7!

pre

cisi

on

lowmediumhigh Houston

0.0

0.5

0.9

1.0 0.9 0.8 0.7!

pre

cisi

on

lowmediumhigh Chicago

0.0

0.5

0.9

1.0 0.9 0.8 0.7!

pre

cisi

on

lowmediumhigh New York

Figure 7. Precision of fill-out heuristic for different ser-vice similarity thresholds (� = 0.7, 0.8, 0.9, 1.0) and ob-ject densities (low,medium,high). Vertical bar shows one-standard-deviation.

queried—a “gas station” could be a high density objectin parts of a city, and low/medium in others. In thelatter case, when finding one could become difficultby simply looking around, local search is possible in aprivacy-supportive manner. The ranking function is alsoa crucial component in deciding the density of objects.For instance, a ranking function that accounts for localreviews of restaurants while making suggestions, willresult in a low density categorization for the keyword“restaurants”, meaning the top-k result set does notchange significantly even for a high concentration ofrestaurants in the area.

The recalled area is also significantly large for lowdensity objects, occasionally dropping when clusters ofsuch objects are found. Fig. 8 depicts this drop forthe cities of Chicago and New York. The observationreinstates the fact that object densities can be locallyhigh. The conclusions made in the “starbucks coffee”case remains applicable in general to the recalled areafor medium density objects. Refer to Section 3 in thesupplementary file for results on the communicationoverhead associated with the proposed methodology.

4.4 ConclusionsBased on the observations from the empirical study,we make the following conclusions on the efficacy ofa privacy-supportive local search application.

Precise geo-locations are necessary for result set accu-racy when the queried objects exist as a dense clusterin the search area. It seems unlikely that both loca-tion privacy and result exactness can be maintained inthis case. A privacy-supportive application would allowthe user to aggressively trade-off the service similarityrequirement to determine a sufficiently large area for

Page 11: Exploiting Service Similarity for Privacy in Location Based Search Queries

MigrantSystems

10

80

250

2 6 10 14

distance from city center(km)

Los Angeles

060

reca

ll!a

rea

(sq

. km

)

0.70.80.91

80

150

2 6 10 14

distance from city center(km)

Houston

060

reca

ll!a

rea

(sq

. km

)

0.70.80.91

050

100

5 10 15 20 25

distance from city center(km)

Chicago

060

120

reca

ll!a

rea

(sq

. km

)

0.70.80.91

550

100

5 10 15 20 25

distance from city center(km)

New York

040

80

reca

ll!a

rea

(sq

. km

)

0.70.80.91

Figure 8. Area (km2) recalled by the fill-out heuristic for different service similarity thresholds (� = 0.7, 0.8, 0.9, 1.0), asuser moves away (distance in km) from city center. Top plots are for low density objects and bottom plots for mediumdensity objects.

location perturbation. Given the high density of objects,resulting objects can still be expected to be in the nearvicinity.

When object density is not dense, location accuracyhas a minor role to play in retrieving relevant results. Aprivacy-supportive application would help identify thelarge default-privacy regions resulting in such situations.

Next generation telecommunication systems couldvery well make it possible to quickly (and cost-effectively) transfer all information required to infer theservice-contour exactly. Until then, approximate infer-encing algorithms can be used to reduce the commu-nication overhead.

5 SUMMARY

In this paper, we proposed a novel architecture to helpidentify privacy and utility trade-offs in a location-basedservice. The architecture has a user-centric design thatdelays the sharing of a location coordinate until the userhas evaluated the impact of its accuracy on the servicequality. Using the prototypical example of a local searchapplication, we showed the form of information thatcan be exchanged between the user and the providerto enable a privacy-supportive LBS. Section 4 of thesupplementary file suggests some future directions ofresearch for this work.

REFERENCES

[1] J. Sythoff and J. Morrison, Location-Based Services: Market Forecast,2011-2015. Pyramid Research, 2011.

[2] P. Golle and K. Partridge, “On the Anonymity of Home/WorkLocation Pairs,” in Proceedings of the 7th International Conference onPervasive Computing, 2009, pp. 390–397.

[3] H. Zang and J. Bolot, “Anonymization of Location Data DoesNot Work: A Large-Scale Measurement Study,” in Proceedings ofthe 17th Annual International Conference on Mobile Computing andNetworking, 2011, pp. 145–156.

[4] M. Duckham and L. Kulik, “A Formal Model of Obfuscationand Negotiation for Location Privacy,” in Proceedings of the 3rdInternational Conference on Pervasive Computing, 2005, pp. 152–170.

[5] H. Kido, Y. Yanagisawa, and T. Satoh, “An Anonymous Commu-nication Technique Using Dummies for Location-Based Services,”in Proceedings of the IEEE International Conference on PervasiveServices, 2005, pp. 88–97.

[6] R. Cheng, Y. Zhang, E. Bertino, and S. Prabhakar, “PreservingUser Location Privacy in Mobile Data Management Infrastruc-tures,” in Proceedings of the 6th Workshop on Privacy EnhancingTechnologies, 2006, pp. 393–412.

[7] M. L. Yiu, C. S. Jensen, X. Huang, and H. Lu, “SpaceTwist: Manag-ing the Trade-Offs Among Location Privacy, Query Performance,and Query Accuracy in Mobile Services,” in Proceedings of the 24thInternational Conference on Data Engineering, 2008, pp. 366–375.

[8] M. Gruteser and D. Grunwald, “Anonymous Usage of Location-Based Services Through Spatial and Temporal Cloaking,” inProceedings of the 1st International Conference on Mobile Systems,Applications, and Services, 2003, pp. 31–42.

[9] B. Gedik and L. Liu, “Protecting Location Privacy with Personal-ized k-Anonymity: Architecture and Algorithms,” IEEE Transac-tions on Mobile Computing, vol. 7, no. 1, pp. 1–18, 2008.

[10] P. Samarati, “Protecting Respondents’ Identities in MicrodataRelease,” IEEE Transactions on Knowledge and Data Engineering,vol. 13, no. 6, pp. 1010–1027, 2001.

[11] G. Ghinita, P. Kalnis, and S. Skiadopoulos, “PRIVE: AnonymousLocation-Based Queries in Distributed Mobile Systems,” in Pro-ceedings of the 16th International Conference on World Wide Web, 2007,pp. 371–380.

[12] P. Kalnis, G. Ghinita, K. Mouratidis, and D. Papadias, “Pre-venting Location-Based Identity Inference in Anonymous SpatialQueries,” IEEE Transactions on Knowledge and Data Engineering,vol. 19, no. 12, pp. 1719–1733, 2007.

[13] G. Ghinita, K. Zhao, D. Papadias, and P. Kalnis, “A ReciprocalFramework for Spatial k-Anonymity,” Journal of Information Sys-tems, vol. 35, no. 3, pp. 299–314, 2010.

[14] P. K. Agarwal, M. de Berg, J. Matousek, and O. Schwarzkopf,“Constructing Levels in Arrangements and Higher Order VoronoiDiagrams,” in Proceedings of the 10th Annual Symposium on Com-putational Geometry, 1994, pp. 67–75.

[15] F. Aurenhammer and O. Schwarzkopf, “A Simple On-line Ran-domized Incremental Algorithm for Computing Higher OrderVoronoi Diagrams,” in Proceedings of the 7th Annual Symposiumon Computational Geometry, 1991, pp. 142–151.

[16] D.-T. Lee, “On k-Nearest Neighbor Voronoi Diagrams in thePlane,” IEEE Transactions on Computers, vol. C-31, no. 6, pp. 478–487, 1982.

[17] K. V. Mardia, “Some Properties of Classical MultidimensionalScaling,” Communications on Statistics – Theory and Methods, vol. A,no. 7, pp. 1233–1241, 1978.

[18] A. Beygelzimer, S. Kakade, and J. Langford, “Cover Trees forNearest Neighbor,” in Proceedings of the Proceedings of the 23rdInternational Conference on Machine Learning, 2006, pp. 97–104.

Page 12: Exploiting Service Similarity for Privacy in Location Based Search Queries

MigrantSystems

1

Supplement: Exploiting Service Similarity forPrivacy in Location Based Search Queries

Rinku Dewri, Member, IEEE, and Ramakrisha Thurimella

F

1 ADDITIONAL RELATED WORK

Location privacy preservation has received significantinterests over the past decade, both across policy makersand academic researchers. Legislative enforcements topreserve location privacy dates back to the United State’sCommunication Act of 1934, wherein “Section 222 re-quires telecommunications carriers to provide confiden-tiality for customer information as proprietary informa-tion of another common carrier.” Disclosure is only al-lowed during emergency situations, or with permissionsof the customer. Efforts are ongoing to enforce morespecific laws related to geolocation information trackingand sharing (e.g. Location Privacy Act of 2011, currentlyin the first step of the legislative process). However, lawsare often regional—while policies in the European Unionmay require every user to consent to location sharing, apolicy in the United States may require users to opt-outof a default sharing. Nonetheless, the important questionthat still remains open is whether a user can derive anyreasonable utility out of the location-based service andstill protect her location information?

Multiple suggestions are available on how a cloakingregion should be formed. Bamba et al. enforce a loca-tion l-diversity requirement in addition to k-anonymity,where the number of still-object counts must also beabove a user-specified threshold [1]. Liu et al. proposethat a minimum level of entropy should also be main-tained in the queries originating from the cloaking region[2]. Dewri et al. have extended these concepts to thecase of continuous services [3], [4]. Shin et al. introduceprofile anonymization in cloaking regions, wherein atleast k � 1 other users with the same profile (denotedby a vector) as the request issuer is present [5]. Riboni etal. make a similar argument, but in the context of serviceparameters. Inferences that can be drawn based on theseparameters are avoided by smoothing the differencesamong the distribution of the parameters in requestsfrom different cloaking regions [6].

A mix zone model is presented for location privacy byBeresford and Stajano [7]. The objective of mix zones isto prevent tracking of long-term user movements, while

• R. Dewri and R. Thurimella are with the Department of Computer Science,University of Denver, CO 80208, USA. Email:{rdewri,ramki}@cs.du.edu.

short-term revelation of location data is permissible.A trusted middleware usually mixes the identities ofusers in specific zones, thereby preventing continuoustracking. Extensions of this technique are proposed forthe scenario where user movements are constrained toroad networks [8].

Mokbel et al. explore query processing of differenttypes on spatial regions – private queries over pub-lic data, public queries over private data, and privatequeries over private data [9]. Their effort is directedtowards facilitating different query formats using cloak-ing regions. Lee et al. explore privacy concerns in pathqueries where source and destination inputs may revealpersonal information about users [10]. They proposethe notion of obfuscated path queries where multiplesources and destinations are specified to hide the true in-puts. Although we do not focus on continuous location-based services in this work, it is worth noting thatcertain locations (home or work places) reveal moreinformation about a user. Hence, the privacy expecta-tions are also bound to be different when users are atsuch locations. Historical location data is used by Xuand Cai in a variant of location k-anonymity, where thecloaking region is required to have at least k differentfootprints [11]. In a later work, the authors argue thatthe impact of a privacy parameter, such as k, on thelevel of privacy is often difficult to perceive. Hence, theytreat privacy as a feeling-based property and proposeusing the popularity of a public region as the privacylevel [12]. Each user specifies a spatial region as herprivacy index, and the cloaking region for the user mustat least have the same popularity as that of the specifiedregion. An entropy based computation is used to definethe popularity of a spatial region. Soriano et al. showthat the privacy assurances of this model do not holdwhen the adversary possesses footprint knowledge onthe spatial regions over time [13]. Shokri et al. proposea framework to quantify location privacy based on theexpected estimation error of an adversary [14]. Thiswork provides a method to arrive at different types ofinferences regarding a user’s location based on a knownmobility profile of the user. Using methods of likelihoodestimations, the authors show that measures such as theanonymity set size or entropy, do not correctly quantifythe privacy enforced by the method [15].

Page 13: Exploiting Service Similarity for Privacy in Location Based Search Queries

MigrantSystems

2

Table 1Minimum area (km2) in which local search results (10 nearest neighbors) are same for a given percentage of the

continental United States landmass. Value in parenthesis shows minimum area that shares 9 out of the 10 results.

keyword 90% 75% 50% 25% 10%

atm 1 (2.4) 1 (6.4) 2 (20.8) 6 (73) 21.4 (273.2)bus station 1 (6) 1 (15.8) 4 (60.9) 16.4 (229.4) 61.4 (721.12)cafe 1 (2) 1 (6) 2 (20.3) 6 (85.3) 24 (267.1)car rental 1 (3) 1 (8) 2 (28.7) 8 (117.95) 34.88 (450.66)gas station 1 (2) 1 (5) 2 (17) 5 (59.8) 18.3 (208.5)hospital 1 (2) 1 (6) 2 (18.4) 5 (69.2) 20.4 (297.32)library 1 (5.9) 1 (14) 3.9 (54.2) 11.9 (152.88) 38.5 (409.8)lodging 1 (8) 1 (22.2) 4.6 (83.2) 18.6 (301.65) 74 (887)night club 1 (4) 1 (12.5) 3.5 (62) 15 (257.25) 65.5 (891.4)parking 1 (5) 1 (15.2) 3.6 (50.8) 13.15 (206.7) 60.2 (632.6)pharmacy 1 (2) 1 (6) 2 (18.2) 6 (73.5) 23.1 (288)police 1 (6.9) 1 (17) 3.9 (55) 12.8 (167.6) 44.3 (438.08)

Data transformation is another method to preventthe inference of locations. Agrawal et al. propose anencryption technique called OPES (Order PreservingEncryption Scheme) that allows comparison operationsto be directly applied on encrypted data [16]. Operanddecryption is however required for computing SUM andAVG. Wong et al. overcome this drawback by develop-ing an asymmetric scalar-product preserving encryption[17]. This allows the preservation of relative distancesbetween database points. Khoshgozaran et al. employHilbert curves to transform the data points and then an-swer queries in the transformed space [18]. The param-eters of the transformation, called the Space DecryptionKey, is assumed to be not known to an adversary. A newparadigm in location privacy is based on private infor-mation retrieval (PIR) techniques. Khoshgozaran et al.propose K nearest neighbor queries that can be reducedto a set of PIR block retrievals [19]. These retrievals canbe performed using a tamper-resistant processor locatedat the server so that the content provider is oblivious ofthe retrieved blocks. Papadopoulos et al. further warrantthe need to retrieve the same number of blocks acrossqueries [20]. While the use of PIR techniques in pro-viding location privacy is an interesting direction to ex-plore, computational inefficiency or the dependence onadditional hardware makes these approaches currentlyunsuitable for mainstream adoption.

2 A MOTIVATING STUDYThe literature reviewed in this work highlights the ef-forts of the academic community to prevent the sharingof “pure” location information. An universal assumptionin most of these methods is that the user, by default,is unwilling to share her location, irrespective of theservice-level impacts. One can argue that a user willingto do so will simply avoid using the privacy-preservingtransformation. It is our opinion that individuals do notview privacy as an immutable property, but rather asa personal yet adaptable element. For instance, while amobile user may keep her GPS device turned off mostof the time, she may occasionally turn it on to achieve

ShopkickTM (www.shopkick.com) rewards when visitinga departmental store. This user’s perspective on locationprivacy is guided by prospective gains from revealingher location. As another example, a user may preciselyreveal her location (irrespective of its sensitivity) whilelooking for nearby emergency care centers; the sameuser may not be willing to do so while getting a listingof nearby local businesses. This user’s perspective onlocation privacy is requirement driven, depending on theassessed (personally) importance of location sensitivityand service usefulness. We performed an empirical studyto determine if a location-based search application cangenerate any utility to an extreme user (always paranoidabout revealing current location) in this latter category.

Consider a grid of cells, each 1000x1000m2, acrossthe continental United States landmass. An individuallocated at any of these cells issues a local search querythat retrieves the 10 nearest businesses matching thesearch term. Table 1 lists, for a given percentage of cellsin the grid where the individual could be located, thenumber of other cells that would receive the same queryanswer as received by this individual. The values inparenthesis indicate the number of other cells that wouldretrieve at least 9 out of the 10 businesses retrieved by theindividual. In the context of the paranoid user, this datahighlights that, for most places that the user could belocated (say 75% of the landmass), she has the freedomto use a location coordinate anywhere in an area of sizeat least 6.4 km2 and still retrieve 9 out of the 10 nearestATMs. The statistics can be different depending on theactual search term issued by the user. In addition, anarea of 6.4 km2 may still not be comforting enough forthe user. A possibility then is to consider an area thatguarantees a 8 out of 10 match. This process presentsan adaptive mechanism for the user who can chooseto trade-off location accuracy at the expense of serviceaccuracy, or vice versa. The challenge however lies inthe fact that the user does not necessarily have therequisite resources (both in terms of computation anddata) to compute these areas. On the other hand, theLBS provider that performs the local search has the

Page 14: Exploiting Service Similarity for Privacy in Location Based Search Queries

MigrantSystems

3

computational and data resources to compute the areaboundaries, provided it can accurately (and quickly)convey the information without requiring the user toreveal her location.

3 COMMUNICATION OVERHEADThe communication overhead is measured by theamount of information that is to be transferred to theuser to infer the service-contour. The baseline for ourcomparison is the size of the sets I and V , after compres-sion using the DEFLATE algorithm. The object identifiersfor elements in V are obtained from the unique identi-fiers assigned by SimpleGeo in its database. The datatransferred when using the Enc function includes thecompressed version of the set T .

For low density objects, the transferred data has a35% reduction in size from that of the baseline data.Although the absolute size of the baseline data is in therange of 5 to 10 KB, the impact of the improvement isseen when aggregated over a number of queries. Thereduction factor (transferred data size over baseline datasize) is rather varying for medium density objects—values ranging from 0.8 to 0.15 in some cases. Absolutevalues for the baseline data are observed to be in therange of 25 to 150 KB. The critical factor contributing tothe difference in size is the set V , which in turn dependson the number of distinct result sets that can be obtainedwithin a geographic area.

4 FUTURE DIRECTIONSOne of the assumptions we made in the empiricalstudy is that the rank order of the top-k results is notimportant. Without this assumption, the (dis)similaritymeasurement will have to be redefined to include dis-agreements in the result ordering. Higher utility will bemaintained if the result objects that are the closest to theuser are indeed retrieved by the mechanism.

For a continuous query LBS model, the policy that de-termines the final choice of the location must also inducerealistic correlations between subsequent locations. Thiswould involve analyzing the current service-contourfrom multiple reference points, in an effort to generatea reasonable trajectory of future locations. The difficultyappears because of the possibility of dynamic updatesto the objects data base. Additional directions includereducing the communication overhead, efficiently solv-ing the service-contour inferencing problem for a mov-ing objects data base, augmenting the inference processwith clear privacy policy descriptions, and integratingapplication sensitivity into the decision making process.

Dynamic updates in our application environment canoccur by addition/deletion of objects. These updatescan happen in the background, and the query processorcan have access to the updated database as soon asthe update operations are complete. We note that, as aresult of this process, the query performance will notdegrade, although stale results may be generated for

a brief period. If the update time is not significant, alocking mechanism can be enforced to guarantee resultvalidity. It is also important to note that frequent updatesto POI databases are not likely. In this work, we didnot consider the possibility of mobile POI points (forexample, as in a friend finder service where the searchedobjects are also mobile). We believe that the case ofmobile POIs needs an extensive and a formal study inits own, since the locations of the moving objects maybe sensitive information. In such a case, obtaining theservice-contour is not as straightforward as in the caseof a local search.

APPENDIX A

Table 2City center co-ordinates used in the empirical study.

City Latitude Longitude

Los Angeles 34.0536910N 118.2431260WHouston 29.7601770N 95.36929100WChicago 41.870450N 87.6299050W

New York 40.7132560N 74.0059050W

Figure 1. Output T of Enc function based on multi-dimensional scaling for a query involving “starbucks cof-fee” as the search term in the city of Los Angeles, CA.The ranking function is 10-nearest-neighbors. Note: colorvariations are lost in greyscale viewing.

REFERENCES

[1] B. Bamba, L. Liu, P. Pesti, and T. Wang, “Supporting AnonymousLocation Queries in Mobile Environments with Privacy Grid,”in Proceedings of the 17th International World Wide Web Conference,2008, pp. 237–246.

[2] F. Liu, K. A. Hua, and Y. Cai, “Query l-Diversity in Location-Based Services,” in Proceedings of the 10th International Conferenceon Mobile Data Management: Systems, Services and Middleware, 2009,pp. 436–442.

[3] R. Dewri, I. Ray, I. Ray, and D. Whitley, “On the Formationof Historically k-Anonymous Anonymity Sets in a ContinuousLBS,” in 6th International ICST Conference on Security and Privacyin Communication Networks, 2010, pp. 71–88.

Page 15: Exploiting Service Similarity for Privacy in Location Based Search Queries

MigrantSystems

4

[4] ——, “Query m-Invariance: Preventing Query Disclosures inContinuous Location-Based Services,” in Proceedings of the 11thInternational Conference on Mobile Data Management, 2010, pp. 95–104.

[5] H. Shin, J. Vaidya, and V. Atluri, “A Profile Anonymization Modelfor Location Based Services,” Journal of Computer Security, vol. 19,no. 5, pp. 795–833, 2011.

[6] C. B. D. Riboni, L. Pareschi and S. Jajodia, “Preserving Anonymityof Recurrent Location-Based Queries,” in Proceedings of the 16thInternational Symposium on Temporal Representation and Reasoning,2009.

[7] A. R. Beresford and F. Stajano, “Mix Zones: User Privacy inLocation-Aware Services,” in Proceedings of the Second IEEE AnnualConference on Pervasive Computing and Communications Workshops,2004, pp. 127–131.

[8] B. Palanisamy and L. Liu, “MobiMix: Protecting Location Privacywith Mix-Zones Over Road Networks,” in Proceedings of the 27thInternational Conference on Data Engineering, 2011, pp. 494–505.

[9] M. F. Mokbel, C. Chow, and W. G. Aref, “The New Casper: QueryProcessing for Location Services Without Compromising Privacy,”in Proceedings of the 32nd International Conference on Very Large DataBases, 2006, pp. 763–774.

[10] K. C. K. Lee, W.-C. Lee, H. V. Leong, and B. Zheng, “OPAQUE:Protecting Path Privacy in Directions Search,” in Proceedings of the25th International Conference on Data Engineering, 2009, pp. 1271–1274.

[11] T. Xu and Y. Cai, “Exploring Historical Location Data forAnonymity Preservation in Location-Based Services,” in IEEEINFOCOM 2008, 2008, pp. 1220–1228.

[12] ——, “Feeling-Based Location Privacy Protection for Location-Based Services,” in Proceedings of the 16th ACM Conference onComputer and Communications Security, 2009, pp. 348–357.

[13] M. Soriano, S. Qing, and J. Lopez, “Time Warp: How Time AffectsPrivacy in LBSs,” in Proceedings of the 12th International Conferenceon Information and Communications Security, 2010, pp. 325–339.

[14] R. Shokri, G. Theodorakopoulos, J.-Y. L. Boudec, and J.-P. Hubaux,“Quantifying Location Privacy,” in Proceedings of the 32nd IEEESymposium on Security and Privacy, 2011, pp. 247–262.

[15] R. Shokri, C. Troncoso, C. Diaz, J. Freudiger, and J.-P. Hubaux,“Unraveling an Old Cloak: k-Anonymity for Location Privacy,”in Proceedings of the 9th Annual ACM Workshop on Privacy in theElectronic Society, 2010, pp. 115–118.

[16] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu, “Order PreservingEncryption for Numeric Data,” in Proceedings of the ACM SIGMODInternational Conference on Management of Data, 2004, pp. 563–574.

[17] W. K. Wong, D. W. Cheung, B. Kao, and N. Mamouslis, “SecurekNN Computation on Encrypted Databases,” in Proceedings of the35th SIGMOD International Conference on Management of Data, 2009,pp. 139–152.

[18] A. Khoshgozaran and C. Shahabi, “Blind Evaluation of NearestNeighbor Queries Using Space Transformation to Preserve Loca-tion Privacy,” in Proceedings of the 10th International Conference onAdvances in Spatial and Temporal Databases, 2007, pp. 239–257.

[19] A. Khoshgozaran, C. Shahabi, and H. Shirani-Mehr, “Location Pri-vacy: Going beyond k-Anonymity, Cloaking and Anonymizers,”Journal of Knowledge and Information Systems, vol. 26, no. 3, pp.435–465, 2011.

[20] S. Papadopoulos, S. Bakiras, and D. Papadias, “Nearest NeighborSearch with Strong Location Privacy,” VLDB Endowment, vol. 3,no. 1-2, pp. 619–629, 2010.