finding similar mobile consumers with a privacy-friendly...

24
This article was downloaded by: [128.122.185.232] On: 10 November 2015, At: 14:10 Publisher: Institute for Operations Research and the Management Sciences (INFORMS) INFORMS is located in Maryland, USA Information Systems Research Publication details, including instructions for authors and subscription information: http://pubsonline.informs.org Finding Similar Mobile Consumers with a Privacy-Friendly Geosocial Design Foster Provost, David Martens, Alan Murray To cite this article: Foster Provost, David Martens, Alan Murray (2015) Finding Similar Mobile Consumers with a Privacy-Friendly Geosocial Design. Information Systems Research 26(2):243-265. http://dx.doi.org/10.1287/isre.2015.0576 Full terms and conditions of use: http://pubsonline.informs.org/page/terms-and-conditions This article may be used only for the purposes of research, teaching, and/or private study. Commercial use or systematic downloading (by robots or other automatic processes) is prohibited without explicit Publisher approval, unless otherwise noted. For more information, contact [email protected]. The Publisher does not warrant or guarantee the article’s accuracy, completeness, merchantability, fitness for a particular purpose, or non-infringement. Descriptions of, or references to, products or publications, or inclusion of an advertisement in this article, neither constitutes nor implies a guarantee, endorsement, or support of claims made of that product, publication, or service. Copyright © 2015, INFORMS Please scroll down for article—it is on subsequent pages INFORMS is the largest professional society in the world for professionals in the fields of operations research, management science, and analytics. For more information on INFORMS, its publications, membership, or meetings visit http://www.informs.org

Upload: others

Post on 27-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

This article was downloaded by [128122185232] On 10 November 2015 At 1410Publisher Institute for Operations Research and the Management Sciences (INFORMS)INFORMS is located in Maryland USA

Information Systems Research

Publication details including instructions for authors and subscription informationhttppubsonlineinformsorg

Finding Similar Mobile Consumers with a Privacy-FriendlyGeosocial DesignFoster Provost David Martens Alan Murray

To cite this articleFoster Provost David Martens Alan Murray (2015) Finding Similar Mobile Consumers with a Privacy-Friendly Geosocial DesignInformation Systems Research 26(2)243-265 httpdxdoiorg101287isre20150576

Full terms and conditions of use httppubsonlineinformsorgpageterms-and-conditions

This article may be used only for the purposes of research teaching andor private study Commercial useor systematic downloading (by robots or other automatic processes) is prohibited without explicit Publisherapproval unless otherwise noted For more information contact permissionsinformsorg

The Publisher does not warrant or guarantee the articlersquos accuracy completeness merchantability fitnessfor a particular purpose or non-infringement Descriptions of or references to products or publications orinclusion of an advertisement in this article neither constitutes nor implies a guarantee endorsement orsupport of claims made of that product publication or service

Copyright copy 2015 INFORMS

Please scroll down for articlemdashit is on subsequent pages

INFORMS is the largest professional society in the world for professionals in the fields of operations research managementscience and analyticsFor more information on INFORMS its publications membership or meetings visit httpwwwinformsorg

Information Systems ResearchVol 26 No 2 June 2015 pp 243ndash265ISSN 1047-7047 (print) ISSN 1526-5536 (online) httpdxdoiorg101287isre20150576

copy 2015 INFORMS

Finding Similar Mobile Consumers with aPrivacy-Friendly Geosocial Design

Foster ProvostDepartment of Information Operations and Management Sciences Leonard N Stern School of Business

New York University New York New York 10012 fprovoststernnyuedu

David MartensDepartment of Engineering Management Faculty of Applied Economics University of Antwerp

B-2000 Antwerp Belgium davidmartensuantwerpbe

Alan MurrayCoriolis Labs New York New York 10021 alancoriolisventurescom

This paper focuses on finding the same and similar users based on location-visitation data in a mobile envi-ronment We propose a new design that uses consumer-location data from mobile devices (smartphones

smart pads laptops etc) to build a ldquogeosimilarity networkrdquo among users The geosimilarity network (GSN)could be used for a variety of analytics-driven applications such as targeting advertisements to the same useron different devices or to users with similar tastes and to improve online interactions by selecting users withsimilar tastes The basic idea is that two devices are similar and thereby connected in the GSN when they shareat least one visited location They are more similar as they visit more shared locations and as the locations theyshare are visited by fewer people This paper first introduces the main ideas and ties them to theory and relatedwork It next introduces a specific design for selecting entities with similar location distributions the resultsof which are shown using real mobile location data across seven ad exchanges We focus on two high-levelquestions (1) Does geosimilarity allow us to find different entities corresponding to the same individual forexample as seen through different bidding systems And (2) do entities linked by similarities in local mobilebehavior show similar interests as measured by visits to particular publishers The results show positive resultsfor both Specifically for (1) even with the data samplersquos limited observability 70ndash80 of the time the sameindividual is connected to herself in the GSN For (2) the GSN neighbors of visitors to a wide variety of pub-lishers are substantially more likely also to visit those same publishers Highly similar GSN neighbors showvery substantial lift

Keywords design science mobile computing analytical modeling network analysisHistory Chris Forman Senior Editor Gautam Pant Associate Editor This paper was received February 17

2014 and was with the authors 2 months for 2 revisions

1 IntroductionThis design-science paper is about finding the sameand similar users based on location-visitation data ina mobile environment A user is an instance of anindividual as viewed through some information sys-tem(s) and to avoid confusion we sometimes willexplicitly refer to a user instance For example anindividual would be viewed as two different userinstances if the individual was interacting with theonline advertising ecosystem on two different devicesor was viewed on the same device through two dif-ferent bidding systems

Specifically we investigate ldquogeosimilarityrdquomdashthesimilarity of instances of consumers based on the dis-tribution of the locations they have been observed tovisit The basic idea is that two users are similar andthereby connected in a geosimilarity network (GSN)when they share at least one visited location Users

are more similar as they visit more shared locationsand locations that are less frequently visited1 Thisdesign is motivated by research showing that geo-graphical co-occurrence provides a strong indicationof users being friends (Crandall et al 2010 Cho et al2011) and influencing each otherrsquos purchasing behav-ior (Pan et al 2011) de Montjoye et al (2013) foundthat human mobility traces are highly discriminativebased on the analysis of 15 months of location data for15 million users they found that four locations aresufficient to accurately identify 95 of the users Thisnot only motivates our design it also places emphasison the desirability of privacy-friendly designs whichwe discuss further below

1 As we discuss in future directions (see sect8) the network perspec-tive could allow similarity judgments also between users who areindirectly connected in the network

243

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers244 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

This paperrsquos motivating application is mobile ad-vertising As with traditional display ad targetingmobile ad targeting could be based on context demo-graphics or psychographics if such data were avail-able Because such data are largely unavailable onmobile consumers in the advertising ecosystem adtargeting firms increasingly are looking to othermeans of finding suitable candidates for ads An alter-native method that has been gaining traction for (non-mobile) ad targeting both in research and in practiceis targeting based on direct or indirect connectionsto specific individuals which we loosely call ldquosocialrdquotargeting as we discuss in detail below For exam-ple brands often want to target individuals who aresimilar to particular existing customers of the brandor product in question or the same existing cus-tomers on different devices2 Geosimilarity providesa viable alternative for such indirect social targetingfor mobile advertising where (anonymized) locationdata are available and importantly anonymized loca-tion data may be much more readily available thanother targeting data

No matter how users are chosen for targetingmobile advertisers need to deal effectively with theproblem of consumer fragmentation individuals areobserved only through the multiple (baroque) infor-mation systems that comprise the digital adver-tising ecosystem and a particular individual maycorrespond to various user instances (Kerho 2012) Forexample an individual may have different instancesbecause she is observed on different devices such asher mobile phone tablet laptop and PC Individualsmay also have different instances because they appearin different advertisement bidding systems each ofwhich presents the individual differently In manysituations these instances are not associated with aunique identifier for the individual unlike the cookiesthat are used to represent a userndashbrowser pair in mostwork on individualized desktop ad targeting More-over as we will discuss presently there are importantreasons why we might prefer that the instances not beassociated with a personal identifier The bottom lineis that once an instance of an individual is identified asbeing a good target advertisers would like to be ableto also target other instances of the same individual(as well as individuals with very similar interests)

Geosimilarity could also be used for privacy-friendly ldquohyperlocalrdquo targeting meaning targeting

2 Note that this paper is not about the ultimate effectiveness of thebrandrsquos choice of whom to targetmdashit may be that targeting theselected users on different devices or targeting users with similarinterests is not appropriate from a marketing standpoint This isa question specific to each brand and the brandrsquos campaign goals(eg direct marketing versus brand advertising) Furthermore thisquestion is intricately intertwined with the design of the creativebeing delivered which also is outside the scope of this paper

people in a precise location (statistically speaking)without needing to store data on the actual loca-tions of the users For example consider a hyper-local coupon campaign a couponing company wantsto target special offers for the local restaurant on thecorner The best prospects are those people who fre-quent this precise area If the restaurant can providesome (anonymized) ldquoseedrdquo usersmdashfor example exist-ing clients of the particular local business (eg identi-fied via an online loyalty program)mdashthe geosimilarityneighbors of these seed users may have a high prob-ability of also frequenting the same precise locationsand thereby be very good prospects for the hyperlocalcoupon

Looking from a different perspective we have seenthe sort of uproar that arises from the idea that ourlocation behavior is being ldquotrackedrdquo by our mobiletechnology (Pogue 2011)3 Therefore marketers whodream of location-driven targeting should think care-fully about what the Federal Trade Commission (FTC)calls ldquoprivacy by designrdquo4 and consider what optionscan provide effective advertising with minimal datacollection and storage As will be detailed in sect4we explicitly take this desideratum into account byanonymizing both the device and location identifiersThis method of using ldquodoubly anonymizedrdquo data forprivacy friendliness is described in more detail else-where (Provost et al 2009)

In sum geosimilarity can be used for composing amobile audience for targeting as follows Based on adoubly anonymized GSN find individuals who areclosely linked to individuals we know already to havethe characteristic(s) that we desire such as prior pur-chase history (as with ldquoretargetingrdquo) brand affinity(eg via visiting a Web page ldquolikingrdquo a brand orproduct or clicking on an ad) or key demograph-ics or even based on sophisticated targeting designs(Bampo et al 2008 Agarwal et al 2008 Heidemannet al 2010 Perlich et al 2014) Once these key ldquoseedrdquousers have been identified targeting close neighborsin the GSN will tend to target both users with simi-lar interests and users who are other instances of thesame individual Below we present theoretical justi-fication for such an approach tying it in to relatedresearch We then investigate empirically whether themethod indeed tends to target (i) other instances ofthe same individual andor (ii) individuals with sim-ilar interests The empirical study is based on datadrawn from actual real-time bidding (RTB) exchangesfor mobile advertising

Before we move on to related work please let usnote that our proposed GSN is different from geoso-cial networking which is defined as a type of social

3 httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha4 httpwwwftcgovopa201012privacyreportshtm

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 245

networking in which geographic services are usedto enable additional social dynamics (Quercia et al2010) The latter starts from an actual network ofinterpersonal relationships and adds location data torecommend other locations and events In contrastthe GSN uses location data to create links betweenusers without the explicit guarantee that these linkscorrespond to an actual interpersonal relationship(however this surely might be as described in thenext section see Crandall et al 2010)

Section 2 argues theoretically why this GSN designshould be a good idea After that we present the high-level GSN design in more detail followed by specificsof the implementation we examine empirically Thenwe provide results from an empirical study basedon real mobile location data from real-time biddingexchanges Section 5 describes the data and experi-mental setup including several technical definitionsfor geosimilarity The ability of the GSN to connectdifferent instances of the same individual is evaluatedin sect6 The use of the GSN to select users with similarinterests and tastes is assessed in sect7

2 Motivation and Related Work21 Social TargetingAdvertising targeting has evolved substantially overthe past half century As information systems providedaccess to new sources and types of data marketersadded new targeting strategies designed around thenew data For example as demographic data becameavailable a few decades ago contextual targetingmdashtargeting based on inferring audience compositionfrom the context in which the ad will be shown (ega billboard location TV show magazine etc)mdashhadto share the spotlight with data-driven demographictargeting either based on explicit demographic pro-files or predictive modeling As data aggregators coa-lesced and integrated information such as magazinesubscriptions and catalog purchases ldquopsychographicrdquodata entered the mix and broadened yet again thespace of targeting designs

Recently we have seen the introduction of a differ-ent sort of targeting design which we can generallycall social targeting Social targeting differs from theaforementioned targeting methods because it relieson explicit linkages between specific individuals Forexample Hill et al (2006) showed the remarkableeffectiveness of social-network targeting targeting con-sumers who are linked to known customers by asocial network Subsequently Facebook (and others)have attempted to implement social-network target-ing for online advertising with varying degrees ofsuccess (see eg Oinas-Kukkonen et al 2010)

We explicitly generalize from social-network target-ing to social targeting to retain the notion that the

targeting is based on linkages to specific other indi-viduals but to relax the notion that the linkages needto be ldquotruerdquo interpersonal relationships The designof Provost et al (2009) is an example of social tar-geting that is not based on ldquotruerdquo interpersonal rela-tionships the linkages between individuals are basedon a bipartite content-affinity network So the socialtargeting there is based on forming an audience byfinding consumers who are linked by shared con-tent visitation with other specific consumers who areknown to have brand affinity (more on that later)Similarly Martens and Provost (2011) define a net-work among banking customers based on paymenttransaction data which is subsequently used to targetmarketing offers for financial products

22 GeosimilarityThe GSN can be very fine-grained based on shared(anonymized) location data for example Internet pro-tocol (IP) addresses fine-grained latitudelongitudecells or small geographic tracts Why would targetinggeosimilarity network neighbors (NNs) of preselectedseed users be a good idea There are several reasonsLet us call mobile devices that are (directly) linked inthe GSN (first-degree) network neighbors

First of all first-degree network neighbors shareat least one location and possibly several As thenumber of shared locations grows we conjecture thatthe likelihood increases that the two devices actuallybelong to the same person For example who besidesme is observed primarily on my home and office IPaddress let alone my favorite coffee shop We seeanalogous results showing that different instances ofthe same person call the same phone numbers (Corteset al 2001) and cite the same references (Hill andProvost 2003)

Second direct coarse-grained geographic targetingalready is used widely in off-line advertising (not viasocial targeting) because geography is a reasonableproxy for demographics and other predictive featuresAn important difference in the present work is thatwe do not choose the geographies to target explic-itly but instead use them implicitly This allows theactual locations to be anonymized Furthermore itallows us to use location information that is too fine-grained even to include in most predictive modelingfor example locations appearing only in a tiny frac-tion of instances (eg a home Wi-Fi address may onlyappear in 0000001 or less of the usersrsquo location sets)as well as transient Wi-Fi locations that only connecttwo devices for a brief time

Third fine-grained location information is likely tocontain more detailed (latent) information than stan-dard geographic information Not only would it linkdevices by approximate wealth income demograph-ics etc it may well link by employer educational

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers246 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

institution interests community and even shoppinghabits Thus we conjecture that geosimilarity tar-geting may combine the advantages of geographictargeting discussed in the previous paragraph withadvantages similar to content-affinity social targetingwhich has been shown to be effective specifically foronline advertising (Provost et al 2009 Perlich et al2014) We might call this ldquolocale-affinityrdquo social tar-geting The empirical results below show that indeedonersquos geosimilarity neighbors are much more likely tofrequent the same online publishers and mobile apps

If that were not enough research provides yetanother reason to expect that geosimilarity targetingmay be especially effective In an article in the Pro-ceedings of the National Academy of Sciences of the UnitedStates of America Crandall et al (2010 p 22436) showthat geographic co-occurrences between individualsare very strongly predictive of the individuals beingfriends ldquoThe knowledge that two people were prox-imate at just a few distinct locations at roughly thesame times can indicate a high conditional proba-bility that they are directly linked in the underlyingsocial networkrdquo This means that a GSN would notonly capture the advantages of geographic targetingand locale-affinity targeting but might also incorpo-rate actual social-network targeting also shown to beextremely effective for marketing (Hill et al 2006)In fact when massive descriptive data are availablethe (usually latent) similarity between social-networkneighbors has been shown to explain much of themarketing advantage previously attributed to socialinfluence (Aral et al 2009)

Moreover interestingly research also has shown thatthe homophily (McPherson et al 2001) that has beenused to explain the effectiveness of social-network tar-geting may actually be due largely to the constraintsplaced by opportunity (Kossinets and Watts 2009)Tie formation in social networks is biased heavilyby triadic closure and thus by structural proximityOver many generations of tie formation biases in theselection of structurally proximate individuals ldquocanamplify even a modest preference for similar othersvia a cumulative advantage-like process to producestriking patterns of observed homophilyrdquo (Kossinetsand Watts 2009 p 405) Why is this important for thepresent paper Because with the exception of sociallinks formed solely online (like new Facebook-onlyfriends) constraints on opportunity are framed byconstraints on physical colocation As Kossinets andWatts (2009 p 406) observe ldquoan individualrsquos choiceof relations is heavily constrained by other aspects ofhis or her life such as geographical location choice ofoccupation place of workrdquo These are exactly the sortsof links that would be represented in a GSN based onmobile device location data (including tablets and lap-tops) Thus the GSN may in addition reveal a large

driver of homophily and thus expand the effective-ness of a social-network targeting strategy to individ-uals who also would have been similar ldquofriendsrdquo butfor whatever reason were not chosen

3 Geosimilarity GeosimilarityNetworks and Mobile Ad Targeting

The basis for the targeting design is to create a net-work among users based on geosimilarity and thento use the network for inference For this paper wewill create a GSN directly among user instances5 Theelements of the network-targeting design include thefollowing How will entities (mobile user instances)be represented What exactly will be the geosimilaritylinks And how exactly will the predictive inferencebe conducted

In this section we will discuss the design at a highlevel including some further-looking elements thatare not part of the experiments but which providecompleteness for the discerning reader To make thediscussion more concrete we first present the moti-vating data scenario The design should generalize tomobile settings beyond this specific data scenario

31 The Mobile Real-Time Bidding EcosystemA high-level overview of how an RTB exchange worksis given in Figure 1 RTB exchanges bring togetheradvertisers and the publishers of Web pages andapps When a user visits a Web pageapp (here-after ldquoWeb pagerdquo) that sells ad slots through RTBexchanges a bid request is created Such a bid requestwill provide additional data on the user and Webpage visited such as the IP address publisher anddevice type (see Table 1 for a list of fields and statis-tics on their availability across RTB exchanges) Basedon such data advertisers can choose to bid For exam-ple a brand may want to target users that previ-ously visited their automobile website browsing for aecofriendly hybrid car The highest bidder is allowedto load a potentially personalized ad for exampleshowing an ad for an ecofriendly hybrid All of thisoccurs in real time in less than 100 milliseconds Thetwo interrelated problems this paper addresses canbe stated specifically in the context of this examplehow does the targeter find instances of this individ-ual who previously visited the brandrsquos website Canthe targeter find users with very similar interests (egthe focal userrsquos husband best friend) under the pre-sumption that they might also be reasonable candi-dates for an ad for an ecofriendly hybrid

5 An alternative is to create a bipartite network between users andlocations and draw inferences directly from the bipartite networkWe do not investigate that explicitly in this paper however theinterested reader should take a careful look at the similarity calcu-lations we present below

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 247

Figure 1 (Color online) Working of Real-Time Bidding

AdvertisersAd

Consumers

Publisher hasad slots for sale

At IP X is on website YBrand 1

51c

14c

33cBrand 2

Brand 3

3

Bids

Bid request

Loading of ad

2

with device type ZBidding is open 4

1

Notes When a user visits a Web page that sells ad slots through an RTB exchange (1) a bid request is created (2) after which several advertisers bid to allowto show an ad at that moment on that Web page to that user (3) The highest bidder is allowed to load a potentially personalized ad (4)

For this paper we address the data currently avail-able in the digital advertising ecosystem Specificallyvia RTB exchanges we can observe a massive numberof mobile users along with the data made availablefor advertisers to decide whether to bid on the userfor a particular campaign (see eg Pubmatic 2010)Two aspects of the data are crucial (1) Each instanceof a mobile user is associated with a key when thiskey is observed in the future we know that the futuredata item involves the same mobile device6 (2) Partof the data associated with each mobile device is alocation For this paper we consider that location tobe the (anonymized) IP address of the Wi-Fi network7

currently in use (although there are various sorts oflocation data) We observe the locations visited byeach user for the data records included in the researchdata set (described in more detail in sect51)

The mobile RTB ecosystem holds different infor-mation than for the typical online (desktop) worldwhich is important for our design The GSN designmeasures similarity in location visitation behaviorOne might wonder if other behavioral data can beobtained from the RTB systems such as website vis-itation data (Provost et al 2009 Raeder et al 2012)Table 1 shows which typical fields (besides the key)are visible on mobile devices over seven different RTBexchanges Let us consider the different data fields aspossible sources of behavioral data

First let us have a look at the fields that are avail-able everywhere the time and date of the bid request(Created) the RTB exchange (Network) the dimen-sions of the ad (Dims) the device type (Device) the

6 The notion of an instance is important Many instances of the samedevice may have different keys in the online advertising ecosystemAs discussed above connecting these is one focus of this research7 For this paper we will consider transient IP addresses that are notexplicitly removed such as those belonging to 3G or other carriernetworks (but not identified as such) to be noise in the data thatmust be accommodated by the similaritytargeting design Betterremoval techniques could increase the strength of the results pre-sented in the next sections

user agent the publisher and the IP address TheIP address (anonymized) is the location we use in ourGSN design Figures 2(a) and 2(b) show the distri-bution of the number of users seen at an IP addressand the number of IP addresses seen per user respec-tively (more information on this data set is providedlater in sect51) On average a user visits 166 IP ad-dresses and an IP address is visited on average bytwo users (over a time period of 10 days see sect51)Although in our data set some known 3G IP addresses(subranges) have already been removed Figure 2(a)shows that high-volume IP addresses are still presentSome of these IP addresses with a very large numberof unique users seen correspond to large institutionsthink for example of a university Wi-Fi address

Although the publisher data item is available for allbid requests we observe that most device instancesvisit only a single publisher (about 95 whereasless than 1 visit three or more publishers andonly 00009 are seen on 10 or more publishers)The publisher-specific Interactive Advertising Bureau(IAB) categorizations (Iabcats) suffer from the sameproblem (as well as being too coarse-grained to beuseful for identifying the same user in any event)The reason is that as mentioned above a mobiledevice has different identifiers when viewed throughdifferent channels in the mobile advertising ecosys-tem such as different RTB systems or even differentapps On the other hand a publisher has on average1366 users that visited it and publishers by and largeseem to be RTB specific Figures 3(a) and 3(b) showmore details of the distributions As a consequenceusing publisher-visitation data to connect users is notbroadly helpful because two user instances in dif-ferent RTB systems rarely share a publisher The IPaddress is always available and is passed consistentlythrough the different apps and RTB systems

The other consistently available fields (such as useragent device type dimensions) are always the samefor a single device These could in principle be used toimprove same-device finding but are only marginally

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers248 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 1 Data Availability for Mobile Devices Across Different RTB Systems

RTB 1 () RTB 2 () RTB 3 () RTB 4 () RTB 5 () RTB 6 () RTB 7 ()

Created 100 100 100 100 100 100 100Network 100 100 100 100 100 100 100Dims 100 100 100 100 100 100 100ID 100 100 100 36 0 19 44Device 100 100 100 100 100 100 100User agent 100 100 100 100 100 100 100Iabcats 0 100 100 98 100 100 84Geo 100 16 100 100 100 100 100Location 0 4 56 42 1 20 100Referrer 8 0 0 17 0 6 3Publisher 100 100 100 100 100 100 100URL 100 0 0 5 93 0 0IP 100 100 100 100 100 100 100

Notes Specifically the table shows how often the following fields are provided time of bid request (Created) the RTB network identifier(Network) dimensions of the ad (Dims) identifier of the device (ID) device type (Device) the user agent IAB categorization (Iabcats)the IP-inferred city state or country (Geo) the location latitude and longitude (Location) the referrer from where the user comes(Referrer) the publisher of the Web page (Publisher) the URL and the IP address

helpful for finding the same user on different devicesor for similar-user finding (eg using Apple productsversus Android products) we do not take advantageof them in this paper instead focusing specificallyon assessing the ability of the geosimilarity design

Figure 2 Distribution of (a) Number of Unique User Instances Seen per IP Address and (b) Number of IP Addresses a User Instance Logs Into

(b) Number of IPs per ID

Number of IPs

Fre

quen

cy o

f occ

uren

ce

100

108

107

106

105

104

103

102

100

101

101 102 103 104 105 106 107 108

(a) Number of IDs per IP

Fre

quen

cy o

f occ

uren

ce

108

107

106

105

104

103

102

100

101

Number of IDs

100 101 102 103 104 105

Figure 3 Distribution of (a) Number of Unique User Instances Seen per Publisher and (b) Number of Publishers a User Instance Visits

(a) Number of IDs per publisher

Number of IDs

Fre

quen

cy o

f occ

uren

ce

100

104

103

102

100

101

101 102 103 104 105 106 107

(b) Number of publishers per ID

Fre

quen

cy o

f occ

uren

ce

108

107

106

105

104

103

102

100

101

Number of publishers

100 101 102 103 104

(as opposed to how to best engineer a same-user-finding system)

The URL field is missing in all but two RTB ex-changes The Geo field is often limited to the citystate and country as inferred from the IP address

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 249

Location data in the form of latitude and longitude(Location) are only available in some RTB systemsand the accuracy of the latitudelongitude data in thecurrent mobile ecosystem is questionable8 again mak-ing it not suitable for broad use

32 Entity and Link RepresentationWe represent each user instance by a ldquoprofilerdquo ofbehavior across locations where we define a locationprofile as a vector of visit frequencies This will neces-sarily require a sparse representation as we observe avery large total set of locations fortunately each indi-vidual is seen at only a small subset of the locationsFor the empirical results below the location profile isa sparse vector of frequencies of observed visitationsto locations (which can be weighted in the calculationof similarity as described below)

The structure of the GSN is the graph with mobiledevices as the nodes and links between the devicesas the edges The corresponding design decisionsinvolve the selection of which devices to link at alland the weights to place on the links The simplestlink representation places unweighted links betweenall pairs of devices that share at least one specific loca-tion More sophisticated weighting criteria are basedon two notions which are discussed next and illus-trated with a simple example in Figure 4 The loca-tion data are obtained by listening to RTB ecosystemswhere IP addresses of devices are broadcast (as shownpreviously in Figure 1)

1 Given two devices that share at least one loca-tion the weight of the link between them can bea function of the devicesrsquo location profiles Such mea-sures can include simply the number of shared loca-tions and the number of shared locations weightedby their visitation percentage in the profile Alterna-tively the weighting could be a measure of similaritybetween the two location profiles for example cosinesimilarity

2 As such Devices A B and C are connected in theGSN of Figure 4 since they all logged into the Wi-Fiof the Met Museum Devices B and C also loggedinto the public Wi-Fi at Times Square and thereforeshould be more strongly connected than A and B (or Band C) which share only one location

8 Analyses of the latitudelongitude data show various data prob-lems including flipped coordinates coordinates set equal to eachother many latitudinallongitudinal points at the centroids of com-mon geographic areas and many impossible or highly improba-ble coordinates Mobile ad industry insiders claim that the highprice paid for latitudelongitude-based hyperlocal targeting hasled to a significant amount of latitudelongitude fraud We havefound nothing written on this phenomenon although data fraudis not uncommon in the online advertising industry (Stitelmanet al 2013)

Figure 4 (Color online) Basic Illustration of the Creation of a GSNBased on Mobile Location Data Observed for Three MobileDevices (A B and C) and a Laptop (D)

Met MuseumA B

Geosocial network(GSN)

A C

D

B

C

D

Times Square

Apartment

Note Devices are connected in the GSN if they visited (logged into) the samelocation (IP address)

3 Links can be weighted according to the componentlocationsrsquo 4lack of5 popularity For example it mightbe argued that the link formed by sharing a loca-tion with a small number of other devices (eg myapartment) indicates a stronger geosimilarity thansharing a location with a massive number of otherdevices This intuition can be extended if two devicesspend a lot of time at such an unpopular locale(eg my apartment) then that should indicate a verystrong similarity If two devices have approximatelythe same distributional profile across several of thesesites (their ldquolocation profilerdquo) then they are eitherthe same person or close friendssoulmates In ourexample C and D are connected because they bothlogged into the Wi-Fi of an apartment whereas Band C are connected by logging into the public (pop-ular) Wi-Fi at Times Square Hence the connectionbetween C and D (which are likely devices belongingto roommates or belonging to the same user) shouldbe stronger than the connection between B and C(which are likely devices belonging to two differenttourists)

This notion of similarity can be incorporated tech-nically by (1) adapting notions from information re-trieval we can weight locations for a given deviceby their device-specific popularity (from the devicersquoslocation profile) divided by the (log of) the locationrsquosoverall popularity let us call that LF times IDF (locationfrequency times inverse device frequency) Then (2) thestrength of the link between two devices that sharea location would be a function of the correspondingLF times IDF scores

Once we have one or more GSNs there are variousways to take advantage of them for targeting or otherinference The simplest method for inference is sim-ply to target all of the geosimilar network neighborsAlternatively one could target the ldquoclosestrdquo neighborsbased on one or more notions of geosimilarity (as dis-

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers250 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

cussed above) These are the methods used for theempirical results presented below

4 Specifics of the GSN DesignAs described above in this design the strength of thelink between two individuals in the GSN is based onthe similarity in the distribution of the locations thatthey visit For this paperrsquos results locations will be(anonymized) IP addresses We now define differentsimilarity measures of varying complexity which willbe evaluated in the following sections

41 NotationThe location profile of a user is represented by a vec-tor Ex of length m where element i denotes the num-ber of visits to IP address IPi2 Ex isin m Vector Exbool isa binary vector of size m where element i denoteswhether the user has visited IPi For similarity com-putations both the Hadamard product () and theinner product 4 middot 5 are used and the minimum (min)operator is defined to operate componentwise on twovectors as well as on a vector and a scalar The max-imum (max) and average (avg) operators are similarTheir logic is defined in Equations (1)ndash(4) The weightof an IP address is defined by its popularity in anal-ogy to the inverse document frequency used in text

Table 2 Similarity Metrics Between Two Userrsquos Location Visit Distributions Ex1 and Ex2

Similarity metric Range d4 Ex11 Ex25

1 scount4 Ex11 Ex25= Exbool11 middot Exbool12 6015 1

2 sbool4 Ex11 Ex25= min4scount4 Ex11 Ex25115 80119 1

3 scount1W 4 Ex11 Ex25= 4 Ew Exbool115 middot Exbool12 6015 103

4 sfreq1min4 Ex11 Ex25= min4 Ex11 Ex251 6015 2

5 sfreq1max4 Ex11 Ex25= max4 Exbool12 Ex11 Exbool11 Ex251 6015 3

6 sfreq1 avg4 Ex11 Ex25= avg4 Exbool12 Ex11 Exbool11 Ex251 6015 205

7 sfreq1min1W 4 Ex11 Ex25= min4 Ex1 Ew1 Ex2 Ew51 6015 206

8 sfreq1max1W 4 Ex11 Ex25= max44 Exbool12 Ew5 Ex11 4 Exbool11 Ew5 Ex251 6015 309

9 sfreq1 avg1W 4 Ex11 Ex25= avg44 Exbool12 Ew5 Ex11 4 Exbool11 Ew5 Ex251 6015 3025

10 scosine4 Ex11 Ex25=Ex1 middot Ex2

Ex11 middot Ex21

60117 0074

11 scosine1W 4 Ex11 Ex25=4 Exbool11 Ew5 middot 4 Exbool12 Ew5

Exbool11 Ew1 middot Exbool12 Ew1

60117 0060

12 sJaccard4 Ex11 Ex25=scount4 Ex11 Ex25

max4 Exbool111 Exbool1251

60117 0033

13 sJaccard1W 4 Ex11 Ex25=scount1W 4 Ex11 Ex25

max4 Exbool11 Ew1 Exbool12 Ew51

60117 0026

14 sJaccard1 freq4 Ex11 Ex25=sfreq1min4 Ex11 Ex25

max4 Ex1 Exbool121 Ex2 Exbool1151

60117 0067

15 sJaccard1 freq1W 4 Ex11 Ex25=sfreq1min1W 4 Ex11 Ex25

max44 Ex1 Exbool125 Ew1 4 Ex2 Exbool115 Ew51

60117 0067

mining (Hotho et al 2005) the weight wi for IPi (theinverse device frequency) is defined as the logarithmof the total number of users n divided by the numberof (unique) users that visit that specific IP address ni

Ez= Ex Ey1 zi = xi times yi1 (1)

a= Ex middot Ey =sum

i=12m

4xi times yi51 (2)

Ez= min4Ex1 Ey51 zi = min4xi1yi51 (3)

Ez= min4Ex1a51 zi = min4xi1 a51 (4)

wi = log104nni50 (5)

42 Link Strength MetricsFifteen similarity metrics are defined in Table 2 andare illustrated by measuring the strength betweenthe location profiles of two example users Ex1 and Ex2shown in Equation (6) The first user visits IP1 threetimes and IP2 twice whereas the second one visits IP1twice and IP3 once Several variants of the same met-rics are considered As described above some onlytake into account the unique number of shared loca-tions whereas others take into account the frequencyof the visits to these locations and the inverse devicefrequency The latter are denoted by freq and W respectively

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 251

The first similarity metric scount counts the numberof shared locations Since the two example users shareonly IP1 this strength is 1 The IDF weighted versionscount1W sums the IDF values of the shared locationswhich is 13 for our simple example The most basicmetric is given by sbool which is 1 when a location isshared and 0 otherwise indicating whether two usersare neighbors or not

The sfreq1min sfreq1max and sfreq1avg metrics compare thefrequencies of the visits to the shared locations Theytake the minimum maximum and average respec-tively of the frequencies for each shared location andsum them The weighted versions sfreq1min1W sfreq1max1W and sfreq1avg1W do the same but weight each frequencywith the corresponding IDF value

The scosine metric measures the similarity by takingthe cosine of the angle between the two vectors TheJaccard metric is defined as the size of the intersec-tion over the size of the union In this case it countsthe number of shared locations but normalizes theseover the total number of locations both users havevisited Users that visit many locations will have ahigher chance to share some location with anotheruser so the Jaccard metric penalizes them For ourbasic example sJaccard is 1 (IP1) over 3 (IP1 IP2 IP3)Once more variants taking into account the frequencyand the IDF are defined as well

Ex1 = 63 2 071

Ex2 = 62 0 171 (6)

Ew = 6103 107 270

5 Experimental Setup51 Data SetThe data set D for the empirical results is 10 days ofanonymized advertising bid requests from mobile de-vices (smartphones tablets etc) observed across sevenRTB exchanges A total of 322770794 bid requestsare observed coming from 42437559 unique userinstances Over these 10 days we observe 41829088unique IP addresses

The number of unique user instances per RTBexchange is given in Table 3 The distributions of thenumber of bid requests per device type and operatingsystem are given in Tables 4 and 5 which show thatthe majority of bid requests come from smartphonesand the iOS operating system

A user (instance) corresponds to an exchange-assigned user ID One person on several deviceswill hence correspond to several users One deviceon several exchanges will often correspond to sev-eral users Also it might be that one ad exchangesometimes provides different user IDs to the same

Table 3 User Instances per RTB Exchange

Exchange Volume Percentage ()

RTB 1 311751979 6RTB 2 211591686 4RTB 3 1415661098 29RTB 4 2015011243 41RTB 5 212751015 5RTB 6 510551815 10RTB 7 212231418 4

Table 4 Distribution of Bid Requests per Device Type

Type Volume Percentage ()

Phone 29213581772 91Tablet 3010161334 9

Table 5 Distribution of Bid Requests per OperatingSystem

Type Volume Percentage ()

Android 12913151617 4001iOS 19219231648 5908Windows 5311529 002

device for example when the device is using differ-ent particular apps As described above an impor-tant task in online advertising is to be able to targetadvertisements to different instances of the same indi-vidual with reasonable probability regardless of thecomplexities of identification in the baroque onlineadvertising ecosystem Figure 5 visualizes a sample oflatitudes and longitudes broadcast by mobile devicesin our data set

52 SamplingFor the results in this paper we apply the GSN anal-ysis to data samples Each sample corresponds to acohortneighborhood and is built as follows A sin-gle (random) ldquoegordquo user is chosen user X aroundwhich a neighborhood is built Every neighbor of theego user must share at least one IP address so webegin by obtaining all (anonymized) IP addresses vis-ited by user X Next all other users that visited atleast one of these IP addresses (the neighbors) areadded The ego must have at least one neighbor to beincluded in this analysis Finally we obtain the com-plete location distribution of each of these neighborsWhen the next sample is chosen we ensure that noneof the users already included in the previous samplesis taken as the ego user (X) As such we ensure thatwe do not have repeated egondashneighbor dyads Theremay be shared neighbors of two different egos buttwo egos will never be immediate neighbors in thedata sample

The result of the sampling over 500 cohorts is summa-rized in Figure 6 on average an ego has 71 neighbors

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers252 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 5 Scatter Plots of Samples of Latitudes and Longitudes Broadcast by Mobile Devices

ndash130 ndash120 ndash110 ndash100 ndash90 ndash80 ndash70 ndash6020

25

30

35

40

45

50

55(a)

Long

Lat

ndash20 ndash10 0 10 20 30 40 50

Long

30

35

40

45

50

55

65

60

75

70

Lat

ndash200 ndash150 ndash100 ndash50 1000 50 150 200

Long

ndash100

ndash80

ndash60

ndash40

ndash20

0

40

20

100

60

80

Lat

4068ndash7406 ndash7404 ndash7402 ndash7400 ndash7394ndash7398 ndash7396 ndash7392 ndash7390

Long

4070

4072

4074

4076

4080

4078

4086

4082

4084

Lat

(b)

(c) (d)

Note Note that locations are plotted without reference to any map The figure gives a striking picture of population density across the world and makes uswonder what is going on with mobile devices in Antarctica

Figure 6 (Color online) The Distribution of the Number of Neighborsfor 500 Randomly Sampled Ego Users

0 500 1000 1500 2000 2500 3000 35000

50

100

150

200

250

300

350

400

450

No neighbors

Fre

quen

cy

the median being 1 The maximum number of neigh-bors is 3167 this user probably visited a publicWi-Fi that thousands of other users also logged intoWe chose 500 cohorts as we observe that the statis-

tics become quite stable as seen in Figure 7 Even200 cohorts seems to be enough to obtain stablesample statistics

The life span of user instances in this set of500 cohorts is shown by the cumulative frequency inFigure 8 Observe that only about 40 of the userinstances are seen at different hours About 30 ofthe user instances are seen only once in the sampledweek This illustrates that it is vital to understand thenotion of user instances These percentages are mis-leading if we interpret the user instances as individu-als or devices One individualdevice without stableidentification in the online advertising ecosystem willcreate many transient user instances which will driveup the percentage of user instances that are seen onlyonce or that are seen only within a short timeframeWe have not figured out how to determine or esti-mate reliably how many ldquounstablerdquo usersdevices arerepresented by the user instances in these data Wekeep all of the data for the analyses below (rather thanfiltering very short-lived user instances) to maintainrealism

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 253

Figure 7 Evolution of Several Sample Statistics as an Increasing Number of Cohorts are Created

0 100 200 300 400 5000

10

20

30

40

50

60

70

Average No network neighbors

0 100 200 300 400 5000

2

4

6

8

10

12

Median No network neighbors

0 100 200 300 400 5000

50

100

150

200

250

300

350

Average No IPs

0 100 200 300 400 5000

2

4

6

8Median No IPs

Notes The horizontal axis denotes the number of cohorts By 500 cohorts the statistics have become stable

Figure 8 (Color online) Cumulative Frequency of Life Span ofUser Instances

100 101 102 103 104 105 1060

01

02

03

04

05

06

07

08

09

10

x = lifespan (sec)

F(x

)

End of June (mobile only)

larr 1 min larr 1 hour larr 1 day

Cumulative frequency of user instances

6 Study 1 Connecting Instances ofthe Same Individual

Returning to our motivating application as men-tioned previously an important task in mobile adver-tising is to be able to target a particular individual inthe face of the individualrsquos fragmentation into differ-ent user instances (for example as seen through dif-ferent bidding systems) Advertisers understand that

it may be impossible to find the same user with 100accuracy Nonetheless it often is important for themto reach the userrsquos various instances without target-ing too many others9 The GSN is designed in partbased on the hypothesis that different instances of thesame person would visit similar locations and there-fore they will be connected in the GSN In this sectionwe present results examining whether and to whatextent that is indeed the case Furthermore the resultsquantify to what extent the same user receives a highGSN-similarity to herself as compared to her otherneighbors

Three experimental settings are used to assess judg-ments that two users are the same The first two sim-ulate different user instances by splitting up the bidrequests of known user instances The last one is notsimulated it links users via the recently introducedidentifier for advertising (IFA) observed over severalRTB exchanges for a subset of device instances

1 Random We divide the IP address visits of a userrandomly across two simulated users Table 6 showsfor example how we can divide a userrsquos IP visits across

9 The astute reader may notice that if the other users targeted arevery similar to the target user the advertising may be well placeddespite not being shown to the exact target user That is not thesubject of this study Study 2 shows evidence that these two notionsare not independent when using the GSN

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers254 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 6 Randomly Splitting User X rsquos IP VisitsCreating Two New Artificial Users X1 and X2

IP1 IP2 IP3

X 1 5 0X1 0 3 0X2 1 2 0

Table 7 The Volume of IFA Occurrence Within OurData Set

Volume

RTB 1 119401855RTB 2 113831962RTB 3 7891292

Table 8 The Frequency of Unique IFAs Visible AcrossRTB Exchanges

Frequency

RTB 1RTB 2 3061594 (1016)RTB 2RTB 3 451256 (169)RTB 3RTB 1 181417 (085)

two new users X1 and X2 for which we know thatthey are actually the same individual Results withRandom should be used only as a ceiling on perfor-mance because the results will be overly optimistic

2 Temporal A more realistic split-up of the trans-actions is a temporal split-up where first the middleday of Xrsquos transactions is determined (hence dynam-ically chosen for each user) All IP address visits upto this middle day are assigned to X1 and all trans-actions after the middle day are assigned to X2 Itwill only be applied to users who are seen on multi-ple days

3 Identifier for advertising IFA is an advertisingidentifier from Apple that users can change or asknot to be used for advertising and is seen as aprivacy-friendlier version of the controversial fixeddevice ID (the ldquoUDIDrdquo) The IFA is observed in threeexchanges with volumes shown in Table 7 for exam-ple 1940855 unique IFAs are seen in RTB 110 Whenwe observe the same IFA across different exchangeswe now know that two devices in different exchangesactually are the same The frequency of unique IFAsvisible across exchanges is shown in Table 8 For exam-ple a total of 3018223 IFAs are seen when we look atboth RTB 1 and RTB 2 with 306594 of these (1016)seen on both exchange 1 and exchange 2 What ifthere were not an IFA and the device would havehad a different identifier in each exchange To what

10 For confidentiality RTB 1 in this setting is not necessarily thesame as RTB 1 in sectsect4 or 7

extent are these two device instances then connectedin the GSN

With these different ldquogold standardrdquo methods toindicate whether or not two users are the samewe assess the ability of the geosimilarity metricsto link the same user determining how often twoinstances of the same person are connected in theGSN (sect61) Afterward we shall assess the differentGSN strength metricsrsquo ability to rank users determin-ing to what extent two instances of the same personexhibit strong geosimilarity as compared to the otherneighbors (sect62) For the random and temporal split-ups 200 samples are used For the IFA-based methodthe complete population is used for measuring thedegree of connectedness to the same user to analyzethe relative strength of the connected users 200 sam-ples are used

61 Same Individual Different ldquoScreensrdquoWe now present results on the GSN connectednessmeasured as how often two instances of the same userare connected in the GSN (in percentage) Note thatthe results summarized in Table 9 should be inter-preted in light of the limited number of days in build-ing the research data set which will limit the GSNconnectivity Nonetheless the relatively high connec-tivities observed even with this sample provide quitepromising results which lends considerable supportto the merit of the overall design

Random0 Using the random split two users thatcorrespond to the same individual are not connectedin only 9 of the neighborhoods Moreover in thesecases user X (whose IP visits are split up) usuallyhas only one or two entries and the random split-upresults in no visits for one of the two simulated users

Temporal0 The temporal split-up is more realisticthan the random one where we assume that a personuses simulated device 1 in the first half of the week(eg during the weekend) and simulated device 2 inthe second half of the week (eg during the week)

Users X1 and X2 are not connected in 32 of theneighborhoods Whereas the random split-up was toooptimistic the temporal one is too pessimistic as inreality there likely is some overlap in the time peri-ods of use of two different devices If user X visitsan IP address only in the first days of the week this

Table 9 To What Extent is the Same User Connectedto Herself

connected

Random 91Temporal 68IFAmdashRTB 1RTB 2 67IFAmdashRTB 2RTB 3 82IFAmdashRTB 3RTB 1 73

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 255

IP address will not occur in user X2rsquos transactions andreduces the chance to be connected to X1

IFA0 We look at all combinations of the three ex-changes with IFA identifiers For example there are45256 unique IFAs seen on both RTB 2 and RTB 3 In82 of these the two IFAs visit at least one same IP onboth exchanges and hence are connected in our GSN

The bottom line is that even with this limited slice ofonline behavior different instances of the same indi-vidual are connected in the GSN 70ndash80 of the timeEncouragingly the results from the simulated scenar-ios concur with the results from the real (IFA) sce-narios (including the aforementioned optimism andslight pessimism of the two simulated scenarios)Consistency over these different settings gives addi-tional confidence in the results suggesting that theGSN indeed holds promise for a high degree of suc-cess at targeting the same user across the fragmenta-tion of the online advertising ecosystem

62 Ranking User InstancesNext given that two user instances of the same indi-vidual are indeed connected we assess to what extentthe second instance of the same user is ranked highlyamong its network neighbors based on the weightof their linkage in the GSN We first will discuss themost general results using the best ranking methodand then will look across different ranking methods

We decompose the results based on the numberof network neighbors (NNs) since there are twoboundary cases that require special interpretationTables 10ndash14 show the proportions of the ego net-works that fall into three scenarios First a user mayhave only one NN In this scenario the ranking isperfect (trivially)mdashthe other instance of the same useris the only user instance given a nonzero score As

Table 10 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashRandom Split-up

Ego has only 1 NN 3104Ego has exactly 2 NNs 1905

Same user ranked first 8005Ego has more than 2 NNs 4901

Note For the scenario with two NNs the table reports thepercentage of cases where the other known instance of thesame user is ranked first (including ties)

Table 11 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashTemporal Split-up

Ego has only 1 NN 3008Ego has exactly 2 NNs 1807

Same user ranked first 8309Ego has more than 2 NNs 5005

Table 12 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 1RTB 2Split-up

Ego has only 1 NN 4401Ego has exactly 2 NNs 1201

Same user ranked first 8201Ego has more than 2 NNs 4400

Table 13 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 2RTB 3Split-up

Ego has only 1 NN 6109Ego has exactly 2 NNs 1907

Same user ranked first 7102Ego has more than 2 NNs 1804

Table 14 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 3RTB 1Split-up

Ego has only 1 NN 5305Ego has exactly 2 NNs 2501

Same user ranked first 6304Ego has more than 2 NNs 2104

shown in the tables this scenario accounts for 30ndash60 of the ego networks (depending on the split-upmethod)

The second boundary case is when the ego has ex-actly two NNs This scenario accounts for 10ndash25of the cases depending on the split-up method Herethe tables also report whether the other known userinstance is ranked first (including ties) among thetwo neighbors The other known instance of the sameuser is ranked first 60ndash80 of the time dependingon the split-up scenario The IFA results are affectedstrongly by the particular RTB settingmdashthis may indi-cate that the publishers on the different RTBs have dif-ferent policies for passing the IFA Importantly theseresults are conservativemdashpossibly very conservativeThe other NNs may also be instances of the same userunbeknownst to usmdashfor example in the IFA casebecause an app does not pass the IFA to the RTB orbecause the other user is the same user on a differentdevice and thus does not share the IFA

The final scenario is when more than two NNsare present As shown in the tables this accounts for20ndash50 of the cases depending on the split-up sce-nario In this case we measure the percentile in theranking where the instance of the known same userfalls (see below) as well as the area under the receiveroperating characteristic curve (AUC) (Fawcett 2006)which is equivalent to the MannndashWhitneyndashWilcoxon(MWW) statistic The AUCMWW measures to what

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers256 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 15 When There Are More Than Two Network Neighbors to What Extent Is the Known Same User StronglyConnected to Itself

AUC (if gt 2 NNs) () Average rank percentile (if gt 2 NNs) ()

Ranking metric R T RTB 12 RTB 23 RTB 31 R T RTB 12 RTB 23 RTB 31

1 78 64 68 59 55 29 48 34 43 542 59 58 67 58 54 45 53 34 44 553 79 69 74 61 61 29 44 29 40 484 86 75 69 61 59 24 40 32 41 505 83 70 71 59 63 26 44 31 43 476 85 74 71 60 65 25 41 31 42 457 86 76 73 62 64 24 40 29 40 468 83 70 72 59 65 26 44 30 43 459 85 74 71 60 67 25 41 30 42 44

10 79 69 69 61 63 29 44 32 41 4711 79 69 71 62 64 44 44 30 41 4612 81 68 68 60 62 28 46 34 42 4913 81 70 71 61 64 27 44 30 40 4614 74 67 66 61 48 34 47 35 41 6015 74 67 66 61 48 34 47 35 41 60

Note See the text for a discussion of ranking metrics

extent data points with label 1 are ranked higherthan those with label 0 In the results that follow(see Table 15) the AUC is measured for each samplewhere any user instance among the neighbors thatcorresponds to the same individual is labeled as 1and all others are labeled as 0 Thus in this settingthe AUC measures how well the neighbors are rankedby the likelihood of being the same individual11 Theaverage over all samples is reported If for a sampleall neighbors have the same score that AUC is setto 05 In addition to the factors noted above this alsowill tend to make the results conservative For exam-ple if the GSN links an ego user to a set of neighborsthat are all instances of the same individual in thisevaluation the AUC will be 05

The ranking measure is the percentile in the rank-ing at which the known-same-user instance is rankedwith lower values being better For example if thesame-user instance has the highest score (closest con-nection) among five neighbors the rank is 20 (15)Note again that this is quite conservative in thisexample the user is ranked at the top of the list butbecause there were only five neighbors the highestpossible score is 20 In case the known-same-userinstance has the same score as another neighbor theaverage rank is reported

As shown in Table 15 all AUC values exceed 05ranging up to 08 This means that in every casea known-same-user instance is likely to be rankedhigher than an instance not known to be the same

11 Technically in this setting the AUCMWW is the probability thata randomly selected neighbor that is in fact a known instance of thesame user is ranked more highly than a randomly selected neighborthat is not known to be an instance of the same user If the known-same-user instances are all ranked above the other-user instancesthe AUC = 100 If they are all below then the AUC = 0

user The frequency-based ranking metrics (4ndash9) per-form quite well in connecting instances of the sameuser strongly

Almost all metrics perform better than a randommodel with performances that even come close to thebest possible solution Consistently performing quitewell are the frequency-based metrics (metrics 3ndash9)with metric 7 (the minimum of the IDF weights ofthe shared locations) getting the most wins Not per-forming well is the very basic count metric as well asthe Boolean metric that provides a binary score onlyInteresting to note the Jaccard metrics do not performwell either It seems that penalizing users that visitmany IP addresses negatively affects the results anissue we will return to in the next study

7 Study 2 Does the GSN Select Userswith Similar Interests

The foregoing section addressed our paperrsquos firstclaimmdashthat the GSN design indeed links instances ofthe same user and links them more strongly thaninstances of other users In this section we turn to thepaperrsquos second claimmdashthat the GSN links users withsimilar interests We will measure similar interests bysimilarity in behavior accessing particular publisherson the Web and using mobile apps12

71 Connecting Users with Similar Interests

To What Extent Are the GSN Neighbors of Visi-tors to Particular Publishers Also Visitors to ThoseSame Publishers To assess whether the GSN con-nects users with similar interestsbehavior we select

12 From now on all user instances will be complete real user in-stances the simulated split-up scenarios (Random Temporal) werecreated specifically for the purpose of the same-user study

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 257

Table 16 Distribution of Publishers per RTB in Terms of thePercentage of Users in Our Data Set That Visit the GivenPublisher

Publisher of D

RTB 1Burstly 3004myYearbookcom 1062My Fitness Pal 0065TextMe Inc 0056GameResort 0025Flixster 0019247Sports 0015Daily Workout Apps LLC 0009Mobile Deluxe 0007YoYo Games Ltd 0004

RTB 2mworldnowcom 0032mtmzcom 0024mtopixcom 0014babycentercom 0015meetmecom 0006appsfacebookcom 0006itunesapplecom 0039appevitecom 0002beautyandskincarescom 0003mmajunkiecom 0001

RTB 3Conversion Exchange 5053Pinger Inc 2096Top Game Developer 4029PremiumEntertainmentApp-FamilyOriented-Android 0059Enflick 0053Clapfoot Games 0044FunPokes Inc 0018Talkatone 0015Sevenlogics Inc 0008A Star Software 0007

RTB 4Pinger PhonemdashiOSmdashConversation 1001Talkatone iPhone App 0007DiceWithBuddies 00139GAG Reader 0012Video Downloader Pro Lite 0016Rage Comics 0009Spades 3D Lite 0006Apalabrados 0003Relax Melodies HD 0002Celeb MemdashPhotoMaker 0000

a diverse collection of 40 publishers listed in Table 16A publisher can correspond to a mobile app a web-site or a set of appswebsites We selected both pub-lishers that are visited by many users and more nichepublishers with fewer visitors As can be seen fromTable 16 several types of publishers can be distin-guished the main groups are games (eg GameRe-sort Top Game Developer) and social networkingtools (eg appsfacebookcom Text Me Inc)

More specifically we create 200 samples for 40 pub-lishers 10 publishers for each of four different RTB

exchanges listed in Table 1613 The percentage ofunique users that visit each publisher in our completedata set is also shown (D) To create each samplean ldquoegordquo user is selected that has visited the pub-lisher in question (an ldquoego visitorrdquo) this ego visi-torrsquos GSN neighborhood is created Next we assesshow many of the egorsquos neighbors also visited thatsame publisher This will be compared with a base-line visit rate to produce lift and leverage values(Provost and Fawcett 2013)mdashspecifically how muchmore likely is it for the neighbors of ego visitors tovisit the publisher than would be expected by chanceWe consider three different setups that produce dif-ferent baselines

1 Considering all RTB exchanges to determine the base-line A user is considered a visitor for a given pub-lisher if she visits the publisher on any RTB ex-change (since some publishers are visible on differentexchanges) Hence the baseline visit rate is the num-ber of users (on any RTB) who visit that publisherdivided by the total number of users across all RTBexchanges This will provide optimistic lifts sincemost publishers appear only on one RTB exchangewhereas the baseline is measured over all exchanges

2 Considering one RTB exchange only to determine thebaseline A user is considered a visitor for a given pub-lisher and RTB if she visits the publisher on the sameRTB exchange This changes the baseline to the num-ber of users (on this RTB) who visit that publisherdivided by the total number of users on this RTBexchange The lift and leverage results will be worsethan in the first setting because the baseline will belarger Seeing that some publishers are seen on differ-ent RTB exchanges this setting is slightly pessimistic

3 Taking time of location visits into account In thethird setting we link users only if they visit an IPwithin the same time period This will be elaboratedon in sect73

Results are shown in Tables 17 and 18 The percent-age of publisher visitors in the neighborhood is givenas N This percentage is compared with the baselinepercentage of users that visit that publisher in thecomplete data D using two metrics the lift is the ratioof these numbers the leverage is the difference Liftis particularly useful for very small base rates (con-sider that an activity with a base rate of 30 cannotpossibly have a lift higher than about 3) Leverage ismore telling for larger base rates because it shows theabsolute improvement (here in percentage points)

When considering all NNs the lifts for all publish-ers are greater than 1 except for one Burstly on RTB 1has a lift of 097 (essentially no lift) but only when weconsider RTB 1 as the baseline This publisher already

13 For confidentiality the RTB numbering here is again differentfrom above

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers258 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 17 RTBs 1 and 2 Percentage of Network Neighbors That Also Visit the Publisher

RTB 1

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Burstly 3004 8063 2084 5059 47022 15054 44018 25061 8043 220572 myYearbookcom 1062 25077 15094 24015 44074 27067 43012 35016 21074 330543 My Fitness Pal 0065 4015 6038 305 3609 56077 36025 15055 23091 14094 TextMe Inc 0056 1805 33006 17094 39002 69075 38046 25018 45 240625 GameResort 0025 4093 19077 4068 18052 74026 18027 11086 47058 110626 Flixster 0019 1055 8015 1036 8014 42085 7095 1079 9042 1067 247Sports 0015 27057 188058 27042 77062 530093 77048 74053 509075 740388 Daily Workout Apps LLC 0009 1092 21047 1083 19023 215046 19014 11054 129027 110459 Mobile Deluxe 0007 1087 28067 108 23053 36009 23046 9063 14707 9056

10 YoYo Games Ltd 0004 3052 99058 3049 21057 609064 21053 13 367045 12096Considering this RTB exchange only to determine baseline

1 Burstly 8085 8063 0097 minus0022 47022 5033 38037 25061 2089 160762 myYearbookcom 4071 25077 5047 21006 44074 905 40003 35016 7046 300453 My Fitness Pal 1089 4015 2019 2025 3609 19048 35001 15055 8021 130654 TextMe Inc 1063 1805 11035 16087 39002 23094 37039 25018 15045 230555 GameResort 0073 4093 6078 402 18052 25049 17079 11086 16033 110146 Flixster 0055 1055 208 0099 8014 14071 7059 1079 3023 10247 247Sports 0043 27057 64073 27014 77062 182023 7702 74053 174097 74018 Daily Workout Apps LLC 0026 1092 7037 1066 19023 73095 18097 11054 44037 110289 Mobile Deluxe 0019 1087 9084 1068 23053 123088 23034 9063 5007 9044

10 YoYo Games Ltd 001 3052 34018 3042 21057 209025 21047 13 126012 1209

RTB 2

Considering all RTB exchanges to determine baseline1 mworldnowcom 0032 13027 40086 12095 54039 167042 54006 31065 97044 310332 mtmzcom 0024 6089 28086 6065 23008 9607 22084 10068 44075 100443 mtopixcom 0014 15007 104079 14093 53045 37106 5303 33075 234065 330614 babycentercom 0015 6087 47031 6072 43048 299056 43033 17074 122024 17065 meetmecom 0006 6002 10308 5097 54072 942093 54066 27085 47909 270796 appsfacebookcom 0006 301 48039 3003 9009 141097 9003 2076 43008 20697 itunesapplecom 0039 25077 65071 25038 51043 131012 51004 29049 75018 290098 appevitecom 0002 48028 21212043 48025 54076 21509068 54074 48028 21212043 480259 beautyandskincarescom 0003 18019 642087 18016 8048 299094 8046 1107 413049 11067

10 mmajunkiecom 0001 1036 201064 1035 26067 3195505 26066 13033 11977074 13033Considering this RTB exchange only to determine baseline

1 mworldnowcom 4034 13027 3006 8093 54039 12053 50005 31065 7029 270312 mtmzcom 3019 6089 2016 307 23008 7024 19089 10068 3035 70493 mtopixcom 1092 15007 7084 13015 53045 27081 51053 33075 17056 310834 babycentercom 1094 6087 3054 4093 43048 22042 41054 17074 9015 15085 meetmecom 0078 6002 7077 5025 54072 70057 53094 27085 35092 270076 appsfacebookcom 0086 301 3062 2024 9009 10063 8024 2076 3022 1097 itunesapplecom 5024 25077 4092 20053 51043 9081 46019 29049 5063 240258 appevitecom 0029 48028 165058 47098 54076 187082 54047 48028 165058 470989 beautyandskincarescom 0038 18019 48011 17081 8048 22045 8011 1107 30095 11032

10 mmajunkiecom 0009 1036 15009 1027 26067 296003 26058 13033 148001 13024

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

has a very high baseline and as we will see in thenext section when we consider only the top 1 and 10NNs the lifts go up to 5 and 3 respectively Burstlyprovides a generic platform for building apps and sois a weaker indication of user interest than many ofthe other publishers (fitness babies games)

All other lifts exceed one by substantial mar-gins and go up to even 2391 (Apalabrados onRTB 4 a Spanish-language social Scrabble-like game)

Obviously the conclusion that the NNs indeed aremore likely to visit the same publishers is highly sig-nificant by a simple sign test comparing the visita-tion rate within the geosimilarity neighborhood tothat of overall population either via lift or leveragep lt 0001 Thus NNs of publisher visitors are indeedmore likely also to visit the same websites the abso-lute values of the lifts and leverages suggest that theyare substantially more likely

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 259

Table 18 RTBs 3 and 4 Percentage of Network Neighbors That Also Visit the Publisher

RTB 3

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Conversion Exchange 5053 34054 6024 29001 64071 11069 59017 50077 9018 450242 Pinger Inc 2096 46002 15053 43006 65012 21097 62015 51048 17037 480523 Top Game Developer 4029 47024 11002 42095 60 13099 55071 48097 11042 440684 PremiumEntertainmentApp-FamilyOriented-Android 0059 23079 40012 2302 20045 34049 19086 22025 37052 210665 Enflick 0053 11071 21095 11017 21057 40044 21004 12027 23 110736 Clapfoot Games 0044 8055 19032 8011 11054 26006 1101 13099 31061 130557 FunPokes Inc 0018 11011 61028 10093 50 275077 49082 15079 87008 150618 Talkatone 0015 13013 88051 12098 31011 209076 30096 17091 120078 170779 Sevenlogics Inc 0008 2071 32059 2063 5088 70074 508 8033 100022 8025

10 A Star Software 0007 3033 44086 3025 23008 311024 23 9009 122061 9002Considering this RTB exchange only to determine baseline

1 Conversion Exchange 11045 34054 3002 23009 64071 5065 53025 50077 4043 390322 Pinger Inc 6013 46002 705 39089 65012 10062 58098 51048 8039 450353 Top Game Developer 8088 47024 5032 38036 60 6076 51012 48097 5052 400094 PremiumEntertainmentApp-FamilyOriented-Android 1023 23079 19038 22057 20045 16066 19023 22025 18013 210025 Enflick 101 11071 1006 1006 21057 19054 20046 12027 11011 110166 Clapfoot Games 0092 8055 9033 7064 11054 12059 10062 13099 15027 130087 FunPokes Inc 0038 11011 2906 10074 50 133022 49062 15079 42007 150418 Talkatone 0031 13013 42076 12082 31011 101034 3008 17091 58035 170619 Sevenlogics Inc 0017 2071 15074 2054 5088 34017 5071 8033 48041 8016

10 A Star Software 0015 3033 21067 3017 23008 150036 22092 9009 59023 8094

RTB 4

Considering all RTB exchanges to determine baseline1 Pinger PhonemdashiOSmdashConversation 1001 71081 70085 7008 80077 79068 79076 70014 6902 690132 Talkatone iPhone App 0007 11038 17203 11031 29003 439068 28097 17002 257078 160963 DiceWithBuddies 0013 7041 59004 7028 18018 144092 18006 7069 61031 70574 9GAG Reader 0012 62013 497099 62 49009 39305 48097 49005 393017 480925 Video Downloader Pro Lite 0016 20003 123015 19087 22058 138084 22042 17093 110027 170776 Rage Comics 0009 609 8001 6081 13033 154086 13025 609 8001 60817 Spades 3D Lite 0006 1205 225076 12044 27027 492057 27022 1205 225076 120448 Apalabrados 0003 77042 21391071 77039 75 21316097 74097 79032 21450043 790299 Relax Melodies HD 0002 31058 21002029 31056 50 31170029 49098 31058 21002029 31056

10 Celeb MemdashPhotoMaker 0 3023 82107 3022 7014 11819048 7014 3023 82107 3022Considering this RTB exchange only to determine baseline

1 Pinger PhonemdashiOSmdashConversation 8051 71081 8044 6303 80077 9049 72026 70014 8024 610632 Talkatone iPhone App 0055 11038 20053 10082 29003 52038 28048 17002 30071 160473 DiceWithBuddies 0009 7041 82011 7032 18018 201054 18009 7069 85027 7064 9GAG Reader 0097 62013 63097 61016 49009 50055 48012 49005 50051 480085 Video Downloader Pro Lite 0044 20003 45036 19059 22058 51014 22014 17093 40062 170496 Rage Comics 0037 609 18079 6053 13033 36032 12097 609 18079 60537 Spades 3D Lite 0046 1205 2609 12004 27027 58068 26081 1205 2609 120048 Apalabrados 0021 77042 371093 77021 75 360031 74079 79032 381006 790119 Relax Melodies HD 0013 31058 238054 31045 50 377069 49087 31058 238054 31045

10 Celeb MemdashPhotoMaker 0003 3023 97089 3019 7014 216076 7011 3023 97089 3019

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

The very high lifts could be attributed to two phe-nomena They may provide some evidence that theGSN embeds a true social network as pointed out bythe good results for the social networking publishers(eg Apalabrados with a lift of 2391 appsevitecomwith a lift of 2212 and others) Friends visiting thesame social networking sites (here unusual ones)

would increase the lift and leverage over just havingdifferent instances of the same user visiting the samewebsites An alternative explanation is that some ofthese publishers are much more likely to be visitedacross different devices of the same user It is not clearwhether this is actually true However it argues forobtaining specific and broad data on which devices

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers260 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

are not the same user to shed further light on forexample any broad data that include a (anonymized)user identifier

From these results we can conclude that GSNneighbors indeed exhibit substantially similar pub-lisher visitation behavior Next we assess to whatextent more strongly connected NNs are more similarthan less strongly connected NNs

72 Ranking Users with Similar Interests

Are the More-Similar Geosimilarity Neighbors of aPublisherrsquos Visitors Even More Likely Also to Be Vis-itors to the Same Publisher To assess whether more-similar geosimilarity neighbors are even more likelyto share interests we compute the measures exactlyas in the previous study except instead of consider-ing all NNs only the 1 (N1) and 10 (N10) neighborswith the highest scores are considered (see Tables 17and 18) First we must consider which of the 15 met-rics that are defined to measure link strength to useTable 19 reports the rank aggregations of the differentmetrics as to which provide the best lifts Specificallyfor each of the 40 publishers each scoring metric pro-vides a lift and for that publisher the scoring metricscan be ranked (the best being ranked 1 etc) The rankaggregation is the average rank for a scoring metricacross all 40 publishers Thus if one metric providedthe best lift for all publishers it would get an aggre-gated rank of 1

To visualize the best metrics the metrics with anaverage ranking less than 8 are shown in boldfacein Table 19 The Jaccard metrics do not perform wellin this study either showing that penalizing the con-nection to users that visit many locations is not sen-sible Remember that the Jaccard metric is based onthe idea that users that visit many locations will havea higher chance to share some location with anotheruser However we have so many locations in totalthat visiting a couple more locations will increase onlymarginally this probability of sharing a location bychance A user that logs into more IP addresses andis more active should hence not be penalized andranked lower than other less active users Metric 3which sums the IDF scores of the shared locationsperforms quite well Given the operational efficiencyadvantage over the frequency-based metrics this met-ric is chosen to measure the strength of the links

As seen from the results in Tables 17 and 18 thelift and leverage values for the strongly connected (top 1and top 10) NNs are even higher sometimes astro-nomical (3170 for Relax Melodies HD (RTB 4)) Theseresults are highly significant by sign tests compar-ing either the lift or the leverage between the top-ranked neighbors and the entire cohort (p lt 0001) Theabsolute values again indicate that the users with the

strongest geosimilarity are often substantially morelikely to visit the same publisher

Thus we can conclude that not only does geosim-ilarity find users with similar interests it also ranksusers well by their likelihood of having similarinterests

An interesting follow-up question is whether thecharacteristics of a publisher are of importance for theresults For example are GSN neighbors more sim-ilar in terms of using the same social network appas compared to playing the same games To answerthis question the publishers are categorized into sixclasses funny social news games communicationand miscellaneous In Figure 9 the lifts of the pub-lishers are shown per category Some general trendscan be observed where publishers related to funnycontent have the highest median lift This could beexplained by the fact that users often forward funnycontent to their friends The social websitesapps fol-low next which seems to further demonstrate thatfriends are likely linked in the GSN However pub-lishers in the communication category perform not sowell compared with the other categories and ratherhigh variances are observed in the different cate-gories Please note that high variance is observedacross the categories therefore it is difficult to makewell-supported claims from these results

73 TimingAs a final evaluation we explore whether the inclu-sion of temporal similarity in location visitation ishelpful for defining GSN connections To this end weintroduce a setting where two users are connectedonly if they visit an IP address in the same timeperiod of a day Specifically we divide a day intodifferent time windowsmdashfor example two time peri-ods of 12 hoursmdashwhere users are connected only ifthey visit the same IP address in the same time periodof a day (which might be on different days) Practi-cally two digits are added to an IP which denote thestarting moment of the corresponding time periodWe report on time periods of 2 hours and of 12 hours(other periods yield similar results) We also includethe setting where users are connected only when theyvisit the same IP address on the same day in light ofthe result of Crandall et al (2010) that visiting thesame location at around the same time is indicativeof friendship

The resulting lifts are reported in Figures 10ndash12when considering respectively all NNs the top 1 NNand the top 10 NNs The results when not consider-ing the time period are repeated as well (All) lim-ited to the conservative case of only considering oneexchange (see above) The baseline (lift of one) isindicated with a red horizontal line lifts higher than1000 are shown as 1000 to limit the range to be able

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 2: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Information Systems ResearchVol 26 No 2 June 2015 pp 243ndash265ISSN 1047-7047 (print) ISSN 1526-5536 (online) httpdxdoiorg101287isre20150576

copy 2015 INFORMS

Finding Similar Mobile Consumers with aPrivacy-Friendly Geosocial Design

Foster ProvostDepartment of Information Operations and Management Sciences Leonard N Stern School of Business

New York University New York New York 10012 fprovoststernnyuedu

David MartensDepartment of Engineering Management Faculty of Applied Economics University of Antwerp

B-2000 Antwerp Belgium davidmartensuantwerpbe

Alan MurrayCoriolis Labs New York New York 10021 alancoriolisventurescom

This paper focuses on finding the same and similar users based on location-visitation data in a mobile envi-ronment We propose a new design that uses consumer-location data from mobile devices (smartphones

smart pads laptops etc) to build a ldquogeosimilarity networkrdquo among users The geosimilarity network (GSN)could be used for a variety of analytics-driven applications such as targeting advertisements to the same useron different devices or to users with similar tastes and to improve online interactions by selecting users withsimilar tastes The basic idea is that two devices are similar and thereby connected in the GSN when they shareat least one visited location They are more similar as they visit more shared locations and as the locations theyshare are visited by fewer people This paper first introduces the main ideas and ties them to theory and relatedwork It next introduces a specific design for selecting entities with similar location distributions the resultsof which are shown using real mobile location data across seven ad exchanges We focus on two high-levelquestions (1) Does geosimilarity allow us to find different entities corresponding to the same individual forexample as seen through different bidding systems And (2) do entities linked by similarities in local mobilebehavior show similar interests as measured by visits to particular publishers The results show positive resultsfor both Specifically for (1) even with the data samplersquos limited observability 70ndash80 of the time the sameindividual is connected to herself in the GSN For (2) the GSN neighbors of visitors to a wide variety of pub-lishers are substantially more likely also to visit those same publishers Highly similar GSN neighbors showvery substantial lift

Keywords design science mobile computing analytical modeling network analysisHistory Chris Forman Senior Editor Gautam Pant Associate Editor This paper was received February 17

2014 and was with the authors 2 months for 2 revisions

1 IntroductionThis design-science paper is about finding the sameand similar users based on location-visitation data ina mobile environment A user is an instance of anindividual as viewed through some information sys-tem(s) and to avoid confusion we sometimes willexplicitly refer to a user instance For example anindividual would be viewed as two different userinstances if the individual was interacting with theonline advertising ecosystem on two different devicesor was viewed on the same device through two dif-ferent bidding systems

Specifically we investigate ldquogeosimilarityrdquomdashthesimilarity of instances of consumers based on the dis-tribution of the locations they have been observed tovisit The basic idea is that two users are similar andthereby connected in a geosimilarity network (GSN)when they share at least one visited location Users

are more similar as they visit more shared locationsand locations that are less frequently visited1 Thisdesign is motivated by research showing that geo-graphical co-occurrence provides a strong indicationof users being friends (Crandall et al 2010 Cho et al2011) and influencing each otherrsquos purchasing behav-ior (Pan et al 2011) de Montjoye et al (2013) foundthat human mobility traces are highly discriminativebased on the analysis of 15 months of location data for15 million users they found that four locations aresufficient to accurately identify 95 of the users Thisnot only motivates our design it also places emphasison the desirability of privacy-friendly designs whichwe discuss further below

1 As we discuss in future directions (see sect8) the network perspec-tive could allow similarity judgments also between users who areindirectly connected in the network

243

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers244 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

This paperrsquos motivating application is mobile ad-vertising As with traditional display ad targetingmobile ad targeting could be based on context demo-graphics or psychographics if such data were avail-able Because such data are largely unavailable onmobile consumers in the advertising ecosystem adtargeting firms increasingly are looking to othermeans of finding suitable candidates for ads An alter-native method that has been gaining traction for (non-mobile) ad targeting both in research and in practiceis targeting based on direct or indirect connectionsto specific individuals which we loosely call ldquosocialrdquotargeting as we discuss in detail below For exam-ple brands often want to target individuals who aresimilar to particular existing customers of the brandor product in question or the same existing cus-tomers on different devices2 Geosimilarity providesa viable alternative for such indirect social targetingfor mobile advertising where (anonymized) locationdata are available and importantly anonymized loca-tion data may be much more readily available thanother targeting data

No matter how users are chosen for targetingmobile advertisers need to deal effectively with theproblem of consumer fragmentation individuals areobserved only through the multiple (baroque) infor-mation systems that comprise the digital adver-tising ecosystem and a particular individual maycorrespond to various user instances (Kerho 2012) Forexample an individual may have different instancesbecause she is observed on different devices such asher mobile phone tablet laptop and PC Individualsmay also have different instances because they appearin different advertisement bidding systems each ofwhich presents the individual differently In manysituations these instances are not associated with aunique identifier for the individual unlike the cookiesthat are used to represent a userndashbrowser pair in mostwork on individualized desktop ad targeting More-over as we will discuss presently there are importantreasons why we might prefer that the instances not beassociated with a personal identifier The bottom lineis that once an instance of an individual is identified asbeing a good target advertisers would like to be ableto also target other instances of the same individual(as well as individuals with very similar interests)

Geosimilarity could also be used for privacy-friendly ldquohyperlocalrdquo targeting meaning targeting

2 Note that this paper is not about the ultimate effectiveness of thebrandrsquos choice of whom to targetmdashit may be that targeting theselected users on different devices or targeting users with similarinterests is not appropriate from a marketing standpoint This isa question specific to each brand and the brandrsquos campaign goals(eg direct marketing versus brand advertising) Furthermore thisquestion is intricately intertwined with the design of the creativebeing delivered which also is outside the scope of this paper

people in a precise location (statistically speaking)without needing to store data on the actual loca-tions of the users For example consider a hyper-local coupon campaign a couponing company wantsto target special offers for the local restaurant on thecorner The best prospects are those people who fre-quent this precise area If the restaurant can providesome (anonymized) ldquoseedrdquo usersmdashfor example exist-ing clients of the particular local business (eg identi-fied via an online loyalty program)mdashthe geosimilarityneighbors of these seed users may have a high prob-ability of also frequenting the same precise locationsand thereby be very good prospects for the hyperlocalcoupon

Looking from a different perspective we have seenthe sort of uproar that arises from the idea that ourlocation behavior is being ldquotrackedrdquo by our mobiletechnology (Pogue 2011)3 Therefore marketers whodream of location-driven targeting should think care-fully about what the Federal Trade Commission (FTC)calls ldquoprivacy by designrdquo4 and consider what optionscan provide effective advertising with minimal datacollection and storage As will be detailed in sect4we explicitly take this desideratum into account byanonymizing both the device and location identifiersThis method of using ldquodoubly anonymizedrdquo data forprivacy friendliness is described in more detail else-where (Provost et al 2009)

In sum geosimilarity can be used for composing amobile audience for targeting as follows Based on adoubly anonymized GSN find individuals who areclosely linked to individuals we know already to havethe characteristic(s) that we desire such as prior pur-chase history (as with ldquoretargetingrdquo) brand affinity(eg via visiting a Web page ldquolikingrdquo a brand orproduct or clicking on an ad) or key demograph-ics or even based on sophisticated targeting designs(Bampo et al 2008 Agarwal et al 2008 Heidemannet al 2010 Perlich et al 2014) Once these key ldquoseedrdquousers have been identified targeting close neighborsin the GSN will tend to target both users with simi-lar interests and users who are other instances of thesame individual Below we present theoretical justi-fication for such an approach tying it in to relatedresearch We then investigate empirically whether themethod indeed tends to target (i) other instances ofthe same individual andor (ii) individuals with sim-ilar interests The empirical study is based on datadrawn from actual real-time bidding (RTB) exchangesfor mobile advertising

Before we move on to related work please let usnote that our proposed GSN is different from geoso-cial networking which is defined as a type of social

3 httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha4 httpwwwftcgovopa201012privacyreportshtm

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 245

networking in which geographic services are usedto enable additional social dynamics (Quercia et al2010) The latter starts from an actual network ofinterpersonal relationships and adds location data torecommend other locations and events In contrastthe GSN uses location data to create links betweenusers without the explicit guarantee that these linkscorrespond to an actual interpersonal relationship(however this surely might be as described in thenext section see Crandall et al 2010)

Section 2 argues theoretically why this GSN designshould be a good idea After that we present the high-level GSN design in more detail followed by specificsof the implementation we examine empirically Thenwe provide results from an empirical study basedon real mobile location data from real-time biddingexchanges Section 5 describes the data and experi-mental setup including several technical definitionsfor geosimilarity The ability of the GSN to connectdifferent instances of the same individual is evaluatedin sect6 The use of the GSN to select users with similarinterests and tastes is assessed in sect7

2 Motivation and Related Work21 Social TargetingAdvertising targeting has evolved substantially overthe past half century As information systems providedaccess to new sources and types of data marketersadded new targeting strategies designed around thenew data For example as demographic data becameavailable a few decades ago contextual targetingmdashtargeting based on inferring audience compositionfrom the context in which the ad will be shown (ega billboard location TV show magazine etc)mdashhadto share the spotlight with data-driven demographictargeting either based on explicit demographic pro-files or predictive modeling As data aggregators coa-lesced and integrated information such as magazinesubscriptions and catalog purchases ldquopsychographicrdquodata entered the mix and broadened yet again thespace of targeting designs

Recently we have seen the introduction of a differ-ent sort of targeting design which we can generallycall social targeting Social targeting differs from theaforementioned targeting methods because it relieson explicit linkages between specific individuals Forexample Hill et al (2006) showed the remarkableeffectiveness of social-network targeting targeting con-sumers who are linked to known customers by asocial network Subsequently Facebook (and others)have attempted to implement social-network target-ing for online advertising with varying degrees ofsuccess (see eg Oinas-Kukkonen et al 2010)

We explicitly generalize from social-network target-ing to social targeting to retain the notion that the

targeting is based on linkages to specific other indi-viduals but to relax the notion that the linkages needto be ldquotruerdquo interpersonal relationships The designof Provost et al (2009) is an example of social tar-geting that is not based on ldquotruerdquo interpersonal rela-tionships the linkages between individuals are basedon a bipartite content-affinity network So the socialtargeting there is based on forming an audience byfinding consumers who are linked by shared con-tent visitation with other specific consumers who areknown to have brand affinity (more on that later)Similarly Martens and Provost (2011) define a net-work among banking customers based on paymenttransaction data which is subsequently used to targetmarketing offers for financial products

22 GeosimilarityThe GSN can be very fine-grained based on shared(anonymized) location data for example Internet pro-tocol (IP) addresses fine-grained latitudelongitudecells or small geographic tracts Why would targetinggeosimilarity network neighbors (NNs) of preselectedseed users be a good idea There are several reasonsLet us call mobile devices that are (directly) linked inthe GSN (first-degree) network neighbors

First of all first-degree network neighbors shareat least one location and possibly several As thenumber of shared locations grows we conjecture thatthe likelihood increases that the two devices actuallybelong to the same person For example who besidesme is observed primarily on my home and office IPaddress let alone my favorite coffee shop We seeanalogous results showing that different instances ofthe same person call the same phone numbers (Corteset al 2001) and cite the same references (Hill andProvost 2003)

Second direct coarse-grained geographic targetingalready is used widely in off-line advertising (not viasocial targeting) because geography is a reasonableproxy for demographics and other predictive featuresAn important difference in the present work is thatwe do not choose the geographies to target explic-itly but instead use them implicitly This allows theactual locations to be anonymized Furthermore itallows us to use location information that is too fine-grained even to include in most predictive modelingfor example locations appearing only in a tiny frac-tion of instances (eg a home Wi-Fi address may onlyappear in 0000001 or less of the usersrsquo location sets)as well as transient Wi-Fi locations that only connecttwo devices for a brief time

Third fine-grained location information is likely tocontain more detailed (latent) information than stan-dard geographic information Not only would it linkdevices by approximate wealth income demograph-ics etc it may well link by employer educational

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers246 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

institution interests community and even shoppinghabits Thus we conjecture that geosimilarity tar-geting may combine the advantages of geographictargeting discussed in the previous paragraph withadvantages similar to content-affinity social targetingwhich has been shown to be effective specifically foronline advertising (Provost et al 2009 Perlich et al2014) We might call this ldquolocale-affinityrdquo social tar-geting The empirical results below show that indeedonersquos geosimilarity neighbors are much more likely tofrequent the same online publishers and mobile apps

If that were not enough research provides yetanother reason to expect that geosimilarity targetingmay be especially effective In an article in the Pro-ceedings of the National Academy of Sciences of the UnitedStates of America Crandall et al (2010 p 22436) showthat geographic co-occurrences between individualsare very strongly predictive of the individuals beingfriends ldquoThe knowledge that two people were prox-imate at just a few distinct locations at roughly thesame times can indicate a high conditional proba-bility that they are directly linked in the underlyingsocial networkrdquo This means that a GSN would notonly capture the advantages of geographic targetingand locale-affinity targeting but might also incorpo-rate actual social-network targeting also shown to beextremely effective for marketing (Hill et al 2006)In fact when massive descriptive data are availablethe (usually latent) similarity between social-networkneighbors has been shown to explain much of themarketing advantage previously attributed to socialinfluence (Aral et al 2009)

Moreover interestingly research also has shown thatthe homophily (McPherson et al 2001) that has beenused to explain the effectiveness of social-network tar-geting may actually be due largely to the constraintsplaced by opportunity (Kossinets and Watts 2009)Tie formation in social networks is biased heavilyby triadic closure and thus by structural proximityOver many generations of tie formation biases in theselection of structurally proximate individuals ldquocanamplify even a modest preference for similar othersvia a cumulative advantage-like process to producestriking patterns of observed homophilyrdquo (Kossinetsand Watts 2009 p 405) Why is this important for thepresent paper Because with the exception of sociallinks formed solely online (like new Facebook-onlyfriends) constraints on opportunity are framed byconstraints on physical colocation As Kossinets andWatts (2009 p 406) observe ldquoan individualrsquos choiceof relations is heavily constrained by other aspects ofhis or her life such as geographical location choice ofoccupation place of workrdquo These are exactly the sortsof links that would be represented in a GSN based onmobile device location data (including tablets and lap-tops) Thus the GSN may in addition reveal a large

driver of homophily and thus expand the effective-ness of a social-network targeting strategy to individ-uals who also would have been similar ldquofriendsrdquo butfor whatever reason were not chosen

3 Geosimilarity GeosimilarityNetworks and Mobile Ad Targeting

The basis for the targeting design is to create a net-work among users based on geosimilarity and thento use the network for inference For this paper wewill create a GSN directly among user instances5 Theelements of the network-targeting design include thefollowing How will entities (mobile user instances)be represented What exactly will be the geosimilaritylinks And how exactly will the predictive inferencebe conducted

In this section we will discuss the design at a highlevel including some further-looking elements thatare not part of the experiments but which providecompleteness for the discerning reader To make thediscussion more concrete we first present the moti-vating data scenario The design should generalize tomobile settings beyond this specific data scenario

31 The Mobile Real-Time Bidding EcosystemA high-level overview of how an RTB exchange worksis given in Figure 1 RTB exchanges bring togetheradvertisers and the publishers of Web pages andapps When a user visits a Web pageapp (here-after ldquoWeb pagerdquo) that sells ad slots through RTBexchanges a bid request is created Such a bid requestwill provide additional data on the user and Webpage visited such as the IP address publisher anddevice type (see Table 1 for a list of fields and statis-tics on their availability across RTB exchanges) Basedon such data advertisers can choose to bid For exam-ple a brand may want to target users that previ-ously visited their automobile website browsing for aecofriendly hybrid car The highest bidder is allowedto load a potentially personalized ad for exampleshowing an ad for an ecofriendly hybrid All of thisoccurs in real time in less than 100 milliseconds Thetwo interrelated problems this paper addresses canbe stated specifically in the context of this examplehow does the targeter find instances of this individ-ual who previously visited the brandrsquos website Canthe targeter find users with very similar interests (egthe focal userrsquos husband best friend) under the pre-sumption that they might also be reasonable candi-dates for an ad for an ecofriendly hybrid

5 An alternative is to create a bipartite network between users andlocations and draw inferences directly from the bipartite networkWe do not investigate that explicitly in this paper however theinterested reader should take a careful look at the similarity calcu-lations we present below

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 247

Figure 1 (Color online) Working of Real-Time Bidding

AdvertisersAd

Consumers

Publisher hasad slots for sale

At IP X is on website YBrand 1

51c

14c

33cBrand 2

Brand 3

3

Bids

Bid request

Loading of ad

2

with device type ZBidding is open 4

1

Notes When a user visits a Web page that sells ad slots through an RTB exchange (1) a bid request is created (2) after which several advertisers bid to allowto show an ad at that moment on that Web page to that user (3) The highest bidder is allowed to load a potentially personalized ad (4)

For this paper we address the data currently avail-able in the digital advertising ecosystem Specificallyvia RTB exchanges we can observe a massive numberof mobile users along with the data made availablefor advertisers to decide whether to bid on the userfor a particular campaign (see eg Pubmatic 2010)Two aspects of the data are crucial (1) Each instanceof a mobile user is associated with a key when thiskey is observed in the future we know that the futuredata item involves the same mobile device6 (2) Partof the data associated with each mobile device is alocation For this paper we consider that location tobe the (anonymized) IP address of the Wi-Fi network7

currently in use (although there are various sorts oflocation data) We observe the locations visited byeach user for the data records included in the researchdata set (described in more detail in sect51)

The mobile RTB ecosystem holds different infor-mation than for the typical online (desktop) worldwhich is important for our design The GSN designmeasures similarity in location visitation behaviorOne might wonder if other behavioral data can beobtained from the RTB systems such as website vis-itation data (Provost et al 2009 Raeder et al 2012)Table 1 shows which typical fields (besides the key)are visible on mobile devices over seven different RTBexchanges Let us consider the different data fields aspossible sources of behavioral data

First let us have a look at the fields that are avail-able everywhere the time and date of the bid request(Created) the RTB exchange (Network) the dimen-sions of the ad (Dims) the device type (Device) the

6 The notion of an instance is important Many instances of the samedevice may have different keys in the online advertising ecosystemAs discussed above connecting these is one focus of this research7 For this paper we will consider transient IP addresses that are notexplicitly removed such as those belonging to 3G or other carriernetworks (but not identified as such) to be noise in the data thatmust be accommodated by the similaritytargeting design Betterremoval techniques could increase the strength of the results pre-sented in the next sections

user agent the publisher and the IP address TheIP address (anonymized) is the location we use in ourGSN design Figures 2(a) and 2(b) show the distri-bution of the number of users seen at an IP addressand the number of IP addresses seen per user respec-tively (more information on this data set is providedlater in sect51) On average a user visits 166 IP ad-dresses and an IP address is visited on average bytwo users (over a time period of 10 days see sect51)Although in our data set some known 3G IP addresses(subranges) have already been removed Figure 2(a)shows that high-volume IP addresses are still presentSome of these IP addresses with a very large numberof unique users seen correspond to large institutionsthink for example of a university Wi-Fi address

Although the publisher data item is available for allbid requests we observe that most device instancesvisit only a single publisher (about 95 whereasless than 1 visit three or more publishers andonly 00009 are seen on 10 or more publishers)The publisher-specific Interactive Advertising Bureau(IAB) categorizations (Iabcats) suffer from the sameproblem (as well as being too coarse-grained to beuseful for identifying the same user in any event)The reason is that as mentioned above a mobiledevice has different identifiers when viewed throughdifferent channels in the mobile advertising ecosys-tem such as different RTB systems or even differentapps On the other hand a publisher has on average1366 users that visited it and publishers by and largeseem to be RTB specific Figures 3(a) and 3(b) showmore details of the distributions As a consequenceusing publisher-visitation data to connect users is notbroadly helpful because two user instances in dif-ferent RTB systems rarely share a publisher The IPaddress is always available and is passed consistentlythrough the different apps and RTB systems

The other consistently available fields (such as useragent device type dimensions) are always the samefor a single device These could in principle be used toimprove same-device finding but are only marginally

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers248 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 1 Data Availability for Mobile Devices Across Different RTB Systems

RTB 1 () RTB 2 () RTB 3 () RTB 4 () RTB 5 () RTB 6 () RTB 7 ()

Created 100 100 100 100 100 100 100Network 100 100 100 100 100 100 100Dims 100 100 100 100 100 100 100ID 100 100 100 36 0 19 44Device 100 100 100 100 100 100 100User agent 100 100 100 100 100 100 100Iabcats 0 100 100 98 100 100 84Geo 100 16 100 100 100 100 100Location 0 4 56 42 1 20 100Referrer 8 0 0 17 0 6 3Publisher 100 100 100 100 100 100 100URL 100 0 0 5 93 0 0IP 100 100 100 100 100 100 100

Notes Specifically the table shows how often the following fields are provided time of bid request (Created) the RTB network identifier(Network) dimensions of the ad (Dims) identifier of the device (ID) device type (Device) the user agent IAB categorization (Iabcats)the IP-inferred city state or country (Geo) the location latitude and longitude (Location) the referrer from where the user comes(Referrer) the publisher of the Web page (Publisher) the URL and the IP address

helpful for finding the same user on different devicesor for similar-user finding (eg using Apple productsversus Android products) we do not take advantageof them in this paper instead focusing specificallyon assessing the ability of the geosimilarity design

Figure 2 Distribution of (a) Number of Unique User Instances Seen per IP Address and (b) Number of IP Addresses a User Instance Logs Into

(b) Number of IPs per ID

Number of IPs

Fre

quen

cy o

f occ

uren

ce

100

108

107

106

105

104

103

102

100

101

101 102 103 104 105 106 107 108

(a) Number of IDs per IP

Fre

quen

cy o

f occ

uren

ce

108

107

106

105

104

103

102

100

101

Number of IDs

100 101 102 103 104 105

Figure 3 Distribution of (a) Number of Unique User Instances Seen per Publisher and (b) Number of Publishers a User Instance Visits

(a) Number of IDs per publisher

Number of IDs

Fre

quen

cy o

f occ

uren

ce

100

104

103

102

100

101

101 102 103 104 105 106 107

(b) Number of publishers per ID

Fre

quen

cy o

f occ

uren

ce

108

107

106

105

104

103

102

100

101

Number of publishers

100 101 102 103 104

(as opposed to how to best engineer a same-user-finding system)

The URL field is missing in all but two RTB ex-changes The Geo field is often limited to the citystate and country as inferred from the IP address

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 249

Location data in the form of latitude and longitude(Location) are only available in some RTB systemsand the accuracy of the latitudelongitude data in thecurrent mobile ecosystem is questionable8 again mak-ing it not suitable for broad use

32 Entity and Link RepresentationWe represent each user instance by a ldquoprofilerdquo ofbehavior across locations where we define a locationprofile as a vector of visit frequencies This will neces-sarily require a sparse representation as we observe avery large total set of locations fortunately each indi-vidual is seen at only a small subset of the locationsFor the empirical results below the location profile isa sparse vector of frequencies of observed visitationsto locations (which can be weighted in the calculationof similarity as described below)

The structure of the GSN is the graph with mobiledevices as the nodes and links between the devicesas the edges The corresponding design decisionsinvolve the selection of which devices to link at alland the weights to place on the links The simplestlink representation places unweighted links betweenall pairs of devices that share at least one specific loca-tion More sophisticated weighting criteria are basedon two notions which are discussed next and illus-trated with a simple example in Figure 4 The loca-tion data are obtained by listening to RTB ecosystemswhere IP addresses of devices are broadcast (as shownpreviously in Figure 1)

1 Given two devices that share at least one loca-tion the weight of the link between them can bea function of the devicesrsquo location profiles Such mea-sures can include simply the number of shared loca-tions and the number of shared locations weightedby their visitation percentage in the profile Alterna-tively the weighting could be a measure of similaritybetween the two location profiles for example cosinesimilarity

2 As such Devices A B and C are connected in theGSN of Figure 4 since they all logged into the Wi-Fiof the Met Museum Devices B and C also loggedinto the public Wi-Fi at Times Square and thereforeshould be more strongly connected than A and B (or Band C) which share only one location

8 Analyses of the latitudelongitude data show various data prob-lems including flipped coordinates coordinates set equal to eachother many latitudinallongitudinal points at the centroids of com-mon geographic areas and many impossible or highly improba-ble coordinates Mobile ad industry insiders claim that the highprice paid for latitudelongitude-based hyperlocal targeting hasled to a significant amount of latitudelongitude fraud We havefound nothing written on this phenomenon although data fraudis not uncommon in the online advertising industry (Stitelmanet al 2013)

Figure 4 (Color online) Basic Illustration of the Creation of a GSNBased on Mobile Location Data Observed for Three MobileDevices (A B and C) and a Laptop (D)

Met MuseumA B

Geosocial network(GSN)

A C

D

B

C

D

Times Square

Apartment

Note Devices are connected in the GSN if they visited (logged into) the samelocation (IP address)

3 Links can be weighted according to the componentlocationsrsquo 4lack of5 popularity For example it mightbe argued that the link formed by sharing a loca-tion with a small number of other devices (eg myapartment) indicates a stronger geosimilarity thansharing a location with a massive number of otherdevices This intuition can be extended if two devicesspend a lot of time at such an unpopular locale(eg my apartment) then that should indicate a verystrong similarity If two devices have approximatelythe same distributional profile across several of thesesites (their ldquolocation profilerdquo) then they are eitherthe same person or close friendssoulmates In ourexample C and D are connected because they bothlogged into the Wi-Fi of an apartment whereas Band C are connected by logging into the public (pop-ular) Wi-Fi at Times Square Hence the connectionbetween C and D (which are likely devices belongingto roommates or belonging to the same user) shouldbe stronger than the connection between B and C(which are likely devices belonging to two differenttourists)

This notion of similarity can be incorporated tech-nically by (1) adapting notions from information re-trieval we can weight locations for a given deviceby their device-specific popularity (from the devicersquoslocation profile) divided by the (log of) the locationrsquosoverall popularity let us call that LF times IDF (locationfrequency times inverse device frequency) Then (2) thestrength of the link between two devices that sharea location would be a function of the correspondingLF times IDF scores

Once we have one or more GSNs there are variousways to take advantage of them for targeting or otherinference The simplest method for inference is sim-ply to target all of the geosimilar network neighborsAlternatively one could target the ldquoclosestrdquo neighborsbased on one or more notions of geosimilarity (as dis-

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers250 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

cussed above) These are the methods used for theempirical results presented below

4 Specifics of the GSN DesignAs described above in this design the strength of thelink between two individuals in the GSN is based onthe similarity in the distribution of the locations thatthey visit For this paperrsquos results locations will be(anonymized) IP addresses We now define differentsimilarity measures of varying complexity which willbe evaluated in the following sections

41 NotationThe location profile of a user is represented by a vec-tor Ex of length m where element i denotes the num-ber of visits to IP address IPi2 Ex isin m Vector Exbool isa binary vector of size m where element i denoteswhether the user has visited IPi For similarity com-putations both the Hadamard product () and theinner product 4 middot 5 are used and the minimum (min)operator is defined to operate componentwise on twovectors as well as on a vector and a scalar The max-imum (max) and average (avg) operators are similarTheir logic is defined in Equations (1)ndash(4) The weightof an IP address is defined by its popularity in anal-ogy to the inverse document frequency used in text

Table 2 Similarity Metrics Between Two Userrsquos Location Visit Distributions Ex1 and Ex2

Similarity metric Range d4 Ex11 Ex25

1 scount4 Ex11 Ex25= Exbool11 middot Exbool12 6015 1

2 sbool4 Ex11 Ex25= min4scount4 Ex11 Ex25115 80119 1

3 scount1W 4 Ex11 Ex25= 4 Ew Exbool115 middot Exbool12 6015 103

4 sfreq1min4 Ex11 Ex25= min4 Ex11 Ex251 6015 2

5 sfreq1max4 Ex11 Ex25= max4 Exbool12 Ex11 Exbool11 Ex251 6015 3

6 sfreq1 avg4 Ex11 Ex25= avg4 Exbool12 Ex11 Exbool11 Ex251 6015 205

7 sfreq1min1W 4 Ex11 Ex25= min4 Ex1 Ew1 Ex2 Ew51 6015 206

8 sfreq1max1W 4 Ex11 Ex25= max44 Exbool12 Ew5 Ex11 4 Exbool11 Ew5 Ex251 6015 309

9 sfreq1 avg1W 4 Ex11 Ex25= avg44 Exbool12 Ew5 Ex11 4 Exbool11 Ew5 Ex251 6015 3025

10 scosine4 Ex11 Ex25=Ex1 middot Ex2

Ex11 middot Ex21

60117 0074

11 scosine1W 4 Ex11 Ex25=4 Exbool11 Ew5 middot 4 Exbool12 Ew5

Exbool11 Ew1 middot Exbool12 Ew1

60117 0060

12 sJaccard4 Ex11 Ex25=scount4 Ex11 Ex25

max4 Exbool111 Exbool1251

60117 0033

13 sJaccard1W 4 Ex11 Ex25=scount1W 4 Ex11 Ex25

max4 Exbool11 Ew1 Exbool12 Ew51

60117 0026

14 sJaccard1 freq4 Ex11 Ex25=sfreq1min4 Ex11 Ex25

max4 Ex1 Exbool121 Ex2 Exbool1151

60117 0067

15 sJaccard1 freq1W 4 Ex11 Ex25=sfreq1min1W 4 Ex11 Ex25

max44 Ex1 Exbool125 Ew1 4 Ex2 Exbool115 Ew51

60117 0067

mining (Hotho et al 2005) the weight wi for IPi (theinverse device frequency) is defined as the logarithmof the total number of users n divided by the numberof (unique) users that visit that specific IP address ni

Ez= Ex Ey1 zi = xi times yi1 (1)

a= Ex middot Ey =sum

i=12m

4xi times yi51 (2)

Ez= min4Ex1 Ey51 zi = min4xi1yi51 (3)

Ez= min4Ex1a51 zi = min4xi1 a51 (4)

wi = log104nni50 (5)

42 Link Strength MetricsFifteen similarity metrics are defined in Table 2 andare illustrated by measuring the strength betweenthe location profiles of two example users Ex1 and Ex2shown in Equation (6) The first user visits IP1 threetimes and IP2 twice whereas the second one visits IP1twice and IP3 once Several variants of the same met-rics are considered As described above some onlytake into account the unique number of shared loca-tions whereas others take into account the frequencyof the visits to these locations and the inverse devicefrequency The latter are denoted by freq and W respectively

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 251

The first similarity metric scount counts the numberof shared locations Since the two example users shareonly IP1 this strength is 1 The IDF weighted versionscount1W sums the IDF values of the shared locationswhich is 13 for our simple example The most basicmetric is given by sbool which is 1 when a location isshared and 0 otherwise indicating whether two usersare neighbors or not

The sfreq1min sfreq1max and sfreq1avg metrics compare thefrequencies of the visits to the shared locations Theytake the minimum maximum and average respec-tively of the frequencies for each shared location andsum them The weighted versions sfreq1min1W sfreq1max1W and sfreq1avg1W do the same but weight each frequencywith the corresponding IDF value

The scosine metric measures the similarity by takingthe cosine of the angle between the two vectors TheJaccard metric is defined as the size of the intersec-tion over the size of the union In this case it countsthe number of shared locations but normalizes theseover the total number of locations both users havevisited Users that visit many locations will have ahigher chance to share some location with anotheruser so the Jaccard metric penalizes them For ourbasic example sJaccard is 1 (IP1) over 3 (IP1 IP2 IP3)Once more variants taking into account the frequencyand the IDF are defined as well

Ex1 = 63 2 071

Ex2 = 62 0 171 (6)

Ew = 6103 107 270

5 Experimental Setup51 Data SetThe data set D for the empirical results is 10 days ofanonymized advertising bid requests from mobile de-vices (smartphones tablets etc) observed across sevenRTB exchanges A total of 322770794 bid requestsare observed coming from 42437559 unique userinstances Over these 10 days we observe 41829088unique IP addresses

The number of unique user instances per RTBexchange is given in Table 3 The distributions of thenumber of bid requests per device type and operatingsystem are given in Tables 4 and 5 which show thatthe majority of bid requests come from smartphonesand the iOS operating system

A user (instance) corresponds to an exchange-assigned user ID One person on several deviceswill hence correspond to several users One deviceon several exchanges will often correspond to sev-eral users Also it might be that one ad exchangesometimes provides different user IDs to the same

Table 3 User Instances per RTB Exchange

Exchange Volume Percentage ()

RTB 1 311751979 6RTB 2 211591686 4RTB 3 1415661098 29RTB 4 2015011243 41RTB 5 212751015 5RTB 6 510551815 10RTB 7 212231418 4

Table 4 Distribution of Bid Requests per Device Type

Type Volume Percentage ()

Phone 29213581772 91Tablet 3010161334 9

Table 5 Distribution of Bid Requests per OperatingSystem

Type Volume Percentage ()

Android 12913151617 4001iOS 19219231648 5908Windows 5311529 002

device for example when the device is using differ-ent particular apps As described above an impor-tant task in online advertising is to be able to targetadvertisements to different instances of the same indi-vidual with reasonable probability regardless of thecomplexities of identification in the baroque onlineadvertising ecosystem Figure 5 visualizes a sample oflatitudes and longitudes broadcast by mobile devicesin our data set

52 SamplingFor the results in this paper we apply the GSN anal-ysis to data samples Each sample corresponds to acohortneighborhood and is built as follows A sin-gle (random) ldquoegordquo user is chosen user X aroundwhich a neighborhood is built Every neighbor of theego user must share at least one IP address so webegin by obtaining all (anonymized) IP addresses vis-ited by user X Next all other users that visited atleast one of these IP addresses (the neighbors) areadded The ego must have at least one neighbor to beincluded in this analysis Finally we obtain the com-plete location distribution of each of these neighborsWhen the next sample is chosen we ensure that noneof the users already included in the previous samplesis taken as the ego user (X) As such we ensure thatwe do not have repeated egondashneighbor dyads Theremay be shared neighbors of two different egos buttwo egos will never be immediate neighbors in thedata sample

The result of the sampling over 500 cohorts is summa-rized in Figure 6 on average an ego has 71 neighbors

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers252 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 5 Scatter Plots of Samples of Latitudes and Longitudes Broadcast by Mobile Devices

ndash130 ndash120 ndash110 ndash100 ndash90 ndash80 ndash70 ndash6020

25

30

35

40

45

50

55(a)

Long

Lat

ndash20 ndash10 0 10 20 30 40 50

Long

30

35

40

45

50

55

65

60

75

70

Lat

ndash200 ndash150 ndash100 ndash50 1000 50 150 200

Long

ndash100

ndash80

ndash60

ndash40

ndash20

0

40

20

100

60

80

Lat

4068ndash7406 ndash7404 ndash7402 ndash7400 ndash7394ndash7398 ndash7396 ndash7392 ndash7390

Long

4070

4072

4074

4076

4080

4078

4086

4082

4084

Lat

(b)

(c) (d)

Note Note that locations are plotted without reference to any map The figure gives a striking picture of population density across the world and makes uswonder what is going on with mobile devices in Antarctica

Figure 6 (Color online) The Distribution of the Number of Neighborsfor 500 Randomly Sampled Ego Users

0 500 1000 1500 2000 2500 3000 35000

50

100

150

200

250

300

350

400

450

No neighbors

Fre

quen

cy

the median being 1 The maximum number of neigh-bors is 3167 this user probably visited a publicWi-Fi that thousands of other users also logged intoWe chose 500 cohorts as we observe that the statis-

tics become quite stable as seen in Figure 7 Even200 cohorts seems to be enough to obtain stablesample statistics

The life span of user instances in this set of500 cohorts is shown by the cumulative frequency inFigure 8 Observe that only about 40 of the userinstances are seen at different hours About 30 ofthe user instances are seen only once in the sampledweek This illustrates that it is vital to understand thenotion of user instances These percentages are mis-leading if we interpret the user instances as individu-als or devices One individualdevice without stableidentification in the online advertising ecosystem willcreate many transient user instances which will driveup the percentage of user instances that are seen onlyonce or that are seen only within a short timeframeWe have not figured out how to determine or esti-mate reliably how many ldquounstablerdquo usersdevices arerepresented by the user instances in these data Wekeep all of the data for the analyses below (rather thanfiltering very short-lived user instances) to maintainrealism

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 253

Figure 7 Evolution of Several Sample Statistics as an Increasing Number of Cohorts are Created

0 100 200 300 400 5000

10

20

30

40

50

60

70

Average No network neighbors

0 100 200 300 400 5000

2

4

6

8

10

12

Median No network neighbors

0 100 200 300 400 5000

50

100

150

200

250

300

350

Average No IPs

0 100 200 300 400 5000

2

4

6

8Median No IPs

Notes The horizontal axis denotes the number of cohorts By 500 cohorts the statistics have become stable

Figure 8 (Color online) Cumulative Frequency of Life Span ofUser Instances

100 101 102 103 104 105 1060

01

02

03

04

05

06

07

08

09

10

x = lifespan (sec)

F(x

)

End of June (mobile only)

larr 1 min larr 1 hour larr 1 day

Cumulative frequency of user instances

6 Study 1 Connecting Instances ofthe Same Individual

Returning to our motivating application as men-tioned previously an important task in mobile adver-tising is to be able to target a particular individual inthe face of the individualrsquos fragmentation into differ-ent user instances (for example as seen through dif-ferent bidding systems) Advertisers understand that

it may be impossible to find the same user with 100accuracy Nonetheless it often is important for themto reach the userrsquos various instances without target-ing too many others9 The GSN is designed in partbased on the hypothesis that different instances of thesame person would visit similar locations and there-fore they will be connected in the GSN In this sectionwe present results examining whether and to whatextent that is indeed the case Furthermore the resultsquantify to what extent the same user receives a highGSN-similarity to herself as compared to her otherneighbors

Three experimental settings are used to assess judg-ments that two users are the same The first two sim-ulate different user instances by splitting up the bidrequests of known user instances The last one is notsimulated it links users via the recently introducedidentifier for advertising (IFA) observed over severalRTB exchanges for a subset of device instances

1 Random We divide the IP address visits of a userrandomly across two simulated users Table 6 showsfor example how we can divide a userrsquos IP visits across

9 The astute reader may notice that if the other users targeted arevery similar to the target user the advertising may be well placeddespite not being shown to the exact target user That is not thesubject of this study Study 2 shows evidence that these two notionsare not independent when using the GSN

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers254 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 6 Randomly Splitting User X rsquos IP VisitsCreating Two New Artificial Users X1 and X2

IP1 IP2 IP3

X 1 5 0X1 0 3 0X2 1 2 0

Table 7 The Volume of IFA Occurrence Within OurData Set

Volume

RTB 1 119401855RTB 2 113831962RTB 3 7891292

Table 8 The Frequency of Unique IFAs Visible AcrossRTB Exchanges

Frequency

RTB 1RTB 2 3061594 (1016)RTB 2RTB 3 451256 (169)RTB 3RTB 1 181417 (085)

two new users X1 and X2 for which we know thatthey are actually the same individual Results withRandom should be used only as a ceiling on perfor-mance because the results will be overly optimistic

2 Temporal A more realistic split-up of the trans-actions is a temporal split-up where first the middleday of Xrsquos transactions is determined (hence dynam-ically chosen for each user) All IP address visits upto this middle day are assigned to X1 and all trans-actions after the middle day are assigned to X2 Itwill only be applied to users who are seen on multi-ple days

3 Identifier for advertising IFA is an advertisingidentifier from Apple that users can change or asknot to be used for advertising and is seen as aprivacy-friendlier version of the controversial fixeddevice ID (the ldquoUDIDrdquo) The IFA is observed in threeexchanges with volumes shown in Table 7 for exam-ple 1940855 unique IFAs are seen in RTB 110 Whenwe observe the same IFA across different exchangeswe now know that two devices in different exchangesactually are the same The frequency of unique IFAsvisible across exchanges is shown in Table 8 For exam-ple a total of 3018223 IFAs are seen when we look atboth RTB 1 and RTB 2 with 306594 of these (1016)seen on both exchange 1 and exchange 2 What ifthere were not an IFA and the device would havehad a different identifier in each exchange To what

10 For confidentiality RTB 1 in this setting is not necessarily thesame as RTB 1 in sectsect4 or 7

extent are these two device instances then connectedin the GSN

With these different ldquogold standardrdquo methods toindicate whether or not two users are the samewe assess the ability of the geosimilarity metricsto link the same user determining how often twoinstances of the same person are connected in theGSN (sect61) Afterward we shall assess the differentGSN strength metricsrsquo ability to rank users determin-ing to what extent two instances of the same personexhibit strong geosimilarity as compared to the otherneighbors (sect62) For the random and temporal split-ups 200 samples are used For the IFA-based methodthe complete population is used for measuring thedegree of connectedness to the same user to analyzethe relative strength of the connected users 200 sam-ples are used

61 Same Individual Different ldquoScreensrdquoWe now present results on the GSN connectednessmeasured as how often two instances of the same userare connected in the GSN (in percentage) Note thatthe results summarized in Table 9 should be inter-preted in light of the limited number of days in build-ing the research data set which will limit the GSNconnectivity Nonetheless the relatively high connec-tivities observed even with this sample provide quitepromising results which lends considerable supportto the merit of the overall design

Random0 Using the random split two users thatcorrespond to the same individual are not connectedin only 9 of the neighborhoods Moreover in thesecases user X (whose IP visits are split up) usuallyhas only one or two entries and the random split-upresults in no visits for one of the two simulated users

Temporal0 The temporal split-up is more realisticthan the random one where we assume that a personuses simulated device 1 in the first half of the week(eg during the weekend) and simulated device 2 inthe second half of the week (eg during the week)

Users X1 and X2 are not connected in 32 of theneighborhoods Whereas the random split-up was toooptimistic the temporal one is too pessimistic as inreality there likely is some overlap in the time peri-ods of use of two different devices If user X visitsan IP address only in the first days of the week this

Table 9 To What Extent is the Same User Connectedto Herself

connected

Random 91Temporal 68IFAmdashRTB 1RTB 2 67IFAmdashRTB 2RTB 3 82IFAmdashRTB 3RTB 1 73

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 255

IP address will not occur in user X2rsquos transactions andreduces the chance to be connected to X1

IFA0 We look at all combinations of the three ex-changes with IFA identifiers For example there are45256 unique IFAs seen on both RTB 2 and RTB 3 In82 of these the two IFAs visit at least one same IP onboth exchanges and hence are connected in our GSN

The bottom line is that even with this limited slice ofonline behavior different instances of the same indi-vidual are connected in the GSN 70ndash80 of the timeEncouragingly the results from the simulated scenar-ios concur with the results from the real (IFA) sce-narios (including the aforementioned optimism andslight pessimism of the two simulated scenarios)Consistency over these different settings gives addi-tional confidence in the results suggesting that theGSN indeed holds promise for a high degree of suc-cess at targeting the same user across the fragmenta-tion of the online advertising ecosystem

62 Ranking User InstancesNext given that two user instances of the same indi-vidual are indeed connected we assess to what extentthe second instance of the same user is ranked highlyamong its network neighbors based on the weightof their linkage in the GSN We first will discuss themost general results using the best ranking methodand then will look across different ranking methods

We decompose the results based on the numberof network neighbors (NNs) since there are twoboundary cases that require special interpretationTables 10ndash14 show the proportions of the ego net-works that fall into three scenarios First a user mayhave only one NN In this scenario the ranking isperfect (trivially)mdashthe other instance of the same useris the only user instance given a nonzero score As

Table 10 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashRandom Split-up

Ego has only 1 NN 3104Ego has exactly 2 NNs 1905

Same user ranked first 8005Ego has more than 2 NNs 4901

Note For the scenario with two NNs the table reports thepercentage of cases where the other known instance of thesame user is ranked first (including ties)

Table 11 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashTemporal Split-up

Ego has only 1 NN 3008Ego has exactly 2 NNs 1807

Same user ranked first 8309Ego has more than 2 NNs 5005

Table 12 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 1RTB 2Split-up

Ego has only 1 NN 4401Ego has exactly 2 NNs 1201

Same user ranked first 8201Ego has more than 2 NNs 4400

Table 13 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 2RTB 3Split-up

Ego has only 1 NN 6109Ego has exactly 2 NNs 1907

Same user ranked first 7102Ego has more than 2 NNs 1804

Table 14 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 3RTB 1Split-up

Ego has only 1 NN 5305Ego has exactly 2 NNs 2501

Same user ranked first 6304Ego has more than 2 NNs 2104

shown in the tables this scenario accounts for 30ndash60 of the ego networks (depending on the split-upmethod)

The second boundary case is when the ego has ex-actly two NNs This scenario accounts for 10ndash25of the cases depending on the split-up method Herethe tables also report whether the other known userinstance is ranked first (including ties) among thetwo neighbors The other known instance of the sameuser is ranked first 60ndash80 of the time dependingon the split-up scenario The IFA results are affectedstrongly by the particular RTB settingmdashthis may indi-cate that the publishers on the different RTBs have dif-ferent policies for passing the IFA Importantly theseresults are conservativemdashpossibly very conservativeThe other NNs may also be instances of the same userunbeknownst to usmdashfor example in the IFA casebecause an app does not pass the IFA to the RTB orbecause the other user is the same user on a differentdevice and thus does not share the IFA

The final scenario is when more than two NNsare present As shown in the tables this accounts for20ndash50 of the cases depending on the split-up sce-nario In this case we measure the percentile in theranking where the instance of the known same userfalls (see below) as well as the area under the receiveroperating characteristic curve (AUC) (Fawcett 2006)which is equivalent to the MannndashWhitneyndashWilcoxon(MWW) statistic The AUCMWW measures to what

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers256 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 15 When There Are More Than Two Network Neighbors to What Extent Is the Known Same User StronglyConnected to Itself

AUC (if gt 2 NNs) () Average rank percentile (if gt 2 NNs) ()

Ranking metric R T RTB 12 RTB 23 RTB 31 R T RTB 12 RTB 23 RTB 31

1 78 64 68 59 55 29 48 34 43 542 59 58 67 58 54 45 53 34 44 553 79 69 74 61 61 29 44 29 40 484 86 75 69 61 59 24 40 32 41 505 83 70 71 59 63 26 44 31 43 476 85 74 71 60 65 25 41 31 42 457 86 76 73 62 64 24 40 29 40 468 83 70 72 59 65 26 44 30 43 459 85 74 71 60 67 25 41 30 42 44

10 79 69 69 61 63 29 44 32 41 4711 79 69 71 62 64 44 44 30 41 4612 81 68 68 60 62 28 46 34 42 4913 81 70 71 61 64 27 44 30 40 4614 74 67 66 61 48 34 47 35 41 6015 74 67 66 61 48 34 47 35 41 60

Note See the text for a discussion of ranking metrics

extent data points with label 1 are ranked higherthan those with label 0 In the results that follow(see Table 15) the AUC is measured for each samplewhere any user instance among the neighbors thatcorresponds to the same individual is labeled as 1and all others are labeled as 0 Thus in this settingthe AUC measures how well the neighbors are rankedby the likelihood of being the same individual11 Theaverage over all samples is reported If for a sampleall neighbors have the same score that AUC is setto 05 In addition to the factors noted above this alsowill tend to make the results conservative For exam-ple if the GSN links an ego user to a set of neighborsthat are all instances of the same individual in thisevaluation the AUC will be 05

The ranking measure is the percentile in the rank-ing at which the known-same-user instance is rankedwith lower values being better For example if thesame-user instance has the highest score (closest con-nection) among five neighbors the rank is 20 (15)Note again that this is quite conservative in thisexample the user is ranked at the top of the list butbecause there were only five neighbors the highestpossible score is 20 In case the known-same-userinstance has the same score as another neighbor theaverage rank is reported

As shown in Table 15 all AUC values exceed 05ranging up to 08 This means that in every casea known-same-user instance is likely to be rankedhigher than an instance not known to be the same

11 Technically in this setting the AUCMWW is the probability thata randomly selected neighbor that is in fact a known instance of thesame user is ranked more highly than a randomly selected neighborthat is not known to be an instance of the same user If the known-same-user instances are all ranked above the other-user instancesthe AUC = 100 If they are all below then the AUC = 0

user The frequency-based ranking metrics (4ndash9) per-form quite well in connecting instances of the sameuser strongly

Almost all metrics perform better than a randommodel with performances that even come close to thebest possible solution Consistently performing quitewell are the frequency-based metrics (metrics 3ndash9)with metric 7 (the minimum of the IDF weights ofthe shared locations) getting the most wins Not per-forming well is the very basic count metric as well asthe Boolean metric that provides a binary score onlyInteresting to note the Jaccard metrics do not performwell either It seems that penalizing users that visitmany IP addresses negatively affects the results anissue we will return to in the next study

7 Study 2 Does the GSN Select Userswith Similar Interests

The foregoing section addressed our paperrsquos firstclaimmdashthat the GSN design indeed links instances ofthe same user and links them more strongly thaninstances of other users In this section we turn to thepaperrsquos second claimmdashthat the GSN links users withsimilar interests We will measure similar interests bysimilarity in behavior accessing particular publisherson the Web and using mobile apps12

71 Connecting Users with Similar Interests

To What Extent Are the GSN Neighbors of Visi-tors to Particular Publishers Also Visitors to ThoseSame Publishers To assess whether the GSN con-nects users with similar interestsbehavior we select

12 From now on all user instances will be complete real user in-stances the simulated split-up scenarios (Random Temporal) werecreated specifically for the purpose of the same-user study

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 257

Table 16 Distribution of Publishers per RTB in Terms of thePercentage of Users in Our Data Set That Visit the GivenPublisher

Publisher of D

RTB 1Burstly 3004myYearbookcom 1062My Fitness Pal 0065TextMe Inc 0056GameResort 0025Flixster 0019247Sports 0015Daily Workout Apps LLC 0009Mobile Deluxe 0007YoYo Games Ltd 0004

RTB 2mworldnowcom 0032mtmzcom 0024mtopixcom 0014babycentercom 0015meetmecom 0006appsfacebookcom 0006itunesapplecom 0039appevitecom 0002beautyandskincarescom 0003mmajunkiecom 0001

RTB 3Conversion Exchange 5053Pinger Inc 2096Top Game Developer 4029PremiumEntertainmentApp-FamilyOriented-Android 0059Enflick 0053Clapfoot Games 0044FunPokes Inc 0018Talkatone 0015Sevenlogics Inc 0008A Star Software 0007

RTB 4Pinger PhonemdashiOSmdashConversation 1001Talkatone iPhone App 0007DiceWithBuddies 00139GAG Reader 0012Video Downloader Pro Lite 0016Rage Comics 0009Spades 3D Lite 0006Apalabrados 0003Relax Melodies HD 0002Celeb MemdashPhotoMaker 0000

a diverse collection of 40 publishers listed in Table 16A publisher can correspond to a mobile app a web-site or a set of appswebsites We selected both pub-lishers that are visited by many users and more nichepublishers with fewer visitors As can be seen fromTable 16 several types of publishers can be distin-guished the main groups are games (eg GameRe-sort Top Game Developer) and social networkingtools (eg appsfacebookcom Text Me Inc)

More specifically we create 200 samples for 40 pub-lishers 10 publishers for each of four different RTB

exchanges listed in Table 1613 The percentage ofunique users that visit each publisher in our completedata set is also shown (D) To create each samplean ldquoegordquo user is selected that has visited the pub-lisher in question (an ldquoego visitorrdquo) this ego visi-torrsquos GSN neighborhood is created Next we assesshow many of the egorsquos neighbors also visited thatsame publisher This will be compared with a base-line visit rate to produce lift and leverage values(Provost and Fawcett 2013)mdashspecifically how muchmore likely is it for the neighbors of ego visitors tovisit the publisher than would be expected by chanceWe consider three different setups that produce dif-ferent baselines

1 Considering all RTB exchanges to determine the base-line A user is considered a visitor for a given pub-lisher if she visits the publisher on any RTB ex-change (since some publishers are visible on differentexchanges) Hence the baseline visit rate is the num-ber of users (on any RTB) who visit that publisherdivided by the total number of users across all RTBexchanges This will provide optimistic lifts sincemost publishers appear only on one RTB exchangewhereas the baseline is measured over all exchanges

2 Considering one RTB exchange only to determine thebaseline A user is considered a visitor for a given pub-lisher and RTB if she visits the publisher on the sameRTB exchange This changes the baseline to the num-ber of users (on this RTB) who visit that publisherdivided by the total number of users on this RTBexchange The lift and leverage results will be worsethan in the first setting because the baseline will belarger Seeing that some publishers are seen on differ-ent RTB exchanges this setting is slightly pessimistic

3 Taking time of location visits into account In thethird setting we link users only if they visit an IPwithin the same time period This will be elaboratedon in sect73

Results are shown in Tables 17 and 18 The percent-age of publisher visitors in the neighborhood is givenas N This percentage is compared with the baselinepercentage of users that visit that publisher in thecomplete data D using two metrics the lift is the ratioof these numbers the leverage is the difference Liftis particularly useful for very small base rates (con-sider that an activity with a base rate of 30 cannotpossibly have a lift higher than about 3) Leverage ismore telling for larger base rates because it shows theabsolute improvement (here in percentage points)

When considering all NNs the lifts for all publish-ers are greater than 1 except for one Burstly on RTB 1has a lift of 097 (essentially no lift) but only when weconsider RTB 1 as the baseline This publisher already

13 For confidentiality the RTB numbering here is again differentfrom above

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers258 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 17 RTBs 1 and 2 Percentage of Network Neighbors That Also Visit the Publisher

RTB 1

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Burstly 3004 8063 2084 5059 47022 15054 44018 25061 8043 220572 myYearbookcom 1062 25077 15094 24015 44074 27067 43012 35016 21074 330543 My Fitness Pal 0065 4015 6038 305 3609 56077 36025 15055 23091 14094 TextMe Inc 0056 1805 33006 17094 39002 69075 38046 25018 45 240625 GameResort 0025 4093 19077 4068 18052 74026 18027 11086 47058 110626 Flixster 0019 1055 8015 1036 8014 42085 7095 1079 9042 1067 247Sports 0015 27057 188058 27042 77062 530093 77048 74053 509075 740388 Daily Workout Apps LLC 0009 1092 21047 1083 19023 215046 19014 11054 129027 110459 Mobile Deluxe 0007 1087 28067 108 23053 36009 23046 9063 14707 9056

10 YoYo Games Ltd 0004 3052 99058 3049 21057 609064 21053 13 367045 12096Considering this RTB exchange only to determine baseline

1 Burstly 8085 8063 0097 minus0022 47022 5033 38037 25061 2089 160762 myYearbookcom 4071 25077 5047 21006 44074 905 40003 35016 7046 300453 My Fitness Pal 1089 4015 2019 2025 3609 19048 35001 15055 8021 130654 TextMe Inc 1063 1805 11035 16087 39002 23094 37039 25018 15045 230555 GameResort 0073 4093 6078 402 18052 25049 17079 11086 16033 110146 Flixster 0055 1055 208 0099 8014 14071 7059 1079 3023 10247 247Sports 0043 27057 64073 27014 77062 182023 7702 74053 174097 74018 Daily Workout Apps LLC 0026 1092 7037 1066 19023 73095 18097 11054 44037 110289 Mobile Deluxe 0019 1087 9084 1068 23053 123088 23034 9063 5007 9044

10 YoYo Games Ltd 001 3052 34018 3042 21057 209025 21047 13 126012 1209

RTB 2

Considering all RTB exchanges to determine baseline1 mworldnowcom 0032 13027 40086 12095 54039 167042 54006 31065 97044 310332 mtmzcom 0024 6089 28086 6065 23008 9607 22084 10068 44075 100443 mtopixcom 0014 15007 104079 14093 53045 37106 5303 33075 234065 330614 babycentercom 0015 6087 47031 6072 43048 299056 43033 17074 122024 17065 meetmecom 0006 6002 10308 5097 54072 942093 54066 27085 47909 270796 appsfacebookcom 0006 301 48039 3003 9009 141097 9003 2076 43008 20697 itunesapplecom 0039 25077 65071 25038 51043 131012 51004 29049 75018 290098 appevitecom 0002 48028 21212043 48025 54076 21509068 54074 48028 21212043 480259 beautyandskincarescom 0003 18019 642087 18016 8048 299094 8046 1107 413049 11067

10 mmajunkiecom 0001 1036 201064 1035 26067 3195505 26066 13033 11977074 13033Considering this RTB exchange only to determine baseline

1 mworldnowcom 4034 13027 3006 8093 54039 12053 50005 31065 7029 270312 mtmzcom 3019 6089 2016 307 23008 7024 19089 10068 3035 70493 mtopixcom 1092 15007 7084 13015 53045 27081 51053 33075 17056 310834 babycentercom 1094 6087 3054 4093 43048 22042 41054 17074 9015 15085 meetmecom 0078 6002 7077 5025 54072 70057 53094 27085 35092 270076 appsfacebookcom 0086 301 3062 2024 9009 10063 8024 2076 3022 1097 itunesapplecom 5024 25077 4092 20053 51043 9081 46019 29049 5063 240258 appevitecom 0029 48028 165058 47098 54076 187082 54047 48028 165058 470989 beautyandskincarescom 0038 18019 48011 17081 8048 22045 8011 1107 30095 11032

10 mmajunkiecom 0009 1036 15009 1027 26067 296003 26058 13033 148001 13024

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

has a very high baseline and as we will see in thenext section when we consider only the top 1 and 10NNs the lifts go up to 5 and 3 respectively Burstlyprovides a generic platform for building apps and sois a weaker indication of user interest than many ofthe other publishers (fitness babies games)

All other lifts exceed one by substantial mar-gins and go up to even 2391 (Apalabrados onRTB 4 a Spanish-language social Scrabble-like game)

Obviously the conclusion that the NNs indeed aremore likely to visit the same publishers is highly sig-nificant by a simple sign test comparing the visita-tion rate within the geosimilarity neighborhood tothat of overall population either via lift or leveragep lt 0001 Thus NNs of publisher visitors are indeedmore likely also to visit the same websites the abso-lute values of the lifts and leverages suggest that theyare substantially more likely

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 259

Table 18 RTBs 3 and 4 Percentage of Network Neighbors That Also Visit the Publisher

RTB 3

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Conversion Exchange 5053 34054 6024 29001 64071 11069 59017 50077 9018 450242 Pinger Inc 2096 46002 15053 43006 65012 21097 62015 51048 17037 480523 Top Game Developer 4029 47024 11002 42095 60 13099 55071 48097 11042 440684 PremiumEntertainmentApp-FamilyOriented-Android 0059 23079 40012 2302 20045 34049 19086 22025 37052 210665 Enflick 0053 11071 21095 11017 21057 40044 21004 12027 23 110736 Clapfoot Games 0044 8055 19032 8011 11054 26006 1101 13099 31061 130557 FunPokes Inc 0018 11011 61028 10093 50 275077 49082 15079 87008 150618 Talkatone 0015 13013 88051 12098 31011 209076 30096 17091 120078 170779 Sevenlogics Inc 0008 2071 32059 2063 5088 70074 508 8033 100022 8025

10 A Star Software 0007 3033 44086 3025 23008 311024 23 9009 122061 9002Considering this RTB exchange only to determine baseline

1 Conversion Exchange 11045 34054 3002 23009 64071 5065 53025 50077 4043 390322 Pinger Inc 6013 46002 705 39089 65012 10062 58098 51048 8039 450353 Top Game Developer 8088 47024 5032 38036 60 6076 51012 48097 5052 400094 PremiumEntertainmentApp-FamilyOriented-Android 1023 23079 19038 22057 20045 16066 19023 22025 18013 210025 Enflick 101 11071 1006 1006 21057 19054 20046 12027 11011 110166 Clapfoot Games 0092 8055 9033 7064 11054 12059 10062 13099 15027 130087 FunPokes Inc 0038 11011 2906 10074 50 133022 49062 15079 42007 150418 Talkatone 0031 13013 42076 12082 31011 101034 3008 17091 58035 170619 Sevenlogics Inc 0017 2071 15074 2054 5088 34017 5071 8033 48041 8016

10 A Star Software 0015 3033 21067 3017 23008 150036 22092 9009 59023 8094

RTB 4

Considering all RTB exchanges to determine baseline1 Pinger PhonemdashiOSmdashConversation 1001 71081 70085 7008 80077 79068 79076 70014 6902 690132 Talkatone iPhone App 0007 11038 17203 11031 29003 439068 28097 17002 257078 160963 DiceWithBuddies 0013 7041 59004 7028 18018 144092 18006 7069 61031 70574 9GAG Reader 0012 62013 497099 62 49009 39305 48097 49005 393017 480925 Video Downloader Pro Lite 0016 20003 123015 19087 22058 138084 22042 17093 110027 170776 Rage Comics 0009 609 8001 6081 13033 154086 13025 609 8001 60817 Spades 3D Lite 0006 1205 225076 12044 27027 492057 27022 1205 225076 120448 Apalabrados 0003 77042 21391071 77039 75 21316097 74097 79032 21450043 790299 Relax Melodies HD 0002 31058 21002029 31056 50 31170029 49098 31058 21002029 31056

10 Celeb MemdashPhotoMaker 0 3023 82107 3022 7014 11819048 7014 3023 82107 3022Considering this RTB exchange only to determine baseline

1 Pinger PhonemdashiOSmdashConversation 8051 71081 8044 6303 80077 9049 72026 70014 8024 610632 Talkatone iPhone App 0055 11038 20053 10082 29003 52038 28048 17002 30071 160473 DiceWithBuddies 0009 7041 82011 7032 18018 201054 18009 7069 85027 7064 9GAG Reader 0097 62013 63097 61016 49009 50055 48012 49005 50051 480085 Video Downloader Pro Lite 0044 20003 45036 19059 22058 51014 22014 17093 40062 170496 Rage Comics 0037 609 18079 6053 13033 36032 12097 609 18079 60537 Spades 3D Lite 0046 1205 2609 12004 27027 58068 26081 1205 2609 120048 Apalabrados 0021 77042 371093 77021 75 360031 74079 79032 381006 790119 Relax Melodies HD 0013 31058 238054 31045 50 377069 49087 31058 238054 31045

10 Celeb MemdashPhotoMaker 0003 3023 97089 3019 7014 216076 7011 3023 97089 3019

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

The very high lifts could be attributed to two phe-nomena They may provide some evidence that theGSN embeds a true social network as pointed out bythe good results for the social networking publishers(eg Apalabrados with a lift of 2391 appsevitecomwith a lift of 2212 and others) Friends visiting thesame social networking sites (here unusual ones)

would increase the lift and leverage over just havingdifferent instances of the same user visiting the samewebsites An alternative explanation is that some ofthese publishers are much more likely to be visitedacross different devices of the same user It is not clearwhether this is actually true However it argues forobtaining specific and broad data on which devices

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers260 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

are not the same user to shed further light on forexample any broad data that include a (anonymized)user identifier

From these results we can conclude that GSNneighbors indeed exhibit substantially similar pub-lisher visitation behavior Next we assess to whatextent more strongly connected NNs are more similarthan less strongly connected NNs

72 Ranking Users with Similar Interests

Are the More-Similar Geosimilarity Neighbors of aPublisherrsquos Visitors Even More Likely Also to Be Vis-itors to the Same Publisher To assess whether more-similar geosimilarity neighbors are even more likelyto share interests we compute the measures exactlyas in the previous study except instead of consider-ing all NNs only the 1 (N1) and 10 (N10) neighborswith the highest scores are considered (see Tables 17and 18) First we must consider which of the 15 met-rics that are defined to measure link strength to useTable 19 reports the rank aggregations of the differentmetrics as to which provide the best lifts Specificallyfor each of the 40 publishers each scoring metric pro-vides a lift and for that publisher the scoring metricscan be ranked (the best being ranked 1 etc) The rankaggregation is the average rank for a scoring metricacross all 40 publishers Thus if one metric providedthe best lift for all publishers it would get an aggre-gated rank of 1

To visualize the best metrics the metrics with anaverage ranking less than 8 are shown in boldfacein Table 19 The Jaccard metrics do not perform wellin this study either showing that penalizing the con-nection to users that visit many locations is not sen-sible Remember that the Jaccard metric is based onthe idea that users that visit many locations will havea higher chance to share some location with anotheruser However we have so many locations in totalthat visiting a couple more locations will increase onlymarginally this probability of sharing a location bychance A user that logs into more IP addresses andis more active should hence not be penalized andranked lower than other less active users Metric 3which sums the IDF scores of the shared locationsperforms quite well Given the operational efficiencyadvantage over the frequency-based metrics this met-ric is chosen to measure the strength of the links

As seen from the results in Tables 17 and 18 thelift and leverage values for the strongly connected (top 1and top 10) NNs are even higher sometimes astro-nomical (3170 for Relax Melodies HD (RTB 4)) Theseresults are highly significant by sign tests compar-ing either the lift or the leverage between the top-ranked neighbors and the entire cohort (p lt 0001) Theabsolute values again indicate that the users with the

strongest geosimilarity are often substantially morelikely to visit the same publisher

Thus we can conclude that not only does geosim-ilarity find users with similar interests it also ranksusers well by their likelihood of having similarinterests

An interesting follow-up question is whether thecharacteristics of a publisher are of importance for theresults For example are GSN neighbors more sim-ilar in terms of using the same social network appas compared to playing the same games To answerthis question the publishers are categorized into sixclasses funny social news games communicationand miscellaneous In Figure 9 the lifts of the pub-lishers are shown per category Some general trendscan be observed where publishers related to funnycontent have the highest median lift This could beexplained by the fact that users often forward funnycontent to their friends The social websitesapps fol-low next which seems to further demonstrate thatfriends are likely linked in the GSN However pub-lishers in the communication category perform not sowell compared with the other categories and ratherhigh variances are observed in the different cate-gories Please note that high variance is observedacross the categories therefore it is difficult to makewell-supported claims from these results

73 TimingAs a final evaluation we explore whether the inclu-sion of temporal similarity in location visitation ishelpful for defining GSN connections To this end weintroduce a setting where two users are connectedonly if they visit an IP address in the same timeperiod of a day Specifically we divide a day intodifferent time windowsmdashfor example two time peri-ods of 12 hoursmdashwhere users are connected only ifthey visit the same IP address in the same time periodof a day (which might be on different days) Practi-cally two digits are added to an IP which denote thestarting moment of the corresponding time periodWe report on time periods of 2 hours and of 12 hours(other periods yield similar results) We also includethe setting where users are connected only when theyvisit the same IP address on the same day in light ofthe result of Crandall et al (2010) that visiting thesame location at around the same time is indicativeof friendship

The resulting lifts are reported in Figures 10ndash12when considering respectively all NNs the top 1 NNand the top 10 NNs The results when not consider-ing the time period are repeated as well (All) lim-ited to the conservative case of only considering oneexchange (see above) The baseline (lift of one) isindicated with a red horizontal line lifts higher than1000 are shown as 1000 to limit the range to be able

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 3: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile Consumers244 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

This paperrsquos motivating application is mobile ad-vertising As with traditional display ad targetingmobile ad targeting could be based on context demo-graphics or psychographics if such data were avail-able Because such data are largely unavailable onmobile consumers in the advertising ecosystem adtargeting firms increasingly are looking to othermeans of finding suitable candidates for ads An alter-native method that has been gaining traction for (non-mobile) ad targeting both in research and in practiceis targeting based on direct or indirect connectionsto specific individuals which we loosely call ldquosocialrdquotargeting as we discuss in detail below For exam-ple brands often want to target individuals who aresimilar to particular existing customers of the brandor product in question or the same existing cus-tomers on different devices2 Geosimilarity providesa viable alternative for such indirect social targetingfor mobile advertising where (anonymized) locationdata are available and importantly anonymized loca-tion data may be much more readily available thanother targeting data

No matter how users are chosen for targetingmobile advertisers need to deal effectively with theproblem of consumer fragmentation individuals areobserved only through the multiple (baroque) infor-mation systems that comprise the digital adver-tising ecosystem and a particular individual maycorrespond to various user instances (Kerho 2012) Forexample an individual may have different instancesbecause she is observed on different devices such asher mobile phone tablet laptop and PC Individualsmay also have different instances because they appearin different advertisement bidding systems each ofwhich presents the individual differently In manysituations these instances are not associated with aunique identifier for the individual unlike the cookiesthat are used to represent a userndashbrowser pair in mostwork on individualized desktop ad targeting More-over as we will discuss presently there are importantreasons why we might prefer that the instances not beassociated with a personal identifier The bottom lineis that once an instance of an individual is identified asbeing a good target advertisers would like to be ableto also target other instances of the same individual(as well as individuals with very similar interests)

Geosimilarity could also be used for privacy-friendly ldquohyperlocalrdquo targeting meaning targeting

2 Note that this paper is not about the ultimate effectiveness of thebrandrsquos choice of whom to targetmdashit may be that targeting theselected users on different devices or targeting users with similarinterests is not appropriate from a marketing standpoint This isa question specific to each brand and the brandrsquos campaign goals(eg direct marketing versus brand advertising) Furthermore thisquestion is intricately intertwined with the design of the creativebeing delivered which also is outside the scope of this paper

people in a precise location (statistically speaking)without needing to store data on the actual loca-tions of the users For example consider a hyper-local coupon campaign a couponing company wantsto target special offers for the local restaurant on thecorner The best prospects are those people who fre-quent this precise area If the restaurant can providesome (anonymized) ldquoseedrdquo usersmdashfor example exist-ing clients of the particular local business (eg identi-fied via an online loyalty program)mdashthe geosimilarityneighbors of these seed users may have a high prob-ability of also frequenting the same precise locationsand thereby be very good prospects for the hyperlocalcoupon

Looking from a different perspective we have seenthe sort of uproar that arises from the idea that ourlocation behavior is being ldquotrackedrdquo by our mobiletechnology (Pogue 2011)3 Therefore marketers whodream of location-driven targeting should think care-fully about what the Federal Trade Commission (FTC)calls ldquoprivacy by designrdquo4 and consider what optionscan provide effective advertising with minimal datacollection and storage As will be detailed in sect4we explicitly take this desideratum into account byanonymizing both the device and location identifiersThis method of using ldquodoubly anonymizedrdquo data forprivacy friendliness is described in more detail else-where (Provost et al 2009)

In sum geosimilarity can be used for composing amobile audience for targeting as follows Based on adoubly anonymized GSN find individuals who areclosely linked to individuals we know already to havethe characteristic(s) that we desire such as prior pur-chase history (as with ldquoretargetingrdquo) brand affinity(eg via visiting a Web page ldquolikingrdquo a brand orproduct or clicking on an ad) or key demograph-ics or even based on sophisticated targeting designs(Bampo et al 2008 Agarwal et al 2008 Heidemannet al 2010 Perlich et al 2014) Once these key ldquoseedrdquousers have been identified targeting close neighborsin the GSN will tend to target both users with simi-lar interests and users who are other instances of thesame individual Below we present theoretical justi-fication for such an approach tying it in to relatedresearch We then investigate empirically whether themethod indeed tends to target (i) other instances ofthe same individual andor (ii) individuals with sim-ilar interests The empirical study is based on datadrawn from actual real-time bidding (RTB) exchangesfor mobile advertising

Before we move on to related work please let usnote that our proposed GSN is different from geoso-cial networking which is defined as a type of social

3 httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha4 httpwwwftcgovopa201012privacyreportshtm

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 245

networking in which geographic services are usedto enable additional social dynamics (Quercia et al2010) The latter starts from an actual network ofinterpersonal relationships and adds location data torecommend other locations and events In contrastthe GSN uses location data to create links betweenusers without the explicit guarantee that these linkscorrespond to an actual interpersonal relationship(however this surely might be as described in thenext section see Crandall et al 2010)

Section 2 argues theoretically why this GSN designshould be a good idea After that we present the high-level GSN design in more detail followed by specificsof the implementation we examine empirically Thenwe provide results from an empirical study basedon real mobile location data from real-time biddingexchanges Section 5 describes the data and experi-mental setup including several technical definitionsfor geosimilarity The ability of the GSN to connectdifferent instances of the same individual is evaluatedin sect6 The use of the GSN to select users with similarinterests and tastes is assessed in sect7

2 Motivation and Related Work21 Social TargetingAdvertising targeting has evolved substantially overthe past half century As information systems providedaccess to new sources and types of data marketersadded new targeting strategies designed around thenew data For example as demographic data becameavailable a few decades ago contextual targetingmdashtargeting based on inferring audience compositionfrom the context in which the ad will be shown (ega billboard location TV show magazine etc)mdashhadto share the spotlight with data-driven demographictargeting either based on explicit demographic pro-files or predictive modeling As data aggregators coa-lesced and integrated information such as magazinesubscriptions and catalog purchases ldquopsychographicrdquodata entered the mix and broadened yet again thespace of targeting designs

Recently we have seen the introduction of a differ-ent sort of targeting design which we can generallycall social targeting Social targeting differs from theaforementioned targeting methods because it relieson explicit linkages between specific individuals Forexample Hill et al (2006) showed the remarkableeffectiveness of social-network targeting targeting con-sumers who are linked to known customers by asocial network Subsequently Facebook (and others)have attempted to implement social-network target-ing for online advertising with varying degrees ofsuccess (see eg Oinas-Kukkonen et al 2010)

We explicitly generalize from social-network target-ing to social targeting to retain the notion that the

targeting is based on linkages to specific other indi-viduals but to relax the notion that the linkages needto be ldquotruerdquo interpersonal relationships The designof Provost et al (2009) is an example of social tar-geting that is not based on ldquotruerdquo interpersonal rela-tionships the linkages between individuals are basedon a bipartite content-affinity network So the socialtargeting there is based on forming an audience byfinding consumers who are linked by shared con-tent visitation with other specific consumers who areknown to have brand affinity (more on that later)Similarly Martens and Provost (2011) define a net-work among banking customers based on paymenttransaction data which is subsequently used to targetmarketing offers for financial products

22 GeosimilarityThe GSN can be very fine-grained based on shared(anonymized) location data for example Internet pro-tocol (IP) addresses fine-grained latitudelongitudecells or small geographic tracts Why would targetinggeosimilarity network neighbors (NNs) of preselectedseed users be a good idea There are several reasonsLet us call mobile devices that are (directly) linked inthe GSN (first-degree) network neighbors

First of all first-degree network neighbors shareat least one location and possibly several As thenumber of shared locations grows we conjecture thatthe likelihood increases that the two devices actuallybelong to the same person For example who besidesme is observed primarily on my home and office IPaddress let alone my favorite coffee shop We seeanalogous results showing that different instances ofthe same person call the same phone numbers (Corteset al 2001) and cite the same references (Hill andProvost 2003)

Second direct coarse-grained geographic targetingalready is used widely in off-line advertising (not viasocial targeting) because geography is a reasonableproxy for demographics and other predictive featuresAn important difference in the present work is thatwe do not choose the geographies to target explic-itly but instead use them implicitly This allows theactual locations to be anonymized Furthermore itallows us to use location information that is too fine-grained even to include in most predictive modelingfor example locations appearing only in a tiny frac-tion of instances (eg a home Wi-Fi address may onlyappear in 0000001 or less of the usersrsquo location sets)as well as transient Wi-Fi locations that only connecttwo devices for a brief time

Third fine-grained location information is likely tocontain more detailed (latent) information than stan-dard geographic information Not only would it linkdevices by approximate wealth income demograph-ics etc it may well link by employer educational

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers246 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

institution interests community and even shoppinghabits Thus we conjecture that geosimilarity tar-geting may combine the advantages of geographictargeting discussed in the previous paragraph withadvantages similar to content-affinity social targetingwhich has been shown to be effective specifically foronline advertising (Provost et al 2009 Perlich et al2014) We might call this ldquolocale-affinityrdquo social tar-geting The empirical results below show that indeedonersquos geosimilarity neighbors are much more likely tofrequent the same online publishers and mobile apps

If that were not enough research provides yetanother reason to expect that geosimilarity targetingmay be especially effective In an article in the Pro-ceedings of the National Academy of Sciences of the UnitedStates of America Crandall et al (2010 p 22436) showthat geographic co-occurrences between individualsare very strongly predictive of the individuals beingfriends ldquoThe knowledge that two people were prox-imate at just a few distinct locations at roughly thesame times can indicate a high conditional proba-bility that they are directly linked in the underlyingsocial networkrdquo This means that a GSN would notonly capture the advantages of geographic targetingand locale-affinity targeting but might also incorpo-rate actual social-network targeting also shown to beextremely effective for marketing (Hill et al 2006)In fact when massive descriptive data are availablethe (usually latent) similarity between social-networkneighbors has been shown to explain much of themarketing advantage previously attributed to socialinfluence (Aral et al 2009)

Moreover interestingly research also has shown thatthe homophily (McPherson et al 2001) that has beenused to explain the effectiveness of social-network tar-geting may actually be due largely to the constraintsplaced by opportunity (Kossinets and Watts 2009)Tie formation in social networks is biased heavilyby triadic closure and thus by structural proximityOver many generations of tie formation biases in theselection of structurally proximate individuals ldquocanamplify even a modest preference for similar othersvia a cumulative advantage-like process to producestriking patterns of observed homophilyrdquo (Kossinetsand Watts 2009 p 405) Why is this important for thepresent paper Because with the exception of sociallinks formed solely online (like new Facebook-onlyfriends) constraints on opportunity are framed byconstraints on physical colocation As Kossinets andWatts (2009 p 406) observe ldquoan individualrsquos choiceof relations is heavily constrained by other aspects ofhis or her life such as geographical location choice ofoccupation place of workrdquo These are exactly the sortsof links that would be represented in a GSN based onmobile device location data (including tablets and lap-tops) Thus the GSN may in addition reveal a large

driver of homophily and thus expand the effective-ness of a social-network targeting strategy to individ-uals who also would have been similar ldquofriendsrdquo butfor whatever reason were not chosen

3 Geosimilarity GeosimilarityNetworks and Mobile Ad Targeting

The basis for the targeting design is to create a net-work among users based on geosimilarity and thento use the network for inference For this paper wewill create a GSN directly among user instances5 Theelements of the network-targeting design include thefollowing How will entities (mobile user instances)be represented What exactly will be the geosimilaritylinks And how exactly will the predictive inferencebe conducted

In this section we will discuss the design at a highlevel including some further-looking elements thatare not part of the experiments but which providecompleteness for the discerning reader To make thediscussion more concrete we first present the moti-vating data scenario The design should generalize tomobile settings beyond this specific data scenario

31 The Mobile Real-Time Bidding EcosystemA high-level overview of how an RTB exchange worksis given in Figure 1 RTB exchanges bring togetheradvertisers and the publishers of Web pages andapps When a user visits a Web pageapp (here-after ldquoWeb pagerdquo) that sells ad slots through RTBexchanges a bid request is created Such a bid requestwill provide additional data on the user and Webpage visited such as the IP address publisher anddevice type (see Table 1 for a list of fields and statis-tics on their availability across RTB exchanges) Basedon such data advertisers can choose to bid For exam-ple a brand may want to target users that previ-ously visited their automobile website browsing for aecofriendly hybrid car The highest bidder is allowedto load a potentially personalized ad for exampleshowing an ad for an ecofriendly hybrid All of thisoccurs in real time in less than 100 milliseconds Thetwo interrelated problems this paper addresses canbe stated specifically in the context of this examplehow does the targeter find instances of this individ-ual who previously visited the brandrsquos website Canthe targeter find users with very similar interests (egthe focal userrsquos husband best friend) under the pre-sumption that they might also be reasonable candi-dates for an ad for an ecofriendly hybrid

5 An alternative is to create a bipartite network between users andlocations and draw inferences directly from the bipartite networkWe do not investigate that explicitly in this paper however theinterested reader should take a careful look at the similarity calcu-lations we present below

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 247

Figure 1 (Color online) Working of Real-Time Bidding

AdvertisersAd

Consumers

Publisher hasad slots for sale

At IP X is on website YBrand 1

51c

14c

33cBrand 2

Brand 3

3

Bids

Bid request

Loading of ad

2

with device type ZBidding is open 4

1

Notes When a user visits a Web page that sells ad slots through an RTB exchange (1) a bid request is created (2) after which several advertisers bid to allowto show an ad at that moment on that Web page to that user (3) The highest bidder is allowed to load a potentially personalized ad (4)

For this paper we address the data currently avail-able in the digital advertising ecosystem Specificallyvia RTB exchanges we can observe a massive numberof mobile users along with the data made availablefor advertisers to decide whether to bid on the userfor a particular campaign (see eg Pubmatic 2010)Two aspects of the data are crucial (1) Each instanceof a mobile user is associated with a key when thiskey is observed in the future we know that the futuredata item involves the same mobile device6 (2) Partof the data associated with each mobile device is alocation For this paper we consider that location tobe the (anonymized) IP address of the Wi-Fi network7

currently in use (although there are various sorts oflocation data) We observe the locations visited byeach user for the data records included in the researchdata set (described in more detail in sect51)

The mobile RTB ecosystem holds different infor-mation than for the typical online (desktop) worldwhich is important for our design The GSN designmeasures similarity in location visitation behaviorOne might wonder if other behavioral data can beobtained from the RTB systems such as website vis-itation data (Provost et al 2009 Raeder et al 2012)Table 1 shows which typical fields (besides the key)are visible on mobile devices over seven different RTBexchanges Let us consider the different data fields aspossible sources of behavioral data

First let us have a look at the fields that are avail-able everywhere the time and date of the bid request(Created) the RTB exchange (Network) the dimen-sions of the ad (Dims) the device type (Device) the

6 The notion of an instance is important Many instances of the samedevice may have different keys in the online advertising ecosystemAs discussed above connecting these is one focus of this research7 For this paper we will consider transient IP addresses that are notexplicitly removed such as those belonging to 3G or other carriernetworks (but not identified as such) to be noise in the data thatmust be accommodated by the similaritytargeting design Betterremoval techniques could increase the strength of the results pre-sented in the next sections

user agent the publisher and the IP address TheIP address (anonymized) is the location we use in ourGSN design Figures 2(a) and 2(b) show the distri-bution of the number of users seen at an IP addressand the number of IP addresses seen per user respec-tively (more information on this data set is providedlater in sect51) On average a user visits 166 IP ad-dresses and an IP address is visited on average bytwo users (over a time period of 10 days see sect51)Although in our data set some known 3G IP addresses(subranges) have already been removed Figure 2(a)shows that high-volume IP addresses are still presentSome of these IP addresses with a very large numberof unique users seen correspond to large institutionsthink for example of a university Wi-Fi address

Although the publisher data item is available for allbid requests we observe that most device instancesvisit only a single publisher (about 95 whereasless than 1 visit three or more publishers andonly 00009 are seen on 10 or more publishers)The publisher-specific Interactive Advertising Bureau(IAB) categorizations (Iabcats) suffer from the sameproblem (as well as being too coarse-grained to beuseful for identifying the same user in any event)The reason is that as mentioned above a mobiledevice has different identifiers when viewed throughdifferent channels in the mobile advertising ecosys-tem such as different RTB systems or even differentapps On the other hand a publisher has on average1366 users that visited it and publishers by and largeseem to be RTB specific Figures 3(a) and 3(b) showmore details of the distributions As a consequenceusing publisher-visitation data to connect users is notbroadly helpful because two user instances in dif-ferent RTB systems rarely share a publisher The IPaddress is always available and is passed consistentlythrough the different apps and RTB systems

The other consistently available fields (such as useragent device type dimensions) are always the samefor a single device These could in principle be used toimprove same-device finding but are only marginally

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers248 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 1 Data Availability for Mobile Devices Across Different RTB Systems

RTB 1 () RTB 2 () RTB 3 () RTB 4 () RTB 5 () RTB 6 () RTB 7 ()

Created 100 100 100 100 100 100 100Network 100 100 100 100 100 100 100Dims 100 100 100 100 100 100 100ID 100 100 100 36 0 19 44Device 100 100 100 100 100 100 100User agent 100 100 100 100 100 100 100Iabcats 0 100 100 98 100 100 84Geo 100 16 100 100 100 100 100Location 0 4 56 42 1 20 100Referrer 8 0 0 17 0 6 3Publisher 100 100 100 100 100 100 100URL 100 0 0 5 93 0 0IP 100 100 100 100 100 100 100

Notes Specifically the table shows how often the following fields are provided time of bid request (Created) the RTB network identifier(Network) dimensions of the ad (Dims) identifier of the device (ID) device type (Device) the user agent IAB categorization (Iabcats)the IP-inferred city state or country (Geo) the location latitude and longitude (Location) the referrer from where the user comes(Referrer) the publisher of the Web page (Publisher) the URL and the IP address

helpful for finding the same user on different devicesor for similar-user finding (eg using Apple productsversus Android products) we do not take advantageof them in this paper instead focusing specificallyon assessing the ability of the geosimilarity design

Figure 2 Distribution of (a) Number of Unique User Instances Seen per IP Address and (b) Number of IP Addresses a User Instance Logs Into

(b) Number of IPs per ID

Number of IPs

Fre

quen

cy o

f occ

uren

ce

100

108

107

106

105

104

103

102

100

101

101 102 103 104 105 106 107 108

(a) Number of IDs per IP

Fre

quen

cy o

f occ

uren

ce

108

107

106

105

104

103

102

100

101

Number of IDs

100 101 102 103 104 105

Figure 3 Distribution of (a) Number of Unique User Instances Seen per Publisher and (b) Number of Publishers a User Instance Visits

(a) Number of IDs per publisher

Number of IDs

Fre

quen

cy o

f occ

uren

ce

100

104

103

102

100

101

101 102 103 104 105 106 107

(b) Number of publishers per ID

Fre

quen

cy o

f occ

uren

ce

108

107

106

105

104

103

102

100

101

Number of publishers

100 101 102 103 104

(as opposed to how to best engineer a same-user-finding system)

The URL field is missing in all but two RTB ex-changes The Geo field is often limited to the citystate and country as inferred from the IP address

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 249

Location data in the form of latitude and longitude(Location) are only available in some RTB systemsand the accuracy of the latitudelongitude data in thecurrent mobile ecosystem is questionable8 again mak-ing it not suitable for broad use

32 Entity and Link RepresentationWe represent each user instance by a ldquoprofilerdquo ofbehavior across locations where we define a locationprofile as a vector of visit frequencies This will neces-sarily require a sparse representation as we observe avery large total set of locations fortunately each indi-vidual is seen at only a small subset of the locationsFor the empirical results below the location profile isa sparse vector of frequencies of observed visitationsto locations (which can be weighted in the calculationof similarity as described below)

The structure of the GSN is the graph with mobiledevices as the nodes and links between the devicesas the edges The corresponding design decisionsinvolve the selection of which devices to link at alland the weights to place on the links The simplestlink representation places unweighted links betweenall pairs of devices that share at least one specific loca-tion More sophisticated weighting criteria are basedon two notions which are discussed next and illus-trated with a simple example in Figure 4 The loca-tion data are obtained by listening to RTB ecosystemswhere IP addresses of devices are broadcast (as shownpreviously in Figure 1)

1 Given two devices that share at least one loca-tion the weight of the link between them can bea function of the devicesrsquo location profiles Such mea-sures can include simply the number of shared loca-tions and the number of shared locations weightedby their visitation percentage in the profile Alterna-tively the weighting could be a measure of similaritybetween the two location profiles for example cosinesimilarity

2 As such Devices A B and C are connected in theGSN of Figure 4 since they all logged into the Wi-Fiof the Met Museum Devices B and C also loggedinto the public Wi-Fi at Times Square and thereforeshould be more strongly connected than A and B (or Band C) which share only one location

8 Analyses of the latitudelongitude data show various data prob-lems including flipped coordinates coordinates set equal to eachother many latitudinallongitudinal points at the centroids of com-mon geographic areas and many impossible or highly improba-ble coordinates Mobile ad industry insiders claim that the highprice paid for latitudelongitude-based hyperlocal targeting hasled to a significant amount of latitudelongitude fraud We havefound nothing written on this phenomenon although data fraudis not uncommon in the online advertising industry (Stitelmanet al 2013)

Figure 4 (Color online) Basic Illustration of the Creation of a GSNBased on Mobile Location Data Observed for Three MobileDevices (A B and C) and a Laptop (D)

Met MuseumA B

Geosocial network(GSN)

A C

D

B

C

D

Times Square

Apartment

Note Devices are connected in the GSN if they visited (logged into) the samelocation (IP address)

3 Links can be weighted according to the componentlocationsrsquo 4lack of5 popularity For example it mightbe argued that the link formed by sharing a loca-tion with a small number of other devices (eg myapartment) indicates a stronger geosimilarity thansharing a location with a massive number of otherdevices This intuition can be extended if two devicesspend a lot of time at such an unpopular locale(eg my apartment) then that should indicate a verystrong similarity If two devices have approximatelythe same distributional profile across several of thesesites (their ldquolocation profilerdquo) then they are eitherthe same person or close friendssoulmates In ourexample C and D are connected because they bothlogged into the Wi-Fi of an apartment whereas Band C are connected by logging into the public (pop-ular) Wi-Fi at Times Square Hence the connectionbetween C and D (which are likely devices belongingto roommates or belonging to the same user) shouldbe stronger than the connection between B and C(which are likely devices belonging to two differenttourists)

This notion of similarity can be incorporated tech-nically by (1) adapting notions from information re-trieval we can weight locations for a given deviceby their device-specific popularity (from the devicersquoslocation profile) divided by the (log of) the locationrsquosoverall popularity let us call that LF times IDF (locationfrequency times inverse device frequency) Then (2) thestrength of the link between two devices that sharea location would be a function of the correspondingLF times IDF scores

Once we have one or more GSNs there are variousways to take advantage of them for targeting or otherinference The simplest method for inference is sim-ply to target all of the geosimilar network neighborsAlternatively one could target the ldquoclosestrdquo neighborsbased on one or more notions of geosimilarity (as dis-

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers250 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

cussed above) These are the methods used for theempirical results presented below

4 Specifics of the GSN DesignAs described above in this design the strength of thelink between two individuals in the GSN is based onthe similarity in the distribution of the locations thatthey visit For this paperrsquos results locations will be(anonymized) IP addresses We now define differentsimilarity measures of varying complexity which willbe evaluated in the following sections

41 NotationThe location profile of a user is represented by a vec-tor Ex of length m where element i denotes the num-ber of visits to IP address IPi2 Ex isin m Vector Exbool isa binary vector of size m where element i denoteswhether the user has visited IPi For similarity com-putations both the Hadamard product () and theinner product 4 middot 5 are used and the minimum (min)operator is defined to operate componentwise on twovectors as well as on a vector and a scalar The max-imum (max) and average (avg) operators are similarTheir logic is defined in Equations (1)ndash(4) The weightof an IP address is defined by its popularity in anal-ogy to the inverse document frequency used in text

Table 2 Similarity Metrics Between Two Userrsquos Location Visit Distributions Ex1 and Ex2

Similarity metric Range d4 Ex11 Ex25

1 scount4 Ex11 Ex25= Exbool11 middot Exbool12 6015 1

2 sbool4 Ex11 Ex25= min4scount4 Ex11 Ex25115 80119 1

3 scount1W 4 Ex11 Ex25= 4 Ew Exbool115 middot Exbool12 6015 103

4 sfreq1min4 Ex11 Ex25= min4 Ex11 Ex251 6015 2

5 sfreq1max4 Ex11 Ex25= max4 Exbool12 Ex11 Exbool11 Ex251 6015 3

6 sfreq1 avg4 Ex11 Ex25= avg4 Exbool12 Ex11 Exbool11 Ex251 6015 205

7 sfreq1min1W 4 Ex11 Ex25= min4 Ex1 Ew1 Ex2 Ew51 6015 206

8 sfreq1max1W 4 Ex11 Ex25= max44 Exbool12 Ew5 Ex11 4 Exbool11 Ew5 Ex251 6015 309

9 sfreq1 avg1W 4 Ex11 Ex25= avg44 Exbool12 Ew5 Ex11 4 Exbool11 Ew5 Ex251 6015 3025

10 scosine4 Ex11 Ex25=Ex1 middot Ex2

Ex11 middot Ex21

60117 0074

11 scosine1W 4 Ex11 Ex25=4 Exbool11 Ew5 middot 4 Exbool12 Ew5

Exbool11 Ew1 middot Exbool12 Ew1

60117 0060

12 sJaccard4 Ex11 Ex25=scount4 Ex11 Ex25

max4 Exbool111 Exbool1251

60117 0033

13 sJaccard1W 4 Ex11 Ex25=scount1W 4 Ex11 Ex25

max4 Exbool11 Ew1 Exbool12 Ew51

60117 0026

14 sJaccard1 freq4 Ex11 Ex25=sfreq1min4 Ex11 Ex25

max4 Ex1 Exbool121 Ex2 Exbool1151

60117 0067

15 sJaccard1 freq1W 4 Ex11 Ex25=sfreq1min1W 4 Ex11 Ex25

max44 Ex1 Exbool125 Ew1 4 Ex2 Exbool115 Ew51

60117 0067

mining (Hotho et al 2005) the weight wi for IPi (theinverse device frequency) is defined as the logarithmof the total number of users n divided by the numberof (unique) users that visit that specific IP address ni

Ez= Ex Ey1 zi = xi times yi1 (1)

a= Ex middot Ey =sum

i=12m

4xi times yi51 (2)

Ez= min4Ex1 Ey51 zi = min4xi1yi51 (3)

Ez= min4Ex1a51 zi = min4xi1 a51 (4)

wi = log104nni50 (5)

42 Link Strength MetricsFifteen similarity metrics are defined in Table 2 andare illustrated by measuring the strength betweenthe location profiles of two example users Ex1 and Ex2shown in Equation (6) The first user visits IP1 threetimes and IP2 twice whereas the second one visits IP1twice and IP3 once Several variants of the same met-rics are considered As described above some onlytake into account the unique number of shared loca-tions whereas others take into account the frequencyof the visits to these locations and the inverse devicefrequency The latter are denoted by freq and W respectively

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 251

The first similarity metric scount counts the numberof shared locations Since the two example users shareonly IP1 this strength is 1 The IDF weighted versionscount1W sums the IDF values of the shared locationswhich is 13 for our simple example The most basicmetric is given by sbool which is 1 when a location isshared and 0 otherwise indicating whether two usersare neighbors or not

The sfreq1min sfreq1max and sfreq1avg metrics compare thefrequencies of the visits to the shared locations Theytake the minimum maximum and average respec-tively of the frequencies for each shared location andsum them The weighted versions sfreq1min1W sfreq1max1W and sfreq1avg1W do the same but weight each frequencywith the corresponding IDF value

The scosine metric measures the similarity by takingthe cosine of the angle between the two vectors TheJaccard metric is defined as the size of the intersec-tion over the size of the union In this case it countsthe number of shared locations but normalizes theseover the total number of locations both users havevisited Users that visit many locations will have ahigher chance to share some location with anotheruser so the Jaccard metric penalizes them For ourbasic example sJaccard is 1 (IP1) over 3 (IP1 IP2 IP3)Once more variants taking into account the frequencyand the IDF are defined as well

Ex1 = 63 2 071

Ex2 = 62 0 171 (6)

Ew = 6103 107 270

5 Experimental Setup51 Data SetThe data set D for the empirical results is 10 days ofanonymized advertising bid requests from mobile de-vices (smartphones tablets etc) observed across sevenRTB exchanges A total of 322770794 bid requestsare observed coming from 42437559 unique userinstances Over these 10 days we observe 41829088unique IP addresses

The number of unique user instances per RTBexchange is given in Table 3 The distributions of thenumber of bid requests per device type and operatingsystem are given in Tables 4 and 5 which show thatthe majority of bid requests come from smartphonesand the iOS operating system

A user (instance) corresponds to an exchange-assigned user ID One person on several deviceswill hence correspond to several users One deviceon several exchanges will often correspond to sev-eral users Also it might be that one ad exchangesometimes provides different user IDs to the same

Table 3 User Instances per RTB Exchange

Exchange Volume Percentage ()

RTB 1 311751979 6RTB 2 211591686 4RTB 3 1415661098 29RTB 4 2015011243 41RTB 5 212751015 5RTB 6 510551815 10RTB 7 212231418 4

Table 4 Distribution of Bid Requests per Device Type

Type Volume Percentage ()

Phone 29213581772 91Tablet 3010161334 9

Table 5 Distribution of Bid Requests per OperatingSystem

Type Volume Percentage ()

Android 12913151617 4001iOS 19219231648 5908Windows 5311529 002

device for example when the device is using differ-ent particular apps As described above an impor-tant task in online advertising is to be able to targetadvertisements to different instances of the same indi-vidual with reasonable probability regardless of thecomplexities of identification in the baroque onlineadvertising ecosystem Figure 5 visualizes a sample oflatitudes and longitudes broadcast by mobile devicesin our data set

52 SamplingFor the results in this paper we apply the GSN anal-ysis to data samples Each sample corresponds to acohortneighborhood and is built as follows A sin-gle (random) ldquoegordquo user is chosen user X aroundwhich a neighborhood is built Every neighbor of theego user must share at least one IP address so webegin by obtaining all (anonymized) IP addresses vis-ited by user X Next all other users that visited atleast one of these IP addresses (the neighbors) areadded The ego must have at least one neighbor to beincluded in this analysis Finally we obtain the com-plete location distribution of each of these neighborsWhen the next sample is chosen we ensure that noneof the users already included in the previous samplesis taken as the ego user (X) As such we ensure thatwe do not have repeated egondashneighbor dyads Theremay be shared neighbors of two different egos buttwo egos will never be immediate neighbors in thedata sample

The result of the sampling over 500 cohorts is summa-rized in Figure 6 on average an ego has 71 neighbors

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers252 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 5 Scatter Plots of Samples of Latitudes and Longitudes Broadcast by Mobile Devices

ndash130 ndash120 ndash110 ndash100 ndash90 ndash80 ndash70 ndash6020

25

30

35

40

45

50

55(a)

Long

Lat

ndash20 ndash10 0 10 20 30 40 50

Long

30

35

40

45

50

55

65

60

75

70

Lat

ndash200 ndash150 ndash100 ndash50 1000 50 150 200

Long

ndash100

ndash80

ndash60

ndash40

ndash20

0

40

20

100

60

80

Lat

4068ndash7406 ndash7404 ndash7402 ndash7400 ndash7394ndash7398 ndash7396 ndash7392 ndash7390

Long

4070

4072

4074

4076

4080

4078

4086

4082

4084

Lat

(b)

(c) (d)

Note Note that locations are plotted without reference to any map The figure gives a striking picture of population density across the world and makes uswonder what is going on with mobile devices in Antarctica

Figure 6 (Color online) The Distribution of the Number of Neighborsfor 500 Randomly Sampled Ego Users

0 500 1000 1500 2000 2500 3000 35000

50

100

150

200

250

300

350

400

450

No neighbors

Fre

quen

cy

the median being 1 The maximum number of neigh-bors is 3167 this user probably visited a publicWi-Fi that thousands of other users also logged intoWe chose 500 cohorts as we observe that the statis-

tics become quite stable as seen in Figure 7 Even200 cohorts seems to be enough to obtain stablesample statistics

The life span of user instances in this set of500 cohorts is shown by the cumulative frequency inFigure 8 Observe that only about 40 of the userinstances are seen at different hours About 30 ofthe user instances are seen only once in the sampledweek This illustrates that it is vital to understand thenotion of user instances These percentages are mis-leading if we interpret the user instances as individu-als or devices One individualdevice without stableidentification in the online advertising ecosystem willcreate many transient user instances which will driveup the percentage of user instances that are seen onlyonce or that are seen only within a short timeframeWe have not figured out how to determine or esti-mate reliably how many ldquounstablerdquo usersdevices arerepresented by the user instances in these data Wekeep all of the data for the analyses below (rather thanfiltering very short-lived user instances) to maintainrealism

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 253

Figure 7 Evolution of Several Sample Statistics as an Increasing Number of Cohorts are Created

0 100 200 300 400 5000

10

20

30

40

50

60

70

Average No network neighbors

0 100 200 300 400 5000

2

4

6

8

10

12

Median No network neighbors

0 100 200 300 400 5000

50

100

150

200

250

300

350

Average No IPs

0 100 200 300 400 5000

2

4

6

8Median No IPs

Notes The horizontal axis denotes the number of cohorts By 500 cohorts the statistics have become stable

Figure 8 (Color online) Cumulative Frequency of Life Span ofUser Instances

100 101 102 103 104 105 1060

01

02

03

04

05

06

07

08

09

10

x = lifespan (sec)

F(x

)

End of June (mobile only)

larr 1 min larr 1 hour larr 1 day

Cumulative frequency of user instances

6 Study 1 Connecting Instances ofthe Same Individual

Returning to our motivating application as men-tioned previously an important task in mobile adver-tising is to be able to target a particular individual inthe face of the individualrsquos fragmentation into differ-ent user instances (for example as seen through dif-ferent bidding systems) Advertisers understand that

it may be impossible to find the same user with 100accuracy Nonetheless it often is important for themto reach the userrsquos various instances without target-ing too many others9 The GSN is designed in partbased on the hypothesis that different instances of thesame person would visit similar locations and there-fore they will be connected in the GSN In this sectionwe present results examining whether and to whatextent that is indeed the case Furthermore the resultsquantify to what extent the same user receives a highGSN-similarity to herself as compared to her otherneighbors

Three experimental settings are used to assess judg-ments that two users are the same The first two sim-ulate different user instances by splitting up the bidrequests of known user instances The last one is notsimulated it links users via the recently introducedidentifier for advertising (IFA) observed over severalRTB exchanges for a subset of device instances

1 Random We divide the IP address visits of a userrandomly across two simulated users Table 6 showsfor example how we can divide a userrsquos IP visits across

9 The astute reader may notice that if the other users targeted arevery similar to the target user the advertising may be well placeddespite not being shown to the exact target user That is not thesubject of this study Study 2 shows evidence that these two notionsare not independent when using the GSN

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers254 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 6 Randomly Splitting User X rsquos IP VisitsCreating Two New Artificial Users X1 and X2

IP1 IP2 IP3

X 1 5 0X1 0 3 0X2 1 2 0

Table 7 The Volume of IFA Occurrence Within OurData Set

Volume

RTB 1 119401855RTB 2 113831962RTB 3 7891292

Table 8 The Frequency of Unique IFAs Visible AcrossRTB Exchanges

Frequency

RTB 1RTB 2 3061594 (1016)RTB 2RTB 3 451256 (169)RTB 3RTB 1 181417 (085)

two new users X1 and X2 for which we know thatthey are actually the same individual Results withRandom should be used only as a ceiling on perfor-mance because the results will be overly optimistic

2 Temporal A more realistic split-up of the trans-actions is a temporal split-up where first the middleday of Xrsquos transactions is determined (hence dynam-ically chosen for each user) All IP address visits upto this middle day are assigned to X1 and all trans-actions after the middle day are assigned to X2 Itwill only be applied to users who are seen on multi-ple days

3 Identifier for advertising IFA is an advertisingidentifier from Apple that users can change or asknot to be used for advertising and is seen as aprivacy-friendlier version of the controversial fixeddevice ID (the ldquoUDIDrdquo) The IFA is observed in threeexchanges with volumes shown in Table 7 for exam-ple 1940855 unique IFAs are seen in RTB 110 Whenwe observe the same IFA across different exchangeswe now know that two devices in different exchangesactually are the same The frequency of unique IFAsvisible across exchanges is shown in Table 8 For exam-ple a total of 3018223 IFAs are seen when we look atboth RTB 1 and RTB 2 with 306594 of these (1016)seen on both exchange 1 and exchange 2 What ifthere were not an IFA and the device would havehad a different identifier in each exchange To what

10 For confidentiality RTB 1 in this setting is not necessarily thesame as RTB 1 in sectsect4 or 7

extent are these two device instances then connectedin the GSN

With these different ldquogold standardrdquo methods toindicate whether or not two users are the samewe assess the ability of the geosimilarity metricsto link the same user determining how often twoinstances of the same person are connected in theGSN (sect61) Afterward we shall assess the differentGSN strength metricsrsquo ability to rank users determin-ing to what extent two instances of the same personexhibit strong geosimilarity as compared to the otherneighbors (sect62) For the random and temporal split-ups 200 samples are used For the IFA-based methodthe complete population is used for measuring thedegree of connectedness to the same user to analyzethe relative strength of the connected users 200 sam-ples are used

61 Same Individual Different ldquoScreensrdquoWe now present results on the GSN connectednessmeasured as how often two instances of the same userare connected in the GSN (in percentage) Note thatthe results summarized in Table 9 should be inter-preted in light of the limited number of days in build-ing the research data set which will limit the GSNconnectivity Nonetheless the relatively high connec-tivities observed even with this sample provide quitepromising results which lends considerable supportto the merit of the overall design

Random0 Using the random split two users thatcorrespond to the same individual are not connectedin only 9 of the neighborhoods Moreover in thesecases user X (whose IP visits are split up) usuallyhas only one or two entries and the random split-upresults in no visits for one of the two simulated users

Temporal0 The temporal split-up is more realisticthan the random one where we assume that a personuses simulated device 1 in the first half of the week(eg during the weekend) and simulated device 2 inthe second half of the week (eg during the week)

Users X1 and X2 are not connected in 32 of theneighborhoods Whereas the random split-up was toooptimistic the temporal one is too pessimistic as inreality there likely is some overlap in the time peri-ods of use of two different devices If user X visitsan IP address only in the first days of the week this

Table 9 To What Extent is the Same User Connectedto Herself

connected

Random 91Temporal 68IFAmdashRTB 1RTB 2 67IFAmdashRTB 2RTB 3 82IFAmdashRTB 3RTB 1 73

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 255

IP address will not occur in user X2rsquos transactions andreduces the chance to be connected to X1

IFA0 We look at all combinations of the three ex-changes with IFA identifiers For example there are45256 unique IFAs seen on both RTB 2 and RTB 3 In82 of these the two IFAs visit at least one same IP onboth exchanges and hence are connected in our GSN

The bottom line is that even with this limited slice ofonline behavior different instances of the same indi-vidual are connected in the GSN 70ndash80 of the timeEncouragingly the results from the simulated scenar-ios concur with the results from the real (IFA) sce-narios (including the aforementioned optimism andslight pessimism of the two simulated scenarios)Consistency over these different settings gives addi-tional confidence in the results suggesting that theGSN indeed holds promise for a high degree of suc-cess at targeting the same user across the fragmenta-tion of the online advertising ecosystem

62 Ranking User InstancesNext given that two user instances of the same indi-vidual are indeed connected we assess to what extentthe second instance of the same user is ranked highlyamong its network neighbors based on the weightof their linkage in the GSN We first will discuss themost general results using the best ranking methodand then will look across different ranking methods

We decompose the results based on the numberof network neighbors (NNs) since there are twoboundary cases that require special interpretationTables 10ndash14 show the proportions of the ego net-works that fall into three scenarios First a user mayhave only one NN In this scenario the ranking isperfect (trivially)mdashthe other instance of the same useris the only user instance given a nonzero score As

Table 10 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashRandom Split-up

Ego has only 1 NN 3104Ego has exactly 2 NNs 1905

Same user ranked first 8005Ego has more than 2 NNs 4901

Note For the scenario with two NNs the table reports thepercentage of cases where the other known instance of thesame user is ranked first (including ties)

Table 11 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashTemporal Split-up

Ego has only 1 NN 3008Ego has exactly 2 NNs 1807

Same user ranked first 8309Ego has more than 2 NNs 5005

Table 12 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 1RTB 2Split-up

Ego has only 1 NN 4401Ego has exactly 2 NNs 1201

Same user ranked first 8201Ego has more than 2 NNs 4400

Table 13 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 2RTB 3Split-up

Ego has only 1 NN 6109Ego has exactly 2 NNs 1907

Same user ranked first 7102Ego has more than 2 NNs 1804

Table 14 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 3RTB 1Split-up

Ego has only 1 NN 5305Ego has exactly 2 NNs 2501

Same user ranked first 6304Ego has more than 2 NNs 2104

shown in the tables this scenario accounts for 30ndash60 of the ego networks (depending on the split-upmethod)

The second boundary case is when the ego has ex-actly two NNs This scenario accounts for 10ndash25of the cases depending on the split-up method Herethe tables also report whether the other known userinstance is ranked first (including ties) among thetwo neighbors The other known instance of the sameuser is ranked first 60ndash80 of the time dependingon the split-up scenario The IFA results are affectedstrongly by the particular RTB settingmdashthis may indi-cate that the publishers on the different RTBs have dif-ferent policies for passing the IFA Importantly theseresults are conservativemdashpossibly very conservativeThe other NNs may also be instances of the same userunbeknownst to usmdashfor example in the IFA casebecause an app does not pass the IFA to the RTB orbecause the other user is the same user on a differentdevice and thus does not share the IFA

The final scenario is when more than two NNsare present As shown in the tables this accounts for20ndash50 of the cases depending on the split-up sce-nario In this case we measure the percentile in theranking where the instance of the known same userfalls (see below) as well as the area under the receiveroperating characteristic curve (AUC) (Fawcett 2006)which is equivalent to the MannndashWhitneyndashWilcoxon(MWW) statistic The AUCMWW measures to what

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers256 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 15 When There Are More Than Two Network Neighbors to What Extent Is the Known Same User StronglyConnected to Itself

AUC (if gt 2 NNs) () Average rank percentile (if gt 2 NNs) ()

Ranking metric R T RTB 12 RTB 23 RTB 31 R T RTB 12 RTB 23 RTB 31

1 78 64 68 59 55 29 48 34 43 542 59 58 67 58 54 45 53 34 44 553 79 69 74 61 61 29 44 29 40 484 86 75 69 61 59 24 40 32 41 505 83 70 71 59 63 26 44 31 43 476 85 74 71 60 65 25 41 31 42 457 86 76 73 62 64 24 40 29 40 468 83 70 72 59 65 26 44 30 43 459 85 74 71 60 67 25 41 30 42 44

10 79 69 69 61 63 29 44 32 41 4711 79 69 71 62 64 44 44 30 41 4612 81 68 68 60 62 28 46 34 42 4913 81 70 71 61 64 27 44 30 40 4614 74 67 66 61 48 34 47 35 41 6015 74 67 66 61 48 34 47 35 41 60

Note See the text for a discussion of ranking metrics

extent data points with label 1 are ranked higherthan those with label 0 In the results that follow(see Table 15) the AUC is measured for each samplewhere any user instance among the neighbors thatcorresponds to the same individual is labeled as 1and all others are labeled as 0 Thus in this settingthe AUC measures how well the neighbors are rankedby the likelihood of being the same individual11 Theaverage over all samples is reported If for a sampleall neighbors have the same score that AUC is setto 05 In addition to the factors noted above this alsowill tend to make the results conservative For exam-ple if the GSN links an ego user to a set of neighborsthat are all instances of the same individual in thisevaluation the AUC will be 05

The ranking measure is the percentile in the rank-ing at which the known-same-user instance is rankedwith lower values being better For example if thesame-user instance has the highest score (closest con-nection) among five neighbors the rank is 20 (15)Note again that this is quite conservative in thisexample the user is ranked at the top of the list butbecause there were only five neighbors the highestpossible score is 20 In case the known-same-userinstance has the same score as another neighbor theaverage rank is reported

As shown in Table 15 all AUC values exceed 05ranging up to 08 This means that in every casea known-same-user instance is likely to be rankedhigher than an instance not known to be the same

11 Technically in this setting the AUCMWW is the probability thata randomly selected neighbor that is in fact a known instance of thesame user is ranked more highly than a randomly selected neighborthat is not known to be an instance of the same user If the known-same-user instances are all ranked above the other-user instancesthe AUC = 100 If they are all below then the AUC = 0

user The frequency-based ranking metrics (4ndash9) per-form quite well in connecting instances of the sameuser strongly

Almost all metrics perform better than a randommodel with performances that even come close to thebest possible solution Consistently performing quitewell are the frequency-based metrics (metrics 3ndash9)with metric 7 (the minimum of the IDF weights ofthe shared locations) getting the most wins Not per-forming well is the very basic count metric as well asthe Boolean metric that provides a binary score onlyInteresting to note the Jaccard metrics do not performwell either It seems that penalizing users that visitmany IP addresses negatively affects the results anissue we will return to in the next study

7 Study 2 Does the GSN Select Userswith Similar Interests

The foregoing section addressed our paperrsquos firstclaimmdashthat the GSN design indeed links instances ofthe same user and links them more strongly thaninstances of other users In this section we turn to thepaperrsquos second claimmdashthat the GSN links users withsimilar interests We will measure similar interests bysimilarity in behavior accessing particular publisherson the Web and using mobile apps12

71 Connecting Users with Similar Interests

To What Extent Are the GSN Neighbors of Visi-tors to Particular Publishers Also Visitors to ThoseSame Publishers To assess whether the GSN con-nects users with similar interestsbehavior we select

12 From now on all user instances will be complete real user in-stances the simulated split-up scenarios (Random Temporal) werecreated specifically for the purpose of the same-user study

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 257

Table 16 Distribution of Publishers per RTB in Terms of thePercentage of Users in Our Data Set That Visit the GivenPublisher

Publisher of D

RTB 1Burstly 3004myYearbookcom 1062My Fitness Pal 0065TextMe Inc 0056GameResort 0025Flixster 0019247Sports 0015Daily Workout Apps LLC 0009Mobile Deluxe 0007YoYo Games Ltd 0004

RTB 2mworldnowcom 0032mtmzcom 0024mtopixcom 0014babycentercom 0015meetmecom 0006appsfacebookcom 0006itunesapplecom 0039appevitecom 0002beautyandskincarescom 0003mmajunkiecom 0001

RTB 3Conversion Exchange 5053Pinger Inc 2096Top Game Developer 4029PremiumEntertainmentApp-FamilyOriented-Android 0059Enflick 0053Clapfoot Games 0044FunPokes Inc 0018Talkatone 0015Sevenlogics Inc 0008A Star Software 0007

RTB 4Pinger PhonemdashiOSmdashConversation 1001Talkatone iPhone App 0007DiceWithBuddies 00139GAG Reader 0012Video Downloader Pro Lite 0016Rage Comics 0009Spades 3D Lite 0006Apalabrados 0003Relax Melodies HD 0002Celeb MemdashPhotoMaker 0000

a diverse collection of 40 publishers listed in Table 16A publisher can correspond to a mobile app a web-site or a set of appswebsites We selected both pub-lishers that are visited by many users and more nichepublishers with fewer visitors As can be seen fromTable 16 several types of publishers can be distin-guished the main groups are games (eg GameRe-sort Top Game Developer) and social networkingtools (eg appsfacebookcom Text Me Inc)

More specifically we create 200 samples for 40 pub-lishers 10 publishers for each of four different RTB

exchanges listed in Table 1613 The percentage ofunique users that visit each publisher in our completedata set is also shown (D) To create each samplean ldquoegordquo user is selected that has visited the pub-lisher in question (an ldquoego visitorrdquo) this ego visi-torrsquos GSN neighborhood is created Next we assesshow many of the egorsquos neighbors also visited thatsame publisher This will be compared with a base-line visit rate to produce lift and leverage values(Provost and Fawcett 2013)mdashspecifically how muchmore likely is it for the neighbors of ego visitors tovisit the publisher than would be expected by chanceWe consider three different setups that produce dif-ferent baselines

1 Considering all RTB exchanges to determine the base-line A user is considered a visitor for a given pub-lisher if she visits the publisher on any RTB ex-change (since some publishers are visible on differentexchanges) Hence the baseline visit rate is the num-ber of users (on any RTB) who visit that publisherdivided by the total number of users across all RTBexchanges This will provide optimistic lifts sincemost publishers appear only on one RTB exchangewhereas the baseline is measured over all exchanges

2 Considering one RTB exchange only to determine thebaseline A user is considered a visitor for a given pub-lisher and RTB if she visits the publisher on the sameRTB exchange This changes the baseline to the num-ber of users (on this RTB) who visit that publisherdivided by the total number of users on this RTBexchange The lift and leverage results will be worsethan in the first setting because the baseline will belarger Seeing that some publishers are seen on differ-ent RTB exchanges this setting is slightly pessimistic

3 Taking time of location visits into account In thethird setting we link users only if they visit an IPwithin the same time period This will be elaboratedon in sect73

Results are shown in Tables 17 and 18 The percent-age of publisher visitors in the neighborhood is givenas N This percentage is compared with the baselinepercentage of users that visit that publisher in thecomplete data D using two metrics the lift is the ratioof these numbers the leverage is the difference Liftis particularly useful for very small base rates (con-sider that an activity with a base rate of 30 cannotpossibly have a lift higher than about 3) Leverage ismore telling for larger base rates because it shows theabsolute improvement (here in percentage points)

When considering all NNs the lifts for all publish-ers are greater than 1 except for one Burstly on RTB 1has a lift of 097 (essentially no lift) but only when weconsider RTB 1 as the baseline This publisher already

13 For confidentiality the RTB numbering here is again differentfrom above

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers258 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 17 RTBs 1 and 2 Percentage of Network Neighbors That Also Visit the Publisher

RTB 1

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Burstly 3004 8063 2084 5059 47022 15054 44018 25061 8043 220572 myYearbookcom 1062 25077 15094 24015 44074 27067 43012 35016 21074 330543 My Fitness Pal 0065 4015 6038 305 3609 56077 36025 15055 23091 14094 TextMe Inc 0056 1805 33006 17094 39002 69075 38046 25018 45 240625 GameResort 0025 4093 19077 4068 18052 74026 18027 11086 47058 110626 Flixster 0019 1055 8015 1036 8014 42085 7095 1079 9042 1067 247Sports 0015 27057 188058 27042 77062 530093 77048 74053 509075 740388 Daily Workout Apps LLC 0009 1092 21047 1083 19023 215046 19014 11054 129027 110459 Mobile Deluxe 0007 1087 28067 108 23053 36009 23046 9063 14707 9056

10 YoYo Games Ltd 0004 3052 99058 3049 21057 609064 21053 13 367045 12096Considering this RTB exchange only to determine baseline

1 Burstly 8085 8063 0097 minus0022 47022 5033 38037 25061 2089 160762 myYearbookcom 4071 25077 5047 21006 44074 905 40003 35016 7046 300453 My Fitness Pal 1089 4015 2019 2025 3609 19048 35001 15055 8021 130654 TextMe Inc 1063 1805 11035 16087 39002 23094 37039 25018 15045 230555 GameResort 0073 4093 6078 402 18052 25049 17079 11086 16033 110146 Flixster 0055 1055 208 0099 8014 14071 7059 1079 3023 10247 247Sports 0043 27057 64073 27014 77062 182023 7702 74053 174097 74018 Daily Workout Apps LLC 0026 1092 7037 1066 19023 73095 18097 11054 44037 110289 Mobile Deluxe 0019 1087 9084 1068 23053 123088 23034 9063 5007 9044

10 YoYo Games Ltd 001 3052 34018 3042 21057 209025 21047 13 126012 1209

RTB 2

Considering all RTB exchanges to determine baseline1 mworldnowcom 0032 13027 40086 12095 54039 167042 54006 31065 97044 310332 mtmzcom 0024 6089 28086 6065 23008 9607 22084 10068 44075 100443 mtopixcom 0014 15007 104079 14093 53045 37106 5303 33075 234065 330614 babycentercom 0015 6087 47031 6072 43048 299056 43033 17074 122024 17065 meetmecom 0006 6002 10308 5097 54072 942093 54066 27085 47909 270796 appsfacebookcom 0006 301 48039 3003 9009 141097 9003 2076 43008 20697 itunesapplecom 0039 25077 65071 25038 51043 131012 51004 29049 75018 290098 appevitecom 0002 48028 21212043 48025 54076 21509068 54074 48028 21212043 480259 beautyandskincarescom 0003 18019 642087 18016 8048 299094 8046 1107 413049 11067

10 mmajunkiecom 0001 1036 201064 1035 26067 3195505 26066 13033 11977074 13033Considering this RTB exchange only to determine baseline

1 mworldnowcom 4034 13027 3006 8093 54039 12053 50005 31065 7029 270312 mtmzcom 3019 6089 2016 307 23008 7024 19089 10068 3035 70493 mtopixcom 1092 15007 7084 13015 53045 27081 51053 33075 17056 310834 babycentercom 1094 6087 3054 4093 43048 22042 41054 17074 9015 15085 meetmecom 0078 6002 7077 5025 54072 70057 53094 27085 35092 270076 appsfacebookcom 0086 301 3062 2024 9009 10063 8024 2076 3022 1097 itunesapplecom 5024 25077 4092 20053 51043 9081 46019 29049 5063 240258 appevitecom 0029 48028 165058 47098 54076 187082 54047 48028 165058 470989 beautyandskincarescom 0038 18019 48011 17081 8048 22045 8011 1107 30095 11032

10 mmajunkiecom 0009 1036 15009 1027 26067 296003 26058 13033 148001 13024

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

has a very high baseline and as we will see in thenext section when we consider only the top 1 and 10NNs the lifts go up to 5 and 3 respectively Burstlyprovides a generic platform for building apps and sois a weaker indication of user interest than many ofthe other publishers (fitness babies games)

All other lifts exceed one by substantial mar-gins and go up to even 2391 (Apalabrados onRTB 4 a Spanish-language social Scrabble-like game)

Obviously the conclusion that the NNs indeed aremore likely to visit the same publishers is highly sig-nificant by a simple sign test comparing the visita-tion rate within the geosimilarity neighborhood tothat of overall population either via lift or leveragep lt 0001 Thus NNs of publisher visitors are indeedmore likely also to visit the same websites the abso-lute values of the lifts and leverages suggest that theyare substantially more likely

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 259

Table 18 RTBs 3 and 4 Percentage of Network Neighbors That Also Visit the Publisher

RTB 3

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Conversion Exchange 5053 34054 6024 29001 64071 11069 59017 50077 9018 450242 Pinger Inc 2096 46002 15053 43006 65012 21097 62015 51048 17037 480523 Top Game Developer 4029 47024 11002 42095 60 13099 55071 48097 11042 440684 PremiumEntertainmentApp-FamilyOriented-Android 0059 23079 40012 2302 20045 34049 19086 22025 37052 210665 Enflick 0053 11071 21095 11017 21057 40044 21004 12027 23 110736 Clapfoot Games 0044 8055 19032 8011 11054 26006 1101 13099 31061 130557 FunPokes Inc 0018 11011 61028 10093 50 275077 49082 15079 87008 150618 Talkatone 0015 13013 88051 12098 31011 209076 30096 17091 120078 170779 Sevenlogics Inc 0008 2071 32059 2063 5088 70074 508 8033 100022 8025

10 A Star Software 0007 3033 44086 3025 23008 311024 23 9009 122061 9002Considering this RTB exchange only to determine baseline

1 Conversion Exchange 11045 34054 3002 23009 64071 5065 53025 50077 4043 390322 Pinger Inc 6013 46002 705 39089 65012 10062 58098 51048 8039 450353 Top Game Developer 8088 47024 5032 38036 60 6076 51012 48097 5052 400094 PremiumEntertainmentApp-FamilyOriented-Android 1023 23079 19038 22057 20045 16066 19023 22025 18013 210025 Enflick 101 11071 1006 1006 21057 19054 20046 12027 11011 110166 Clapfoot Games 0092 8055 9033 7064 11054 12059 10062 13099 15027 130087 FunPokes Inc 0038 11011 2906 10074 50 133022 49062 15079 42007 150418 Talkatone 0031 13013 42076 12082 31011 101034 3008 17091 58035 170619 Sevenlogics Inc 0017 2071 15074 2054 5088 34017 5071 8033 48041 8016

10 A Star Software 0015 3033 21067 3017 23008 150036 22092 9009 59023 8094

RTB 4

Considering all RTB exchanges to determine baseline1 Pinger PhonemdashiOSmdashConversation 1001 71081 70085 7008 80077 79068 79076 70014 6902 690132 Talkatone iPhone App 0007 11038 17203 11031 29003 439068 28097 17002 257078 160963 DiceWithBuddies 0013 7041 59004 7028 18018 144092 18006 7069 61031 70574 9GAG Reader 0012 62013 497099 62 49009 39305 48097 49005 393017 480925 Video Downloader Pro Lite 0016 20003 123015 19087 22058 138084 22042 17093 110027 170776 Rage Comics 0009 609 8001 6081 13033 154086 13025 609 8001 60817 Spades 3D Lite 0006 1205 225076 12044 27027 492057 27022 1205 225076 120448 Apalabrados 0003 77042 21391071 77039 75 21316097 74097 79032 21450043 790299 Relax Melodies HD 0002 31058 21002029 31056 50 31170029 49098 31058 21002029 31056

10 Celeb MemdashPhotoMaker 0 3023 82107 3022 7014 11819048 7014 3023 82107 3022Considering this RTB exchange only to determine baseline

1 Pinger PhonemdashiOSmdashConversation 8051 71081 8044 6303 80077 9049 72026 70014 8024 610632 Talkatone iPhone App 0055 11038 20053 10082 29003 52038 28048 17002 30071 160473 DiceWithBuddies 0009 7041 82011 7032 18018 201054 18009 7069 85027 7064 9GAG Reader 0097 62013 63097 61016 49009 50055 48012 49005 50051 480085 Video Downloader Pro Lite 0044 20003 45036 19059 22058 51014 22014 17093 40062 170496 Rage Comics 0037 609 18079 6053 13033 36032 12097 609 18079 60537 Spades 3D Lite 0046 1205 2609 12004 27027 58068 26081 1205 2609 120048 Apalabrados 0021 77042 371093 77021 75 360031 74079 79032 381006 790119 Relax Melodies HD 0013 31058 238054 31045 50 377069 49087 31058 238054 31045

10 Celeb MemdashPhotoMaker 0003 3023 97089 3019 7014 216076 7011 3023 97089 3019

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

The very high lifts could be attributed to two phe-nomena They may provide some evidence that theGSN embeds a true social network as pointed out bythe good results for the social networking publishers(eg Apalabrados with a lift of 2391 appsevitecomwith a lift of 2212 and others) Friends visiting thesame social networking sites (here unusual ones)

would increase the lift and leverage over just havingdifferent instances of the same user visiting the samewebsites An alternative explanation is that some ofthese publishers are much more likely to be visitedacross different devices of the same user It is not clearwhether this is actually true However it argues forobtaining specific and broad data on which devices

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers260 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

are not the same user to shed further light on forexample any broad data that include a (anonymized)user identifier

From these results we can conclude that GSNneighbors indeed exhibit substantially similar pub-lisher visitation behavior Next we assess to whatextent more strongly connected NNs are more similarthan less strongly connected NNs

72 Ranking Users with Similar Interests

Are the More-Similar Geosimilarity Neighbors of aPublisherrsquos Visitors Even More Likely Also to Be Vis-itors to the Same Publisher To assess whether more-similar geosimilarity neighbors are even more likelyto share interests we compute the measures exactlyas in the previous study except instead of consider-ing all NNs only the 1 (N1) and 10 (N10) neighborswith the highest scores are considered (see Tables 17and 18) First we must consider which of the 15 met-rics that are defined to measure link strength to useTable 19 reports the rank aggregations of the differentmetrics as to which provide the best lifts Specificallyfor each of the 40 publishers each scoring metric pro-vides a lift and for that publisher the scoring metricscan be ranked (the best being ranked 1 etc) The rankaggregation is the average rank for a scoring metricacross all 40 publishers Thus if one metric providedthe best lift for all publishers it would get an aggre-gated rank of 1

To visualize the best metrics the metrics with anaverage ranking less than 8 are shown in boldfacein Table 19 The Jaccard metrics do not perform wellin this study either showing that penalizing the con-nection to users that visit many locations is not sen-sible Remember that the Jaccard metric is based onthe idea that users that visit many locations will havea higher chance to share some location with anotheruser However we have so many locations in totalthat visiting a couple more locations will increase onlymarginally this probability of sharing a location bychance A user that logs into more IP addresses andis more active should hence not be penalized andranked lower than other less active users Metric 3which sums the IDF scores of the shared locationsperforms quite well Given the operational efficiencyadvantage over the frequency-based metrics this met-ric is chosen to measure the strength of the links

As seen from the results in Tables 17 and 18 thelift and leverage values for the strongly connected (top 1and top 10) NNs are even higher sometimes astro-nomical (3170 for Relax Melodies HD (RTB 4)) Theseresults are highly significant by sign tests compar-ing either the lift or the leverage between the top-ranked neighbors and the entire cohort (p lt 0001) Theabsolute values again indicate that the users with the

strongest geosimilarity are often substantially morelikely to visit the same publisher

Thus we can conclude that not only does geosim-ilarity find users with similar interests it also ranksusers well by their likelihood of having similarinterests

An interesting follow-up question is whether thecharacteristics of a publisher are of importance for theresults For example are GSN neighbors more sim-ilar in terms of using the same social network appas compared to playing the same games To answerthis question the publishers are categorized into sixclasses funny social news games communicationand miscellaneous In Figure 9 the lifts of the pub-lishers are shown per category Some general trendscan be observed where publishers related to funnycontent have the highest median lift This could beexplained by the fact that users often forward funnycontent to their friends The social websitesapps fol-low next which seems to further demonstrate thatfriends are likely linked in the GSN However pub-lishers in the communication category perform not sowell compared with the other categories and ratherhigh variances are observed in the different cate-gories Please note that high variance is observedacross the categories therefore it is difficult to makewell-supported claims from these results

73 TimingAs a final evaluation we explore whether the inclu-sion of temporal similarity in location visitation ishelpful for defining GSN connections To this end weintroduce a setting where two users are connectedonly if they visit an IP address in the same timeperiod of a day Specifically we divide a day intodifferent time windowsmdashfor example two time peri-ods of 12 hoursmdashwhere users are connected only ifthey visit the same IP address in the same time periodof a day (which might be on different days) Practi-cally two digits are added to an IP which denote thestarting moment of the corresponding time periodWe report on time periods of 2 hours and of 12 hours(other periods yield similar results) We also includethe setting where users are connected only when theyvisit the same IP address on the same day in light ofthe result of Crandall et al (2010) that visiting thesame location at around the same time is indicativeof friendship

The resulting lifts are reported in Figures 10ndash12when considering respectively all NNs the top 1 NNand the top 10 NNs The results when not consider-ing the time period are repeated as well (All) lim-ited to the conservative case of only considering oneexchange (see above) The baseline (lift of one) isindicated with a red horizontal line lifts higher than1000 are shown as 1000 to limit the range to be able

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 4: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 245

networking in which geographic services are usedto enable additional social dynamics (Quercia et al2010) The latter starts from an actual network ofinterpersonal relationships and adds location data torecommend other locations and events In contrastthe GSN uses location data to create links betweenusers without the explicit guarantee that these linkscorrespond to an actual interpersonal relationship(however this surely might be as described in thenext section see Crandall et al 2010)

Section 2 argues theoretically why this GSN designshould be a good idea After that we present the high-level GSN design in more detail followed by specificsof the implementation we examine empirically Thenwe provide results from an empirical study basedon real mobile location data from real-time biddingexchanges Section 5 describes the data and experi-mental setup including several technical definitionsfor geosimilarity The ability of the GSN to connectdifferent instances of the same individual is evaluatedin sect6 The use of the GSN to select users with similarinterests and tastes is assessed in sect7

2 Motivation and Related Work21 Social TargetingAdvertising targeting has evolved substantially overthe past half century As information systems providedaccess to new sources and types of data marketersadded new targeting strategies designed around thenew data For example as demographic data becameavailable a few decades ago contextual targetingmdashtargeting based on inferring audience compositionfrom the context in which the ad will be shown (ega billboard location TV show magazine etc)mdashhadto share the spotlight with data-driven demographictargeting either based on explicit demographic pro-files or predictive modeling As data aggregators coa-lesced and integrated information such as magazinesubscriptions and catalog purchases ldquopsychographicrdquodata entered the mix and broadened yet again thespace of targeting designs

Recently we have seen the introduction of a differ-ent sort of targeting design which we can generallycall social targeting Social targeting differs from theaforementioned targeting methods because it relieson explicit linkages between specific individuals Forexample Hill et al (2006) showed the remarkableeffectiveness of social-network targeting targeting con-sumers who are linked to known customers by asocial network Subsequently Facebook (and others)have attempted to implement social-network target-ing for online advertising with varying degrees ofsuccess (see eg Oinas-Kukkonen et al 2010)

We explicitly generalize from social-network target-ing to social targeting to retain the notion that the

targeting is based on linkages to specific other indi-viduals but to relax the notion that the linkages needto be ldquotruerdquo interpersonal relationships The designof Provost et al (2009) is an example of social tar-geting that is not based on ldquotruerdquo interpersonal rela-tionships the linkages between individuals are basedon a bipartite content-affinity network So the socialtargeting there is based on forming an audience byfinding consumers who are linked by shared con-tent visitation with other specific consumers who areknown to have brand affinity (more on that later)Similarly Martens and Provost (2011) define a net-work among banking customers based on paymenttransaction data which is subsequently used to targetmarketing offers for financial products

22 GeosimilarityThe GSN can be very fine-grained based on shared(anonymized) location data for example Internet pro-tocol (IP) addresses fine-grained latitudelongitudecells or small geographic tracts Why would targetinggeosimilarity network neighbors (NNs) of preselectedseed users be a good idea There are several reasonsLet us call mobile devices that are (directly) linked inthe GSN (first-degree) network neighbors

First of all first-degree network neighbors shareat least one location and possibly several As thenumber of shared locations grows we conjecture thatthe likelihood increases that the two devices actuallybelong to the same person For example who besidesme is observed primarily on my home and office IPaddress let alone my favorite coffee shop We seeanalogous results showing that different instances ofthe same person call the same phone numbers (Corteset al 2001) and cite the same references (Hill andProvost 2003)

Second direct coarse-grained geographic targetingalready is used widely in off-line advertising (not viasocial targeting) because geography is a reasonableproxy for demographics and other predictive featuresAn important difference in the present work is thatwe do not choose the geographies to target explic-itly but instead use them implicitly This allows theactual locations to be anonymized Furthermore itallows us to use location information that is too fine-grained even to include in most predictive modelingfor example locations appearing only in a tiny frac-tion of instances (eg a home Wi-Fi address may onlyappear in 0000001 or less of the usersrsquo location sets)as well as transient Wi-Fi locations that only connecttwo devices for a brief time

Third fine-grained location information is likely tocontain more detailed (latent) information than stan-dard geographic information Not only would it linkdevices by approximate wealth income demograph-ics etc it may well link by employer educational

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers246 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

institution interests community and even shoppinghabits Thus we conjecture that geosimilarity tar-geting may combine the advantages of geographictargeting discussed in the previous paragraph withadvantages similar to content-affinity social targetingwhich has been shown to be effective specifically foronline advertising (Provost et al 2009 Perlich et al2014) We might call this ldquolocale-affinityrdquo social tar-geting The empirical results below show that indeedonersquos geosimilarity neighbors are much more likely tofrequent the same online publishers and mobile apps

If that were not enough research provides yetanother reason to expect that geosimilarity targetingmay be especially effective In an article in the Pro-ceedings of the National Academy of Sciences of the UnitedStates of America Crandall et al (2010 p 22436) showthat geographic co-occurrences between individualsare very strongly predictive of the individuals beingfriends ldquoThe knowledge that two people were prox-imate at just a few distinct locations at roughly thesame times can indicate a high conditional proba-bility that they are directly linked in the underlyingsocial networkrdquo This means that a GSN would notonly capture the advantages of geographic targetingand locale-affinity targeting but might also incorpo-rate actual social-network targeting also shown to beextremely effective for marketing (Hill et al 2006)In fact when massive descriptive data are availablethe (usually latent) similarity between social-networkneighbors has been shown to explain much of themarketing advantage previously attributed to socialinfluence (Aral et al 2009)

Moreover interestingly research also has shown thatthe homophily (McPherson et al 2001) that has beenused to explain the effectiveness of social-network tar-geting may actually be due largely to the constraintsplaced by opportunity (Kossinets and Watts 2009)Tie formation in social networks is biased heavilyby triadic closure and thus by structural proximityOver many generations of tie formation biases in theselection of structurally proximate individuals ldquocanamplify even a modest preference for similar othersvia a cumulative advantage-like process to producestriking patterns of observed homophilyrdquo (Kossinetsand Watts 2009 p 405) Why is this important for thepresent paper Because with the exception of sociallinks formed solely online (like new Facebook-onlyfriends) constraints on opportunity are framed byconstraints on physical colocation As Kossinets andWatts (2009 p 406) observe ldquoan individualrsquos choiceof relations is heavily constrained by other aspects ofhis or her life such as geographical location choice ofoccupation place of workrdquo These are exactly the sortsof links that would be represented in a GSN based onmobile device location data (including tablets and lap-tops) Thus the GSN may in addition reveal a large

driver of homophily and thus expand the effective-ness of a social-network targeting strategy to individ-uals who also would have been similar ldquofriendsrdquo butfor whatever reason were not chosen

3 Geosimilarity GeosimilarityNetworks and Mobile Ad Targeting

The basis for the targeting design is to create a net-work among users based on geosimilarity and thento use the network for inference For this paper wewill create a GSN directly among user instances5 Theelements of the network-targeting design include thefollowing How will entities (mobile user instances)be represented What exactly will be the geosimilaritylinks And how exactly will the predictive inferencebe conducted

In this section we will discuss the design at a highlevel including some further-looking elements thatare not part of the experiments but which providecompleteness for the discerning reader To make thediscussion more concrete we first present the moti-vating data scenario The design should generalize tomobile settings beyond this specific data scenario

31 The Mobile Real-Time Bidding EcosystemA high-level overview of how an RTB exchange worksis given in Figure 1 RTB exchanges bring togetheradvertisers and the publishers of Web pages andapps When a user visits a Web pageapp (here-after ldquoWeb pagerdquo) that sells ad slots through RTBexchanges a bid request is created Such a bid requestwill provide additional data on the user and Webpage visited such as the IP address publisher anddevice type (see Table 1 for a list of fields and statis-tics on their availability across RTB exchanges) Basedon such data advertisers can choose to bid For exam-ple a brand may want to target users that previ-ously visited their automobile website browsing for aecofriendly hybrid car The highest bidder is allowedto load a potentially personalized ad for exampleshowing an ad for an ecofriendly hybrid All of thisoccurs in real time in less than 100 milliseconds Thetwo interrelated problems this paper addresses canbe stated specifically in the context of this examplehow does the targeter find instances of this individ-ual who previously visited the brandrsquos website Canthe targeter find users with very similar interests (egthe focal userrsquos husband best friend) under the pre-sumption that they might also be reasonable candi-dates for an ad for an ecofriendly hybrid

5 An alternative is to create a bipartite network between users andlocations and draw inferences directly from the bipartite networkWe do not investigate that explicitly in this paper however theinterested reader should take a careful look at the similarity calcu-lations we present below

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 247

Figure 1 (Color online) Working of Real-Time Bidding

AdvertisersAd

Consumers

Publisher hasad slots for sale

At IP X is on website YBrand 1

51c

14c

33cBrand 2

Brand 3

3

Bids

Bid request

Loading of ad

2

with device type ZBidding is open 4

1

Notes When a user visits a Web page that sells ad slots through an RTB exchange (1) a bid request is created (2) after which several advertisers bid to allowto show an ad at that moment on that Web page to that user (3) The highest bidder is allowed to load a potentially personalized ad (4)

For this paper we address the data currently avail-able in the digital advertising ecosystem Specificallyvia RTB exchanges we can observe a massive numberof mobile users along with the data made availablefor advertisers to decide whether to bid on the userfor a particular campaign (see eg Pubmatic 2010)Two aspects of the data are crucial (1) Each instanceof a mobile user is associated with a key when thiskey is observed in the future we know that the futuredata item involves the same mobile device6 (2) Partof the data associated with each mobile device is alocation For this paper we consider that location tobe the (anonymized) IP address of the Wi-Fi network7

currently in use (although there are various sorts oflocation data) We observe the locations visited byeach user for the data records included in the researchdata set (described in more detail in sect51)

The mobile RTB ecosystem holds different infor-mation than for the typical online (desktop) worldwhich is important for our design The GSN designmeasures similarity in location visitation behaviorOne might wonder if other behavioral data can beobtained from the RTB systems such as website vis-itation data (Provost et al 2009 Raeder et al 2012)Table 1 shows which typical fields (besides the key)are visible on mobile devices over seven different RTBexchanges Let us consider the different data fields aspossible sources of behavioral data

First let us have a look at the fields that are avail-able everywhere the time and date of the bid request(Created) the RTB exchange (Network) the dimen-sions of the ad (Dims) the device type (Device) the

6 The notion of an instance is important Many instances of the samedevice may have different keys in the online advertising ecosystemAs discussed above connecting these is one focus of this research7 For this paper we will consider transient IP addresses that are notexplicitly removed such as those belonging to 3G or other carriernetworks (but not identified as such) to be noise in the data thatmust be accommodated by the similaritytargeting design Betterremoval techniques could increase the strength of the results pre-sented in the next sections

user agent the publisher and the IP address TheIP address (anonymized) is the location we use in ourGSN design Figures 2(a) and 2(b) show the distri-bution of the number of users seen at an IP addressand the number of IP addresses seen per user respec-tively (more information on this data set is providedlater in sect51) On average a user visits 166 IP ad-dresses and an IP address is visited on average bytwo users (over a time period of 10 days see sect51)Although in our data set some known 3G IP addresses(subranges) have already been removed Figure 2(a)shows that high-volume IP addresses are still presentSome of these IP addresses with a very large numberof unique users seen correspond to large institutionsthink for example of a university Wi-Fi address

Although the publisher data item is available for allbid requests we observe that most device instancesvisit only a single publisher (about 95 whereasless than 1 visit three or more publishers andonly 00009 are seen on 10 or more publishers)The publisher-specific Interactive Advertising Bureau(IAB) categorizations (Iabcats) suffer from the sameproblem (as well as being too coarse-grained to beuseful for identifying the same user in any event)The reason is that as mentioned above a mobiledevice has different identifiers when viewed throughdifferent channels in the mobile advertising ecosys-tem such as different RTB systems or even differentapps On the other hand a publisher has on average1366 users that visited it and publishers by and largeseem to be RTB specific Figures 3(a) and 3(b) showmore details of the distributions As a consequenceusing publisher-visitation data to connect users is notbroadly helpful because two user instances in dif-ferent RTB systems rarely share a publisher The IPaddress is always available and is passed consistentlythrough the different apps and RTB systems

The other consistently available fields (such as useragent device type dimensions) are always the samefor a single device These could in principle be used toimprove same-device finding but are only marginally

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers248 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 1 Data Availability for Mobile Devices Across Different RTB Systems

RTB 1 () RTB 2 () RTB 3 () RTB 4 () RTB 5 () RTB 6 () RTB 7 ()

Created 100 100 100 100 100 100 100Network 100 100 100 100 100 100 100Dims 100 100 100 100 100 100 100ID 100 100 100 36 0 19 44Device 100 100 100 100 100 100 100User agent 100 100 100 100 100 100 100Iabcats 0 100 100 98 100 100 84Geo 100 16 100 100 100 100 100Location 0 4 56 42 1 20 100Referrer 8 0 0 17 0 6 3Publisher 100 100 100 100 100 100 100URL 100 0 0 5 93 0 0IP 100 100 100 100 100 100 100

Notes Specifically the table shows how often the following fields are provided time of bid request (Created) the RTB network identifier(Network) dimensions of the ad (Dims) identifier of the device (ID) device type (Device) the user agent IAB categorization (Iabcats)the IP-inferred city state or country (Geo) the location latitude and longitude (Location) the referrer from where the user comes(Referrer) the publisher of the Web page (Publisher) the URL and the IP address

helpful for finding the same user on different devicesor for similar-user finding (eg using Apple productsversus Android products) we do not take advantageof them in this paper instead focusing specificallyon assessing the ability of the geosimilarity design

Figure 2 Distribution of (a) Number of Unique User Instances Seen per IP Address and (b) Number of IP Addresses a User Instance Logs Into

(b) Number of IPs per ID

Number of IPs

Fre

quen

cy o

f occ

uren

ce

100

108

107

106

105

104

103

102

100

101

101 102 103 104 105 106 107 108

(a) Number of IDs per IP

Fre

quen

cy o

f occ

uren

ce

108

107

106

105

104

103

102

100

101

Number of IDs

100 101 102 103 104 105

Figure 3 Distribution of (a) Number of Unique User Instances Seen per Publisher and (b) Number of Publishers a User Instance Visits

(a) Number of IDs per publisher

Number of IDs

Fre

quen

cy o

f occ

uren

ce

100

104

103

102

100

101

101 102 103 104 105 106 107

(b) Number of publishers per ID

Fre

quen

cy o

f occ

uren

ce

108

107

106

105

104

103

102

100

101

Number of publishers

100 101 102 103 104

(as opposed to how to best engineer a same-user-finding system)

The URL field is missing in all but two RTB ex-changes The Geo field is often limited to the citystate and country as inferred from the IP address

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 249

Location data in the form of latitude and longitude(Location) are only available in some RTB systemsand the accuracy of the latitudelongitude data in thecurrent mobile ecosystem is questionable8 again mak-ing it not suitable for broad use

32 Entity and Link RepresentationWe represent each user instance by a ldquoprofilerdquo ofbehavior across locations where we define a locationprofile as a vector of visit frequencies This will neces-sarily require a sparse representation as we observe avery large total set of locations fortunately each indi-vidual is seen at only a small subset of the locationsFor the empirical results below the location profile isa sparse vector of frequencies of observed visitationsto locations (which can be weighted in the calculationof similarity as described below)

The structure of the GSN is the graph with mobiledevices as the nodes and links between the devicesas the edges The corresponding design decisionsinvolve the selection of which devices to link at alland the weights to place on the links The simplestlink representation places unweighted links betweenall pairs of devices that share at least one specific loca-tion More sophisticated weighting criteria are basedon two notions which are discussed next and illus-trated with a simple example in Figure 4 The loca-tion data are obtained by listening to RTB ecosystemswhere IP addresses of devices are broadcast (as shownpreviously in Figure 1)

1 Given two devices that share at least one loca-tion the weight of the link between them can bea function of the devicesrsquo location profiles Such mea-sures can include simply the number of shared loca-tions and the number of shared locations weightedby their visitation percentage in the profile Alterna-tively the weighting could be a measure of similaritybetween the two location profiles for example cosinesimilarity

2 As such Devices A B and C are connected in theGSN of Figure 4 since they all logged into the Wi-Fiof the Met Museum Devices B and C also loggedinto the public Wi-Fi at Times Square and thereforeshould be more strongly connected than A and B (or Band C) which share only one location

8 Analyses of the latitudelongitude data show various data prob-lems including flipped coordinates coordinates set equal to eachother many latitudinallongitudinal points at the centroids of com-mon geographic areas and many impossible or highly improba-ble coordinates Mobile ad industry insiders claim that the highprice paid for latitudelongitude-based hyperlocal targeting hasled to a significant amount of latitudelongitude fraud We havefound nothing written on this phenomenon although data fraudis not uncommon in the online advertising industry (Stitelmanet al 2013)

Figure 4 (Color online) Basic Illustration of the Creation of a GSNBased on Mobile Location Data Observed for Three MobileDevices (A B and C) and a Laptop (D)

Met MuseumA B

Geosocial network(GSN)

A C

D

B

C

D

Times Square

Apartment

Note Devices are connected in the GSN if they visited (logged into) the samelocation (IP address)

3 Links can be weighted according to the componentlocationsrsquo 4lack of5 popularity For example it mightbe argued that the link formed by sharing a loca-tion with a small number of other devices (eg myapartment) indicates a stronger geosimilarity thansharing a location with a massive number of otherdevices This intuition can be extended if two devicesspend a lot of time at such an unpopular locale(eg my apartment) then that should indicate a verystrong similarity If two devices have approximatelythe same distributional profile across several of thesesites (their ldquolocation profilerdquo) then they are eitherthe same person or close friendssoulmates In ourexample C and D are connected because they bothlogged into the Wi-Fi of an apartment whereas Band C are connected by logging into the public (pop-ular) Wi-Fi at Times Square Hence the connectionbetween C and D (which are likely devices belongingto roommates or belonging to the same user) shouldbe stronger than the connection between B and C(which are likely devices belonging to two differenttourists)

This notion of similarity can be incorporated tech-nically by (1) adapting notions from information re-trieval we can weight locations for a given deviceby their device-specific popularity (from the devicersquoslocation profile) divided by the (log of) the locationrsquosoverall popularity let us call that LF times IDF (locationfrequency times inverse device frequency) Then (2) thestrength of the link between two devices that sharea location would be a function of the correspondingLF times IDF scores

Once we have one or more GSNs there are variousways to take advantage of them for targeting or otherinference The simplest method for inference is sim-ply to target all of the geosimilar network neighborsAlternatively one could target the ldquoclosestrdquo neighborsbased on one or more notions of geosimilarity (as dis-

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers250 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

cussed above) These are the methods used for theempirical results presented below

4 Specifics of the GSN DesignAs described above in this design the strength of thelink between two individuals in the GSN is based onthe similarity in the distribution of the locations thatthey visit For this paperrsquos results locations will be(anonymized) IP addresses We now define differentsimilarity measures of varying complexity which willbe evaluated in the following sections

41 NotationThe location profile of a user is represented by a vec-tor Ex of length m where element i denotes the num-ber of visits to IP address IPi2 Ex isin m Vector Exbool isa binary vector of size m where element i denoteswhether the user has visited IPi For similarity com-putations both the Hadamard product () and theinner product 4 middot 5 are used and the minimum (min)operator is defined to operate componentwise on twovectors as well as on a vector and a scalar The max-imum (max) and average (avg) operators are similarTheir logic is defined in Equations (1)ndash(4) The weightof an IP address is defined by its popularity in anal-ogy to the inverse document frequency used in text

Table 2 Similarity Metrics Between Two Userrsquos Location Visit Distributions Ex1 and Ex2

Similarity metric Range d4 Ex11 Ex25

1 scount4 Ex11 Ex25= Exbool11 middot Exbool12 6015 1

2 sbool4 Ex11 Ex25= min4scount4 Ex11 Ex25115 80119 1

3 scount1W 4 Ex11 Ex25= 4 Ew Exbool115 middot Exbool12 6015 103

4 sfreq1min4 Ex11 Ex25= min4 Ex11 Ex251 6015 2

5 sfreq1max4 Ex11 Ex25= max4 Exbool12 Ex11 Exbool11 Ex251 6015 3

6 sfreq1 avg4 Ex11 Ex25= avg4 Exbool12 Ex11 Exbool11 Ex251 6015 205

7 sfreq1min1W 4 Ex11 Ex25= min4 Ex1 Ew1 Ex2 Ew51 6015 206

8 sfreq1max1W 4 Ex11 Ex25= max44 Exbool12 Ew5 Ex11 4 Exbool11 Ew5 Ex251 6015 309

9 sfreq1 avg1W 4 Ex11 Ex25= avg44 Exbool12 Ew5 Ex11 4 Exbool11 Ew5 Ex251 6015 3025

10 scosine4 Ex11 Ex25=Ex1 middot Ex2

Ex11 middot Ex21

60117 0074

11 scosine1W 4 Ex11 Ex25=4 Exbool11 Ew5 middot 4 Exbool12 Ew5

Exbool11 Ew1 middot Exbool12 Ew1

60117 0060

12 sJaccard4 Ex11 Ex25=scount4 Ex11 Ex25

max4 Exbool111 Exbool1251

60117 0033

13 sJaccard1W 4 Ex11 Ex25=scount1W 4 Ex11 Ex25

max4 Exbool11 Ew1 Exbool12 Ew51

60117 0026

14 sJaccard1 freq4 Ex11 Ex25=sfreq1min4 Ex11 Ex25

max4 Ex1 Exbool121 Ex2 Exbool1151

60117 0067

15 sJaccard1 freq1W 4 Ex11 Ex25=sfreq1min1W 4 Ex11 Ex25

max44 Ex1 Exbool125 Ew1 4 Ex2 Exbool115 Ew51

60117 0067

mining (Hotho et al 2005) the weight wi for IPi (theinverse device frequency) is defined as the logarithmof the total number of users n divided by the numberof (unique) users that visit that specific IP address ni

Ez= Ex Ey1 zi = xi times yi1 (1)

a= Ex middot Ey =sum

i=12m

4xi times yi51 (2)

Ez= min4Ex1 Ey51 zi = min4xi1yi51 (3)

Ez= min4Ex1a51 zi = min4xi1 a51 (4)

wi = log104nni50 (5)

42 Link Strength MetricsFifteen similarity metrics are defined in Table 2 andare illustrated by measuring the strength betweenthe location profiles of two example users Ex1 and Ex2shown in Equation (6) The first user visits IP1 threetimes and IP2 twice whereas the second one visits IP1twice and IP3 once Several variants of the same met-rics are considered As described above some onlytake into account the unique number of shared loca-tions whereas others take into account the frequencyof the visits to these locations and the inverse devicefrequency The latter are denoted by freq and W respectively

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 251

The first similarity metric scount counts the numberof shared locations Since the two example users shareonly IP1 this strength is 1 The IDF weighted versionscount1W sums the IDF values of the shared locationswhich is 13 for our simple example The most basicmetric is given by sbool which is 1 when a location isshared and 0 otherwise indicating whether two usersare neighbors or not

The sfreq1min sfreq1max and sfreq1avg metrics compare thefrequencies of the visits to the shared locations Theytake the minimum maximum and average respec-tively of the frequencies for each shared location andsum them The weighted versions sfreq1min1W sfreq1max1W and sfreq1avg1W do the same but weight each frequencywith the corresponding IDF value

The scosine metric measures the similarity by takingthe cosine of the angle between the two vectors TheJaccard metric is defined as the size of the intersec-tion over the size of the union In this case it countsthe number of shared locations but normalizes theseover the total number of locations both users havevisited Users that visit many locations will have ahigher chance to share some location with anotheruser so the Jaccard metric penalizes them For ourbasic example sJaccard is 1 (IP1) over 3 (IP1 IP2 IP3)Once more variants taking into account the frequencyand the IDF are defined as well

Ex1 = 63 2 071

Ex2 = 62 0 171 (6)

Ew = 6103 107 270

5 Experimental Setup51 Data SetThe data set D for the empirical results is 10 days ofanonymized advertising bid requests from mobile de-vices (smartphones tablets etc) observed across sevenRTB exchanges A total of 322770794 bid requestsare observed coming from 42437559 unique userinstances Over these 10 days we observe 41829088unique IP addresses

The number of unique user instances per RTBexchange is given in Table 3 The distributions of thenumber of bid requests per device type and operatingsystem are given in Tables 4 and 5 which show thatthe majority of bid requests come from smartphonesand the iOS operating system

A user (instance) corresponds to an exchange-assigned user ID One person on several deviceswill hence correspond to several users One deviceon several exchanges will often correspond to sev-eral users Also it might be that one ad exchangesometimes provides different user IDs to the same

Table 3 User Instances per RTB Exchange

Exchange Volume Percentage ()

RTB 1 311751979 6RTB 2 211591686 4RTB 3 1415661098 29RTB 4 2015011243 41RTB 5 212751015 5RTB 6 510551815 10RTB 7 212231418 4

Table 4 Distribution of Bid Requests per Device Type

Type Volume Percentage ()

Phone 29213581772 91Tablet 3010161334 9

Table 5 Distribution of Bid Requests per OperatingSystem

Type Volume Percentage ()

Android 12913151617 4001iOS 19219231648 5908Windows 5311529 002

device for example when the device is using differ-ent particular apps As described above an impor-tant task in online advertising is to be able to targetadvertisements to different instances of the same indi-vidual with reasonable probability regardless of thecomplexities of identification in the baroque onlineadvertising ecosystem Figure 5 visualizes a sample oflatitudes and longitudes broadcast by mobile devicesin our data set

52 SamplingFor the results in this paper we apply the GSN anal-ysis to data samples Each sample corresponds to acohortneighborhood and is built as follows A sin-gle (random) ldquoegordquo user is chosen user X aroundwhich a neighborhood is built Every neighbor of theego user must share at least one IP address so webegin by obtaining all (anonymized) IP addresses vis-ited by user X Next all other users that visited atleast one of these IP addresses (the neighbors) areadded The ego must have at least one neighbor to beincluded in this analysis Finally we obtain the com-plete location distribution of each of these neighborsWhen the next sample is chosen we ensure that noneof the users already included in the previous samplesis taken as the ego user (X) As such we ensure thatwe do not have repeated egondashneighbor dyads Theremay be shared neighbors of two different egos buttwo egos will never be immediate neighbors in thedata sample

The result of the sampling over 500 cohorts is summa-rized in Figure 6 on average an ego has 71 neighbors

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers252 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 5 Scatter Plots of Samples of Latitudes and Longitudes Broadcast by Mobile Devices

ndash130 ndash120 ndash110 ndash100 ndash90 ndash80 ndash70 ndash6020

25

30

35

40

45

50

55(a)

Long

Lat

ndash20 ndash10 0 10 20 30 40 50

Long

30

35

40

45

50

55

65

60

75

70

Lat

ndash200 ndash150 ndash100 ndash50 1000 50 150 200

Long

ndash100

ndash80

ndash60

ndash40

ndash20

0

40

20

100

60

80

Lat

4068ndash7406 ndash7404 ndash7402 ndash7400 ndash7394ndash7398 ndash7396 ndash7392 ndash7390

Long

4070

4072

4074

4076

4080

4078

4086

4082

4084

Lat

(b)

(c) (d)

Note Note that locations are plotted without reference to any map The figure gives a striking picture of population density across the world and makes uswonder what is going on with mobile devices in Antarctica

Figure 6 (Color online) The Distribution of the Number of Neighborsfor 500 Randomly Sampled Ego Users

0 500 1000 1500 2000 2500 3000 35000

50

100

150

200

250

300

350

400

450

No neighbors

Fre

quen

cy

the median being 1 The maximum number of neigh-bors is 3167 this user probably visited a publicWi-Fi that thousands of other users also logged intoWe chose 500 cohorts as we observe that the statis-

tics become quite stable as seen in Figure 7 Even200 cohorts seems to be enough to obtain stablesample statistics

The life span of user instances in this set of500 cohorts is shown by the cumulative frequency inFigure 8 Observe that only about 40 of the userinstances are seen at different hours About 30 ofthe user instances are seen only once in the sampledweek This illustrates that it is vital to understand thenotion of user instances These percentages are mis-leading if we interpret the user instances as individu-als or devices One individualdevice without stableidentification in the online advertising ecosystem willcreate many transient user instances which will driveup the percentage of user instances that are seen onlyonce or that are seen only within a short timeframeWe have not figured out how to determine or esti-mate reliably how many ldquounstablerdquo usersdevices arerepresented by the user instances in these data Wekeep all of the data for the analyses below (rather thanfiltering very short-lived user instances) to maintainrealism

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 253

Figure 7 Evolution of Several Sample Statistics as an Increasing Number of Cohorts are Created

0 100 200 300 400 5000

10

20

30

40

50

60

70

Average No network neighbors

0 100 200 300 400 5000

2

4

6

8

10

12

Median No network neighbors

0 100 200 300 400 5000

50

100

150

200

250

300

350

Average No IPs

0 100 200 300 400 5000

2

4

6

8Median No IPs

Notes The horizontal axis denotes the number of cohorts By 500 cohorts the statistics have become stable

Figure 8 (Color online) Cumulative Frequency of Life Span ofUser Instances

100 101 102 103 104 105 1060

01

02

03

04

05

06

07

08

09

10

x = lifespan (sec)

F(x

)

End of June (mobile only)

larr 1 min larr 1 hour larr 1 day

Cumulative frequency of user instances

6 Study 1 Connecting Instances ofthe Same Individual

Returning to our motivating application as men-tioned previously an important task in mobile adver-tising is to be able to target a particular individual inthe face of the individualrsquos fragmentation into differ-ent user instances (for example as seen through dif-ferent bidding systems) Advertisers understand that

it may be impossible to find the same user with 100accuracy Nonetheless it often is important for themto reach the userrsquos various instances without target-ing too many others9 The GSN is designed in partbased on the hypothesis that different instances of thesame person would visit similar locations and there-fore they will be connected in the GSN In this sectionwe present results examining whether and to whatextent that is indeed the case Furthermore the resultsquantify to what extent the same user receives a highGSN-similarity to herself as compared to her otherneighbors

Three experimental settings are used to assess judg-ments that two users are the same The first two sim-ulate different user instances by splitting up the bidrequests of known user instances The last one is notsimulated it links users via the recently introducedidentifier for advertising (IFA) observed over severalRTB exchanges for a subset of device instances

1 Random We divide the IP address visits of a userrandomly across two simulated users Table 6 showsfor example how we can divide a userrsquos IP visits across

9 The astute reader may notice that if the other users targeted arevery similar to the target user the advertising may be well placeddespite not being shown to the exact target user That is not thesubject of this study Study 2 shows evidence that these two notionsare not independent when using the GSN

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers254 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 6 Randomly Splitting User X rsquos IP VisitsCreating Two New Artificial Users X1 and X2

IP1 IP2 IP3

X 1 5 0X1 0 3 0X2 1 2 0

Table 7 The Volume of IFA Occurrence Within OurData Set

Volume

RTB 1 119401855RTB 2 113831962RTB 3 7891292

Table 8 The Frequency of Unique IFAs Visible AcrossRTB Exchanges

Frequency

RTB 1RTB 2 3061594 (1016)RTB 2RTB 3 451256 (169)RTB 3RTB 1 181417 (085)

two new users X1 and X2 for which we know thatthey are actually the same individual Results withRandom should be used only as a ceiling on perfor-mance because the results will be overly optimistic

2 Temporal A more realistic split-up of the trans-actions is a temporal split-up where first the middleday of Xrsquos transactions is determined (hence dynam-ically chosen for each user) All IP address visits upto this middle day are assigned to X1 and all trans-actions after the middle day are assigned to X2 Itwill only be applied to users who are seen on multi-ple days

3 Identifier for advertising IFA is an advertisingidentifier from Apple that users can change or asknot to be used for advertising and is seen as aprivacy-friendlier version of the controversial fixeddevice ID (the ldquoUDIDrdquo) The IFA is observed in threeexchanges with volumes shown in Table 7 for exam-ple 1940855 unique IFAs are seen in RTB 110 Whenwe observe the same IFA across different exchangeswe now know that two devices in different exchangesactually are the same The frequency of unique IFAsvisible across exchanges is shown in Table 8 For exam-ple a total of 3018223 IFAs are seen when we look atboth RTB 1 and RTB 2 with 306594 of these (1016)seen on both exchange 1 and exchange 2 What ifthere were not an IFA and the device would havehad a different identifier in each exchange To what

10 For confidentiality RTB 1 in this setting is not necessarily thesame as RTB 1 in sectsect4 or 7

extent are these two device instances then connectedin the GSN

With these different ldquogold standardrdquo methods toindicate whether or not two users are the samewe assess the ability of the geosimilarity metricsto link the same user determining how often twoinstances of the same person are connected in theGSN (sect61) Afterward we shall assess the differentGSN strength metricsrsquo ability to rank users determin-ing to what extent two instances of the same personexhibit strong geosimilarity as compared to the otherneighbors (sect62) For the random and temporal split-ups 200 samples are used For the IFA-based methodthe complete population is used for measuring thedegree of connectedness to the same user to analyzethe relative strength of the connected users 200 sam-ples are used

61 Same Individual Different ldquoScreensrdquoWe now present results on the GSN connectednessmeasured as how often two instances of the same userare connected in the GSN (in percentage) Note thatthe results summarized in Table 9 should be inter-preted in light of the limited number of days in build-ing the research data set which will limit the GSNconnectivity Nonetheless the relatively high connec-tivities observed even with this sample provide quitepromising results which lends considerable supportto the merit of the overall design

Random0 Using the random split two users thatcorrespond to the same individual are not connectedin only 9 of the neighborhoods Moreover in thesecases user X (whose IP visits are split up) usuallyhas only one or two entries and the random split-upresults in no visits for one of the two simulated users

Temporal0 The temporal split-up is more realisticthan the random one where we assume that a personuses simulated device 1 in the first half of the week(eg during the weekend) and simulated device 2 inthe second half of the week (eg during the week)

Users X1 and X2 are not connected in 32 of theneighborhoods Whereas the random split-up was toooptimistic the temporal one is too pessimistic as inreality there likely is some overlap in the time peri-ods of use of two different devices If user X visitsan IP address only in the first days of the week this

Table 9 To What Extent is the Same User Connectedto Herself

connected

Random 91Temporal 68IFAmdashRTB 1RTB 2 67IFAmdashRTB 2RTB 3 82IFAmdashRTB 3RTB 1 73

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 255

IP address will not occur in user X2rsquos transactions andreduces the chance to be connected to X1

IFA0 We look at all combinations of the three ex-changes with IFA identifiers For example there are45256 unique IFAs seen on both RTB 2 and RTB 3 In82 of these the two IFAs visit at least one same IP onboth exchanges and hence are connected in our GSN

The bottom line is that even with this limited slice ofonline behavior different instances of the same indi-vidual are connected in the GSN 70ndash80 of the timeEncouragingly the results from the simulated scenar-ios concur with the results from the real (IFA) sce-narios (including the aforementioned optimism andslight pessimism of the two simulated scenarios)Consistency over these different settings gives addi-tional confidence in the results suggesting that theGSN indeed holds promise for a high degree of suc-cess at targeting the same user across the fragmenta-tion of the online advertising ecosystem

62 Ranking User InstancesNext given that two user instances of the same indi-vidual are indeed connected we assess to what extentthe second instance of the same user is ranked highlyamong its network neighbors based on the weightof their linkage in the GSN We first will discuss themost general results using the best ranking methodand then will look across different ranking methods

We decompose the results based on the numberof network neighbors (NNs) since there are twoboundary cases that require special interpretationTables 10ndash14 show the proportions of the ego net-works that fall into three scenarios First a user mayhave only one NN In this scenario the ranking isperfect (trivially)mdashthe other instance of the same useris the only user instance given a nonzero score As

Table 10 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashRandom Split-up

Ego has only 1 NN 3104Ego has exactly 2 NNs 1905

Same user ranked first 8005Ego has more than 2 NNs 4901

Note For the scenario with two NNs the table reports thepercentage of cases where the other known instance of thesame user is ranked first (including ties)

Table 11 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashTemporal Split-up

Ego has only 1 NN 3008Ego has exactly 2 NNs 1807

Same user ranked first 8309Ego has more than 2 NNs 5005

Table 12 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 1RTB 2Split-up

Ego has only 1 NN 4401Ego has exactly 2 NNs 1201

Same user ranked first 8201Ego has more than 2 NNs 4400

Table 13 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 2RTB 3Split-up

Ego has only 1 NN 6109Ego has exactly 2 NNs 1907

Same user ranked first 7102Ego has more than 2 NNs 1804

Table 14 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 3RTB 1Split-up

Ego has only 1 NN 5305Ego has exactly 2 NNs 2501

Same user ranked first 6304Ego has more than 2 NNs 2104

shown in the tables this scenario accounts for 30ndash60 of the ego networks (depending on the split-upmethod)

The second boundary case is when the ego has ex-actly two NNs This scenario accounts for 10ndash25of the cases depending on the split-up method Herethe tables also report whether the other known userinstance is ranked first (including ties) among thetwo neighbors The other known instance of the sameuser is ranked first 60ndash80 of the time dependingon the split-up scenario The IFA results are affectedstrongly by the particular RTB settingmdashthis may indi-cate that the publishers on the different RTBs have dif-ferent policies for passing the IFA Importantly theseresults are conservativemdashpossibly very conservativeThe other NNs may also be instances of the same userunbeknownst to usmdashfor example in the IFA casebecause an app does not pass the IFA to the RTB orbecause the other user is the same user on a differentdevice and thus does not share the IFA

The final scenario is when more than two NNsare present As shown in the tables this accounts for20ndash50 of the cases depending on the split-up sce-nario In this case we measure the percentile in theranking where the instance of the known same userfalls (see below) as well as the area under the receiveroperating characteristic curve (AUC) (Fawcett 2006)which is equivalent to the MannndashWhitneyndashWilcoxon(MWW) statistic The AUCMWW measures to what

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers256 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 15 When There Are More Than Two Network Neighbors to What Extent Is the Known Same User StronglyConnected to Itself

AUC (if gt 2 NNs) () Average rank percentile (if gt 2 NNs) ()

Ranking metric R T RTB 12 RTB 23 RTB 31 R T RTB 12 RTB 23 RTB 31

1 78 64 68 59 55 29 48 34 43 542 59 58 67 58 54 45 53 34 44 553 79 69 74 61 61 29 44 29 40 484 86 75 69 61 59 24 40 32 41 505 83 70 71 59 63 26 44 31 43 476 85 74 71 60 65 25 41 31 42 457 86 76 73 62 64 24 40 29 40 468 83 70 72 59 65 26 44 30 43 459 85 74 71 60 67 25 41 30 42 44

10 79 69 69 61 63 29 44 32 41 4711 79 69 71 62 64 44 44 30 41 4612 81 68 68 60 62 28 46 34 42 4913 81 70 71 61 64 27 44 30 40 4614 74 67 66 61 48 34 47 35 41 6015 74 67 66 61 48 34 47 35 41 60

Note See the text for a discussion of ranking metrics

extent data points with label 1 are ranked higherthan those with label 0 In the results that follow(see Table 15) the AUC is measured for each samplewhere any user instance among the neighbors thatcorresponds to the same individual is labeled as 1and all others are labeled as 0 Thus in this settingthe AUC measures how well the neighbors are rankedby the likelihood of being the same individual11 Theaverage over all samples is reported If for a sampleall neighbors have the same score that AUC is setto 05 In addition to the factors noted above this alsowill tend to make the results conservative For exam-ple if the GSN links an ego user to a set of neighborsthat are all instances of the same individual in thisevaluation the AUC will be 05

The ranking measure is the percentile in the rank-ing at which the known-same-user instance is rankedwith lower values being better For example if thesame-user instance has the highest score (closest con-nection) among five neighbors the rank is 20 (15)Note again that this is quite conservative in thisexample the user is ranked at the top of the list butbecause there were only five neighbors the highestpossible score is 20 In case the known-same-userinstance has the same score as another neighbor theaverage rank is reported

As shown in Table 15 all AUC values exceed 05ranging up to 08 This means that in every casea known-same-user instance is likely to be rankedhigher than an instance not known to be the same

11 Technically in this setting the AUCMWW is the probability thata randomly selected neighbor that is in fact a known instance of thesame user is ranked more highly than a randomly selected neighborthat is not known to be an instance of the same user If the known-same-user instances are all ranked above the other-user instancesthe AUC = 100 If they are all below then the AUC = 0

user The frequency-based ranking metrics (4ndash9) per-form quite well in connecting instances of the sameuser strongly

Almost all metrics perform better than a randommodel with performances that even come close to thebest possible solution Consistently performing quitewell are the frequency-based metrics (metrics 3ndash9)with metric 7 (the minimum of the IDF weights ofthe shared locations) getting the most wins Not per-forming well is the very basic count metric as well asthe Boolean metric that provides a binary score onlyInteresting to note the Jaccard metrics do not performwell either It seems that penalizing users that visitmany IP addresses negatively affects the results anissue we will return to in the next study

7 Study 2 Does the GSN Select Userswith Similar Interests

The foregoing section addressed our paperrsquos firstclaimmdashthat the GSN design indeed links instances ofthe same user and links them more strongly thaninstances of other users In this section we turn to thepaperrsquos second claimmdashthat the GSN links users withsimilar interests We will measure similar interests bysimilarity in behavior accessing particular publisherson the Web and using mobile apps12

71 Connecting Users with Similar Interests

To What Extent Are the GSN Neighbors of Visi-tors to Particular Publishers Also Visitors to ThoseSame Publishers To assess whether the GSN con-nects users with similar interestsbehavior we select

12 From now on all user instances will be complete real user in-stances the simulated split-up scenarios (Random Temporal) werecreated specifically for the purpose of the same-user study

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 257

Table 16 Distribution of Publishers per RTB in Terms of thePercentage of Users in Our Data Set That Visit the GivenPublisher

Publisher of D

RTB 1Burstly 3004myYearbookcom 1062My Fitness Pal 0065TextMe Inc 0056GameResort 0025Flixster 0019247Sports 0015Daily Workout Apps LLC 0009Mobile Deluxe 0007YoYo Games Ltd 0004

RTB 2mworldnowcom 0032mtmzcom 0024mtopixcom 0014babycentercom 0015meetmecom 0006appsfacebookcom 0006itunesapplecom 0039appevitecom 0002beautyandskincarescom 0003mmajunkiecom 0001

RTB 3Conversion Exchange 5053Pinger Inc 2096Top Game Developer 4029PremiumEntertainmentApp-FamilyOriented-Android 0059Enflick 0053Clapfoot Games 0044FunPokes Inc 0018Talkatone 0015Sevenlogics Inc 0008A Star Software 0007

RTB 4Pinger PhonemdashiOSmdashConversation 1001Talkatone iPhone App 0007DiceWithBuddies 00139GAG Reader 0012Video Downloader Pro Lite 0016Rage Comics 0009Spades 3D Lite 0006Apalabrados 0003Relax Melodies HD 0002Celeb MemdashPhotoMaker 0000

a diverse collection of 40 publishers listed in Table 16A publisher can correspond to a mobile app a web-site or a set of appswebsites We selected both pub-lishers that are visited by many users and more nichepublishers with fewer visitors As can be seen fromTable 16 several types of publishers can be distin-guished the main groups are games (eg GameRe-sort Top Game Developer) and social networkingtools (eg appsfacebookcom Text Me Inc)

More specifically we create 200 samples for 40 pub-lishers 10 publishers for each of four different RTB

exchanges listed in Table 1613 The percentage ofunique users that visit each publisher in our completedata set is also shown (D) To create each samplean ldquoegordquo user is selected that has visited the pub-lisher in question (an ldquoego visitorrdquo) this ego visi-torrsquos GSN neighborhood is created Next we assesshow many of the egorsquos neighbors also visited thatsame publisher This will be compared with a base-line visit rate to produce lift and leverage values(Provost and Fawcett 2013)mdashspecifically how muchmore likely is it for the neighbors of ego visitors tovisit the publisher than would be expected by chanceWe consider three different setups that produce dif-ferent baselines

1 Considering all RTB exchanges to determine the base-line A user is considered a visitor for a given pub-lisher if she visits the publisher on any RTB ex-change (since some publishers are visible on differentexchanges) Hence the baseline visit rate is the num-ber of users (on any RTB) who visit that publisherdivided by the total number of users across all RTBexchanges This will provide optimistic lifts sincemost publishers appear only on one RTB exchangewhereas the baseline is measured over all exchanges

2 Considering one RTB exchange only to determine thebaseline A user is considered a visitor for a given pub-lisher and RTB if she visits the publisher on the sameRTB exchange This changes the baseline to the num-ber of users (on this RTB) who visit that publisherdivided by the total number of users on this RTBexchange The lift and leverage results will be worsethan in the first setting because the baseline will belarger Seeing that some publishers are seen on differ-ent RTB exchanges this setting is slightly pessimistic

3 Taking time of location visits into account In thethird setting we link users only if they visit an IPwithin the same time period This will be elaboratedon in sect73

Results are shown in Tables 17 and 18 The percent-age of publisher visitors in the neighborhood is givenas N This percentage is compared with the baselinepercentage of users that visit that publisher in thecomplete data D using two metrics the lift is the ratioof these numbers the leverage is the difference Liftis particularly useful for very small base rates (con-sider that an activity with a base rate of 30 cannotpossibly have a lift higher than about 3) Leverage ismore telling for larger base rates because it shows theabsolute improvement (here in percentage points)

When considering all NNs the lifts for all publish-ers are greater than 1 except for one Burstly on RTB 1has a lift of 097 (essentially no lift) but only when weconsider RTB 1 as the baseline This publisher already

13 For confidentiality the RTB numbering here is again differentfrom above

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers258 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 17 RTBs 1 and 2 Percentage of Network Neighbors That Also Visit the Publisher

RTB 1

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Burstly 3004 8063 2084 5059 47022 15054 44018 25061 8043 220572 myYearbookcom 1062 25077 15094 24015 44074 27067 43012 35016 21074 330543 My Fitness Pal 0065 4015 6038 305 3609 56077 36025 15055 23091 14094 TextMe Inc 0056 1805 33006 17094 39002 69075 38046 25018 45 240625 GameResort 0025 4093 19077 4068 18052 74026 18027 11086 47058 110626 Flixster 0019 1055 8015 1036 8014 42085 7095 1079 9042 1067 247Sports 0015 27057 188058 27042 77062 530093 77048 74053 509075 740388 Daily Workout Apps LLC 0009 1092 21047 1083 19023 215046 19014 11054 129027 110459 Mobile Deluxe 0007 1087 28067 108 23053 36009 23046 9063 14707 9056

10 YoYo Games Ltd 0004 3052 99058 3049 21057 609064 21053 13 367045 12096Considering this RTB exchange only to determine baseline

1 Burstly 8085 8063 0097 minus0022 47022 5033 38037 25061 2089 160762 myYearbookcom 4071 25077 5047 21006 44074 905 40003 35016 7046 300453 My Fitness Pal 1089 4015 2019 2025 3609 19048 35001 15055 8021 130654 TextMe Inc 1063 1805 11035 16087 39002 23094 37039 25018 15045 230555 GameResort 0073 4093 6078 402 18052 25049 17079 11086 16033 110146 Flixster 0055 1055 208 0099 8014 14071 7059 1079 3023 10247 247Sports 0043 27057 64073 27014 77062 182023 7702 74053 174097 74018 Daily Workout Apps LLC 0026 1092 7037 1066 19023 73095 18097 11054 44037 110289 Mobile Deluxe 0019 1087 9084 1068 23053 123088 23034 9063 5007 9044

10 YoYo Games Ltd 001 3052 34018 3042 21057 209025 21047 13 126012 1209

RTB 2

Considering all RTB exchanges to determine baseline1 mworldnowcom 0032 13027 40086 12095 54039 167042 54006 31065 97044 310332 mtmzcom 0024 6089 28086 6065 23008 9607 22084 10068 44075 100443 mtopixcom 0014 15007 104079 14093 53045 37106 5303 33075 234065 330614 babycentercom 0015 6087 47031 6072 43048 299056 43033 17074 122024 17065 meetmecom 0006 6002 10308 5097 54072 942093 54066 27085 47909 270796 appsfacebookcom 0006 301 48039 3003 9009 141097 9003 2076 43008 20697 itunesapplecom 0039 25077 65071 25038 51043 131012 51004 29049 75018 290098 appevitecom 0002 48028 21212043 48025 54076 21509068 54074 48028 21212043 480259 beautyandskincarescom 0003 18019 642087 18016 8048 299094 8046 1107 413049 11067

10 mmajunkiecom 0001 1036 201064 1035 26067 3195505 26066 13033 11977074 13033Considering this RTB exchange only to determine baseline

1 mworldnowcom 4034 13027 3006 8093 54039 12053 50005 31065 7029 270312 mtmzcom 3019 6089 2016 307 23008 7024 19089 10068 3035 70493 mtopixcom 1092 15007 7084 13015 53045 27081 51053 33075 17056 310834 babycentercom 1094 6087 3054 4093 43048 22042 41054 17074 9015 15085 meetmecom 0078 6002 7077 5025 54072 70057 53094 27085 35092 270076 appsfacebookcom 0086 301 3062 2024 9009 10063 8024 2076 3022 1097 itunesapplecom 5024 25077 4092 20053 51043 9081 46019 29049 5063 240258 appevitecom 0029 48028 165058 47098 54076 187082 54047 48028 165058 470989 beautyandskincarescom 0038 18019 48011 17081 8048 22045 8011 1107 30095 11032

10 mmajunkiecom 0009 1036 15009 1027 26067 296003 26058 13033 148001 13024

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

has a very high baseline and as we will see in thenext section when we consider only the top 1 and 10NNs the lifts go up to 5 and 3 respectively Burstlyprovides a generic platform for building apps and sois a weaker indication of user interest than many ofthe other publishers (fitness babies games)

All other lifts exceed one by substantial mar-gins and go up to even 2391 (Apalabrados onRTB 4 a Spanish-language social Scrabble-like game)

Obviously the conclusion that the NNs indeed aremore likely to visit the same publishers is highly sig-nificant by a simple sign test comparing the visita-tion rate within the geosimilarity neighborhood tothat of overall population either via lift or leveragep lt 0001 Thus NNs of publisher visitors are indeedmore likely also to visit the same websites the abso-lute values of the lifts and leverages suggest that theyare substantially more likely

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 259

Table 18 RTBs 3 and 4 Percentage of Network Neighbors That Also Visit the Publisher

RTB 3

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Conversion Exchange 5053 34054 6024 29001 64071 11069 59017 50077 9018 450242 Pinger Inc 2096 46002 15053 43006 65012 21097 62015 51048 17037 480523 Top Game Developer 4029 47024 11002 42095 60 13099 55071 48097 11042 440684 PremiumEntertainmentApp-FamilyOriented-Android 0059 23079 40012 2302 20045 34049 19086 22025 37052 210665 Enflick 0053 11071 21095 11017 21057 40044 21004 12027 23 110736 Clapfoot Games 0044 8055 19032 8011 11054 26006 1101 13099 31061 130557 FunPokes Inc 0018 11011 61028 10093 50 275077 49082 15079 87008 150618 Talkatone 0015 13013 88051 12098 31011 209076 30096 17091 120078 170779 Sevenlogics Inc 0008 2071 32059 2063 5088 70074 508 8033 100022 8025

10 A Star Software 0007 3033 44086 3025 23008 311024 23 9009 122061 9002Considering this RTB exchange only to determine baseline

1 Conversion Exchange 11045 34054 3002 23009 64071 5065 53025 50077 4043 390322 Pinger Inc 6013 46002 705 39089 65012 10062 58098 51048 8039 450353 Top Game Developer 8088 47024 5032 38036 60 6076 51012 48097 5052 400094 PremiumEntertainmentApp-FamilyOriented-Android 1023 23079 19038 22057 20045 16066 19023 22025 18013 210025 Enflick 101 11071 1006 1006 21057 19054 20046 12027 11011 110166 Clapfoot Games 0092 8055 9033 7064 11054 12059 10062 13099 15027 130087 FunPokes Inc 0038 11011 2906 10074 50 133022 49062 15079 42007 150418 Talkatone 0031 13013 42076 12082 31011 101034 3008 17091 58035 170619 Sevenlogics Inc 0017 2071 15074 2054 5088 34017 5071 8033 48041 8016

10 A Star Software 0015 3033 21067 3017 23008 150036 22092 9009 59023 8094

RTB 4

Considering all RTB exchanges to determine baseline1 Pinger PhonemdashiOSmdashConversation 1001 71081 70085 7008 80077 79068 79076 70014 6902 690132 Talkatone iPhone App 0007 11038 17203 11031 29003 439068 28097 17002 257078 160963 DiceWithBuddies 0013 7041 59004 7028 18018 144092 18006 7069 61031 70574 9GAG Reader 0012 62013 497099 62 49009 39305 48097 49005 393017 480925 Video Downloader Pro Lite 0016 20003 123015 19087 22058 138084 22042 17093 110027 170776 Rage Comics 0009 609 8001 6081 13033 154086 13025 609 8001 60817 Spades 3D Lite 0006 1205 225076 12044 27027 492057 27022 1205 225076 120448 Apalabrados 0003 77042 21391071 77039 75 21316097 74097 79032 21450043 790299 Relax Melodies HD 0002 31058 21002029 31056 50 31170029 49098 31058 21002029 31056

10 Celeb MemdashPhotoMaker 0 3023 82107 3022 7014 11819048 7014 3023 82107 3022Considering this RTB exchange only to determine baseline

1 Pinger PhonemdashiOSmdashConversation 8051 71081 8044 6303 80077 9049 72026 70014 8024 610632 Talkatone iPhone App 0055 11038 20053 10082 29003 52038 28048 17002 30071 160473 DiceWithBuddies 0009 7041 82011 7032 18018 201054 18009 7069 85027 7064 9GAG Reader 0097 62013 63097 61016 49009 50055 48012 49005 50051 480085 Video Downloader Pro Lite 0044 20003 45036 19059 22058 51014 22014 17093 40062 170496 Rage Comics 0037 609 18079 6053 13033 36032 12097 609 18079 60537 Spades 3D Lite 0046 1205 2609 12004 27027 58068 26081 1205 2609 120048 Apalabrados 0021 77042 371093 77021 75 360031 74079 79032 381006 790119 Relax Melodies HD 0013 31058 238054 31045 50 377069 49087 31058 238054 31045

10 Celeb MemdashPhotoMaker 0003 3023 97089 3019 7014 216076 7011 3023 97089 3019

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

The very high lifts could be attributed to two phe-nomena They may provide some evidence that theGSN embeds a true social network as pointed out bythe good results for the social networking publishers(eg Apalabrados with a lift of 2391 appsevitecomwith a lift of 2212 and others) Friends visiting thesame social networking sites (here unusual ones)

would increase the lift and leverage over just havingdifferent instances of the same user visiting the samewebsites An alternative explanation is that some ofthese publishers are much more likely to be visitedacross different devices of the same user It is not clearwhether this is actually true However it argues forobtaining specific and broad data on which devices

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers260 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

are not the same user to shed further light on forexample any broad data that include a (anonymized)user identifier

From these results we can conclude that GSNneighbors indeed exhibit substantially similar pub-lisher visitation behavior Next we assess to whatextent more strongly connected NNs are more similarthan less strongly connected NNs

72 Ranking Users with Similar Interests

Are the More-Similar Geosimilarity Neighbors of aPublisherrsquos Visitors Even More Likely Also to Be Vis-itors to the Same Publisher To assess whether more-similar geosimilarity neighbors are even more likelyto share interests we compute the measures exactlyas in the previous study except instead of consider-ing all NNs only the 1 (N1) and 10 (N10) neighborswith the highest scores are considered (see Tables 17and 18) First we must consider which of the 15 met-rics that are defined to measure link strength to useTable 19 reports the rank aggregations of the differentmetrics as to which provide the best lifts Specificallyfor each of the 40 publishers each scoring metric pro-vides a lift and for that publisher the scoring metricscan be ranked (the best being ranked 1 etc) The rankaggregation is the average rank for a scoring metricacross all 40 publishers Thus if one metric providedthe best lift for all publishers it would get an aggre-gated rank of 1

To visualize the best metrics the metrics with anaverage ranking less than 8 are shown in boldfacein Table 19 The Jaccard metrics do not perform wellin this study either showing that penalizing the con-nection to users that visit many locations is not sen-sible Remember that the Jaccard metric is based onthe idea that users that visit many locations will havea higher chance to share some location with anotheruser However we have so many locations in totalthat visiting a couple more locations will increase onlymarginally this probability of sharing a location bychance A user that logs into more IP addresses andis more active should hence not be penalized andranked lower than other less active users Metric 3which sums the IDF scores of the shared locationsperforms quite well Given the operational efficiencyadvantage over the frequency-based metrics this met-ric is chosen to measure the strength of the links

As seen from the results in Tables 17 and 18 thelift and leverage values for the strongly connected (top 1and top 10) NNs are even higher sometimes astro-nomical (3170 for Relax Melodies HD (RTB 4)) Theseresults are highly significant by sign tests compar-ing either the lift or the leverage between the top-ranked neighbors and the entire cohort (p lt 0001) Theabsolute values again indicate that the users with the

strongest geosimilarity are often substantially morelikely to visit the same publisher

Thus we can conclude that not only does geosim-ilarity find users with similar interests it also ranksusers well by their likelihood of having similarinterests

An interesting follow-up question is whether thecharacteristics of a publisher are of importance for theresults For example are GSN neighbors more sim-ilar in terms of using the same social network appas compared to playing the same games To answerthis question the publishers are categorized into sixclasses funny social news games communicationand miscellaneous In Figure 9 the lifts of the pub-lishers are shown per category Some general trendscan be observed where publishers related to funnycontent have the highest median lift This could beexplained by the fact that users often forward funnycontent to their friends The social websitesapps fol-low next which seems to further demonstrate thatfriends are likely linked in the GSN However pub-lishers in the communication category perform not sowell compared with the other categories and ratherhigh variances are observed in the different cate-gories Please note that high variance is observedacross the categories therefore it is difficult to makewell-supported claims from these results

73 TimingAs a final evaluation we explore whether the inclu-sion of temporal similarity in location visitation ishelpful for defining GSN connections To this end weintroduce a setting where two users are connectedonly if they visit an IP address in the same timeperiod of a day Specifically we divide a day intodifferent time windowsmdashfor example two time peri-ods of 12 hoursmdashwhere users are connected only ifthey visit the same IP address in the same time periodof a day (which might be on different days) Practi-cally two digits are added to an IP which denote thestarting moment of the corresponding time periodWe report on time periods of 2 hours and of 12 hours(other periods yield similar results) We also includethe setting where users are connected only when theyvisit the same IP address on the same day in light ofthe result of Crandall et al (2010) that visiting thesame location at around the same time is indicativeof friendship

The resulting lifts are reported in Figures 10ndash12when considering respectively all NNs the top 1 NNand the top 10 NNs The results when not consider-ing the time period are repeated as well (All) lim-ited to the conservative case of only considering oneexchange (see above) The baseline (lift of one) isindicated with a red horizontal line lifts higher than1000 are shown as 1000 to limit the range to be able

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 5: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile Consumers246 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

institution interests community and even shoppinghabits Thus we conjecture that geosimilarity tar-geting may combine the advantages of geographictargeting discussed in the previous paragraph withadvantages similar to content-affinity social targetingwhich has been shown to be effective specifically foronline advertising (Provost et al 2009 Perlich et al2014) We might call this ldquolocale-affinityrdquo social tar-geting The empirical results below show that indeedonersquos geosimilarity neighbors are much more likely tofrequent the same online publishers and mobile apps

If that were not enough research provides yetanother reason to expect that geosimilarity targetingmay be especially effective In an article in the Pro-ceedings of the National Academy of Sciences of the UnitedStates of America Crandall et al (2010 p 22436) showthat geographic co-occurrences between individualsare very strongly predictive of the individuals beingfriends ldquoThe knowledge that two people were prox-imate at just a few distinct locations at roughly thesame times can indicate a high conditional proba-bility that they are directly linked in the underlyingsocial networkrdquo This means that a GSN would notonly capture the advantages of geographic targetingand locale-affinity targeting but might also incorpo-rate actual social-network targeting also shown to beextremely effective for marketing (Hill et al 2006)In fact when massive descriptive data are availablethe (usually latent) similarity between social-networkneighbors has been shown to explain much of themarketing advantage previously attributed to socialinfluence (Aral et al 2009)

Moreover interestingly research also has shown thatthe homophily (McPherson et al 2001) that has beenused to explain the effectiveness of social-network tar-geting may actually be due largely to the constraintsplaced by opportunity (Kossinets and Watts 2009)Tie formation in social networks is biased heavilyby triadic closure and thus by structural proximityOver many generations of tie formation biases in theselection of structurally proximate individuals ldquocanamplify even a modest preference for similar othersvia a cumulative advantage-like process to producestriking patterns of observed homophilyrdquo (Kossinetsand Watts 2009 p 405) Why is this important for thepresent paper Because with the exception of sociallinks formed solely online (like new Facebook-onlyfriends) constraints on opportunity are framed byconstraints on physical colocation As Kossinets andWatts (2009 p 406) observe ldquoan individualrsquos choiceof relations is heavily constrained by other aspects ofhis or her life such as geographical location choice ofoccupation place of workrdquo These are exactly the sortsof links that would be represented in a GSN based onmobile device location data (including tablets and lap-tops) Thus the GSN may in addition reveal a large

driver of homophily and thus expand the effective-ness of a social-network targeting strategy to individ-uals who also would have been similar ldquofriendsrdquo butfor whatever reason were not chosen

3 Geosimilarity GeosimilarityNetworks and Mobile Ad Targeting

The basis for the targeting design is to create a net-work among users based on geosimilarity and thento use the network for inference For this paper wewill create a GSN directly among user instances5 Theelements of the network-targeting design include thefollowing How will entities (mobile user instances)be represented What exactly will be the geosimilaritylinks And how exactly will the predictive inferencebe conducted

In this section we will discuss the design at a highlevel including some further-looking elements thatare not part of the experiments but which providecompleteness for the discerning reader To make thediscussion more concrete we first present the moti-vating data scenario The design should generalize tomobile settings beyond this specific data scenario

31 The Mobile Real-Time Bidding EcosystemA high-level overview of how an RTB exchange worksis given in Figure 1 RTB exchanges bring togetheradvertisers and the publishers of Web pages andapps When a user visits a Web pageapp (here-after ldquoWeb pagerdquo) that sells ad slots through RTBexchanges a bid request is created Such a bid requestwill provide additional data on the user and Webpage visited such as the IP address publisher anddevice type (see Table 1 for a list of fields and statis-tics on their availability across RTB exchanges) Basedon such data advertisers can choose to bid For exam-ple a brand may want to target users that previ-ously visited their automobile website browsing for aecofriendly hybrid car The highest bidder is allowedto load a potentially personalized ad for exampleshowing an ad for an ecofriendly hybrid All of thisoccurs in real time in less than 100 milliseconds Thetwo interrelated problems this paper addresses canbe stated specifically in the context of this examplehow does the targeter find instances of this individ-ual who previously visited the brandrsquos website Canthe targeter find users with very similar interests (egthe focal userrsquos husband best friend) under the pre-sumption that they might also be reasonable candi-dates for an ad for an ecofriendly hybrid

5 An alternative is to create a bipartite network between users andlocations and draw inferences directly from the bipartite networkWe do not investigate that explicitly in this paper however theinterested reader should take a careful look at the similarity calcu-lations we present below

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 247

Figure 1 (Color online) Working of Real-Time Bidding

AdvertisersAd

Consumers

Publisher hasad slots for sale

At IP X is on website YBrand 1

51c

14c

33cBrand 2

Brand 3

3

Bids

Bid request

Loading of ad

2

with device type ZBidding is open 4

1

Notes When a user visits a Web page that sells ad slots through an RTB exchange (1) a bid request is created (2) after which several advertisers bid to allowto show an ad at that moment on that Web page to that user (3) The highest bidder is allowed to load a potentially personalized ad (4)

For this paper we address the data currently avail-able in the digital advertising ecosystem Specificallyvia RTB exchanges we can observe a massive numberof mobile users along with the data made availablefor advertisers to decide whether to bid on the userfor a particular campaign (see eg Pubmatic 2010)Two aspects of the data are crucial (1) Each instanceof a mobile user is associated with a key when thiskey is observed in the future we know that the futuredata item involves the same mobile device6 (2) Partof the data associated with each mobile device is alocation For this paper we consider that location tobe the (anonymized) IP address of the Wi-Fi network7

currently in use (although there are various sorts oflocation data) We observe the locations visited byeach user for the data records included in the researchdata set (described in more detail in sect51)

The mobile RTB ecosystem holds different infor-mation than for the typical online (desktop) worldwhich is important for our design The GSN designmeasures similarity in location visitation behaviorOne might wonder if other behavioral data can beobtained from the RTB systems such as website vis-itation data (Provost et al 2009 Raeder et al 2012)Table 1 shows which typical fields (besides the key)are visible on mobile devices over seven different RTBexchanges Let us consider the different data fields aspossible sources of behavioral data

First let us have a look at the fields that are avail-able everywhere the time and date of the bid request(Created) the RTB exchange (Network) the dimen-sions of the ad (Dims) the device type (Device) the

6 The notion of an instance is important Many instances of the samedevice may have different keys in the online advertising ecosystemAs discussed above connecting these is one focus of this research7 For this paper we will consider transient IP addresses that are notexplicitly removed such as those belonging to 3G or other carriernetworks (but not identified as such) to be noise in the data thatmust be accommodated by the similaritytargeting design Betterremoval techniques could increase the strength of the results pre-sented in the next sections

user agent the publisher and the IP address TheIP address (anonymized) is the location we use in ourGSN design Figures 2(a) and 2(b) show the distri-bution of the number of users seen at an IP addressand the number of IP addresses seen per user respec-tively (more information on this data set is providedlater in sect51) On average a user visits 166 IP ad-dresses and an IP address is visited on average bytwo users (over a time period of 10 days see sect51)Although in our data set some known 3G IP addresses(subranges) have already been removed Figure 2(a)shows that high-volume IP addresses are still presentSome of these IP addresses with a very large numberof unique users seen correspond to large institutionsthink for example of a university Wi-Fi address

Although the publisher data item is available for allbid requests we observe that most device instancesvisit only a single publisher (about 95 whereasless than 1 visit three or more publishers andonly 00009 are seen on 10 or more publishers)The publisher-specific Interactive Advertising Bureau(IAB) categorizations (Iabcats) suffer from the sameproblem (as well as being too coarse-grained to beuseful for identifying the same user in any event)The reason is that as mentioned above a mobiledevice has different identifiers when viewed throughdifferent channels in the mobile advertising ecosys-tem such as different RTB systems or even differentapps On the other hand a publisher has on average1366 users that visited it and publishers by and largeseem to be RTB specific Figures 3(a) and 3(b) showmore details of the distributions As a consequenceusing publisher-visitation data to connect users is notbroadly helpful because two user instances in dif-ferent RTB systems rarely share a publisher The IPaddress is always available and is passed consistentlythrough the different apps and RTB systems

The other consistently available fields (such as useragent device type dimensions) are always the samefor a single device These could in principle be used toimprove same-device finding but are only marginally

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers248 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 1 Data Availability for Mobile Devices Across Different RTB Systems

RTB 1 () RTB 2 () RTB 3 () RTB 4 () RTB 5 () RTB 6 () RTB 7 ()

Created 100 100 100 100 100 100 100Network 100 100 100 100 100 100 100Dims 100 100 100 100 100 100 100ID 100 100 100 36 0 19 44Device 100 100 100 100 100 100 100User agent 100 100 100 100 100 100 100Iabcats 0 100 100 98 100 100 84Geo 100 16 100 100 100 100 100Location 0 4 56 42 1 20 100Referrer 8 0 0 17 0 6 3Publisher 100 100 100 100 100 100 100URL 100 0 0 5 93 0 0IP 100 100 100 100 100 100 100

Notes Specifically the table shows how often the following fields are provided time of bid request (Created) the RTB network identifier(Network) dimensions of the ad (Dims) identifier of the device (ID) device type (Device) the user agent IAB categorization (Iabcats)the IP-inferred city state or country (Geo) the location latitude and longitude (Location) the referrer from where the user comes(Referrer) the publisher of the Web page (Publisher) the URL and the IP address

helpful for finding the same user on different devicesor for similar-user finding (eg using Apple productsversus Android products) we do not take advantageof them in this paper instead focusing specificallyon assessing the ability of the geosimilarity design

Figure 2 Distribution of (a) Number of Unique User Instances Seen per IP Address and (b) Number of IP Addresses a User Instance Logs Into

(b) Number of IPs per ID

Number of IPs

Fre

quen

cy o

f occ

uren

ce

100

108

107

106

105

104

103

102

100

101

101 102 103 104 105 106 107 108

(a) Number of IDs per IP

Fre

quen

cy o

f occ

uren

ce

108

107

106

105

104

103

102

100

101

Number of IDs

100 101 102 103 104 105

Figure 3 Distribution of (a) Number of Unique User Instances Seen per Publisher and (b) Number of Publishers a User Instance Visits

(a) Number of IDs per publisher

Number of IDs

Fre

quen

cy o

f occ

uren

ce

100

104

103

102

100

101

101 102 103 104 105 106 107

(b) Number of publishers per ID

Fre

quen

cy o

f occ

uren

ce

108

107

106

105

104

103

102

100

101

Number of publishers

100 101 102 103 104

(as opposed to how to best engineer a same-user-finding system)

The URL field is missing in all but two RTB ex-changes The Geo field is often limited to the citystate and country as inferred from the IP address

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 249

Location data in the form of latitude and longitude(Location) are only available in some RTB systemsand the accuracy of the latitudelongitude data in thecurrent mobile ecosystem is questionable8 again mak-ing it not suitable for broad use

32 Entity and Link RepresentationWe represent each user instance by a ldquoprofilerdquo ofbehavior across locations where we define a locationprofile as a vector of visit frequencies This will neces-sarily require a sparse representation as we observe avery large total set of locations fortunately each indi-vidual is seen at only a small subset of the locationsFor the empirical results below the location profile isa sparse vector of frequencies of observed visitationsto locations (which can be weighted in the calculationof similarity as described below)

The structure of the GSN is the graph with mobiledevices as the nodes and links between the devicesas the edges The corresponding design decisionsinvolve the selection of which devices to link at alland the weights to place on the links The simplestlink representation places unweighted links betweenall pairs of devices that share at least one specific loca-tion More sophisticated weighting criteria are basedon two notions which are discussed next and illus-trated with a simple example in Figure 4 The loca-tion data are obtained by listening to RTB ecosystemswhere IP addresses of devices are broadcast (as shownpreviously in Figure 1)

1 Given two devices that share at least one loca-tion the weight of the link between them can bea function of the devicesrsquo location profiles Such mea-sures can include simply the number of shared loca-tions and the number of shared locations weightedby their visitation percentage in the profile Alterna-tively the weighting could be a measure of similaritybetween the two location profiles for example cosinesimilarity

2 As such Devices A B and C are connected in theGSN of Figure 4 since they all logged into the Wi-Fiof the Met Museum Devices B and C also loggedinto the public Wi-Fi at Times Square and thereforeshould be more strongly connected than A and B (or Band C) which share only one location

8 Analyses of the latitudelongitude data show various data prob-lems including flipped coordinates coordinates set equal to eachother many latitudinallongitudinal points at the centroids of com-mon geographic areas and many impossible or highly improba-ble coordinates Mobile ad industry insiders claim that the highprice paid for latitudelongitude-based hyperlocal targeting hasled to a significant amount of latitudelongitude fraud We havefound nothing written on this phenomenon although data fraudis not uncommon in the online advertising industry (Stitelmanet al 2013)

Figure 4 (Color online) Basic Illustration of the Creation of a GSNBased on Mobile Location Data Observed for Three MobileDevices (A B and C) and a Laptop (D)

Met MuseumA B

Geosocial network(GSN)

A C

D

B

C

D

Times Square

Apartment

Note Devices are connected in the GSN if they visited (logged into) the samelocation (IP address)

3 Links can be weighted according to the componentlocationsrsquo 4lack of5 popularity For example it mightbe argued that the link formed by sharing a loca-tion with a small number of other devices (eg myapartment) indicates a stronger geosimilarity thansharing a location with a massive number of otherdevices This intuition can be extended if two devicesspend a lot of time at such an unpopular locale(eg my apartment) then that should indicate a verystrong similarity If two devices have approximatelythe same distributional profile across several of thesesites (their ldquolocation profilerdquo) then they are eitherthe same person or close friendssoulmates In ourexample C and D are connected because they bothlogged into the Wi-Fi of an apartment whereas Band C are connected by logging into the public (pop-ular) Wi-Fi at Times Square Hence the connectionbetween C and D (which are likely devices belongingto roommates or belonging to the same user) shouldbe stronger than the connection between B and C(which are likely devices belonging to two differenttourists)

This notion of similarity can be incorporated tech-nically by (1) adapting notions from information re-trieval we can weight locations for a given deviceby their device-specific popularity (from the devicersquoslocation profile) divided by the (log of) the locationrsquosoverall popularity let us call that LF times IDF (locationfrequency times inverse device frequency) Then (2) thestrength of the link between two devices that sharea location would be a function of the correspondingLF times IDF scores

Once we have one or more GSNs there are variousways to take advantage of them for targeting or otherinference The simplest method for inference is sim-ply to target all of the geosimilar network neighborsAlternatively one could target the ldquoclosestrdquo neighborsbased on one or more notions of geosimilarity (as dis-

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers250 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

cussed above) These are the methods used for theempirical results presented below

4 Specifics of the GSN DesignAs described above in this design the strength of thelink between two individuals in the GSN is based onthe similarity in the distribution of the locations thatthey visit For this paperrsquos results locations will be(anonymized) IP addresses We now define differentsimilarity measures of varying complexity which willbe evaluated in the following sections

41 NotationThe location profile of a user is represented by a vec-tor Ex of length m where element i denotes the num-ber of visits to IP address IPi2 Ex isin m Vector Exbool isa binary vector of size m where element i denoteswhether the user has visited IPi For similarity com-putations both the Hadamard product () and theinner product 4 middot 5 are used and the minimum (min)operator is defined to operate componentwise on twovectors as well as on a vector and a scalar The max-imum (max) and average (avg) operators are similarTheir logic is defined in Equations (1)ndash(4) The weightof an IP address is defined by its popularity in anal-ogy to the inverse document frequency used in text

Table 2 Similarity Metrics Between Two Userrsquos Location Visit Distributions Ex1 and Ex2

Similarity metric Range d4 Ex11 Ex25

1 scount4 Ex11 Ex25= Exbool11 middot Exbool12 6015 1

2 sbool4 Ex11 Ex25= min4scount4 Ex11 Ex25115 80119 1

3 scount1W 4 Ex11 Ex25= 4 Ew Exbool115 middot Exbool12 6015 103

4 sfreq1min4 Ex11 Ex25= min4 Ex11 Ex251 6015 2

5 sfreq1max4 Ex11 Ex25= max4 Exbool12 Ex11 Exbool11 Ex251 6015 3

6 sfreq1 avg4 Ex11 Ex25= avg4 Exbool12 Ex11 Exbool11 Ex251 6015 205

7 sfreq1min1W 4 Ex11 Ex25= min4 Ex1 Ew1 Ex2 Ew51 6015 206

8 sfreq1max1W 4 Ex11 Ex25= max44 Exbool12 Ew5 Ex11 4 Exbool11 Ew5 Ex251 6015 309

9 sfreq1 avg1W 4 Ex11 Ex25= avg44 Exbool12 Ew5 Ex11 4 Exbool11 Ew5 Ex251 6015 3025

10 scosine4 Ex11 Ex25=Ex1 middot Ex2

Ex11 middot Ex21

60117 0074

11 scosine1W 4 Ex11 Ex25=4 Exbool11 Ew5 middot 4 Exbool12 Ew5

Exbool11 Ew1 middot Exbool12 Ew1

60117 0060

12 sJaccard4 Ex11 Ex25=scount4 Ex11 Ex25

max4 Exbool111 Exbool1251

60117 0033

13 sJaccard1W 4 Ex11 Ex25=scount1W 4 Ex11 Ex25

max4 Exbool11 Ew1 Exbool12 Ew51

60117 0026

14 sJaccard1 freq4 Ex11 Ex25=sfreq1min4 Ex11 Ex25

max4 Ex1 Exbool121 Ex2 Exbool1151

60117 0067

15 sJaccard1 freq1W 4 Ex11 Ex25=sfreq1min1W 4 Ex11 Ex25

max44 Ex1 Exbool125 Ew1 4 Ex2 Exbool115 Ew51

60117 0067

mining (Hotho et al 2005) the weight wi for IPi (theinverse device frequency) is defined as the logarithmof the total number of users n divided by the numberof (unique) users that visit that specific IP address ni

Ez= Ex Ey1 zi = xi times yi1 (1)

a= Ex middot Ey =sum

i=12m

4xi times yi51 (2)

Ez= min4Ex1 Ey51 zi = min4xi1yi51 (3)

Ez= min4Ex1a51 zi = min4xi1 a51 (4)

wi = log104nni50 (5)

42 Link Strength MetricsFifteen similarity metrics are defined in Table 2 andare illustrated by measuring the strength betweenthe location profiles of two example users Ex1 and Ex2shown in Equation (6) The first user visits IP1 threetimes and IP2 twice whereas the second one visits IP1twice and IP3 once Several variants of the same met-rics are considered As described above some onlytake into account the unique number of shared loca-tions whereas others take into account the frequencyof the visits to these locations and the inverse devicefrequency The latter are denoted by freq and W respectively

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 251

The first similarity metric scount counts the numberof shared locations Since the two example users shareonly IP1 this strength is 1 The IDF weighted versionscount1W sums the IDF values of the shared locationswhich is 13 for our simple example The most basicmetric is given by sbool which is 1 when a location isshared and 0 otherwise indicating whether two usersare neighbors or not

The sfreq1min sfreq1max and sfreq1avg metrics compare thefrequencies of the visits to the shared locations Theytake the minimum maximum and average respec-tively of the frequencies for each shared location andsum them The weighted versions sfreq1min1W sfreq1max1W and sfreq1avg1W do the same but weight each frequencywith the corresponding IDF value

The scosine metric measures the similarity by takingthe cosine of the angle between the two vectors TheJaccard metric is defined as the size of the intersec-tion over the size of the union In this case it countsthe number of shared locations but normalizes theseover the total number of locations both users havevisited Users that visit many locations will have ahigher chance to share some location with anotheruser so the Jaccard metric penalizes them For ourbasic example sJaccard is 1 (IP1) over 3 (IP1 IP2 IP3)Once more variants taking into account the frequencyand the IDF are defined as well

Ex1 = 63 2 071

Ex2 = 62 0 171 (6)

Ew = 6103 107 270

5 Experimental Setup51 Data SetThe data set D for the empirical results is 10 days ofanonymized advertising bid requests from mobile de-vices (smartphones tablets etc) observed across sevenRTB exchanges A total of 322770794 bid requestsare observed coming from 42437559 unique userinstances Over these 10 days we observe 41829088unique IP addresses

The number of unique user instances per RTBexchange is given in Table 3 The distributions of thenumber of bid requests per device type and operatingsystem are given in Tables 4 and 5 which show thatthe majority of bid requests come from smartphonesand the iOS operating system

A user (instance) corresponds to an exchange-assigned user ID One person on several deviceswill hence correspond to several users One deviceon several exchanges will often correspond to sev-eral users Also it might be that one ad exchangesometimes provides different user IDs to the same

Table 3 User Instances per RTB Exchange

Exchange Volume Percentage ()

RTB 1 311751979 6RTB 2 211591686 4RTB 3 1415661098 29RTB 4 2015011243 41RTB 5 212751015 5RTB 6 510551815 10RTB 7 212231418 4

Table 4 Distribution of Bid Requests per Device Type

Type Volume Percentage ()

Phone 29213581772 91Tablet 3010161334 9

Table 5 Distribution of Bid Requests per OperatingSystem

Type Volume Percentage ()

Android 12913151617 4001iOS 19219231648 5908Windows 5311529 002

device for example when the device is using differ-ent particular apps As described above an impor-tant task in online advertising is to be able to targetadvertisements to different instances of the same indi-vidual with reasonable probability regardless of thecomplexities of identification in the baroque onlineadvertising ecosystem Figure 5 visualizes a sample oflatitudes and longitudes broadcast by mobile devicesin our data set

52 SamplingFor the results in this paper we apply the GSN anal-ysis to data samples Each sample corresponds to acohortneighborhood and is built as follows A sin-gle (random) ldquoegordquo user is chosen user X aroundwhich a neighborhood is built Every neighbor of theego user must share at least one IP address so webegin by obtaining all (anonymized) IP addresses vis-ited by user X Next all other users that visited atleast one of these IP addresses (the neighbors) areadded The ego must have at least one neighbor to beincluded in this analysis Finally we obtain the com-plete location distribution of each of these neighborsWhen the next sample is chosen we ensure that noneof the users already included in the previous samplesis taken as the ego user (X) As such we ensure thatwe do not have repeated egondashneighbor dyads Theremay be shared neighbors of two different egos buttwo egos will never be immediate neighbors in thedata sample

The result of the sampling over 500 cohorts is summa-rized in Figure 6 on average an ego has 71 neighbors

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers252 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 5 Scatter Plots of Samples of Latitudes and Longitudes Broadcast by Mobile Devices

ndash130 ndash120 ndash110 ndash100 ndash90 ndash80 ndash70 ndash6020

25

30

35

40

45

50

55(a)

Long

Lat

ndash20 ndash10 0 10 20 30 40 50

Long

30

35

40

45

50

55

65

60

75

70

Lat

ndash200 ndash150 ndash100 ndash50 1000 50 150 200

Long

ndash100

ndash80

ndash60

ndash40

ndash20

0

40

20

100

60

80

Lat

4068ndash7406 ndash7404 ndash7402 ndash7400 ndash7394ndash7398 ndash7396 ndash7392 ndash7390

Long

4070

4072

4074

4076

4080

4078

4086

4082

4084

Lat

(b)

(c) (d)

Note Note that locations are plotted without reference to any map The figure gives a striking picture of population density across the world and makes uswonder what is going on with mobile devices in Antarctica

Figure 6 (Color online) The Distribution of the Number of Neighborsfor 500 Randomly Sampled Ego Users

0 500 1000 1500 2000 2500 3000 35000

50

100

150

200

250

300

350

400

450

No neighbors

Fre

quen

cy

the median being 1 The maximum number of neigh-bors is 3167 this user probably visited a publicWi-Fi that thousands of other users also logged intoWe chose 500 cohorts as we observe that the statis-

tics become quite stable as seen in Figure 7 Even200 cohorts seems to be enough to obtain stablesample statistics

The life span of user instances in this set of500 cohorts is shown by the cumulative frequency inFigure 8 Observe that only about 40 of the userinstances are seen at different hours About 30 ofthe user instances are seen only once in the sampledweek This illustrates that it is vital to understand thenotion of user instances These percentages are mis-leading if we interpret the user instances as individu-als or devices One individualdevice without stableidentification in the online advertising ecosystem willcreate many transient user instances which will driveup the percentage of user instances that are seen onlyonce or that are seen only within a short timeframeWe have not figured out how to determine or esti-mate reliably how many ldquounstablerdquo usersdevices arerepresented by the user instances in these data Wekeep all of the data for the analyses below (rather thanfiltering very short-lived user instances) to maintainrealism

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 253

Figure 7 Evolution of Several Sample Statistics as an Increasing Number of Cohorts are Created

0 100 200 300 400 5000

10

20

30

40

50

60

70

Average No network neighbors

0 100 200 300 400 5000

2

4

6

8

10

12

Median No network neighbors

0 100 200 300 400 5000

50

100

150

200

250

300

350

Average No IPs

0 100 200 300 400 5000

2

4

6

8Median No IPs

Notes The horizontal axis denotes the number of cohorts By 500 cohorts the statistics have become stable

Figure 8 (Color online) Cumulative Frequency of Life Span ofUser Instances

100 101 102 103 104 105 1060

01

02

03

04

05

06

07

08

09

10

x = lifespan (sec)

F(x

)

End of June (mobile only)

larr 1 min larr 1 hour larr 1 day

Cumulative frequency of user instances

6 Study 1 Connecting Instances ofthe Same Individual

Returning to our motivating application as men-tioned previously an important task in mobile adver-tising is to be able to target a particular individual inthe face of the individualrsquos fragmentation into differ-ent user instances (for example as seen through dif-ferent bidding systems) Advertisers understand that

it may be impossible to find the same user with 100accuracy Nonetheless it often is important for themto reach the userrsquos various instances without target-ing too many others9 The GSN is designed in partbased on the hypothesis that different instances of thesame person would visit similar locations and there-fore they will be connected in the GSN In this sectionwe present results examining whether and to whatextent that is indeed the case Furthermore the resultsquantify to what extent the same user receives a highGSN-similarity to herself as compared to her otherneighbors

Three experimental settings are used to assess judg-ments that two users are the same The first two sim-ulate different user instances by splitting up the bidrequests of known user instances The last one is notsimulated it links users via the recently introducedidentifier for advertising (IFA) observed over severalRTB exchanges for a subset of device instances

1 Random We divide the IP address visits of a userrandomly across two simulated users Table 6 showsfor example how we can divide a userrsquos IP visits across

9 The astute reader may notice that if the other users targeted arevery similar to the target user the advertising may be well placeddespite not being shown to the exact target user That is not thesubject of this study Study 2 shows evidence that these two notionsare not independent when using the GSN

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers254 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 6 Randomly Splitting User X rsquos IP VisitsCreating Two New Artificial Users X1 and X2

IP1 IP2 IP3

X 1 5 0X1 0 3 0X2 1 2 0

Table 7 The Volume of IFA Occurrence Within OurData Set

Volume

RTB 1 119401855RTB 2 113831962RTB 3 7891292

Table 8 The Frequency of Unique IFAs Visible AcrossRTB Exchanges

Frequency

RTB 1RTB 2 3061594 (1016)RTB 2RTB 3 451256 (169)RTB 3RTB 1 181417 (085)

two new users X1 and X2 for which we know thatthey are actually the same individual Results withRandom should be used only as a ceiling on perfor-mance because the results will be overly optimistic

2 Temporal A more realistic split-up of the trans-actions is a temporal split-up where first the middleday of Xrsquos transactions is determined (hence dynam-ically chosen for each user) All IP address visits upto this middle day are assigned to X1 and all trans-actions after the middle day are assigned to X2 Itwill only be applied to users who are seen on multi-ple days

3 Identifier for advertising IFA is an advertisingidentifier from Apple that users can change or asknot to be used for advertising and is seen as aprivacy-friendlier version of the controversial fixeddevice ID (the ldquoUDIDrdquo) The IFA is observed in threeexchanges with volumes shown in Table 7 for exam-ple 1940855 unique IFAs are seen in RTB 110 Whenwe observe the same IFA across different exchangeswe now know that two devices in different exchangesactually are the same The frequency of unique IFAsvisible across exchanges is shown in Table 8 For exam-ple a total of 3018223 IFAs are seen when we look atboth RTB 1 and RTB 2 with 306594 of these (1016)seen on both exchange 1 and exchange 2 What ifthere were not an IFA and the device would havehad a different identifier in each exchange To what

10 For confidentiality RTB 1 in this setting is not necessarily thesame as RTB 1 in sectsect4 or 7

extent are these two device instances then connectedin the GSN

With these different ldquogold standardrdquo methods toindicate whether or not two users are the samewe assess the ability of the geosimilarity metricsto link the same user determining how often twoinstances of the same person are connected in theGSN (sect61) Afterward we shall assess the differentGSN strength metricsrsquo ability to rank users determin-ing to what extent two instances of the same personexhibit strong geosimilarity as compared to the otherneighbors (sect62) For the random and temporal split-ups 200 samples are used For the IFA-based methodthe complete population is used for measuring thedegree of connectedness to the same user to analyzethe relative strength of the connected users 200 sam-ples are used

61 Same Individual Different ldquoScreensrdquoWe now present results on the GSN connectednessmeasured as how often two instances of the same userare connected in the GSN (in percentage) Note thatthe results summarized in Table 9 should be inter-preted in light of the limited number of days in build-ing the research data set which will limit the GSNconnectivity Nonetheless the relatively high connec-tivities observed even with this sample provide quitepromising results which lends considerable supportto the merit of the overall design

Random0 Using the random split two users thatcorrespond to the same individual are not connectedin only 9 of the neighborhoods Moreover in thesecases user X (whose IP visits are split up) usuallyhas only one or two entries and the random split-upresults in no visits for one of the two simulated users

Temporal0 The temporal split-up is more realisticthan the random one where we assume that a personuses simulated device 1 in the first half of the week(eg during the weekend) and simulated device 2 inthe second half of the week (eg during the week)

Users X1 and X2 are not connected in 32 of theneighborhoods Whereas the random split-up was toooptimistic the temporal one is too pessimistic as inreality there likely is some overlap in the time peri-ods of use of two different devices If user X visitsan IP address only in the first days of the week this

Table 9 To What Extent is the Same User Connectedto Herself

connected

Random 91Temporal 68IFAmdashRTB 1RTB 2 67IFAmdashRTB 2RTB 3 82IFAmdashRTB 3RTB 1 73

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 255

IP address will not occur in user X2rsquos transactions andreduces the chance to be connected to X1

IFA0 We look at all combinations of the three ex-changes with IFA identifiers For example there are45256 unique IFAs seen on both RTB 2 and RTB 3 In82 of these the two IFAs visit at least one same IP onboth exchanges and hence are connected in our GSN

The bottom line is that even with this limited slice ofonline behavior different instances of the same indi-vidual are connected in the GSN 70ndash80 of the timeEncouragingly the results from the simulated scenar-ios concur with the results from the real (IFA) sce-narios (including the aforementioned optimism andslight pessimism of the two simulated scenarios)Consistency over these different settings gives addi-tional confidence in the results suggesting that theGSN indeed holds promise for a high degree of suc-cess at targeting the same user across the fragmenta-tion of the online advertising ecosystem

62 Ranking User InstancesNext given that two user instances of the same indi-vidual are indeed connected we assess to what extentthe second instance of the same user is ranked highlyamong its network neighbors based on the weightof their linkage in the GSN We first will discuss themost general results using the best ranking methodand then will look across different ranking methods

We decompose the results based on the numberof network neighbors (NNs) since there are twoboundary cases that require special interpretationTables 10ndash14 show the proportions of the ego net-works that fall into three scenarios First a user mayhave only one NN In this scenario the ranking isperfect (trivially)mdashthe other instance of the same useris the only user instance given a nonzero score As

Table 10 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashRandom Split-up

Ego has only 1 NN 3104Ego has exactly 2 NNs 1905

Same user ranked first 8005Ego has more than 2 NNs 4901

Note For the scenario with two NNs the table reports thepercentage of cases where the other known instance of thesame user is ranked first (including ties)

Table 11 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashTemporal Split-up

Ego has only 1 NN 3008Ego has exactly 2 NNs 1807

Same user ranked first 8309Ego has more than 2 NNs 5005

Table 12 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 1RTB 2Split-up

Ego has only 1 NN 4401Ego has exactly 2 NNs 1201

Same user ranked first 8201Ego has more than 2 NNs 4400

Table 13 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 2RTB 3Split-up

Ego has only 1 NN 6109Ego has exactly 2 NNs 1907

Same user ranked first 7102Ego has more than 2 NNs 1804

Table 14 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 3RTB 1Split-up

Ego has only 1 NN 5305Ego has exactly 2 NNs 2501

Same user ranked first 6304Ego has more than 2 NNs 2104

shown in the tables this scenario accounts for 30ndash60 of the ego networks (depending on the split-upmethod)

The second boundary case is when the ego has ex-actly two NNs This scenario accounts for 10ndash25of the cases depending on the split-up method Herethe tables also report whether the other known userinstance is ranked first (including ties) among thetwo neighbors The other known instance of the sameuser is ranked first 60ndash80 of the time dependingon the split-up scenario The IFA results are affectedstrongly by the particular RTB settingmdashthis may indi-cate that the publishers on the different RTBs have dif-ferent policies for passing the IFA Importantly theseresults are conservativemdashpossibly very conservativeThe other NNs may also be instances of the same userunbeknownst to usmdashfor example in the IFA casebecause an app does not pass the IFA to the RTB orbecause the other user is the same user on a differentdevice and thus does not share the IFA

The final scenario is when more than two NNsare present As shown in the tables this accounts for20ndash50 of the cases depending on the split-up sce-nario In this case we measure the percentile in theranking where the instance of the known same userfalls (see below) as well as the area under the receiveroperating characteristic curve (AUC) (Fawcett 2006)which is equivalent to the MannndashWhitneyndashWilcoxon(MWW) statistic The AUCMWW measures to what

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers256 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 15 When There Are More Than Two Network Neighbors to What Extent Is the Known Same User StronglyConnected to Itself

AUC (if gt 2 NNs) () Average rank percentile (if gt 2 NNs) ()

Ranking metric R T RTB 12 RTB 23 RTB 31 R T RTB 12 RTB 23 RTB 31

1 78 64 68 59 55 29 48 34 43 542 59 58 67 58 54 45 53 34 44 553 79 69 74 61 61 29 44 29 40 484 86 75 69 61 59 24 40 32 41 505 83 70 71 59 63 26 44 31 43 476 85 74 71 60 65 25 41 31 42 457 86 76 73 62 64 24 40 29 40 468 83 70 72 59 65 26 44 30 43 459 85 74 71 60 67 25 41 30 42 44

10 79 69 69 61 63 29 44 32 41 4711 79 69 71 62 64 44 44 30 41 4612 81 68 68 60 62 28 46 34 42 4913 81 70 71 61 64 27 44 30 40 4614 74 67 66 61 48 34 47 35 41 6015 74 67 66 61 48 34 47 35 41 60

Note See the text for a discussion of ranking metrics

extent data points with label 1 are ranked higherthan those with label 0 In the results that follow(see Table 15) the AUC is measured for each samplewhere any user instance among the neighbors thatcorresponds to the same individual is labeled as 1and all others are labeled as 0 Thus in this settingthe AUC measures how well the neighbors are rankedby the likelihood of being the same individual11 Theaverage over all samples is reported If for a sampleall neighbors have the same score that AUC is setto 05 In addition to the factors noted above this alsowill tend to make the results conservative For exam-ple if the GSN links an ego user to a set of neighborsthat are all instances of the same individual in thisevaluation the AUC will be 05

The ranking measure is the percentile in the rank-ing at which the known-same-user instance is rankedwith lower values being better For example if thesame-user instance has the highest score (closest con-nection) among five neighbors the rank is 20 (15)Note again that this is quite conservative in thisexample the user is ranked at the top of the list butbecause there were only five neighbors the highestpossible score is 20 In case the known-same-userinstance has the same score as another neighbor theaverage rank is reported

As shown in Table 15 all AUC values exceed 05ranging up to 08 This means that in every casea known-same-user instance is likely to be rankedhigher than an instance not known to be the same

11 Technically in this setting the AUCMWW is the probability thata randomly selected neighbor that is in fact a known instance of thesame user is ranked more highly than a randomly selected neighborthat is not known to be an instance of the same user If the known-same-user instances are all ranked above the other-user instancesthe AUC = 100 If they are all below then the AUC = 0

user The frequency-based ranking metrics (4ndash9) per-form quite well in connecting instances of the sameuser strongly

Almost all metrics perform better than a randommodel with performances that even come close to thebest possible solution Consistently performing quitewell are the frequency-based metrics (metrics 3ndash9)with metric 7 (the minimum of the IDF weights ofthe shared locations) getting the most wins Not per-forming well is the very basic count metric as well asthe Boolean metric that provides a binary score onlyInteresting to note the Jaccard metrics do not performwell either It seems that penalizing users that visitmany IP addresses negatively affects the results anissue we will return to in the next study

7 Study 2 Does the GSN Select Userswith Similar Interests

The foregoing section addressed our paperrsquos firstclaimmdashthat the GSN design indeed links instances ofthe same user and links them more strongly thaninstances of other users In this section we turn to thepaperrsquos second claimmdashthat the GSN links users withsimilar interests We will measure similar interests bysimilarity in behavior accessing particular publisherson the Web and using mobile apps12

71 Connecting Users with Similar Interests

To What Extent Are the GSN Neighbors of Visi-tors to Particular Publishers Also Visitors to ThoseSame Publishers To assess whether the GSN con-nects users with similar interestsbehavior we select

12 From now on all user instances will be complete real user in-stances the simulated split-up scenarios (Random Temporal) werecreated specifically for the purpose of the same-user study

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 257

Table 16 Distribution of Publishers per RTB in Terms of thePercentage of Users in Our Data Set That Visit the GivenPublisher

Publisher of D

RTB 1Burstly 3004myYearbookcom 1062My Fitness Pal 0065TextMe Inc 0056GameResort 0025Flixster 0019247Sports 0015Daily Workout Apps LLC 0009Mobile Deluxe 0007YoYo Games Ltd 0004

RTB 2mworldnowcom 0032mtmzcom 0024mtopixcom 0014babycentercom 0015meetmecom 0006appsfacebookcom 0006itunesapplecom 0039appevitecom 0002beautyandskincarescom 0003mmajunkiecom 0001

RTB 3Conversion Exchange 5053Pinger Inc 2096Top Game Developer 4029PremiumEntertainmentApp-FamilyOriented-Android 0059Enflick 0053Clapfoot Games 0044FunPokes Inc 0018Talkatone 0015Sevenlogics Inc 0008A Star Software 0007

RTB 4Pinger PhonemdashiOSmdashConversation 1001Talkatone iPhone App 0007DiceWithBuddies 00139GAG Reader 0012Video Downloader Pro Lite 0016Rage Comics 0009Spades 3D Lite 0006Apalabrados 0003Relax Melodies HD 0002Celeb MemdashPhotoMaker 0000

a diverse collection of 40 publishers listed in Table 16A publisher can correspond to a mobile app a web-site or a set of appswebsites We selected both pub-lishers that are visited by many users and more nichepublishers with fewer visitors As can be seen fromTable 16 several types of publishers can be distin-guished the main groups are games (eg GameRe-sort Top Game Developer) and social networkingtools (eg appsfacebookcom Text Me Inc)

More specifically we create 200 samples for 40 pub-lishers 10 publishers for each of four different RTB

exchanges listed in Table 1613 The percentage ofunique users that visit each publisher in our completedata set is also shown (D) To create each samplean ldquoegordquo user is selected that has visited the pub-lisher in question (an ldquoego visitorrdquo) this ego visi-torrsquos GSN neighborhood is created Next we assesshow many of the egorsquos neighbors also visited thatsame publisher This will be compared with a base-line visit rate to produce lift and leverage values(Provost and Fawcett 2013)mdashspecifically how muchmore likely is it for the neighbors of ego visitors tovisit the publisher than would be expected by chanceWe consider three different setups that produce dif-ferent baselines

1 Considering all RTB exchanges to determine the base-line A user is considered a visitor for a given pub-lisher if she visits the publisher on any RTB ex-change (since some publishers are visible on differentexchanges) Hence the baseline visit rate is the num-ber of users (on any RTB) who visit that publisherdivided by the total number of users across all RTBexchanges This will provide optimistic lifts sincemost publishers appear only on one RTB exchangewhereas the baseline is measured over all exchanges

2 Considering one RTB exchange only to determine thebaseline A user is considered a visitor for a given pub-lisher and RTB if she visits the publisher on the sameRTB exchange This changes the baseline to the num-ber of users (on this RTB) who visit that publisherdivided by the total number of users on this RTBexchange The lift and leverage results will be worsethan in the first setting because the baseline will belarger Seeing that some publishers are seen on differ-ent RTB exchanges this setting is slightly pessimistic

3 Taking time of location visits into account In thethird setting we link users only if they visit an IPwithin the same time period This will be elaboratedon in sect73

Results are shown in Tables 17 and 18 The percent-age of publisher visitors in the neighborhood is givenas N This percentage is compared with the baselinepercentage of users that visit that publisher in thecomplete data D using two metrics the lift is the ratioof these numbers the leverage is the difference Liftis particularly useful for very small base rates (con-sider that an activity with a base rate of 30 cannotpossibly have a lift higher than about 3) Leverage ismore telling for larger base rates because it shows theabsolute improvement (here in percentage points)

When considering all NNs the lifts for all publish-ers are greater than 1 except for one Burstly on RTB 1has a lift of 097 (essentially no lift) but only when weconsider RTB 1 as the baseline This publisher already

13 For confidentiality the RTB numbering here is again differentfrom above

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers258 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 17 RTBs 1 and 2 Percentage of Network Neighbors That Also Visit the Publisher

RTB 1

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Burstly 3004 8063 2084 5059 47022 15054 44018 25061 8043 220572 myYearbookcom 1062 25077 15094 24015 44074 27067 43012 35016 21074 330543 My Fitness Pal 0065 4015 6038 305 3609 56077 36025 15055 23091 14094 TextMe Inc 0056 1805 33006 17094 39002 69075 38046 25018 45 240625 GameResort 0025 4093 19077 4068 18052 74026 18027 11086 47058 110626 Flixster 0019 1055 8015 1036 8014 42085 7095 1079 9042 1067 247Sports 0015 27057 188058 27042 77062 530093 77048 74053 509075 740388 Daily Workout Apps LLC 0009 1092 21047 1083 19023 215046 19014 11054 129027 110459 Mobile Deluxe 0007 1087 28067 108 23053 36009 23046 9063 14707 9056

10 YoYo Games Ltd 0004 3052 99058 3049 21057 609064 21053 13 367045 12096Considering this RTB exchange only to determine baseline

1 Burstly 8085 8063 0097 minus0022 47022 5033 38037 25061 2089 160762 myYearbookcom 4071 25077 5047 21006 44074 905 40003 35016 7046 300453 My Fitness Pal 1089 4015 2019 2025 3609 19048 35001 15055 8021 130654 TextMe Inc 1063 1805 11035 16087 39002 23094 37039 25018 15045 230555 GameResort 0073 4093 6078 402 18052 25049 17079 11086 16033 110146 Flixster 0055 1055 208 0099 8014 14071 7059 1079 3023 10247 247Sports 0043 27057 64073 27014 77062 182023 7702 74053 174097 74018 Daily Workout Apps LLC 0026 1092 7037 1066 19023 73095 18097 11054 44037 110289 Mobile Deluxe 0019 1087 9084 1068 23053 123088 23034 9063 5007 9044

10 YoYo Games Ltd 001 3052 34018 3042 21057 209025 21047 13 126012 1209

RTB 2

Considering all RTB exchanges to determine baseline1 mworldnowcom 0032 13027 40086 12095 54039 167042 54006 31065 97044 310332 mtmzcom 0024 6089 28086 6065 23008 9607 22084 10068 44075 100443 mtopixcom 0014 15007 104079 14093 53045 37106 5303 33075 234065 330614 babycentercom 0015 6087 47031 6072 43048 299056 43033 17074 122024 17065 meetmecom 0006 6002 10308 5097 54072 942093 54066 27085 47909 270796 appsfacebookcom 0006 301 48039 3003 9009 141097 9003 2076 43008 20697 itunesapplecom 0039 25077 65071 25038 51043 131012 51004 29049 75018 290098 appevitecom 0002 48028 21212043 48025 54076 21509068 54074 48028 21212043 480259 beautyandskincarescom 0003 18019 642087 18016 8048 299094 8046 1107 413049 11067

10 mmajunkiecom 0001 1036 201064 1035 26067 3195505 26066 13033 11977074 13033Considering this RTB exchange only to determine baseline

1 mworldnowcom 4034 13027 3006 8093 54039 12053 50005 31065 7029 270312 mtmzcom 3019 6089 2016 307 23008 7024 19089 10068 3035 70493 mtopixcom 1092 15007 7084 13015 53045 27081 51053 33075 17056 310834 babycentercom 1094 6087 3054 4093 43048 22042 41054 17074 9015 15085 meetmecom 0078 6002 7077 5025 54072 70057 53094 27085 35092 270076 appsfacebookcom 0086 301 3062 2024 9009 10063 8024 2076 3022 1097 itunesapplecom 5024 25077 4092 20053 51043 9081 46019 29049 5063 240258 appevitecom 0029 48028 165058 47098 54076 187082 54047 48028 165058 470989 beautyandskincarescom 0038 18019 48011 17081 8048 22045 8011 1107 30095 11032

10 mmajunkiecom 0009 1036 15009 1027 26067 296003 26058 13033 148001 13024

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

has a very high baseline and as we will see in thenext section when we consider only the top 1 and 10NNs the lifts go up to 5 and 3 respectively Burstlyprovides a generic platform for building apps and sois a weaker indication of user interest than many ofthe other publishers (fitness babies games)

All other lifts exceed one by substantial mar-gins and go up to even 2391 (Apalabrados onRTB 4 a Spanish-language social Scrabble-like game)

Obviously the conclusion that the NNs indeed aremore likely to visit the same publishers is highly sig-nificant by a simple sign test comparing the visita-tion rate within the geosimilarity neighborhood tothat of overall population either via lift or leveragep lt 0001 Thus NNs of publisher visitors are indeedmore likely also to visit the same websites the abso-lute values of the lifts and leverages suggest that theyare substantially more likely

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 259

Table 18 RTBs 3 and 4 Percentage of Network Neighbors That Also Visit the Publisher

RTB 3

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Conversion Exchange 5053 34054 6024 29001 64071 11069 59017 50077 9018 450242 Pinger Inc 2096 46002 15053 43006 65012 21097 62015 51048 17037 480523 Top Game Developer 4029 47024 11002 42095 60 13099 55071 48097 11042 440684 PremiumEntertainmentApp-FamilyOriented-Android 0059 23079 40012 2302 20045 34049 19086 22025 37052 210665 Enflick 0053 11071 21095 11017 21057 40044 21004 12027 23 110736 Clapfoot Games 0044 8055 19032 8011 11054 26006 1101 13099 31061 130557 FunPokes Inc 0018 11011 61028 10093 50 275077 49082 15079 87008 150618 Talkatone 0015 13013 88051 12098 31011 209076 30096 17091 120078 170779 Sevenlogics Inc 0008 2071 32059 2063 5088 70074 508 8033 100022 8025

10 A Star Software 0007 3033 44086 3025 23008 311024 23 9009 122061 9002Considering this RTB exchange only to determine baseline

1 Conversion Exchange 11045 34054 3002 23009 64071 5065 53025 50077 4043 390322 Pinger Inc 6013 46002 705 39089 65012 10062 58098 51048 8039 450353 Top Game Developer 8088 47024 5032 38036 60 6076 51012 48097 5052 400094 PremiumEntertainmentApp-FamilyOriented-Android 1023 23079 19038 22057 20045 16066 19023 22025 18013 210025 Enflick 101 11071 1006 1006 21057 19054 20046 12027 11011 110166 Clapfoot Games 0092 8055 9033 7064 11054 12059 10062 13099 15027 130087 FunPokes Inc 0038 11011 2906 10074 50 133022 49062 15079 42007 150418 Talkatone 0031 13013 42076 12082 31011 101034 3008 17091 58035 170619 Sevenlogics Inc 0017 2071 15074 2054 5088 34017 5071 8033 48041 8016

10 A Star Software 0015 3033 21067 3017 23008 150036 22092 9009 59023 8094

RTB 4

Considering all RTB exchanges to determine baseline1 Pinger PhonemdashiOSmdashConversation 1001 71081 70085 7008 80077 79068 79076 70014 6902 690132 Talkatone iPhone App 0007 11038 17203 11031 29003 439068 28097 17002 257078 160963 DiceWithBuddies 0013 7041 59004 7028 18018 144092 18006 7069 61031 70574 9GAG Reader 0012 62013 497099 62 49009 39305 48097 49005 393017 480925 Video Downloader Pro Lite 0016 20003 123015 19087 22058 138084 22042 17093 110027 170776 Rage Comics 0009 609 8001 6081 13033 154086 13025 609 8001 60817 Spades 3D Lite 0006 1205 225076 12044 27027 492057 27022 1205 225076 120448 Apalabrados 0003 77042 21391071 77039 75 21316097 74097 79032 21450043 790299 Relax Melodies HD 0002 31058 21002029 31056 50 31170029 49098 31058 21002029 31056

10 Celeb MemdashPhotoMaker 0 3023 82107 3022 7014 11819048 7014 3023 82107 3022Considering this RTB exchange only to determine baseline

1 Pinger PhonemdashiOSmdashConversation 8051 71081 8044 6303 80077 9049 72026 70014 8024 610632 Talkatone iPhone App 0055 11038 20053 10082 29003 52038 28048 17002 30071 160473 DiceWithBuddies 0009 7041 82011 7032 18018 201054 18009 7069 85027 7064 9GAG Reader 0097 62013 63097 61016 49009 50055 48012 49005 50051 480085 Video Downloader Pro Lite 0044 20003 45036 19059 22058 51014 22014 17093 40062 170496 Rage Comics 0037 609 18079 6053 13033 36032 12097 609 18079 60537 Spades 3D Lite 0046 1205 2609 12004 27027 58068 26081 1205 2609 120048 Apalabrados 0021 77042 371093 77021 75 360031 74079 79032 381006 790119 Relax Melodies HD 0013 31058 238054 31045 50 377069 49087 31058 238054 31045

10 Celeb MemdashPhotoMaker 0003 3023 97089 3019 7014 216076 7011 3023 97089 3019

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

The very high lifts could be attributed to two phe-nomena They may provide some evidence that theGSN embeds a true social network as pointed out bythe good results for the social networking publishers(eg Apalabrados with a lift of 2391 appsevitecomwith a lift of 2212 and others) Friends visiting thesame social networking sites (here unusual ones)

would increase the lift and leverage over just havingdifferent instances of the same user visiting the samewebsites An alternative explanation is that some ofthese publishers are much more likely to be visitedacross different devices of the same user It is not clearwhether this is actually true However it argues forobtaining specific and broad data on which devices

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers260 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

are not the same user to shed further light on forexample any broad data that include a (anonymized)user identifier

From these results we can conclude that GSNneighbors indeed exhibit substantially similar pub-lisher visitation behavior Next we assess to whatextent more strongly connected NNs are more similarthan less strongly connected NNs

72 Ranking Users with Similar Interests

Are the More-Similar Geosimilarity Neighbors of aPublisherrsquos Visitors Even More Likely Also to Be Vis-itors to the Same Publisher To assess whether more-similar geosimilarity neighbors are even more likelyto share interests we compute the measures exactlyas in the previous study except instead of consider-ing all NNs only the 1 (N1) and 10 (N10) neighborswith the highest scores are considered (see Tables 17and 18) First we must consider which of the 15 met-rics that are defined to measure link strength to useTable 19 reports the rank aggregations of the differentmetrics as to which provide the best lifts Specificallyfor each of the 40 publishers each scoring metric pro-vides a lift and for that publisher the scoring metricscan be ranked (the best being ranked 1 etc) The rankaggregation is the average rank for a scoring metricacross all 40 publishers Thus if one metric providedthe best lift for all publishers it would get an aggre-gated rank of 1

To visualize the best metrics the metrics with anaverage ranking less than 8 are shown in boldfacein Table 19 The Jaccard metrics do not perform wellin this study either showing that penalizing the con-nection to users that visit many locations is not sen-sible Remember that the Jaccard metric is based onthe idea that users that visit many locations will havea higher chance to share some location with anotheruser However we have so many locations in totalthat visiting a couple more locations will increase onlymarginally this probability of sharing a location bychance A user that logs into more IP addresses andis more active should hence not be penalized andranked lower than other less active users Metric 3which sums the IDF scores of the shared locationsperforms quite well Given the operational efficiencyadvantage over the frequency-based metrics this met-ric is chosen to measure the strength of the links

As seen from the results in Tables 17 and 18 thelift and leverage values for the strongly connected (top 1and top 10) NNs are even higher sometimes astro-nomical (3170 for Relax Melodies HD (RTB 4)) Theseresults are highly significant by sign tests compar-ing either the lift or the leverage between the top-ranked neighbors and the entire cohort (p lt 0001) Theabsolute values again indicate that the users with the

strongest geosimilarity are often substantially morelikely to visit the same publisher

Thus we can conclude that not only does geosim-ilarity find users with similar interests it also ranksusers well by their likelihood of having similarinterests

An interesting follow-up question is whether thecharacteristics of a publisher are of importance for theresults For example are GSN neighbors more sim-ilar in terms of using the same social network appas compared to playing the same games To answerthis question the publishers are categorized into sixclasses funny social news games communicationand miscellaneous In Figure 9 the lifts of the pub-lishers are shown per category Some general trendscan be observed where publishers related to funnycontent have the highest median lift This could beexplained by the fact that users often forward funnycontent to their friends The social websitesapps fol-low next which seems to further demonstrate thatfriends are likely linked in the GSN However pub-lishers in the communication category perform not sowell compared with the other categories and ratherhigh variances are observed in the different cate-gories Please note that high variance is observedacross the categories therefore it is difficult to makewell-supported claims from these results

73 TimingAs a final evaluation we explore whether the inclu-sion of temporal similarity in location visitation ishelpful for defining GSN connections To this end weintroduce a setting where two users are connectedonly if they visit an IP address in the same timeperiod of a day Specifically we divide a day intodifferent time windowsmdashfor example two time peri-ods of 12 hoursmdashwhere users are connected only ifthey visit the same IP address in the same time periodof a day (which might be on different days) Practi-cally two digits are added to an IP which denote thestarting moment of the corresponding time periodWe report on time periods of 2 hours and of 12 hours(other periods yield similar results) We also includethe setting where users are connected only when theyvisit the same IP address on the same day in light ofthe result of Crandall et al (2010) that visiting thesame location at around the same time is indicativeof friendship

The resulting lifts are reported in Figures 10ndash12when considering respectively all NNs the top 1 NNand the top 10 NNs The results when not consider-ing the time period are repeated as well (All) lim-ited to the conservative case of only considering oneexchange (see above) The baseline (lift of one) isindicated with a red horizontal line lifts higher than1000 are shown as 1000 to limit the range to be able

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 6: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 247

Figure 1 (Color online) Working of Real-Time Bidding

AdvertisersAd

Consumers

Publisher hasad slots for sale

At IP X is on website YBrand 1

51c

14c

33cBrand 2

Brand 3

3

Bids

Bid request

Loading of ad

2

with device type ZBidding is open 4

1

Notes When a user visits a Web page that sells ad slots through an RTB exchange (1) a bid request is created (2) after which several advertisers bid to allowto show an ad at that moment on that Web page to that user (3) The highest bidder is allowed to load a potentially personalized ad (4)

For this paper we address the data currently avail-able in the digital advertising ecosystem Specificallyvia RTB exchanges we can observe a massive numberof mobile users along with the data made availablefor advertisers to decide whether to bid on the userfor a particular campaign (see eg Pubmatic 2010)Two aspects of the data are crucial (1) Each instanceof a mobile user is associated with a key when thiskey is observed in the future we know that the futuredata item involves the same mobile device6 (2) Partof the data associated with each mobile device is alocation For this paper we consider that location tobe the (anonymized) IP address of the Wi-Fi network7

currently in use (although there are various sorts oflocation data) We observe the locations visited byeach user for the data records included in the researchdata set (described in more detail in sect51)

The mobile RTB ecosystem holds different infor-mation than for the typical online (desktop) worldwhich is important for our design The GSN designmeasures similarity in location visitation behaviorOne might wonder if other behavioral data can beobtained from the RTB systems such as website vis-itation data (Provost et al 2009 Raeder et al 2012)Table 1 shows which typical fields (besides the key)are visible on mobile devices over seven different RTBexchanges Let us consider the different data fields aspossible sources of behavioral data

First let us have a look at the fields that are avail-able everywhere the time and date of the bid request(Created) the RTB exchange (Network) the dimen-sions of the ad (Dims) the device type (Device) the

6 The notion of an instance is important Many instances of the samedevice may have different keys in the online advertising ecosystemAs discussed above connecting these is one focus of this research7 For this paper we will consider transient IP addresses that are notexplicitly removed such as those belonging to 3G or other carriernetworks (but not identified as such) to be noise in the data thatmust be accommodated by the similaritytargeting design Betterremoval techniques could increase the strength of the results pre-sented in the next sections

user agent the publisher and the IP address TheIP address (anonymized) is the location we use in ourGSN design Figures 2(a) and 2(b) show the distri-bution of the number of users seen at an IP addressand the number of IP addresses seen per user respec-tively (more information on this data set is providedlater in sect51) On average a user visits 166 IP ad-dresses and an IP address is visited on average bytwo users (over a time period of 10 days see sect51)Although in our data set some known 3G IP addresses(subranges) have already been removed Figure 2(a)shows that high-volume IP addresses are still presentSome of these IP addresses with a very large numberof unique users seen correspond to large institutionsthink for example of a university Wi-Fi address

Although the publisher data item is available for allbid requests we observe that most device instancesvisit only a single publisher (about 95 whereasless than 1 visit three or more publishers andonly 00009 are seen on 10 or more publishers)The publisher-specific Interactive Advertising Bureau(IAB) categorizations (Iabcats) suffer from the sameproblem (as well as being too coarse-grained to beuseful for identifying the same user in any event)The reason is that as mentioned above a mobiledevice has different identifiers when viewed throughdifferent channels in the mobile advertising ecosys-tem such as different RTB systems or even differentapps On the other hand a publisher has on average1366 users that visited it and publishers by and largeseem to be RTB specific Figures 3(a) and 3(b) showmore details of the distributions As a consequenceusing publisher-visitation data to connect users is notbroadly helpful because two user instances in dif-ferent RTB systems rarely share a publisher The IPaddress is always available and is passed consistentlythrough the different apps and RTB systems

The other consistently available fields (such as useragent device type dimensions) are always the samefor a single device These could in principle be used toimprove same-device finding but are only marginally

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers248 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 1 Data Availability for Mobile Devices Across Different RTB Systems

RTB 1 () RTB 2 () RTB 3 () RTB 4 () RTB 5 () RTB 6 () RTB 7 ()

Created 100 100 100 100 100 100 100Network 100 100 100 100 100 100 100Dims 100 100 100 100 100 100 100ID 100 100 100 36 0 19 44Device 100 100 100 100 100 100 100User agent 100 100 100 100 100 100 100Iabcats 0 100 100 98 100 100 84Geo 100 16 100 100 100 100 100Location 0 4 56 42 1 20 100Referrer 8 0 0 17 0 6 3Publisher 100 100 100 100 100 100 100URL 100 0 0 5 93 0 0IP 100 100 100 100 100 100 100

Notes Specifically the table shows how often the following fields are provided time of bid request (Created) the RTB network identifier(Network) dimensions of the ad (Dims) identifier of the device (ID) device type (Device) the user agent IAB categorization (Iabcats)the IP-inferred city state or country (Geo) the location latitude and longitude (Location) the referrer from where the user comes(Referrer) the publisher of the Web page (Publisher) the URL and the IP address

helpful for finding the same user on different devicesor for similar-user finding (eg using Apple productsversus Android products) we do not take advantageof them in this paper instead focusing specificallyon assessing the ability of the geosimilarity design

Figure 2 Distribution of (a) Number of Unique User Instances Seen per IP Address and (b) Number of IP Addresses a User Instance Logs Into

(b) Number of IPs per ID

Number of IPs

Fre

quen

cy o

f occ

uren

ce

100

108

107

106

105

104

103

102

100

101

101 102 103 104 105 106 107 108

(a) Number of IDs per IP

Fre

quen

cy o

f occ

uren

ce

108

107

106

105

104

103

102

100

101

Number of IDs

100 101 102 103 104 105

Figure 3 Distribution of (a) Number of Unique User Instances Seen per Publisher and (b) Number of Publishers a User Instance Visits

(a) Number of IDs per publisher

Number of IDs

Fre

quen

cy o

f occ

uren

ce

100

104

103

102

100

101

101 102 103 104 105 106 107

(b) Number of publishers per ID

Fre

quen

cy o

f occ

uren

ce

108

107

106

105

104

103

102

100

101

Number of publishers

100 101 102 103 104

(as opposed to how to best engineer a same-user-finding system)

The URL field is missing in all but two RTB ex-changes The Geo field is often limited to the citystate and country as inferred from the IP address

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 249

Location data in the form of latitude and longitude(Location) are only available in some RTB systemsand the accuracy of the latitudelongitude data in thecurrent mobile ecosystem is questionable8 again mak-ing it not suitable for broad use

32 Entity and Link RepresentationWe represent each user instance by a ldquoprofilerdquo ofbehavior across locations where we define a locationprofile as a vector of visit frequencies This will neces-sarily require a sparse representation as we observe avery large total set of locations fortunately each indi-vidual is seen at only a small subset of the locationsFor the empirical results below the location profile isa sparse vector of frequencies of observed visitationsto locations (which can be weighted in the calculationof similarity as described below)

The structure of the GSN is the graph with mobiledevices as the nodes and links between the devicesas the edges The corresponding design decisionsinvolve the selection of which devices to link at alland the weights to place on the links The simplestlink representation places unweighted links betweenall pairs of devices that share at least one specific loca-tion More sophisticated weighting criteria are basedon two notions which are discussed next and illus-trated with a simple example in Figure 4 The loca-tion data are obtained by listening to RTB ecosystemswhere IP addresses of devices are broadcast (as shownpreviously in Figure 1)

1 Given two devices that share at least one loca-tion the weight of the link between them can bea function of the devicesrsquo location profiles Such mea-sures can include simply the number of shared loca-tions and the number of shared locations weightedby their visitation percentage in the profile Alterna-tively the weighting could be a measure of similaritybetween the two location profiles for example cosinesimilarity

2 As such Devices A B and C are connected in theGSN of Figure 4 since they all logged into the Wi-Fiof the Met Museum Devices B and C also loggedinto the public Wi-Fi at Times Square and thereforeshould be more strongly connected than A and B (or Band C) which share only one location

8 Analyses of the latitudelongitude data show various data prob-lems including flipped coordinates coordinates set equal to eachother many latitudinallongitudinal points at the centroids of com-mon geographic areas and many impossible or highly improba-ble coordinates Mobile ad industry insiders claim that the highprice paid for latitudelongitude-based hyperlocal targeting hasled to a significant amount of latitudelongitude fraud We havefound nothing written on this phenomenon although data fraudis not uncommon in the online advertising industry (Stitelmanet al 2013)

Figure 4 (Color online) Basic Illustration of the Creation of a GSNBased on Mobile Location Data Observed for Three MobileDevices (A B and C) and a Laptop (D)

Met MuseumA B

Geosocial network(GSN)

A C

D

B

C

D

Times Square

Apartment

Note Devices are connected in the GSN if they visited (logged into) the samelocation (IP address)

3 Links can be weighted according to the componentlocationsrsquo 4lack of5 popularity For example it mightbe argued that the link formed by sharing a loca-tion with a small number of other devices (eg myapartment) indicates a stronger geosimilarity thansharing a location with a massive number of otherdevices This intuition can be extended if two devicesspend a lot of time at such an unpopular locale(eg my apartment) then that should indicate a verystrong similarity If two devices have approximatelythe same distributional profile across several of thesesites (their ldquolocation profilerdquo) then they are eitherthe same person or close friendssoulmates In ourexample C and D are connected because they bothlogged into the Wi-Fi of an apartment whereas Band C are connected by logging into the public (pop-ular) Wi-Fi at Times Square Hence the connectionbetween C and D (which are likely devices belongingto roommates or belonging to the same user) shouldbe stronger than the connection between B and C(which are likely devices belonging to two differenttourists)

This notion of similarity can be incorporated tech-nically by (1) adapting notions from information re-trieval we can weight locations for a given deviceby their device-specific popularity (from the devicersquoslocation profile) divided by the (log of) the locationrsquosoverall popularity let us call that LF times IDF (locationfrequency times inverse device frequency) Then (2) thestrength of the link between two devices that sharea location would be a function of the correspondingLF times IDF scores

Once we have one or more GSNs there are variousways to take advantage of them for targeting or otherinference The simplest method for inference is sim-ply to target all of the geosimilar network neighborsAlternatively one could target the ldquoclosestrdquo neighborsbased on one or more notions of geosimilarity (as dis-

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers250 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

cussed above) These are the methods used for theempirical results presented below

4 Specifics of the GSN DesignAs described above in this design the strength of thelink between two individuals in the GSN is based onthe similarity in the distribution of the locations thatthey visit For this paperrsquos results locations will be(anonymized) IP addresses We now define differentsimilarity measures of varying complexity which willbe evaluated in the following sections

41 NotationThe location profile of a user is represented by a vec-tor Ex of length m where element i denotes the num-ber of visits to IP address IPi2 Ex isin m Vector Exbool isa binary vector of size m where element i denoteswhether the user has visited IPi For similarity com-putations both the Hadamard product () and theinner product 4 middot 5 are used and the minimum (min)operator is defined to operate componentwise on twovectors as well as on a vector and a scalar The max-imum (max) and average (avg) operators are similarTheir logic is defined in Equations (1)ndash(4) The weightof an IP address is defined by its popularity in anal-ogy to the inverse document frequency used in text

Table 2 Similarity Metrics Between Two Userrsquos Location Visit Distributions Ex1 and Ex2

Similarity metric Range d4 Ex11 Ex25

1 scount4 Ex11 Ex25= Exbool11 middot Exbool12 6015 1

2 sbool4 Ex11 Ex25= min4scount4 Ex11 Ex25115 80119 1

3 scount1W 4 Ex11 Ex25= 4 Ew Exbool115 middot Exbool12 6015 103

4 sfreq1min4 Ex11 Ex25= min4 Ex11 Ex251 6015 2

5 sfreq1max4 Ex11 Ex25= max4 Exbool12 Ex11 Exbool11 Ex251 6015 3

6 sfreq1 avg4 Ex11 Ex25= avg4 Exbool12 Ex11 Exbool11 Ex251 6015 205

7 sfreq1min1W 4 Ex11 Ex25= min4 Ex1 Ew1 Ex2 Ew51 6015 206

8 sfreq1max1W 4 Ex11 Ex25= max44 Exbool12 Ew5 Ex11 4 Exbool11 Ew5 Ex251 6015 309

9 sfreq1 avg1W 4 Ex11 Ex25= avg44 Exbool12 Ew5 Ex11 4 Exbool11 Ew5 Ex251 6015 3025

10 scosine4 Ex11 Ex25=Ex1 middot Ex2

Ex11 middot Ex21

60117 0074

11 scosine1W 4 Ex11 Ex25=4 Exbool11 Ew5 middot 4 Exbool12 Ew5

Exbool11 Ew1 middot Exbool12 Ew1

60117 0060

12 sJaccard4 Ex11 Ex25=scount4 Ex11 Ex25

max4 Exbool111 Exbool1251

60117 0033

13 sJaccard1W 4 Ex11 Ex25=scount1W 4 Ex11 Ex25

max4 Exbool11 Ew1 Exbool12 Ew51

60117 0026

14 sJaccard1 freq4 Ex11 Ex25=sfreq1min4 Ex11 Ex25

max4 Ex1 Exbool121 Ex2 Exbool1151

60117 0067

15 sJaccard1 freq1W 4 Ex11 Ex25=sfreq1min1W 4 Ex11 Ex25

max44 Ex1 Exbool125 Ew1 4 Ex2 Exbool115 Ew51

60117 0067

mining (Hotho et al 2005) the weight wi for IPi (theinverse device frequency) is defined as the logarithmof the total number of users n divided by the numberof (unique) users that visit that specific IP address ni

Ez= Ex Ey1 zi = xi times yi1 (1)

a= Ex middot Ey =sum

i=12m

4xi times yi51 (2)

Ez= min4Ex1 Ey51 zi = min4xi1yi51 (3)

Ez= min4Ex1a51 zi = min4xi1 a51 (4)

wi = log104nni50 (5)

42 Link Strength MetricsFifteen similarity metrics are defined in Table 2 andare illustrated by measuring the strength betweenthe location profiles of two example users Ex1 and Ex2shown in Equation (6) The first user visits IP1 threetimes and IP2 twice whereas the second one visits IP1twice and IP3 once Several variants of the same met-rics are considered As described above some onlytake into account the unique number of shared loca-tions whereas others take into account the frequencyof the visits to these locations and the inverse devicefrequency The latter are denoted by freq and W respectively

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 251

The first similarity metric scount counts the numberof shared locations Since the two example users shareonly IP1 this strength is 1 The IDF weighted versionscount1W sums the IDF values of the shared locationswhich is 13 for our simple example The most basicmetric is given by sbool which is 1 when a location isshared and 0 otherwise indicating whether two usersare neighbors or not

The sfreq1min sfreq1max and sfreq1avg metrics compare thefrequencies of the visits to the shared locations Theytake the minimum maximum and average respec-tively of the frequencies for each shared location andsum them The weighted versions sfreq1min1W sfreq1max1W and sfreq1avg1W do the same but weight each frequencywith the corresponding IDF value

The scosine metric measures the similarity by takingthe cosine of the angle between the two vectors TheJaccard metric is defined as the size of the intersec-tion over the size of the union In this case it countsthe number of shared locations but normalizes theseover the total number of locations both users havevisited Users that visit many locations will have ahigher chance to share some location with anotheruser so the Jaccard metric penalizes them For ourbasic example sJaccard is 1 (IP1) over 3 (IP1 IP2 IP3)Once more variants taking into account the frequencyand the IDF are defined as well

Ex1 = 63 2 071

Ex2 = 62 0 171 (6)

Ew = 6103 107 270

5 Experimental Setup51 Data SetThe data set D for the empirical results is 10 days ofanonymized advertising bid requests from mobile de-vices (smartphones tablets etc) observed across sevenRTB exchanges A total of 322770794 bid requestsare observed coming from 42437559 unique userinstances Over these 10 days we observe 41829088unique IP addresses

The number of unique user instances per RTBexchange is given in Table 3 The distributions of thenumber of bid requests per device type and operatingsystem are given in Tables 4 and 5 which show thatthe majority of bid requests come from smartphonesand the iOS operating system

A user (instance) corresponds to an exchange-assigned user ID One person on several deviceswill hence correspond to several users One deviceon several exchanges will often correspond to sev-eral users Also it might be that one ad exchangesometimes provides different user IDs to the same

Table 3 User Instances per RTB Exchange

Exchange Volume Percentage ()

RTB 1 311751979 6RTB 2 211591686 4RTB 3 1415661098 29RTB 4 2015011243 41RTB 5 212751015 5RTB 6 510551815 10RTB 7 212231418 4

Table 4 Distribution of Bid Requests per Device Type

Type Volume Percentage ()

Phone 29213581772 91Tablet 3010161334 9

Table 5 Distribution of Bid Requests per OperatingSystem

Type Volume Percentage ()

Android 12913151617 4001iOS 19219231648 5908Windows 5311529 002

device for example when the device is using differ-ent particular apps As described above an impor-tant task in online advertising is to be able to targetadvertisements to different instances of the same indi-vidual with reasonable probability regardless of thecomplexities of identification in the baroque onlineadvertising ecosystem Figure 5 visualizes a sample oflatitudes and longitudes broadcast by mobile devicesin our data set

52 SamplingFor the results in this paper we apply the GSN anal-ysis to data samples Each sample corresponds to acohortneighborhood and is built as follows A sin-gle (random) ldquoegordquo user is chosen user X aroundwhich a neighborhood is built Every neighbor of theego user must share at least one IP address so webegin by obtaining all (anonymized) IP addresses vis-ited by user X Next all other users that visited atleast one of these IP addresses (the neighbors) areadded The ego must have at least one neighbor to beincluded in this analysis Finally we obtain the com-plete location distribution of each of these neighborsWhen the next sample is chosen we ensure that noneof the users already included in the previous samplesis taken as the ego user (X) As such we ensure thatwe do not have repeated egondashneighbor dyads Theremay be shared neighbors of two different egos buttwo egos will never be immediate neighbors in thedata sample

The result of the sampling over 500 cohorts is summa-rized in Figure 6 on average an ego has 71 neighbors

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers252 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 5 Scatter Plots of Samples of Latitudes and Longitudes Broadcast by Mobile Devices

ndash130 ndash120 ndash110 ndash100 ndash90 ndash80 ndash70 ndash6020

25

30

35

40

45

50

55(a)

Long

Lat

ndash20 ndash10 0 10 20 30 40 50

Long

30

35

40

45

50

55

65

60

75

70

Lat

ndash200 ndash150 ndash100 ndash50 1000 50 150 200

Long

ndash100

ndash80

ndash60

ndash40

ndash20

0

40

20

100

60

80

Lat

4068ndash7406 ndash7404 ndash7402 ndash7400 ndash7394ndash7398 ndash7396 ndash7392 ndash7390

Long

4070

4072

4074

4076

4080

4078

4086

4082

4084

Lat

(b)

(c) (d)

Note Note that locations are plotted without reference to any map The figure gives a striking picture of population density across the world and makes uswonder what is going on with mobile devices in Antarctica

Figure 6 (Color online) The Distribution of the Number of Neighborsfor 500 Randomly Sampled Ego Users

0 500 1000 1500 2000 2500 3000 35000

50

100

150

200

250

300

350

400

450

No neighbors

Fre

quen

cy

the median being 1 The maximum number of neigh-bors is 3167 this user probably visited a publicWi-Fi that thousands of other users also logged intoWe chose 500 cohorts as we observe that the statis-

tics become quite stable as seen in Figure 7 Even200 cohorts seems to be enough to obtain stablesample statistics

The life span of user instances in this set of500 cohorts is shown by the cumulative frequency inFigure 8 Observe that only about 40 of the userinstances are seen at different hours About 30 ofthe user instances are seen only once in the sampledweek This illustrates that it is vital to understand thenotion of user instances These percentages are mis-leading if we interpret the user instances as individu-als or devices One individualdevice without stableidentification in the online advertising ecosystem willcreate many transient user instances which will driveup the percentage of user instances that are seen onlyonce or that are seen only within a short timeframeWe have not figured out how to determine or esti-mate reliably how many ldquounstablerdquo usersdevices arerepresented by the user instances in these data Wekeep all of the data for the analyses below (rather thanfiltering very short-lived user instances) to maintainrealism

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 253

Figure 7 Evolution of Several Sample Statistics as an Increasing Number of Cohorts are Created

0 100 200 300 400 5000

10

20

30

40

50

60

70

Average No network neighbors

0 100 200 300 400 5000

2

4

6

8

10

12

Median No network neighbors

0 100 200 300 400 5000

50

100

150

200

250

300

350

Average No IPs

0 100 200 300 400 5000

2

4

6

8Median No IPs

Notes The horizontal axis denotes the number of cohorts By 500 cohorts the statistics have become stable

Figure 8 (Color online) Cumulative Frequency of Life Span ofUser Instances

100 101 102 103 104 105 1060

01

02

03

04

05

06

07

08

09

10

x = lifespan (sec)

F(x

)

End of June (mobile only)

larr 1 min larr 1 hour larr 1 day

Cumulative frequency of user instances

6 Study 1 Connecting Instances ofthe Same Individual

Returning to our motivating application as men-tioned previously an important task in mobile adver-tising is to be able to target a particular individual inthe face of the individualrsquos fragmentation into differ-ent user instances (for example as seen through dif-ferent bidding systems) Advertisers understand that

it may be impossible to find the same user with 100accuracy Nonetheless it often is important for themto reach the userrsquos various instances without target-ing too many others9 The GSN is designed in partbased on the hypothesis that different instances of thesame person would visit similar locations and there-fore they will be connected in the GSN In this sectionwe present results examining whether and to whatextent that is indeed the case Furthermore the resultsquantify to what extent the same user receives a highGSN-similarity to herself as compared to her otherneighbors

Three experimental settings are used to assess judg-ments that two users are the same The first two sim-ulate different user instances by splitting up the bidrequests of known user instances The last one is notsimulated it links users via the recently introducedidentifier for advertising (IFA) observed over severalRTB exchanges for a subset of device instances

1 Random We divide the IP address visits of a userrandomly across two simulated users Table 6 showsfor example how we can divide a userrsquos IP visits across

9 The astute reader may notice that if the other users targeted arevery similar to the target user the advertising may be well placeddespite not being shown to the exact target user That is not thesubject of this study Study 2 shows evidence that these two notionsare not independent when using the GSN

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers254 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 6 Randomly Splitting User X rsquos IP VisitsCreating Two New Artificial Users X1 and X2

IP1 IP2 IP3

X 1 5 0X1 0 3 0X2 1 2 0

Table 7 The Volume of IFA Occurrence Within OurData Set

Volume

RTB 1 119401855RTB 2 113831962RTB 3 7891292

Table 8 The Frequency of Unique IFAs Visible AcrossRTB Exchanges

Frequency

RTB 1RTB 2 3061594 (1016)RTB 2RTB 3 451256 (169)RTB 3RTB 1 181417 (085)

two new users X1 and X2 for which we know thatthey are actually the same individual Results withRandom should be used only as a ceiling on perfor-mance because the results will be overly optimistic

2 Temporal A more realistic split-up of the trans-actions is a temporal split-up where first the middleday of Xrsquos transactions is determined (hence dynam-ically chosen for each user) All IP address visits upto this middle day are assigned to X1 and all trans-actions after the middle day are assigned to X2 Itwill only be applied to users who are seen on multi-ple days

3 Identifier for advertising IFA is an advertisingidentifier from Apple that users can change or asknot to be used for advertising and is seen as aprivacy-friendlier version of the controversial fixeddevice ID (the ldquoUDIDrdquo) The IFA is observed in threeexchanges with volumes shown in Table 7 for exam-ple 1940855 unique IFAs are seen in RTB 110 Whenwe observe the same IFA across different exchangeswe now know that two devices in different exchangesactually are the same The frequency of unique IFAsvisible across exchanges is shown in Table 8 For exam-ple a total of 3018223 IFAs are seen when we look atboth RTB 1 and RTB 2 with 306594 of these (1016)seen on both exchange 1 and exchange 2 What ifthere were not an IFA and the device would havehad a different identifier in each exchange To what

10 For confidentiality RTB 1 in this setting is not necessarily thesame as RTB 1 in sectsect4 or 7

extent are these two device instances then connectedin the GSN

With these different ldquogold standardrdquo methods toindicate whether or not two users are the samewe assess the ability of the geosimilarity metricsto link the same user determining how often twoinstances of the same person are connected in theGSN (sect61) Afterward we shall assess the differentGSN strength metricsrsquo ability to rank users determin-ing to what extent two instances of the same personexhibit strong geosimilarity as compared to the otherneighbors (sect62) For the random and temporal split-ups 200 samples are used For the IFA-based methodthe complete population is used for measuring thedegree of connectedness to the same user to analyzethe relative strength of the connected users 200 sam-ples are used

61 Same Individual Different ldquoScreensrdquoWe now present results on the GSN connectednessmeasured as how often two instances of the same userare connected in the GSN (in percentage) Note thatthe results summarized in Table 9 should be inter-preted in light of the limited number of days in build-ing the research data set which will limit the GSNconnectivity Nonetheless the relatively high connec-tivities observed even with this sample provide quitepromising results which lends considerable supportto the merit of the overall design

Random0 Using the random split two users thatcorrespond to the same individual are not connectedin only 9 of the neighborhoods Moreover in thesecases user X (whose IP visits are split up) usuallyhas only one or two entries and the random split-upresults in no visits for one of the two simulated users

Temporal0 The temporal split-up is more realisticthan the random one where we assume that a personuses simulated device 1 in the first half of the week(eg during the weekend) and simulated device 2 inthe second half of the week (eg during the week)

Users X1 and X2 are not connected in 32 of theneighborhoods Whereas the random split-up was toooptimistic the temporal one is too pessimistic as inreality there likely is some overlap in the time peri-ods of use of two different devices If user X visitsan IP address only in the first days of the week this

Table 9 To What Extent is the Same User Connectedto Herself

connected

Random 91Temporal 68IFAmdashRTB 1RTB 2 67IFAmdashRTB 2RTB 3 82IFAmdashRTB 3RTB 1 73

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 255

IP address will not occur in user X2rsquos transactions andreduces the chance to be connected to X1

IFA0 We look at all combinations of the three ex-changes with IFA identifiers For example there are45256 unique IFAs seen on both RTB 2 and RTB 3 In82 of these the two IFAs visit at least one same IP onboth exchanges and hence are connected in our GSN

The bottom line is that even with this limited slice ofonline behavior different instances of the same indi-vidual are connected in the GSN 70ndash80 of the timeEncouragingly the results from the simulated scenar-ios concur with the results from the real (IFA) sce-narios (including the aforementioned optimism andslight pessimism of the two simulated scenarios)Consistency over these different settings gives addi-tional confidence in the results suggesting that theGSN indeed holds promise for a high degree of suc-cess at targeting the same user across the fragmenta-tion of the online advertising ecosystem

62 Ranking User InstancesNext given that two user instances of the same indi-vidual are indeed connected we assess to what extentthe second instance of the same user is ranked highlyamong its network neighbors based on the weightof their linkage in the GSN We first will discuss themost general results using the best ranking methodand then will look across different ranking methods

We decompose the results based on the numberof network neighbors (NNs) since there are twoboundary cases that require special interpretationTables 10ndash14 show the proportions of the ego net-works that fall into three scenarios First a user mayhave only one NN In this scenario the ranking isperfect (trivially)mdashthe other instance of the same useris the only user instance given a nonzero score As

Table 10 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashRandom Split-up

Ego has only 1 NN 3104Ego has exactly 2 NNs 1905

Same user ranked first 8005Ego has more than 2 NNs 4901

Note For the scenario with two NNs the table reports thepercentage of cases where the other known instance of thesame user is ranked first (including ties)

Table 11 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashTemporal Split-up

Ego has only 1 NN 3008Ego has exactly 2 NNs 1807

Same user ranked first 8309Ego has more than 2 NNs 5005

Table 12 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 1RTB 2Split-up

Ego has only 1 NN 4401Ego has exactly 2 NNs 1201

Same user ranked first 8201Ego has more than 2 NNs 4400

Table 13 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 2RTB 3Split-up

Ego has only 1 NN 6109Ego has exactly 2 NNs 1907

Same user ranked first 7102Ego has more than 2 NNs 1804

Table 14 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 3RTB 1Split-up

Ego has only 1 NN 5305Ego has exactly 2 NNs 2501

Same user ranked first 6304Ego has more than 2 NNs 2104

shown in the tables this scenario accounts for 30ndash60 of the ego networks (depending on the split-upmethod)

The second boundary case is when the ego has ex-actly two NNs This scenario accounts for 10ndash25of the cases depending on the split-up method Herethe tables also report whether the other known userinstance is ranked first (including ties) among thetwo neighbors The other known instance of the sameuser is ranked first 60ndash80 of the time dependingon the split-up scenario The IFA results are affectedstrongly by the particular RTB settingmdashthis may indi-cate that the publishers on the different RTBs have dif-ferent policies for passing the IFA Importantly theseresults are conservativemdashpossibly very conservativeThe other NNs may also be instances of the same userunbeknownst to usmdashfor example in the IFA casebecause an app does not pass the IFA to the RTB orbecause the other user is the same user on a differentdevice and thus does not share the IFA

The final scenario is when more than two NNsare present As shown in the tables this accounts for20ndash50 of the cases depending on the split-up sce-nario In this case we measure the percentile in theranking where the instance of the known same userfalls (see below) as well as the area under the receiveroperating characteristic curve (AUC) (Fawcett 2006)which is equivalent to the MannndashWhitneyndashWilcoxon(MWW) statistic The AUCMWW measures to what

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers256 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 15 When There Are More Than Two Network Neighbors to What Extent Is the Known Same User StronglyConnected to Itself

AUC (if gt 2 NNs) () Average rank percentile (if gt 2 NNs) ()

Ranking metric R T RTB 12 RTB 23 RTB 31 R T RTB 12 RTB 23 RTB 31

1 78 64 68 59 55 29 48 34 43 542 59 58 67 58 54 45 53 34 44 553 79 69 74 61 61 29 44 29 40 484 86 75 69 61 59 24 40 32 41 505 83 70 71 59 63 26 44 31 43 476 85 74 71 60 65 25 41 31 42 457 86 76 73 62 64 24 40 29 40 468 83 70 72 59 65 26 44 30 43 459 85 74 71 60 67 25 41 30 42 44

10 79 69 69 61 63 29 44 32 41 4711 79 69 71 62 64 44 44 30 41 4612 81 68 68 60 62 28 46 34 42 4913 81 70 71 61 64 27 44 30 40 4614 74 67 66 61 48 34 47 35 41 6015 74 67 66 61 48 34 47 35 41 60

Note See the text for a discussion of ranking metrics

extent data points with label 1 are ranked higherthan those with label 0 In the results that follow(see Table 15) the AUC is measured for each samplewhere any user instance among the neighbors thatcorresponds to the same individual is labeled as 1and all others are labeled as 0 Thus in this settingthe AUC measures how well the neighbors are rankedby the likelihood of being the same individual11 Theaverage over all samples is reported If for a sampleall neighbors have the same score that AUC is setto 05 In addition to the factors noted above this alsowill tend to make the results conservative For exam-ple if the GSN links an ego user to a set of neighborsthat are all instances of the same individual in thisevaluation the AUC will be 05

The ranking measure is the percentile in the rank-ing at which the known-same-user instance is rankedwith lower values being better For example if thesame-user instance has the highest score (closest con-nection) among five neighbors the rank is 20 (15)Note again that this is quite conservative in thisexample the user is ranked at the top of the list butbecause there were only five neighbors the highestpossible score is 20 In case the known-same-userinstance has the same score as another neighbor theaverage rank is reported

As shown in Table 15 all AUC values exceed 05ranging up to 08 This means that in every casea known-same-user instance is likely to be rankedhigher than an instance not known to be the same

11 Technically in this setting the AUCMWW is the probability thata randomly selected neighbor that is in fact a known instance of thesame user is ranked more highly than a randomly selected neighborthat is not known to be an instance of the same user If the known-same-user instances are all ranked above the other-user instancesthe AUC = 100 If they are all below then the AUC = 0

user The frequency-based ranking metrics (4ndash9) per-form quite well in connecting instances of the sameuser strongly

Almost all metrics perform better than a randommodel with performances that even come close to thebest possible solution Consistently performing quitewell are the frequency-based metrics (metrics 3ndash9)with metric 7 (the minimum of the IDF weights ofthe shared locations) getting the most wins Not per-forming well is the very basic count metric as well asthe Boolean metric that provides a binary score onlyInteresting to note the Jaccard metrics do not performwell either It seems that penalizing users that visitmany IP addresses negatively affects the results anissue we will return to in the next study

7 Study 2 Does the GSN Select Userswith Similar Interests

The foregoing section addressed our paperrsquos firstclaimmdashthat the GSN design indeed links instances ofthe same user and links them more strongly thaninstances of other users In this section we turn to thepaperrsquos second claimmdashthat the GSN links users withsimilar interests We will measure similar interests bysimilarity in behavior accessing particular publisherson the Web and using mobile apps12

71 Connecting Users with Similar Interests

To What Extent Are the GSN Neighbors of Visi-tors to Particular Publishers Also Visitors to ThoseSame Publishers To assess whether the GSN con-nects users with similar interestsbehavior we select

12 From now on all user instances will be complete real user in-stances the simulated split-up scenarios (Random Temporal) werecreated specifically for the purpose of the same-user study

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 257

Table 16 Distribution of Publishers per RTB in Terms of thePercentage of Users in Our Data Set That Visit the GivenPublisher

Publisher of D

RTB 1Burstly 3004myYearbookcom 1062My Fitness Pal 0065TextMe Inc 0056GameResort 0025Flixster 0019247Sports 0015Daily Workout Apps LLC 0009Mobile Deluxe 0007YoYo Games Ltd 0004

RTB 2mworldnowcom 0032mtmzcom 0024mtopixcom 0014babycentercom 0015meetmecom 0006appsfacebookcom 0006itunesapplecom 0039appevitecom 0002beautyandskincarescom 0003mmajunkiecom 0001

RTB 3Conversion Exchange 5053Pinger Inc 2096Top Game Developer 4029PremiumEntertainmentApp-FamilyOriented-Android 0059Enflick 0053Clapfoot Games 0044FunPokes Inc 0018Talkatone 0015Sevenlogics Inc 0008A Star Software 0007

RTB 4Pinger PhonemdashiOSmdashConversation 1001Talkatone iPhone App 0007DiceWithBuddies 00139GAG Reader 0012Video Downloader Pro Lite 0016Rage Comics 0009Spades 3D Lite 0006Apalabrados 0003Relax Melodies HD 0002Celeb MemdashPhotoMaker 0000

a diverse collection of 40 publishers listed in Table 16A publisher can correspond to a mobile app a web-site or a set of appswebsites We selected both pub-lishers that are visited by many users and more nichepublishers with fewer visitors As can be seen fromTable 16 several types of publishers can be distin-guished the main groups are games (eg GameRe-sort Top Game Developer) and social networkingtools (eg appsfacebookcom Text Me Inc)

More specifically we create 200 samples for 40 pub-lishers 10 publishers for each of four different RTB

exchanges listed in Table 1613 The percentage ofunique users that visit each publisher in our completedata set is also shown (D) To create each samplean ldquoegordquo user is selected that has visited the pub-lisher in question (an ldquoego visitorrdquo) this ego visi-torrsquos GSN neighborhood is created Next we assesshow many of the egorsquos neighbors also visited thatsame publisher This will be compared with a base-line visit rate to produce lift and leverage values(Provost and Fawcett 2013)mdashspecifically how muchmore likely is it for the neighbors of ego visitors tovisit the publisher than would be expected by chanceWe consider three different setups that produce dif-ferent baselines

1 Considering all RTB exchanges to determine the base-line A user is considered a visitor for a given pub-lisher if she visits the publisher on any RTB ex-change (since some publishers are visible on differentexchanges) Hence the baseline visit rate is the num-ber of users (on any RTB) who visit that publisherdivided by the total number of users across all RTBexchanges This will provide optimistic lifts sincemost publishers appear only on one RTB exchangewhereas the baseline is measured over all exchanges

2 Considering one RTB exchange only to determine thebaseline A user is considered a visitor for a given pub-lisher and RTB if she visits the publisher on the sameRTB exchange This changes the baseline to the num-ber of users (on this RTB) who visit that publisherdivided by the total number of users on this RTBexchange The lift and leverage results will be worsethan in the first setting because the baseline will belarger Seeing that some publishers are seen on differ-ent RTB exchanges this setting is slightly pessimistic

3 Taking time of location visits into account In thethird setting we link users only if they visit an IPwithin the same time period This will be elaboratedon in sect73

Results are shown in Tables 17 and 18 The percent-age of publisher visitors in the neighborhood is givenas N This percentage is compared with the baselinepercentage of users that visit that publisher in thecomplete data D using two metrics the lift is the ratioof these numbers the leverage is the difference Liftis particularly useful for very small base rates (con-sider that an activity with a base rate of 30 cannotpossibly have a lift higher than about 3) Leverage ismore telling for larger base rates because it shows theabsolute improvement (here in percentage points)

When considering all NNs the lifts for all publish-ers are greater than 1 except for one Burstly on RTB 1has a lift of 097 (essentially no lift) but only when weconsider RTB 1 as the baseline This publisher already

13 For confidentiality the RTB numbering here is again differentfrom above

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers258 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 17 RTBs 1 and 2 Percentage of Network Neighbors That Also Visit the Publisher

RTB 1

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Burstly 3004 8063 2084 5059 47022 15054 44018 25061 8043 220572 myYearbookcom 1062 25077 15094 24015 44074 27067 43012 35016 21074 330543 My Fitness Pal 0065 4015 6038 305 3609 56077 36025 15055 23091 14094 TextMe Inc 0056 1805 33006 17094 39002 69075 38046 25018 45 240625 GameResort 0025 4093 19077 4068 18052 74026 18027 11086 47058 110626 Flixster 0019 1055 8015 1036 8014 42085 7095 1079 9042 1067 247Sports 0015 27057 188058 27042 77062 530093 77048 74053 509075 740388 Daily Workout Apps LLC 0009 1092 21047 1083 19023 215046 19014 11054 129027 110459 Mobile Deluxe 0007 1087 28067 108 23053 36009 23046 9063 14707 9056

10 YoYo Games Ltd 0004 3052 99058 3049 21057 609064 21053 13 367045 12096Considering this RTB exchange only to determine baseline

1 Burstly 8085 8063 0097 minus0022 47022 5033 38037 25061 2089 160762 myYearbookcom 4071 25077 5047 21006 44074 905 40003 35016 7046 300453 My Fitness Pal 1089 4015 2019 2025 3609 19048 35001 15055 8021 130654 TextMe Inc 1063 1805 11035 16087 39002 23094 37039 25018 15045 230555 GameResort 0073 4093 6078 402 18052 25049 17079 11086 16033 110146 Flixster 0055 1055 208 0099 8014 14071 7059 1079 3023 10247 247Sports 0043 27057 64073 27014 77062 182023 7702 74053 174097 74018 Daily Workout Apps LLC 0026 1092 7037 1066 19023 73095 18097 11054 44037 110289 Mobile Deluxe 0019 1087 9084 1068 23053 123088 23034 9063 5007 9044

10 YoYo Games Ltd 001 3052 34018 3042 21057 209025 21047 13 126012 1209

RTB 2

Considering all RTB exchanges to determine baseline1 mworldnowcom 0032 13027 40086 12095 54039 167042 54006 31065 97044 310332 mtmzcom 0024 6089 28086 6065 23008 9607 22084 10068 44075 100443 mtopixcom 0014 15007 104079 14093 53045 37106 5303 33075 234065 330614 babycentercom 0015 6087 47031 6072 43048 299056 43033 17074 122024 17065 meetmecom 0006 6002 10308 5097 54072 942093 54066 27085 47909 270796 appsfacebookcom 0006 301 48039 3003 9009 141097 9003 2076 43008 20697 itunesapplecom 0039 25077 65071 25038 51043 131012 51004 29049 75018 290098 appevitecom 0002 48028 21212043 48025 54076 21509068 54074 48028 21212043 480259 beautyandskincarescom 0003 18019 642087 18016 8048 299094 8046 1107 413049 11067

10 mmajunkiecom 0001 1036 201064 1035 26067 3195505 26066 13033 11977074 13033Considering this RTB exchange only to determine baseline

1 mworldnowcom 4034 13027 3006 8093 54039 12053 50005 31065 7029 270312 mtmzcom 3019 6089 2016 307 23008 7024 19089 10068 3035 70493 mtopixcom 1092 15007 7084 13015 53045 27081 51053 33075 17056 310834 babycentercom 1094 6087 3054 4093 43048 22042 41054 17074 9015 15085 meetmecom 0078 6002 7077 5025 54072 70057 53094 27085 35092 270076 appsfacebookcom 0086 301 3062 2024 9009 10063 8024 2076 3022 1097 itunesapplecom 5024 25077 4092 20053 51043 9081 46019 29049 5063 240258 appevitecom 0029 48028 165058 47098 54076 187082 54047 48028 165058 470989 beautyandskincarescom 0038 18019 48011 17081 8048 22045 8011 1107 30095 11032

10 mmajunkiecom 0009 1036 15009 1027 26067 296003 26058 13033 148001 13024

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

has a very high baseline and as we will see in thenext section when we consider only the top 1 and 10NNs the lifts go up to 5 and 3 respectively Burstlyprovides a generic platform for building apps and sois a weaker indication of user interest than many ofthe other publishers (fitness babies games)

All other lifts exceed one by substantial mar-gins and go up to even 2391 (Apalabrados onRTB 4 a Spanish-language social Scrabble-like game)

Obviously the conclusion that the NNs indeed aremore likely to visit the same publishers is highly sig-nificant by a simple sign test comparing the visita-tion rate within the geosimilarity neighborhood tothat of overall population either via lift or leveragep lt 0001 Thus NNs of publisher visitors are indeedmore likely also to visit the same websites the abso-lute values of the lifts and leverages suggest that theyare substantially more likely

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 259

Table 18 RTBs 3 and 4 Percentage of Network Neighbors That Also Visit the Publisher

RTB 3

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Conversion Exchange 5053 34054 6024 29001 64071 11069 59017 50077 9018 450242 Pinger Inc 2096 46002 15053 43006 65012 21097 62015 51048 17037 480523 Top Game Developer 4029 47024 11002 42095 60 13099 55071 48097 11042 440684 PremiumEntertainmentApp-FamilyOriented-Android 0059 23079 40012 2302 20045 34049 19086 22025 37052 210665 Enflick 0053 11071 21095 11017 21057 40044 21004 12027 23 110736 Clapfoot Games 0044 8055 19032 8011 11054 26006 1101 13099 31061 130557 FunPokes Inc 0018 11011 61028 10093 50 275077 49082 15079 87008 150618 Talkatone 0015 13013 88051 12098 31011 209076 30096 17091 120078 170779 Sevenlogics Inc 0008 2071 32059 2063 5088 70074 508 8033 100022 8025

10 A Star Software 0007 3033 44086 3025 23008 311024 23 9009 122061 9002Considering this RTB exchange only to determine baseline

1 Conversion Exchange 11045 34054 3002 23009 64071 5065 53025 50077 4043 390322 Pinger Inc 6013 46002 705 39089 65012 10062 58098 51048 8039 450353 Top Game Developer 8088 47024 5032 38036 60 6076 51012 48097 5052 400094 PremiumEntertainmentApp-FamilyOriented-Android 1023 23079 19038 22057 20045 16066 19023 22025 18013 210025 Enflick 101 11071 1006 1006 21057 19054 20046 12027 11011 110166 Clapfoot Games 0092 8055 9033 7064 11054 12059 10062 13099 15027 130087 FunPokes Inc 0038 11011 2906 10074 50 133022 49062 15079 42007 150418 Talkatone 0031 13013 42076 12082 31011 101034 3008 17091 58035 170619 Sevenlogics Inc 0017 2071 15074 2054 5088 34017 5071 8033 48041 8016

10 A Star Software 0015 3033 21067 3017 23008 150036 22092 9009 59023 8094

RTB 4

Considering all RTB exchanges to determine baseline1 Pinger PhonemdashiOSmdashConversation 1001 71081 70085 7008 80077 79068 79076 70014 6902 690132 Talkatone iPhone App 0007 11038 17203 11031 29003 439068 28097 17002 257078 160963 DiceWithBuddies 0013 7041 59004 7028 18018 144092 18006 7069 61031 70574 9GAG Reader 0012 62013 497099 62 49009 39305 48097 49005 393017 480925 Video Downloader Pro Lite 0016 20003 123015 19087 22058 138084 22042 17093 110027 170776 Rage Comics 0009 609 8001 6081 13033 154086 13025 609 8001 60817 Spades 3D Lite 0006 1205 225076 12044 27027 492057 27022 1205 225076 120448 Apalabrados 0003 77042 21391071 77039 75 21316097 74097 79032 21450043 790299 Relax Melodies HD 0002 31058 21002029 31056 50 31170029 49098 31058 21002029 31056

10 Celeb MemdashPhotoMaker 0 3023 82107 3022 7014 11819048 7014 3023 82107 3022Considering this RTB exchange only to determine baseline

1 Pinger PhonemdashiOSmdashConversation 8051 71081 8044 6303 80077 9049 72026 70014 8024 610632 Talkatone iPhone App 0055 11038 20053 10082 29003 52038 28048 17002 30071 160473 DiceWithBuddies 0009 7041 82011 7032 18018 201054 18009 7069 85027 7064 9GAG Reader 0097 62013 63097 61016 49009 50055 48012 49005 50051 480085 Video Downloader Pro Lite 0044 20003 45036 19059 22058 51014 22014 17093 40062 170496 Rage Comics 0037 609 18079 6053 13033 36032 12097 609 18079 60537 Spades 3D Lite 0046 1205 2609 12004 27027 58068 26081 1205 2609 120048 Apalabrados 0021 77042 371093 77021 75 360031 74079 79032 381006 790119 Relax Melodies HD 0013 31058 238054 31045 50 377069 49087 31058 238054 31045

10 Celeb MemdashPhotoMaker 0003 3023 97089 3019 7014 216076 7011 3023 97089 3019

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

The very high lifts could be attributed to two phe-nomena They may provide some evidence that theGSN embeds a true social network as pointed out bythe good results for the social networking publishers(eg Apalabrados with a lift of 2391 appsevitecomwith a lift of 2212 and others) Friends visiting thesame social networking sites (here unusual ones)

would increase the lift and leverage over just havingdifferent instances of the same user visiting the samewebsites An alternative explanation is that some ofthese publishers are much more likely to be visitedacross different devices of the same user It is not clearwhether this is actually true However it argues forobtaining specific and broad data on which devices

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers260 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

are not the same user to shed further light on forexample any broad data that include a (anonymized)user identifier

From these results we can conclude that GSNneighbors indeed exhibit substantially similar pub-lisher visitation behavior Next we assess to whatextent more strongly connected NNs are more similarthan less strongly connected NNs

72 Ranking Users with Similar Interests

Are the More-Similar Geosimilarity Neighbors of aPublisherrsquos Visitors Even More Likely Also to Be Vis-itors to the Same Publisher To assess whether more-similar geosimilarity neighbors are even more likelyto share interests we compute the measures exactlyas in the previous study except instead of consider-ing all NNs only the 1 (N1) and 10 (N10) neighborswith the highest scores are considered (see Tables 17and 18) First we must consider which of the 15 met-rics that are defined to measure link strength to useTable 19 reports the rank aggregations of the differentmetrics as to which provide the best lifts Specificallyfor each of the 40 publishers each scoring metric pro-vides a lift and for that publisher the scoring metricscan be ranked (the best being ranked 1 etc) The rankaggregation is the average rank for a scoring metricacross all 40 publishers Thus if one metric providedthe best lift for all publishers it would get an aggre-gated rank of 1

To visualize the best metrics the metrics with anaverage ranking less than 8 are shown in boldfacein Table 19 The Jaccard metrics do not perform wellin this study either showing that penalizing the con-nection to users that visit many locations is not sen-sible Remember that the Jaccard metric is based onthe idea that users that visit many locations will havea higher chance to share some location with anotheruser However we have so many locations in totalthat visiting a couple more locations will increase onlymarginally this probability of sharing a location bychance A user that logs into more IP addresses andis more active should hence not be penalized andranked lower than other less active users Metric 3which sums the IDF scores of the shared locationsperforms quite well Given the operational efficiencyadvantage over the frequency-based metrics this met-ric is chosen to measure the strength of the links

As seen from the results in Tables 17 and 18 thelift and leverage values for the strongly connected (top 1and top 10) NNs are even higher sometimes astro-nomical (3170 for Relax Melodies HD (RTB 4)) Theseresults are highly significant by sign tests compar-ing either the lift or the leverage between the top-ranked neighbors and the entire cohort (p lt 0001) Theabsolute values again indicate that the users with the

strongest geosimilarity are often substantially morelikely to visit the same publisher

Thus we can conclude that not only does geosim-ilarity find users with similar interests it also ranksusers well by their likelihood of having similarinterests

An interesting follow-up question is whether thecharacteristics of a publisher are of importance for theresults For example are GSN neighbors more sim-ilar in terms of using the same social network appas compared to playing the same games To answerthis question the publishers are categorized into sixclasses funny social news games communicationand miscellaneous In Figure 9 the lifts of the pub-lishers are shown per category Some general trendscan be observed where publishers related to funnycontent have the highest median lift This could beexplained by the fact that users often forward funnycontent to their friends The social websitesapps fol-low next which seems to further demonstrate thatfriends are likely linked in the GSN However pub-lishers in the communication category perform not sowell compared with the other categories and ratherhigh variances are observed in the different cate-gories Please note that high variance is observedacross the categories therefore it is difficult to makewell-supported claims from these results

73 TimingAs a final evaluation we explore whether the inclu-sion of temporal similarity in location visitation ishelpful for defining GSN connections To this end weintroduce a setting where two users are connectedonly if they visit an IP address in the same timeperiod of a day Specifically we divide a day intodifferent time windowsmdashfor example two time peri-ods of 12 hoursmdashwhere users are connected only ifthey visit the same IP address in the same time periodof a day (which might be on different days) Practi-cally two digits are added to an IP which denote thestarting moment of the corresponding time periodWe report on time periods of 2 hours and of 12 hours(other periods yield similar results) We also includethe setting where users are connected only when theyvisit the same IP address on the same day in light ofthe result of Crandall et al (2010) that visiting thesame location at around the same time is indicativeof friendship

The resulting lifts are reported in Figures 10ndash12when considering respectively all NNs the top 1 NNand the top 10 NNs The results when not consider-ing the time period are repeated as well (All) lim-ited to the conservative case of only considering oneexchange (see above) The baseline (lift of one) isindicated with a red horizontal line lifts higher than1000 are shown as 1000 to limit the range to be able

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 7: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile Consumers248 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 1 Data Availability for Mobile Devices Across Different RTB Systems

RTB 1 () RTB 2 () RTB 3 () RTB 4 () RTB 5 () RTB 6 () RTB 7 ()

Created 100 100 100 100 100 100 100Network 100 100 100 100 100 100 100Dims 100 100 100 100 100 100 100ID 100 100 100 36 0 19 44Device 100 100 100 100 100 100 100User agent 100 100 100 100 100 100 100Iabcats 0 100 100 98 100 100 84Geo 100 16 100 100 100 100 100Location 0 4 56 42 1 20 100Referrer 8 0 0 17 0 6 3Publisher 100 100 100 100 100 100 100URL 100 0 0 5 93 0 0IP 100 100 100 100 100 100 100

Notes Specifically the table shows how often the following fields are provided time of bid request (Created) the RTB network identifier(Network) dimensions of the ad (Dims) identifier of the device (ID) device type (Device) the user agent IAB categorization (Iabcats)the IP-inferred city state or country (Geo) the location latitude and longitude (Location) the referrer from where the user comes(Referrer) the publisher of the Web page (Publisher) the URL and the IP address

helpful for finding the same user on different devicesor for similar-user finding (eg using Apple productsversus Android products) we do not take advantageof them in this paper instead focusing specificallyon assessing the ability of the geosimilarity design

Figure 2 Distribution of (a) Number of Unique User Instances Seen per IP Address and (b) Number of IP Addresses a User Instance Logs Into

(b) Number of IPs per ID

Number of IPs

Fre

quen

cy o

f occ

uren

ce

100

108

107

106

105

104

103

102

100

101

101 102 103 104 105 106 107 108

(a) Number of IDs per IP

Fre

quen

cy o

f occ

uren

ce

108

107

106

105

104

103

102

100

101

Number of IDs

100 101 102 103 104 105

Figure 3 Distribution of (a) Number of Unique User Instances Seen per Publisher and (b) Number of Publishers a User Instance Visits

(a) Number of IDs per publisher

Number of IDs

Fre

quen

cy o

f occ

uren

ce

100

104

103

102

100

101

101 102 103 104 105 106 107

(b) Number of publishers per ID

Fre

quen

cy o

f occ

uren

ce

108

107

106

105

104

103

102

100

101

Number of publishers

100 101 102 103 104

(as opposed to how to best engineer a same-user-finding system)

The URL field is missing in all but two RTB ex-changes The Geo field is often limited to the citystate and country as inferred from the IP address

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 249

Location data in the form of latitude and longitude(Location) are only available in some RTB systemsand the accuracy of the latitudelongitude data in thecurrent mobile ecosystem is questionable8 again mak-ing it not suitable for broad use

32 Entity and Link RepresentationWe represent each user instance by a ldquoprofilerdquo ofbehavior across locations where we define a locationprofile as a vector of visit frequencies This will neces-sarily require a sparse representation as we observe avery large total set of locations fortunately each indi-vidual is seen at only a small subset of the locationsFor the empirical results below the location profile isa sparse vector of frequencies of observed visitationsto locations (which can be weighted in the calculationof similarity as described below)

The structure of the GSN is the graph with mobiledevices as the nodes and links between the devicesas the edges The corresponding design decisionsinvolve the selection of which devices to link at alland the weights to place on the links The simplestlink representation places unweighted links betweenall pairs of devices that share at least one specific loca-tion More sophisticated weighting criteria are basedon two notions which are discussed next and illus-trated with a simple example in Figure 4 The loca-tion data are obtained by listening to RTB ecosystemswhere IP addresses of devices are broadcast (as shownpreviously in Figure 1)

1 Given two devices that share at least one loca-tion the weight of the link between them can bea function of the devicesrsquo location profiles Such mea-sures can include simply the number of shared loca-tions and the number of shared locations weightedby their visitation percentage in the profile Alterna-tively the weighting could be a measure of similaritybetween the two location profiles for example cosinesimilarity

2 As such Devices A B and C are connected in theGSN of Figure 4 since they all logged into the Wi-Fiof the Met Museum Devices B and C also loggedinto the public Wi-Fi at Times Square and thereforeshould be more strongly connected than A and B (or Band C) which share only one location

8 Analyses of the latitudelongitude data show various data prob-lems including flipped coordinates coordinates set equal to eachother many latitudinallongitudinal points at the centroids of com-mon geographic areas and many impossible or highly improba-ble coordinates Mobile ad industry insiders claim that the highprice paid for latitudelongitude-based hyperlocal targeting hasled to a significant amount of latitudelongitude fraud We havefound nothing written on this phenomenon although data fraudis not uncommon in the online advertising industry (Stitelmanet al 2013)

Figure 4 (Color online) Basic Illustration of the Creation of a GSNBased on Mobile Location Data Observed for Three MobileDevices (A B and C) and a Laptop (D)

Met MuseumA B

Geosocial network(GSN)

A C

D

B

C

D

Times Square

Apartment

Note Devices are connected in the GSN if they visited (logged into) the samelocation (IP address)

3 Links can be weighted according to the componentlocationsrsquo 4lack of5 popularity For example it mightbe argued that the link formed by sharing a loca-tion with a small number of other devices (eg myapartment) indicates a stronger geosimilarity thansharing a location with a massive number of otherdevices This intuition can be extended if two devicesspend a lot of time at such an unpopular locale(eg my apartment) then that should indicate a verystrong similarity If two devices have approximatelythe same distributional profile across several of thesesites (their ldquolocation profilerdquo) then they are eitherthe same person or close friendssoulmates In ourexample C and D are connected because they bothlogged into the Wi-Fi of an apartment whereas Band C are connected by logging into the public (pop-ular) Wi-Fi at Times Square Hence the connectionbetween C and D (which are likely devices belongingto roommates or belonging to the same user) shouldbe stronger than the connection between B and C(which are likely devices belonging to two differenttourists)

This notion of similarity can be incorporated tech-nically by (1) adapting notions from information re-trieval we can weight locations for a given deviceby their device-specific popularity (from the devicersquoslocation profile) divided by the (log of) the locationrsquosoverall popularity let us call that LF times IDF (locationfrequency times inverse device frequency) Then (2) thestrength of the link between two devices that sharea location would be a function of the correspondingLF times IDF scores

Once we have one or more GSNs there are variousways to take advantage of them for targeting or otherinference The simplest method for inference is sim-ply to target all of the geosimilar network neighborsAlternatively one could target the ldquoclosestrdquo neighborsbased on one or more notions of geosimilarity (as dis-

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers250 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

cussed above) These are the methods used for theempirical results presented below

4 Specifics of the GSN DesignAs described above in this design the strength of thelink between two individuals in the GSN is based onthe similarity in the distribution of the locations thatthey visit For this paperrsquos results locations will be(anonymized) IP addresses We now define differentsimilarity measures of varying complexity which willbe evaluated in the following sections

41 NotationThe location profile of a user is represented by a vec-tor Ex of length m where element i denotes the num-ber of visits to IP address IPi2 Ex isin m Vector Exbool isa binary vector of size m where element i denoteswhether the user has visited IPi For similarity com-putations both the Hadamard product () and theinner product 4 middot 5 are used and the minimum (min)operator is defined to operate componentwise on twovectors as well as on a vector and a scalar The max-imum (max) and average (avg) operators are similarTheir logic is defined in Equations (1)ndash(4) The weightof an IP address is defined by its popularity in anal-ogy to the inverse document frequency used in text

Table 2 Similarity Metrics Between Two Userrsquos Location Visit Distributions Ex1 and Ex2

Similarity metric Range d4 Ex11 Ex25

1 scount4 Ex11 Ex25= Exbool11 middot Exbool12 6015 1

2 sbool4 Ex11 Ex25= min4scount4 Ex11 Ex25115 80119 1

3 scount1W 4 Ex11 Ex25= 4 Ew Exbool115 middot Exbool12 6015 103

4 sfreq1min4 Ex11 Ex25= min4 Ex11 Ex251 6015 2

5 sfreq1max4 Ex11 Ex25= max4 Exbool12 Ex11 Exbool11 Ex251 6015 3

6 sfreq1 avg4 Ex11 Ex25= avg4 Exbool12 Ex11 Exbool11 Ex251 6015 205

7 sfreq1min1W 4 Ex11 Ex25= min4 Ex1 Ew1 Ex2 Ew51 6015 206

8 sfreq1max1W 4 Ex11 Ex25= max44 Exbool12 Ew5 Ex11 4 Exbool11 Ew5 Ex251 6015 309

9 sfreq1 avg1W 4 Ex11 Ex25= avg44 Exbool12 Ew5 Ex11 4 Exbool11 Ew5 Ex251 6015 3025

10 scosine4 Ex11 Ex25=Ex1 middot Ex2

Ex11 middot Ex21

60117 0074

11 scosine1W 4 Ex11 Ex25=4 Exbool11 Ew5 middot 4 Exbool12 Ew5

Exbool11 Ew1 middot Exbool12 Ew1

60117 0060

12 sJaccard4 Ex11 Ex25=scount4 Ex11 Ex25

max4 Exbool111 Exbool1251

60117 0033

13 sJaccard1W 4 Ex11 Ex25=scount1W 4 Ex11 Ex25

max4 Exbool11 Ew1 Exbool12 Ew51

60117 0026

14 sJaccard1 freq4 Ex11 Ex25=sfreq1min4 Ex11 Ex25

max4 Ex1 Exbool121 Ex2 Exbool1151

60117 0067

15 sJaccard1 freq1W 4 Ex11 Ex25=sfreq1min1W 4 Ex11 Ex25

max44 Ex1 Exbool125 Ew1 4 Ex2 Exbool115 Ew51

60117 0067

mining (Hotho et al 2005) the weight wi for IPi (theinverse device frequency) is defined as the logarithmof the total number of users n divided by the numberof (unique) users that visit that specific IP address ni

Ez= Ex Ey1 zi = xi times yi1 (1)

a= Ex middot Ey =sum

i=12m

4xi times yi51 (2)

Ez= min4Ex1 Ey51 zi = min4xi1yi51 (3)

Ez= min4Ex1a51 zi = min4xi1 a51 (4)

wi = log104nni50 (5)

42 Link Strength MetricsFifteen similarity metrics are defined in Table 2 andare illustrated by measuring the strength betweenthe location profiles of two example users Ex1 and Ex2shown in Equation (6) The first user visits IP1 threetimes and IP2 twice whereas the second one visits IP1twice and IP3 once Several variants of the same met-rics are considered As described above some onlytake into account the unique number of shared loca-tions whereas others take into account the frequencyof the visits to these locations and the inverse devicefrequency The latter are denoted by freq and W respectively

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 251

The first similarity metric scount counts the numberof shared locations Since the two example users shareonly IP1 this strength is 1 The IDF weighted versionscount1W sums the IDF values of the shared locationswhich is 13 for our simple example The most basicmetric is given by sbool which is 1 when a location isshared and 0 otherwise indicating whether two usersare neighbors or not

The sfreq1min sfreq1max and sfreq1avg metrics compare thefrequencies of the visits to the shared locations Theytake the minimum maximum and average respec-tively of the frequencies for each shared location andsum them The weighted versions sfreq1min1W sfreq1max1W and sfreq1avg1W do the same but weight each frequencywith the corresponding IDF value

The scosine metric measures the similarity by takingthe cosine of the angle between the two vectors TheJaccard metric is defined as the size of the intersec-tion over the size of the union In this case it countsthe number of shared locations but normalizes theseover the total number of locations both users havevisited Users that visit many locations will have ahigher chance to share some location with anotheruser so the Jaccard metric penalizes them For ourbasic example sJaccard is 1 (IP1) over 3 (IP1 IP2 IP3)Once more variants taking into account the frequencyand the IDF are defined as well

Ex1 = 63 2 071

Ex2 = 62 0 171 (6)

Ew = 6103 107 270

5 Experimental Setup51 Data SetThe data set D for the empirical results is 10 days ofanonymized advertising bid requests from mobile de-vices (smartphones tablets etc) observed across sevenRTB exchanges A total of 322770794 bid requestsare observed coming from 42437559 unique userinstances Over these 10 days we observe 41829088unique IP addresses

The number of unique user instances per RTBexchange is given in Table 3 The distributions of thenumber of bid requests per device type and operatingsystem are given in Tables 4 and 5 which show thatthe majority of bid requests come from smartphonesand the iOS operating system

A user (instance) corresponds to an exchange-assigned user ID One person on several deviceswill hence correspond to several users One deviceon several exchanges will often correspond to sev-eral users Also it might be that one ad exchangesometimes provides different user IDs to the same

Table 3 User Instances per RTB Exchange

Exchange Volume Percentage ()

RTB 1 311751979 6RTB 2 211591686 4RTB 3 1415661098 29RTB 4 2015011243 41RTB 5 212751015 5RTB 6 510551815 10RTB 7 212231418 4

Table 4 Distribution of Bid Requests per Device Type

Type Volume Percentage ()

Phone 29213581772 91Tablet 3010161334 9

Table 5 Distribution of Bid Requests per OperatingSystem

Type Volume Percentage ()

Android 12913151617 4001iOS 19219231648 5908Windows 5311529 002

device for example when the device is using differ-ent particular apps As described above an impor-tant task in online advertising is to be able to targetadvertisements to different instances of the same indi-vidual with reasonable probability regardless of thecomplexities of identification in the baroque onlineadvertising ecosystem Figure 5 visualizes a sample oflatitudes and longitudes broadcast by mobile devicesin our data set

52 SamplingFor the results in this paper we apply the GSN anal-ysis to data samples Each sample corresponds to acohortneighborhood and is built as follows A sin-gle (random) ldquoegordquo user is chosen user X aroundwhich a neighborhood is built Every neighbor of theego user must share at least one IP address so webegin by obtaining all (anonymized) IP addresses vis-ited by user X Next all other users that visited atleast one of these IP addresses (the neighbors) areadded The ego must have at least one neighbor to beincluded in this analysis Finally we obtain the com-plete location distribution of each of these neighborsWhen the next sample is chosen we ensure that noneof the users already included in the previous samplesis taken as the ego user (X) As such we ensure thatwe do not have repeated egondashneighbor dyads Theremay be shared neighbors of two different egos buttwo egos will never be immediate neighbors in thedata sample

The result of the sampling over 500 cohorts is summa-rized in Figure 6 on average an ego has 71 neighbors

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers252 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 5 Scatter Plots of Samples of Latitudes and Longitudes Broadcast by Mobile Devices

ndash130 ndash120 ndash110 ndash100 ndash90 ndash80 ndash70 ndash6020

25

30

35

40

45

50

55(a)

Long

Lat

ndash20 ndash10 0 10 20 30 40 50

Long

30

35

40

45

50

55

65

60

75

70

Lat

ndash200 ndash150 ndash100 ndash50 1000 50 150 200

Long

ndash100

ndash80

ndash60

ndash40

ndash20

0

40

20

100

60

80

Lat

4068ndash7406 ndash7404 ndash7402 ndash7400 ndash7394ndash7398 ndash7396 ndash7392 ndash7390

Long

4070

4072

4074

4076

4080

4078

4086

4082

4084

Lat

(b)

(c) (d)

Note Note that locations are plotted without reference to any map The figure gives a striking picture of population density across the world and makes uswonder what is going on with mobile devices in Antarctica

Figure 6 (Color online) The Distribution of the Number of Neighborsfor 500 Randomly Sampled Ego Users

0 500 1000 1500 2000 2500 3000 35000

50

100

150

200

250

300

350

400

450

No neighbors

Fre

quen

cy

the median being 1 The maximum number of neigh-bors is 3167 this user probably visited a publicWi-Fi that thousands of other users also logged intoWe chose 500 cohorts as we observe that the statis-

tics become quite stable as seen in Figure 7 Even200 cohorts seems to be enough to obtain stablesample statistics

The life span of user instances in this set of500 cohorts is shown by the cumulative frequency inFigure 8 Observe that only about 40 of the userinstances are seen at different hours About 30 ofthe user instances are seen only once in the sampledweek This illustrates that it is vital to understand thenotion of user instances These percentages are mis-leading if we interpret the user instances as individu-als or devices One individualdevice without stableidentification in the online advertising ecosystem willcreate many transient user instances which will driveup the percentage of user instances that are seen onlyonce or that are seen only within a short timeframeWe have not figured out how to determine or esti-mate reliably how many ldquounstablerdquo usersdevices arerepresented by the user instances in these data Wekeep all of the data for the analyses below (rather thanfiltering very short-lived user instances) to maintainrealism

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 253

Figure 7 Evolution of Several Sample Statistics as an Increasing Number of Cohorts are Created

0 100 200 300 400 5000

10

20

30

40

50

60

70

Average No network neighbors

0 100 200 300 400 5000

2

4

6

8

10

12

Median No network neighbors

0 100 200 300 400 5000

50

100

150

200

250

300

350

Average No IPs

0 100 200 300 400 5000

2

4

6

8Median No IPs

Notes The horizontal axis denotes the number of cohorts By 500 cohorts the statistics have become stable

Figure 8 (Color online) Cumulative Frequency of Life Span ofUser Instances

100 101 102 103 104 105 1060

01

02

03

04

05

06

07

08

09

10

x = lifespan (sec)

F(x

)

End of June (mobile only)

larr 1 min larr 1 hour larr 1 day

Cumulative frequency of user instances

6 Study 1 Connecting Instances ofthe Same Individual

Returning to our motivating application as men-tioned previously an important task in mobile adver-tising is to be able to target a particular individual inthe face of the individualrsquos fragmentation into differ-ent user instances (for example as seen through dif-ferent bidding systems) Advertisers understand that

it may be impossible to find the same user with 100accuracy Nonetheless it often is important for themto reach the userrsquos various instances without target-ing too many others9 The GSN is designed in partbased on the hypothesis that different instances of thesame person would visit similar locations and there-fore they will be connected in the GSN In this sectionwe present results examining whether and to whatextent that is indeed the case Furthermore the resultsquantify to what extent the same user receives a highGSN-similarity to herself as compared to her otherneighbors

Three experimental settings are used to assess judg-ments that two users are the same The first two sim-ulate different user instances by splitting up the bidrequests of known user instances The last one is notsimulated it links users via the recently introducedidentifier for advertising (IFA) observed over severalRTB exchanges for a subset of device instances

1 Random We divide the IP address visits of a userrandomly across two simulated users Table 6 showsfor example how we can divide a userrsquos IP visits across

9 The astute reader may notice that if the other users targeted arevery similar to the target user the advertising may be well placeddespite not being shown to the exact target user That is not thesubject of this study Study 2 shows evidence that these two notionsare not independent when using the GSN

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers254 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 6 Randomly Splitting User X rsquos IP VisitsCreating Two New Artificial Users X1 and X2

IP1 IP2 IP3

X 1 5 0X1 0 3 0X2 1 2 0

Table 7 The Volume of IFA Occurrence Within OurData Set

Volume

RTB 1 119401855RTB 2 113831962RTB 3 7891292

Table 8 The Frequency of Unique IFAs Visible AcrossRTB Exchanges

Frequency

RTB 1RTB 2 3061594 (1016)RTB 2RTB 3 451256 (169)RTB 3RTB 1 181417 (085)

two new users X1 and X2 for which we know thatthey are actually the same individual Results withRandom should be used only as a ceiling on perfor-mance because the results will be overly optimistic

2 Temporal A more realistic split-up of the trans-actions is a temporal split-up where first the middleday of Xrsquos transactions is determined (hence dynam-ically chosen for each user) All IP address visits upto this middle day are assigned to X1 and all trans-actions after the middle day are assigned to X2 Itwill only be applied to users who are seen on multi-ple days

3 Identifier for advertising IFA is an advertisingidentifier from Apple that users can change or asknot to be used for advertising and is seen as aprivacy-friendlier version of the controversial fixeddevice ID (the ldquoUDIDrdquo) The IFA is observed in threeexchanges with volumes shown in Table 7 for exam-ple 1940855 unique IFAs are seen in RTB 110 Whenwe observe the same IFA across different exchangeswe now know that two devices in different exchangesactually are the same The frequency of unique IFAsvisible across exchanges is shown in Table 8 For exam-ple a total of 3018223 IFAs are seen when we look atboth RTB 1 and RTB 2 with 306594 of these (1016)seen on both exchange 1 and exchange 2 What ifthere were not an IFA and the device would havehad a different identifier in each exchange To what

10 For confidentiality RTB 1 in this setting is not necessarily thesame as RTB 1 in sectsect4 or 7

extent are these two device instances then connectedin the GSN

With these different ldquogold standardrdquo methods toindicate whether or not two users are the samewe assess the ability of the geosimilarity metricsto link the same user determining how often twoinstances of the same person are connected in theGSN (sect61) Afterward we shall assess the differentGSN strength metricsrsquo ability to rank users determin-ing to what extent two instances of the same personexhibit strong geosimilarity as compared to the otherneighbors (sect62) For the random and temporal split-ups 200 samples are used For the IFA-based methodthe complete population is used for measuring thedegree of connectedness to the same user to analyzethe relative strength of the connected users 200 sam-ples are used

61 Same Individual Different ldquoScreensrdquoWe now present results on the GSN connectednessmeasured as how often two instances of the same userare connected in the GSN (in percentage) Note thatthe results summarized in Table 9 should be inter-preted in light of the limited number of days in build-ing the research data set which will limit the GSNconnectivity Nonetheless the relatively high connec-tivities observed even with this sample provide quitepromising results which lends considerable supportto the merit of the overall design

Random0 Using the random split two users thatcorrespond to the same individual are not connectedin only 9 of the neighborhoods Moreover in thesecases user X (whose IP visits are split up) usuallyhas only one or two entries and the random split-upresults in no visits for one of the two simulated users

Temporal0 The temporal split-up is more realisticthan the random one where we assume that a personuses simulated device 1 in the first half of the week(eg during the weekend) and simulated device 2 inthe second half of the week (eg during the week)

Users X1 and X2 are not connected in 32 of theneighborhoods Whereas the random split-up was toooptimistic the temporal one is too pessimistic as inreality there likely is some overlap in the time peri-ods of use of two different devices If user X visitsan IP address only in the first days of the week this

Table 9 To What Extent is the Same User Connectedto Herself

connected

Random 91Temporal 68IFAmdashRTB 1RTB 2 67IFAmdashRTB 2RTB 3 82IFAmdashRTB 3RTB 1 73

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 255

IP address will not occur in user X2rsquos transactions andreduces the chance to be connected to X1

IFA0 We look at all combinations of the three ex-changes with IFA identifiers For example there are45256 unique IFAs seen on both RTB 2 and RTB 3 In82 of these the two IFAs visit at least one same IP onboth exchanges and hence are connected in our GSN

The bottom line is that even with this limited slice ofonline behavior different instances of the same indi-vidual are connected in the GSN 70ndash80 of the timeEncouragingly the results from the simulated scenar-ios concur with the results from the real (IFA) sce-narios (including the aforementioned optimism andslight pessimism of the two simulated scenarios)Consistency over these different settings gives addi-tional confidence in the results suggesting that theGSN indeed holds promise for a high degree of suc-cess at targeting the same user across the fragmenta-tion of the online advertising ecosystem

62 Ranking User InstancesNext given that two user instances of the same indi-vidual are indeed connected we assess to what extentthe second instance of the same user is ranked highlyamong its network neighbors based on the weightof their linkage in the GSN We first will discuss themost general results using the best ranking methodand then will look across different ranking methods

We decompose the results based on the numberof network neighbors (NNs) since there are twoboundary cases that require special interpretationTables 10ndash14 show the proportions of the ego net-works that fall into three scenarios First a user mayhave only one NN In this scenario the ranking isperfect (trivially)mdashthe other instance of the same useris the only user instance given a nonzero score As

Table 10 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashRandom Split-up

Ego has only 1 NN 3104Ego has exactly 2 NNs 1905

Same user ranked first 8005Ego has more than 2 NNs 4901

Note For the scenario with two NNs the table reports thepercentage of cases where the other known instance of thesame user is ranked first (including ties)

Table 11 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashTemporal Split-up

Ego has only 1 NN 3008Ego has exactly 2 NNs 1807

Same user ranked first 8309Ego has more than 2 NNs 5005

Table 12 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 1RTB 2Split-up

Ego has only 1 NN 4401Ego has exactly 2 NNs 1201

Same user ranked first 8201Ego has more than 2 NNs 4400

Table 13 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 2RTB 3Split-up

Ego has only 1 NN 6109Ego has exactly 2 NNs 1907

Same user ranked first 7102Ego has more than 2 NNs 1804

Table 14 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 3RTB 1Split-up

Ego has only 1 NN 5305Ego has exactly 2 NNs 2501

Same user ranked first 6304Ego has more than 2 NNs 2104

shown in the tables this scenario accounts for 30ndash60 of the ego networks (depending on the split-upmethod)

The second boundary case is when the ego has ex-actly two NNs This scenario accounts for 10ndash25of the cases depending on the split-up method Herethe tables also report whether the other known userinstance is ranked first (including ties) among thetwo neighbors The other known instance of the sameuser is ranked first 60ndash80 of the time dependingon the split-up scenario The IFA results are affectedstrongly by the particular RTB settingmdashthis may indi-cate that the publishers on the different RTBs have dif-ferent policies for passing the IFA Importantly theseresults are conservativemdashpossibly very conservativeThe other NNs may also be instances of the same userunbeknownst to usmdashfor example in the IFA casebecause an app does not pass the IFA to the RTB orbecause the other user is the same user on a differentdevice and thus does not share the IFA

The final scenario is when more than two NNsare present As shown in the tables this accounts for20ndash50 of the cases depending on the split-up sce-nario In this case we measure the percentile in theranking where the instance of the known same userfalls (see below) as well as the area under the receiveroperating characteristic curve (AUC) (Fawcett 2006)which is equivalent to the MannndashWhitneyndashWilcoxon(MWW) statistic The AUCMWW measures to what

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers256 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 15 When There Are More Than Two Network Neighbors to What Extent Is the Known Same User StronglyConnected to Itself

AUC (if gt 2 NNs) () Average rank percentile (if gt 2 NNs) ()

Ranking metric R T RTB 12 RTB 23 RTB 31 R T RTB 12 RTB 23 RTB 31

1 78 64 68 59 55 29 48 34 43 542 59 58 67 58 54 45 53 34 44 553 79 69 74 61 61 29 44 29 40 484 86 75 69 61 59 24 40 32 41 505 83 70 71 59 63 26 44 31 43 476 85 74 71 60 65 25 41 31 42 457 86 76 73 62 64 24 40 29 40 468 83 70 72 59 65 26 44 30 43 459 85 74 71 60 67 25 41 30 42 44

10 79 69 69 61 63 29 44 32 41 4711 79 69 71 62 64 44 44 30 41 4612 81 68 68 60 62 28 46 34 42 4913 81 70 71 61 64 27 44 30 40 4614 74 67 66 61 48 34 47 35 41 6015 74 67 66 61 48 34 47 35 41 60

Note See the text for a discussion of ranking metrics

extent data points with label 1 are ranked higherthan those with label 0 In the results that follow(see Table 15) the AUC is measured for each samplewhere any user instance among the neighbors thatcorresponds to the same individual is labeled as 1and all others are labeled as 0 Thus in this settingthe AUC measures how well the neighbors are rankedby the likelihood of being the same individual11 Theaverage over all samples is reported If for a sampleall neighbors have the same score that AUC is setto 05 In addition to the factors noted above this alsowill tend to make the results conservative For exam-ple if the GSN links an ego user to a set of neighborsthat are all instances of the same individual in thisevaluation the AUC will be 05

The ranking measure is the percentile in the rank-ing at which the known-same-user instance is rankedwith lower values being better For example if thesame-user instance has the highest score (closest con-nection) among five neighbors the rank is 20 (15)Note again that this is quite conservative in thisexample the user is ranked at the top of the list butbecause there were only five neighbors the highestpossible score is 20 In case the known-same-userinstance has the same score as another neighbor theaverage rank is reported

As shown in Table 15 all AUC values exceed 05ranging up to 08 This means that in every casea known-same-user instance is likely to be rankedhigher than an instance not known to be the same

11 Technically in this setting the AUCMWW is the probability thata randomly selected neighbor that is in fact a known instance of thesame user is ranked more highly than a randomly selected neighborthat is not known to be an instance of the same user If the known-same-user instances are all ranked above the other-user instancesthe AUC = 100 If they are all below then the AUC = 0

user The frequency-based ranking metrics (4ndash9) per-form quite well in connecting instances of the sameuser strongly

Almost all metrics perform better than a randommodel with performances that even come close to thebest possible solution Consistently performing quitewell are the frequency-based metrics (metrics 3ndash9)with metric 7 (the minimum of the IDF weights ofthe shared locations) getting the most wins Not per-forming well is the very basic count metric as well asthe Boolean metric that provides a binary score onlyInteresting to note the Jaccard metrics do not performwell either It seems that penalizing users that visitmany IP addresses negatively affects the results anissue we will return to in the next study

7 Study 2 Does the GSN Select Userswith Similar Interests

The foregoing section addressed our paperrsquos firstclaimmdashthat the GSN design indeed links instances ofthe same user and links them more strongly thaninstances of other users In this section we turn to thepaperrsquos second claimmdashthat the GSN links users withsimilar interests We will measure similar interests bysimilarity in behavior accessing particular publisherson the Web and using mobile apps12

71 Connecting Users with Similar Interests

To What Extent Are the GSN Neighbors of Visi-tors to Particular Publishers Also Visitors to ThoseSame Publishers To assess whether the GSN con-nects users with similar interestsbehavior we select

12 From now on all user instances will be complete real user in-stances the simulated split-up scenarios (Random Temporal) werecreated specifically for the purpose of the same-user study

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 257

Table 16 Distribution of Publishers per RTB in Terms of thePercentage of Users in Our Data Set That Visit the GivenPublisher

Publisher of D

RTB 1Burstly 3004myYearbookcom 1062My Fitness Pal 0065TextMe Inc 0056GameResort 0025Flixster 0019247Sports 0015Daily Workout Apps LLC 0009Mobile Deluxe 0007YoYo Games Ltd 0004

RTB 2mworldnowcom 0032mtmzcom 0024mtopixcom 0014babycentercom 0015meetmecom 0006appsfacebookcom 0006itunesapplecom 0039appevitecom 0002beautyandskincarescom 0003mmajunkiecom 0001

RTB 3Conversion Exchange 5053Pinger Inc 2096Top Game Developer 4029PremiumEntertainmentApp-FamilyOriented-Android 0059Enflick 0053Clapfoot Games 0044FunPokes Inc 0018Talkatone 0015Sevenlogics Inc 0008A Star Software 0007

RTB 4Pinger PhonemdashiOSmdashConversation 1001Talkatone iPhone App 0007DiceWithBuddies 00139GAG Reader 0012Video Downloader Pro Lite 0016Rage Comics 0009Spades 3D Lite 0006Apalabrados 0003Relax Melodies HD 0002Celeb MemdashPhotoMaker 0000

a diverse collection of 40 publishers listed in Table 16A publisher can correspond to a mobile app a web-site or a set of appswebsites We selected both pub-lishers that are visited by many users and more nichepublishers with fewer visitors As can be seen fromTable 16 several types of publishers can be distin-guished the main groups are games (eg GameRe-sort Top Game Developer) and social networkingtools (eg appsfacebookcom Text Me Inc)

More specifically we create 200 samples for 40 pub-lishers 10 publishers for each of four different RTB

exchanges listed in Table 1613 The percentage ofunique users that visit each publisher in our completedata set is also shown (D) To create each samplean ldquoegordquo user is selected that has visited the pub-lisher in question (an ldquoego visitorrdquo) this ego visi-torrsquos GSN neighborhood is created Next we assesshow many of the egorsquos neighbors also visited thatsame publisher This will be compared with a base-line visit rate to produce lift and leverage values(Provost and Fawcett 2013)mdashspecifically how muchmore likely is it for the neighbors of ego visitors tovisit the publisher than would be expected by chanceWe consider three different setups that produce dif-ferent baselines

1 Considering all RTB exchanges to determine the base-line A user is considered a visitor for a given pub-lisher if she visits the publisher on any RTB ex-change (since some publishers are visible on differentexchanges) Hence the baseline visit rate is the num-ber of users (on any RTB) who visit that publisherdivided by the total number of users across all RTBexchanges This will provide optimistic lifts sincemost publishers appear only on one RTB exchangewhereas the baseline is measured over all exchanges

2 Considering one RTB exchange only to determine thebaseline A user is considered a visitor for a given pub-lisher and RTB if she visits the publisher on the sameRTB exchange This changes the baseline to the num-ber of users (on this RTB) who visit that publisherdivided by the total number of users on this RTBexchange The lift and leverage results will be worsethan in the first setting because the baseline will belarger Seeing that some publishers are seen on differ-ent RTB exchanges this setting is slightly pessimistic

3 Taking time of location visits into account In thethird setting we link users only if they visit an IPwithin the same time period This will be elaboratedon in sect73

Results are shown in Tables 17 and 18 The percent-age of publisher visitors in the neighborhood is givenas N This percentage is compared with the baselinepercentage of users that visit that publisher in thecomplete data D using two metrics the lift is the ratioof these numbers the leverage is the difference Liftis particularly useful for very small base rates (con-sider that an activity with a base rate of 30 cannotpossibly have a lift higher than about 3) Leverage ismore telling for larger base rates because it shows theabsolute improvement (here in percentage points)

When considering all NNs the lifts for all publish-ers are greater than 1 except for one Burstly on RTB 1has a lift of 097 (essentially no lift) but only when weconsider RTB 1 as the baseline This publisher already

13 For confidentiality the RTB numbering here is again differentfrom above

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers258 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 17 RTBs 1 and 2 Percentage of Network Neighbors That Also Visit the Publisher

RTB 1

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Burstly 3004 8063 2084 5059 47022 15054 44018 25061 8043 220572 myYearbookcom 1062 25077 15094 24015 44074 27067 43012 35016 21074 330543 My Fitness Pal 0065 4015 6038 305 3609 56077 36025 15055 23091 14094 TextMe Inc 0056 1805 33006 17094 39002 69075 38046 25018 45 240625 GameResort 0025 4093 19077 4068 18052 74026 18027 11086 47058 110626 Flixster 0019 1055 8015 1036 8014 42085 7095 1079 9042 1067 247Sports 0015 27057 188058 27042 77062 530093 77048 74053 509075 740388 Daily Workout Apps LLC 0009 1092 21047 1083 19023 215046 19014 11054 129027 110459 Mobile Deluxe 0007 1087 28067 108 23053 36009 23046 9063 14707 9056

10 YoYo Games Ltd 0004 3052 99058 3049 21057 609064 21053 13 367045 12096Considering this RTB exchange only to determine baseline

1 Burstly 8085 8063 0097 minus0022 47022 5033 38037 25061 2089 160762 myYearbookcom 4071 25077 5047 21006 44074 905 40003 35016 7046 300453 My Fitness Pal 1089 4015 2019 2025 3609 19048 35001 15055 8021 130654 TextMe Inc 1063 1805 11035 16087 39002 23094 37039 25018 15045 230555 GameResort 0073 4093 6078 402 18052 25049 17079 11086 16033 110146 Flixster 0055 1055 208 0099 8014 14071 7059 1079 3023 10247 247Sports 0043 27057 64073 27014 77062 182023 7702 74053 174097 74018 Daily Workout Apps LLC 0026 1092 7037 1066 19023 73095 18097 11054 44037 110289 Mobile Deluxe 0019 1087 9084 1068 23053 123088 23034 9063 5007 9044

10 YoYo Games Ltd 001 3052 34018 3042 21057 209025 21047 13 126012 1209

RTB 2

Considering all RTB exchanges to determine baseline1 mworldnowcom 0032 13027 40086 12095 54039 167042 54006 31065 97044 310332 mtmzcom 0024 6089 28086 6065 23008 9607 22084 10068 44075 100443 mtopixcom 0014 15007 104079 14093 53045 37106 5303 33075 234065 330614 babycentercom 0015 6087 47031 6072 43048 299056 43033 17074 122024 17065 meetmecom 0006 6002 10308 5097 54072 942093 54066 27085 47909 270796 appsfacebookcom 0006 301 48039 3003 9009 141097 9003 2076 43008 20697 itunesapplecom 0039 25077 65071 25038 51043 131012 51004 29049 75018 290098 appevitecom 0002 48028 21212043 48025 54076 21509068 54074 48028 21212043 480259 beautyandskincarescom 0003 18019 642087 18016 8048 299094 8046 1107 413049 11067

10 mmajunkiecom 0001 1036 201064 1035 26067 3195505 26066 13033 11977074 13033Considering this RTB exchange only to determine baseline

1 mworldnowcom 4034 13027 3006 8093 54039 12053 50005 31065 7029 270312 mtmzcom 3019 6089 2016 307 23008 7024 19089 10068 3035 70493 mtopixcom 1092 15007 7084 13015 53045 27081 51053 33075 17056 310834 babycentercom 1094 6087 3054 4093 43048 22042 41054 17074 9015 15085 meetmecom 0078 6002 7077 5025 54072 70057 53094 27085 35092 270076 appsfacebookcom 0086 301 3062 2024 9009 10063 8024 2076 3022 1097 itunesapplecom 5024 25077 4092 20053 51043 9081 46019 29049 5063 240258 appevitecom 0029 48028 165058 47098 54076 187082 54047 48028 165058 470989 beautyandskincarescom 0038 18019 48011 17081 8048 22045 8011 1107 30095 11032

10 mmajunkiecom 0009 1036 15009 1027 26067 296003 26058 13033 148001 13024

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

has a very high baseline and as we will see in thenext section when we consider only the top 1 and 10NNs the lifts go up to 5 and 3 respectively Burstlyprovides a generic platform for building apps and sois a weaker indication of user interest than many ofthe other publishers (fitness babies games)

All other lifts exceed one by substantial mar-gins and go up to even 2391 (Apalabrados onRTB 4 a Spanish-language social Scrabble-like game)

Obviously the conclusion that the NNs indeed aremore likely to visit the same publishers is highly sig-nificant by a simple sign test comparing the visita-tion rate within the geosimilarity neighborhood tothat of overall population either via lift or leveragep lt 0001 Thus NNs of publisher visitors are indeedmore likely also to visit the same websites the abso-lute values of the lifts and leverages suggest that theyare substantially more likely

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 259

Table 18 RTBs 3 and 4 Percentage of Network Neighbors That Also Visit the Publisher

RTB 3

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Conversion Exchange 5053 34054 6024 29001 64071 11069 59017 50077 9018 450242 Pinger Inc 2096 46002 15053 43006 65012 21097 62015 51048 17037 480523 Top Game Developer 4029 47024 11002 42095 60 13099 55071 48097 11042 440684 PremiumEntertainmentApp-FamilyOriented-Android 0059 23079 40012 2302 20045 34049 19086 22025 37052 210665 Enflick 0053 11071 21095 11017 21057 40044 21004 12027 23 110736 Clapfoot Games 0044 8055 19032 8011 11054 26006 1101 13099 31061 130557 FunPokes Inc 0018 11011 61028 10093 50 275077 49082 15079 87008 150618 Talkatone 0015 13013 88051 12098 31011 209076 30096 17091 120078 170779 Sevenlogics Inc 0008 2071 32059 2063 5088 70074 508 8033 100022 8025

10 A Star Software 0007 3033 44086 3025 23008 311024 23 9009 122061 9002Considering this RTB exchange only to determine baseline

1 Conversion Exchange 11045 34054 3002 23009 64071 5065 53025 50077 4043 390322 Pinger Inc 6013 46002 705 39089 65012 10062 58098 51048 8039 450353 Top Game Developer 8088 47024 5032 38036 60 6076 51012 48097 5052 400094 PremiumEntertainmentApp-FamilyOriented-Android 1023 23079 19038 22057 20045 16066 19023 22025 18013 210025 Enflick 101 11071 1006 1006 21057 19054 20046 12027 11011 110166 Clapfoot Games 0092 8055 9033 7064 11054 12059 10062 13099 15027 130087 FunPokes Inc 0038 11011 2906 10074 50 133022 49062 15079 42007 150418 Talkatone 0031 13013 42076 12082 31011 101034 3008 17091 58035 170619 Sevenlogics Inc 0017 2071 15074 2054 5088 34017 5071 8033 48041 8016

10 A Star Software 0015 3033 21067 3017 23008 150036 22092 9009 59023 8094

RTB 4

Considering all RTB exchanges to determine baseline1 Pinger PhonemdashiOSmdashConversation 1001 71081 70085 7008 80077 79068 79076 70014 6902 690132 Talkatone iPhone App 0007 11038 17203 11031 29003 439068 28097 17002 257078 160963 DiceWithBuddies 0013 7041 59004 7028 18018 144092 18006 7069 61031 70574 9GAG Reader 0012 62013 497099 62 49009 39305 48097 49005 393017 480925 Video Downloader Pro Lite 0016 20003 123015 19087 22058 138084 22042 17093 110027 170776 Rage Comics 0009 609 8001 6081 13033 154086 13025 609 8001 60817 Spades 3D Lite 0006 1205 225076 12044 27027 492057 27022 1205 225076 120448 Apalabrados 0003 77042 21391071 77039 75 21316097 74097 79032 21450043 790299 Relax Melodies HD 0002 31058 21002029 31056 50 31170029 49098 31058 21002029 31056

10 Celeb MemdashPhotoMaker 0 3023 82107 3022 7014 11819048 7014 3023 82107 3022Considering this RTB exchange only to determine baseline

1 Pinger PhonemdashiOSmdashConversation 8051 71081 8044 6303 80077 9049 72026 70014 8024 610632 Talkatone iPhone App 0055 11038 20053 10082 29003 52038 28048 17002 30071 160473 DiceWithBuddies 0009 7041 82011 7032 18018 201054 18009 7069 85027 7064 9GAG Reader 0097 62013 63097 61016 49009 50055 48012 49005 50051 480085 Video Downloader Pro Lite 0044 20003 45036 19059 22058 51014 22014 17093 40062 170496 Rage Comics 0037 609 18079 6053 13033 36032 12097 609 18079 60537 Spades 3D Lite 0046 1205 2609 12004 27027 58068 26081 1205 2609 120048 Apalabrados 0021 77042 371093 77021 75 360031 74079 79032 381006 790119 Relax Melodies HD 0013 31058 238054 31045 50 377069 49087 31058 238054 31045

10 Celeb MemdashPhotoMaker 0003 3023 97089 3019 7014 216076 7011 3023 97089 3019

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

The very high lifts could be attributed to two phe-nomena They may provide some evidence that theGSN embeds a true social network as pointed out bythe good results for the social networking publishers(eg Apalabrados with a lift of 2391 appsevitecomwith a lift of 2212 and others) Friends visiting thesame social networking sites (here unusual ones)

would increase the lift and leverage over just havingdifferent instances of the same user visiting the samewebsites An alternative explanation is that some ofthese publishers are much more likely to be visitedacross different devices of the same user It is not clearwhether this is actually true However it argues forobtaining specific and broad data on which devices

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers260 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

are not the same user to shed further light on forexample any broad data that include a (anonymized)user identifier

From these results we can conclude that GSNneighbors indeed exhibit substantially similar pub-lisher visitation behavior Next we assess to whatextent more strongly connected NNs are more similarthan less strongly connected NNs

72 Ranking Users with Similar Interests

Are the More-Similar Geosimilarity Neighbors of aPublisherrsquos Visitors Even More Likely Also to Be Vis-itors to the Same Publisher To assess whether more-similar geosimilarity neighbors are even more likelyto share interests we compute the measures exactlyas in the previous study except instead of consider-ing all NNs only the 1 (N1) and 10 (N10) neighborswith the highest scores are considered (see Tables 17and 18) First we must consider which of the 15 met-rics that are defined to measure link strength to useTable 19 reports the rank aggregations of the differentmetrics as to which provide the best lifts Specificallyfor each of the 40 publishers each scoring metric pro-vides a lift and for that publisher the scoring metricscan be ranked (the best being ranked 1 etc) The rankaggregation is the average rank for a scoring metricacross all 40 publishers Thus if one metric providedthe best lift for all publishers it would get an aggre-gated rank of 1

To visualize the best metrics the metrics with anaverage ranking less than 8 are shown in boldfacein Table 19 The Jaccard metrics do not perform wellin this study either showing that penalizing the con-nection to users that visit many locations is not sen-sible Remember that the Jaccard metric is based onthe idea that users that visit many locations will havea higher chance to share some location with anotheruser However we have so many locations in totalthat visiting a couple more locations will increase onlymarginally this probability of sharing a location bychance A user that logs into more IP addresses andis more active should hence not be penalized andranked lower than other less active users Metric 3which sums the IDF scores of the shared locationsperforms quite well Given the operational efficiencyadvantage over the frequency-based metrics this met-ric is chosen to measure the strength of the links

As seen from the results in Tables 17 and 18 thelift and leverage values for the strongly connected (top 1and top 10) NNs are even higher sometimes astro-nomical (3170 for Relax Melodies HD (RTB 4)) Theseresults are highly significant by sign tests compar-ing either the lift or the leverage between the top-ranked neighbors and the entire cohort (p lt 0001) Theabsolute values again indicate that the users with the

strongest geosimilarity are often substantially morelikely to visit the same publisher

Thus we can conclude that not only does geosim-ilarity find users with similar interests it also ranksusers well by their likelihood of having similarinterests

An interesting follow-up question is whether thecharacteristics of a publisher are of importance for theresults For example are GSN neighbors more sim-ilar in terms of using the same social network appas compared to playing the same games To answerthis question the publishers are categorized into sixclasses funny social news games communicationand miscellaneous In Figure 9 the lifts of the pub-lishers are shown per category Some general trendscan be observed where publishers related to funnycontent have the highest median lift This could beexplained by the fact that users often forward funnycontent to their friends The social websitesapps fol-low next which seems to further demonstrate thatfriends are likely linked in the GSN However pub-lishers in the communication category perform not sowell compared with the other categories and ratherhigh variances are observed in the different cate-gories Please note that high variance is observedacross the categories therefore it is difficult to makewell-supported claims from these results

73 TimingAs a final evaluation we explore whether the inclu-sion of temporal similarity in location visitation ishelpful for defining GSN connections To this end weintroduce a setting where two users are connectedonly if they visit an IP address in the same timeperiod of a day Specifically we divide a day intodifferent time windowsmdashfor example two time peri-ods of 12 hoursmdashwhere users are connected only ifthey visit the same IP address in the same time periodof a day (which might be on different days) Practi-cally two digits are added to an IP which denote thestarting moment of the corresponding time periodWe report on time periods of 2 hours and of 12 hours(other periods yield similar results) We also includethe setting where users are connected only when theyvisit the same IP address on the same day in light ofthe result of Crandall et al (2010) that visiting thesame location at around the same time is indicativeof friendship

The resulting lifts are reported in Figures 10ndash12when considering respectively all NNs the top 1 NNand the top 10 NNs The results when not consider-ing the time period are repeated as well (All) lim-ited to the conservative case of only considering oneexchange (see above) The baseline (lift of one) isindicated with a red horizontal line lifts higher than1000 are shown as 1000 to limit the range to be able

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 8: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 249

Location data in the form of latitude and longitude(Location) are only available in some RTB systemsand the accuracy of the latitudelongitude data in thecurrent mobile ecosystem is questionable8 again mak-ing it not suitable for broad use

32 Entity and Link RepresentationWe represent each user instance by a ldquoprofilerdquo ofbehavior across locations where we define a locationprofile as a vector of visit frequencies This will neces-sarily require a sparse representation as we observe avery large total set of locations fortunately each indi-vidual is seen at only a small subset of the locationsFor the empirical results below the location profile isa sparse vector of frequencies of observed visitationsto locations (which can be weighted in the calculationof similarity as described below)

The structure of the GSN is the graph with mobiledevices as the nodes and links between the devicesas the edges The corresponding design decisionsinvolve the selection of which devices to link at alland the weights to place on the links The simplestlink representation places unweighted links betweenall pairs of devices that share at least one specific loca-tion More sophisticated weighting criteria are basedon two notions which are discussed next and illus-trated with a simple example in Figure 4 The loca-tion data are obtained by listening to RTB ecosystemswhere IP addresses of devices are broadcast (as shownpreviously in Figure 1)

1 Given two devices that share at least one loca-tion the weight of the link between them can bea function of the devicesrsquo location profiles Such mea-sures can include simply the number of shared loca-tions and the number of shared locations weightedby their visitation percentage in the profile Alterna-tively the weighting could be a measure of similaritybetween the two location profiles for example cosinesimilarity

2 As such Devices A B and C are connected in theGSN of Figure 4 since they all logged into the Wi-Fiof the Met Museum Devices B and C also loggedinto the public Wi-Fi at Times Square and thereforeshould be more strongly connected than A and B (or Band C) which share only one location

8 Analyses of the latitudelongitude data show various data prob-lems including flipped coordinates coordinates set equal to eachother many latitudinallongitudinal points at the centroids of com-mon geographic areas and many impossible or highly improba-ble coordinates Mobile ad industry insiders claim that the highprice paid for latitudelongitude-based hyperlocal targeting hasled to a significant amount of latitudelongitude fraud We havefound nothing written on this phenomenon although data fraudis not uncommon in the online advertising industry (Stitelmanet al 2013)

Figure 4 (Color online) Basic Illustration of the Creation of a GSNBased on Mobile Location Data Observed for Three MobileDevices (A B and C) and a Laptop (D)

Met MuseumA B

Geosocial network(GSN)

A C

D

B

C

D

Times Square

Apartment

Note Devices are connected in the GSN if they visited (logged into) the samelocation (IP address)

3 Links can be weighted according to the componentlocationsrsquo 4lack of5 popularity For example it mightbe argued that the link formed by sharing a loca-tion with a small number of other devices (eg myapartment) indicates a stronger geosimilarity thansharing a location with a massive number of otherdevices This intuition can be extended if two devicesspend a lot of time at such an unpopular locale(eg my apartment) then that should indicate a verystrong similarity If two devices have approximatelythe same distributional profile across several of thesesites (their ldquolocation profilerdquo) then they are eitherthe same person or close friendssoulmates In ourexample C and D are connected because they bothlogged into the Wi-Fi of an apartment whereas Band C are connected by logging into the public (pop-ular) Wi-Fi at Times Square Hence the connectionbetween C and D (which are likely devices belongingto roommates or belonging to the same user) shouldbe stronger than the connection between B and C(which are likely devices belonging to two differenttourists)

This notion of similarity can be incorporated tech-nically by (1) adapting notions from information re-trieval we can weight locations for a given deviceby their device-specific popularity (from the devicersquoslocation profile) divided by the (log of) the locationrsquosoverall popularity let us call that LF times IDF (locationfrequency times inverse device frequency) Then (2) thestrength of the link between two devices that sharea location would be a function of the correspondingLF times IDF scores

Once we have one or more GSNs there are variousways to take advantage of them for targeting or otherinference The simplest method for inference is sim-ply to target all of the geosimilar network neighborsAlternatively one could target the ldquoclosestrdquo neighborsbased on one or more notions of geosimilarity (as dis-

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers250 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

cussed above) These are the methods used for theempirical results presented below

4 Specifics of the GSN DesignAs described above in this design the strength of thelink between two individuals in the GSN is based onthe similarity in the distribution of the locations thatthey visit For this paperrsquos results locations will be(anonymized) IP addresses We now define differentsimilarity measures of varying complexity which willbe evaluated in the following sections

41 NotationThe location profile of a user is represented by a vec-tor Ex of length m where element i denotes the num-ber of visits to IP address IPi2 Ex isin m Vector Exbool isa binary vector of size m where element i denoteswhether the user has visited IPi For similarity com-putations both the Hadamard product () and theinner product 4 middot 5 are used and the minimum (min)operator is defined to operate componentwise on twovectors as well as on a vector and a scalar The max-imum (max) and average (avg) operators are similarTheir logic is defined in Equations (1)ndash(4) The weightof an IP address is defined by its popularity in anal-ogy to the inverse document frequency used in text

Table 2 Similarity Metrics Between Two Userrsquos Location Visit Distributions Ex1 and Ex2

Similarity metric Range d4 Ex11 Ex25

1 scount4 Ex11 Ex25= Exbool11 middot Exbool12 6015 1

2 sbool4 Ex11 Ex25= min4scount4 Ex11 Ex25115 80119 1

3 scount1W 4 Ex11 Ex25= 4 Ew Exbool115 middot Exbool12 6015 103

4 sfreq1min4 Ex11 Ex25= min4 Ex11 Ex251 6015 2

5 sfreq1max4 Ex11 Ex25= max4 Exbool12 Ex11 Exbool11 Ex251 6015 3

6 sfreq1 avg4 Ex11 Ex25= avg4 Exbool12 Ex11 Exbool11 Ex251 6015 205

7 sfreq1min1W 4 Ex11 Ex25= min4 Ex1 Ew1 Ex2 Ew51 6015 206

8 sfreq1max1W 4 Ex11 Ex25= max44 Exbool12 Ew5 Ex11 4 Exbool11 Ew5 Ex251 6015 309

9 sfreq1 avg1W 4 Ex11 Ex25= avg44 Exbool12 Ew5 Ex11 4 Exbool11 Ew5 Ex251 6015 3025

10 scosine4 Ex11 Ex25=Ex1 middot Ex2

Ex11 middot Ex21

60117 0074

11 scosine1W 4 Ex11 Ex25=4 Exbool11 Ew5 middot 4 Exbool12 Ew5

Exbool11 Ew1 middot Exbool12 Ew1

60117 0060

12 sJaccard4 Ex11 Ex25=scount4 Ex11 Ex25

max4 Exbool111 Exbool1251

60117 0033

13 sJaccard1W 4 Ex11 Ex25=scount1W 4 Ex11 Ex25

max4 Exbool11 Ew1 Exbool12 Ew51

60117 0026

14 sJaccard1 freq4 Ex11 Ex25=sfreq1min4 Ex11 Ex25

max4 Ex1 Exbool121 Ex2 Exbool1151

60117 0067

15 sJaccard1 freq1W 4 Ex11 Ex25=sfreq1min1W 4 Ex11 Ex25

max44 Ex1 Exbool125 Ew1 4 Ex2 Exbool115 Ew51

60117 0067

mining (Hotho et al 2005) the weight wi for IPi (theinverse device frequency) is defined as the logarithmof the total number of users n divided by the numberof (unique) users that visit that specific IP address ni

Ez= Ex Ey1 zi = xi times yi1 (1)

a= Ex middot Ey =sum

i=12m

4xi times yi51 (2)

Ez= min4Ex1 Ey51 zi = min4xi1yi51 (3)

Ez= min4Ex1a51 zi = min4xi1 a51 (4)

wi = log104nni50 (5)

42 Link Strength MetricsFifteen similarity metrics are defined in Table 2 andare illustrated by measuring the strength betweenthe location profiles of two example users Ex1 and Ex2shown in Equation (6) The first user visits IP1 threetimes and IP2 twice whereas the second one visits IP1twice and IP3 once Several variants of the same met-rics are considered As described above some onlytake into account the unique number of shared loca-tions whereas others take into account the frequencyof the visits to these locations and the inverse devicefrequency The latter are denoted by freq and W respectively

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 251

The first similarity metric scount counts the numberof shared locations Since the two example users shareonly IP1 this strength is 1 The IDF weighted versionscount1W sums the IDF values of the shared locationswhich is 13 for our simple example The most basicmetric is given by sbool which is 1 when a location isshared and 0 otherwise indicating whether two usersare neighbors or not

The sfreq1min sfreq1max and sfreq1avg metrics compare thefrequencies of the visits to the shared locations Theytake the minimum maximum and average respec-tively of the frequencies for each shared location andsum them The weighted versions sfreq1min1W sfreq1max1W and sfreq1avg1W do the same but weight each frequencywith the corresponding IDF value

The scosine metric measures the similarity by takingthe cosine of the angle between the two vectors TheJaccard metric is defined as the size of the intersec-tion over the size of the union In this case it countsthe number of shared locations but normalizes theseover the total number of locations both users havevisited Users that visit many locations will have ahigher chance to share some location with anotheruser so the Jaccard metric penalizes them For ourbasic example sJaccard is 1 (IP1) over 3 (IP1 IP2 IP3)Once more variants taking into account the frequencyand the IDF are defined as well

Ex1 = 63 2 071

Ex2 = 62 0 171 (6)

Ew = 6103 107 270

5 Experimental Setup51 Data SetThe data set D for the empirical results is 10 days ofanonymized advertising bid requests from mobile de-vices (smartphones tablets etc) observed across sevenRTB exchanges A total of 322770794 bid requestsare observed coming from 42437559 unique userinstances Over these 10 days we observe 41829088unique IP addresses

The number of unique user instances per RTBexchange is given in Table 3 The distributions of thenumber of bid requests per device type and operatingsystem are given in Tables 4 and 5 which show thatthe majority of bid requests come from smartphonesand the iOS operating system

A user (instance) corresponds to an exchange-assigned user ID One person on several deviceswill hence correspond to several users One deviceon several exchanges will often correspond to sev-eral users Also it might be that one ad exchangesometimes provides different user IDs to the same

Table 3 User Instances per RTB Exchange

Exchange Volume Percentage ()

RTB 1 311751979 6RTB 2 211591686 4RTB 3 1415661098 29RTB 4 2015011243 41RTB 5 212751015 5RTB 6 510551815 10RTB 7 212231418 4

Table 4 Distribution of Bid Requests per Device Type

Type Volume Percentage ()

Phone 29213581772 91Tablet 3010161334 9

Table 5 Distribution of Bid Requests per OperatingSystem

Type Volume Percentage ()

Android 12913151617 4001iOS 19219231648 5908Windows 5311529 002

device for example when the device is using differ-ent particular apps As described above an impor-tant task in online advertising is to be able to targetadvertisements to different instances of the same indi-vidual with reasonable probability regardless of thecomplexities of identification in the baroque onlineadvertising ecosystem Figure 5 visualizes a sample oflatitudes and longitudes broadcast by mobile devicesin our data set

52 SamplingFor the results in this paper we apply the GSN anal-ysis to data samples Each sample corresponds to acohortneighborhood and is built as follows A sin-gle (random) ldquoegordquo user is chosen user X aroundwhich a neighborhood is built Every neighbor of theego user must share at least one IP address so webegin by obtaining all (anonymized) IP addresses vis-ited by user X Next all other users that visited atleast one of these IP addresses (the neighbors) areadded The ego must have at least one neighbor to beincluded in this analysis Finally we obtain the com-plete location distribution of each of these neighborsWhen the next sample is chosen we ensure that noneof the users already included in the previous samplesis taken as the ego user (X) As such we ensure thatwe do not have repeated egondashneighbor dyads Theremay be shared neighbors of two different egos buttwo egos will never be immediate neighbors in thedata sample

The result of the sampling over 500 cohorts is summa-rized in Figure 6 on average an ego has 71 neighbors

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers252 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 5 Scatter Plots of Samples of Latitudes and Longitudes Broadcast by Mobile Devices

ndash130 ndash120 ndash110 ndash100 ndash90 ndash80 ndash70 ndash6020

25

30

35

40

45

50

55(a)

Long

Lat

ndash20 ndash10 0 10 20 30 40 50

Long

30

35

40

45

50

55

65

60

75

70

Lat

ndash200 ndash150 ndash100 ndash50 1000 50 150 200

Long

ndash100

ndash80

ndash60

ndash40

ndash20

0

40

20

100

60

80

Lat

4068ndash7406 ndash7404 ndash7402 ndash7400 ndash7394ndash7398 ndash7396 ndash7392 ndash7390

Long

4070

4072

4074

4076

4080

4078

4086

4082

4084

Lat

(b)

(c) (d)

Note Note that locations are plotted without reference to any map The figure gives a striking picture of population density across the world and makes uswonder what is going on with mobile devices in Antarctica

Figure 6 (Color online) The Distribution of the Number of Neighborsfor 500 Randomly Sampled Ego Users

0 500 1000 1500 2000 2500 3000 35000

50

100

150

200

250

300

350

400

450

No neighbors

Fre

quen

cy

the median being 1 The maximum number of neigh-bors is 3167 this user probably visited a publicWi-Fi that thousands of other users also logged intoWe chose 500 cohorts as we observe that the statis-

tics become quite stable as seen in Figure 7 Even200 cohorts seems to be enough to obtain stablesample statistics

The life span of user instances in this set of500 cohorts is shown by the cumulative frequency inFigure 8 Observe that only about 40 of the userinstances are seen at different hours About 30 ofthe user instances are seen only once in the sampledweek This illustrates that it is vital to understand thenotion of user instances These percentages are mis-leading if we interpret the user instances as individu-als or devices One individualdevice without stableidentification in the online advertising ecosystem willcreate many transient user instances which will driveup the percentage of user instances that are seen onlyonce or that are seen only within a short timeframeWe have not figured out how to determine or esti-mate reliably how many ldquounstablerdquo usersdevices arerepresented by the user instances in these data Wekeep all of the data for the analyses below (rather thanfiltering very short-lived user instances) to maintainrealism

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 253

Figure 7 Evolution of Several Sample Statistics as an Increasing Number of Cohorts are Created

0 100 200 300 400 5000

10

20

30

40

50

60

70

Average No network neighbors

0 100 200 300 400 5000

2

4

6

8

10

12

Median No network neighbors

0 100 200 300 400 5000

50

100

150

200

250

300

350

Average No IPs

0 100 200 300 400 5000

2

4

6

8Median No IPs

Notes The horizontal axis denotes the number of cohorts By 500 cohorts the statistics have become stable

Figure 8 (Color online) Cumulative Frequency of Life Span ofUser Instances

100 101 102 103 104 105 1060

01

02

03

04

05

06

07

08

09

10

x = lifespan (sec)

F(x

)

End of June (mobile only)

larr 1 min larr 1 hour larr 1 day

Cumulative frequency of user instances

6 Study 1 Connecting Instances ofthe Same Individual

Returning to our motivating application as men-tioned previously an important task in mobile adver-tising is to be able to target a particular individual inthe face of the individualrsquos fragmentation into differ-ent user instances (for example as seen through dif-ferent bidding systems) Advertisers understand that

it may be impossible to find the same user with 100accuracy Nonetheless it often is important for themto reach the userrsquos various instances without target-ing too many others9 The GSN is designed in partbased on the hypothesis that different instances of thesame person would visit similar locations and there-fore they will be connected in the GSN In this sectionwe present results examining whether and to whatextent that is indeed the case Furthermore the resultsquantify to what extent the same user receives a highGSN-similarity to herself as compared to her otherneighbors

Three experimental settings are used to assess judg-ments that two users are the same The first two sim-ulate different user instances by splitting up the bidrequests of known user instances The last one is notsimulated it links users via the recently introducedidentifier for advertising (IFA) observed over severalRTB exchanges for a subset of device instances

1 Random We divide the IP address visits of a userrandomly across two simulated users Table 6 showsfor example how we can divide a userrsquos IP visits across

9 The astute reader may notice that if the other users targeted arevery similar to the target user the advertising may be well placeddespite not being shown to the exact target user That is not thesubject of this study Study 2 shows evidence that these two notionsare not independent when using the GSN

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers254 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 6 Randomly Splitting User X rsquos IP VisitsCreating Two New Artificial Users X1 and X2

IP1 IP2 IP3

X 1 5 0X1 0 3 0X2 1 2 0

Table 7 The Volume of IFA Occurrence Within OurData Set

Volume

RTB 1 119401855RTB 2 113831962RTB 3 7891292

Table 8 The Frequency of Unique IFAs Visible AcrossRTB Exchanges

Frequency

RTB 1RTB 2 3061594 (1016)RTB 2RTB 3 451256 (169)RTB 3RTB 1 181417 (085)

two new users X1 and X2 for which we know thatthey are actually the same individual Results withRandom should be used only as a ceiling on perfor-mance because the results will be overly optimistic

2 Temporal A more realistic split-up of the trans-actions is a temporal split-up where first the middleday of Xrsquos transactions is determined (hence dynam-ically chosen for each user) All IP address visits upto this middle day are assigned to X1 and all trans-actions after the middle day are assigned to X2 Itwill only be applied to users who are seen on multi-ple days

3 Identifier for advertising IFA is an advertisingidentifier from Apple that users can change or asknot to be used for advertising and is seen as aprivacy-friendlier version of the controversial fixeddevice ID (the ldquoUDIDrdquo) The IFA is observed in threeexchanges with volumes shown in Table 7 for exam-ple 1940855 unique IFAs are seen in RTB 110 Whenwe observe the same IFA across different exchangeswe now know that two devices in different exchangesactually are the same The frequency of unique IFAsvisible across exchanges is shown in Table 8 For exam-ple a total of 3018223 IFAs are seen when we look atboth RTB 1 and RTB 2 with 306594 of these (1016)seen on both exchange 1 and exchange 2 What ifthere were not an IFA and the device would havehad a different identifier in each exchange To what

10 For confidentiality RTB 1 in this setting is not necessarily thesame as RTB 1 in sectsect4 or 7

extent are these two device instances then connectedin the GSN

With these different ldquogold standardrdquo methods toindicate whether or not two users are the samewe assess the ability of the geosimilarity metricsto link the same user determining how often twoinstances of the same person are connected in theGSN (sect61) Afterward we shall assess the differentGSN strength metricsrsquo ability to rank users determin-ing to what extent two instances of the same personexhibit strong geosimilarity as compared to the otherneighbors (sect62) For the random and temporal split-ups 200 samples are used For the IFA-based methodthe complete population is used for measuring thedegree of connectedness to the same user to analyzethe relative strength of the connected users 200 sam-ples are used

61 Same Individual Different ldquoScreensrdquoWe now present results on the GSN connectednessmeasured as how often two instances of the same userare connected in the GSN (in percentage) Note thatthe results summarized in Table 9 should be inter-preted in light of the limited number of days in build-ing the research data set which will limit the GSNconnectivity Nonetheless the relatively high connec-tivities observed even with this sample provide quitepromising results which lends considerable supportto the merit of the overall design

Random0 Using the random split two users thatcorrespond to the same individual are not connectedin only 9 of the neighborhoods Moreover in thesecases user X (whose IP visits are split up) usuallyhas only one or two entries and the random split-upresults in no visits for one of the two simulated users

Temporal0 The temporal split-up is more realisticthan the random one where we assume that a personuses simulated device 1 in the first half of the week(eg during the weekend) and simulated device 2 inthe second half of the week (eg during the week)

Users X1 and X2 are not connected in 32 of theneighborhoods Whereas the random split-up was toooptimistic the temporal one is too pessimistic as inreality there likely is some overlap in the time peri-ods of use of two different devices If user X visitsan IP address only in the first days of the week this

Table 9 To What Extent is the Same User Connectedto Herself

connected

Random 91Temporal 68IFAmdashRTB 1RTB 2 67IFAmdashRTB 2RTB 3 82IFAmdashRTB 3RTB 1 73

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 255

IP address will not occur in user X2rsquos transactions andreduces the chance to be connected to X1

IFA0 We look at all combinations of the three ex-changes with IFA identifiers For example there are45256 unique IFAs seen on both RTB 2 and RTB 3 In82 of these the two IFAs visit at least one same IP onboth exchanges and hence are connected in our GSN

The bottom line is that even with this limited slice ofonline behavior different instances of the same indi-vidual are connected in the GSN 70ndash80 of the timeEncouragingly the results from the simulated scenar-ios concur with the results from the real (IFA) sce-narios (including the aforementioned optimism andslight pessimism of the two simulated scenarios)Consistency over these different settings gives addi-tional confidence in the results suggesting that theGSN indeed holds promise for a high degree of suc-cess at targeting the same user across the fragmenta-tion of the online advertising ecosystem

62 Ranking User InstancesNext given that two user instances of the same indi-vidual are indeed connected we assess to what extentthe second instance of the same user is ranked highlyamong its network neighbors based on the weightof their linkage in the GSN We first will discuss themost general results using the best ranking methodand then will look across different ranking methods

We decompose the results based on the numberof network neighbors (NNs) since there are twoboundary cases that require special interpretationTables 10ndash14 show the proportions of the ego net-works that fall into three scenarios First a user mayhave only one NN In this scenario the ranking isperfect (trivially)mdashthe other instance of the same useris the only user instance given a nonzero score As

Table 10 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashRandom Split-up

Ego has only 1 NN 3104Ego has exactly 2 NNs 1905

Same user ranked first 8005Ego has more than 2 NNs 4901

Note For the scenario with two NNs the table reports thepercentage of cases where the other known instance of thesame user is ranked first (including ties)

Table 11 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashTemporal Split-up

Ego has only 1 NN 3008Ego has exactly 2 NNs 1807

Same user ranked first 8309Ego has more than 2 NNs 5005

Table 12 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 1RTB 2Split-up

Ego has only 1 NN 4401Ego has exactly 2 NNs 1201

Same user ranked first 8201Ego has more than 2 NNs 4400

Table 13 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 2RTB 3Split-up

Ego has only 1 NN 6109Ego has exactly 2 NNs 1907

Same user ranked first 7102Ego has more than 2 NNs 1804

Table 14 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 3RTB 1Split-up

Ego has only 1 NN 5305Ego has exactly 2 NNs 2501

Same user ranked first 6304Ego has more than 2 NNs 2104

shown in the tables this scenario accounts for 30ndash60 of the ego networks (depending on the split-upmethod)

The second boundary case is when the ego has ex-actly two NNs This scenario accounts for 10ndash25of the cases depending on the split-up method Herethe tables also report whether the other known userinstance is ranked first (including ties) among thetwo neighbors The other known instance of the sameuser is ranked first 60ndash80 of the time dependingon the split-up scenario The IFA results are affectedstrongly by the particular RTB settingmdashthis may indi-cate that the publishers on the different RTBs have dif-ferent policies for passing the IFA Importantly theseresults are conservativemdashpossibly very conservativeThe other NNs may also be instances of the same userunbeknownst to usmdashfor example in the IFA casebecause an app does not pass the IFA to the RTB orbecause the other user is the same user on a differentdevice and thus does not share the IFA

The final scenario is when more than two NNsare present As shown in the tables this accounts for20ndash50 of the cases depending on the split-up sce-nario In this case we measure the percentile in theranking where the instance of the known same userfalls (see below) as well as the area under the receiveroperating characteristic curve (AUC) (Fawcett 2006)which is equivalent to the MannndashWhitneyndashWilcoxon(MWW) statistic The AUCMWW measures to what

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers256 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 15 When There Are More Than Two Network Neighbors to What Extent Is the Known Same User StronglyConnected to Itself

AUC (if gt 2 NNs) () Average rank percentile (if gt 2 NNs) ()

Ranking metric R T RTB 12 RTB 23 RTB 31 R T RTB 12 RTB 23 RTB 31

1 78 64 68 59 55 29 48 34 43 542 59 58 67 58 54 45 53 34 44 553 79 69 74 61 61 29 44 29 40 484 86 75 69 61 59 24 40 32 41 505 83 70 71 59 63 26 44 31 43 476 85 74 71 60 65 25 41 31 42 457 86 76 73 62 64 24 40 29 40 468 83 70 72 59 65 26 44 30 43 459 85 74 71 60 67 25 41 30 42 44

10 79 69 69 61 63 29 44 32 41 4711 79 69 71 62 64 44 44 30 41 4612 81 68 68 60 62 28 46 34 42 4913 81 70 71 61 64 27 44 30 40 4614 74 67 66 61 48 34 47 35 41 6015 74 67 66 61 48 34 47 35 41 60

Note See the text for a discussion of ranking metrics

extent data points with label 1 are ranked higherthan those with label 0 In the results that follow(see Table 15) the AUC is measured for each samplewhere any user instance among the neighbors thatcorresponds to the same individual is labeled as 1and all others are labeled as 0 Thus in this settingthe AUC measures how well the neighbors are rankedby the likelihood of being the same individual11 Theaverage over all samples is reported If for a sampleall neighbors have the same score that AUC is setto 05 In addition to the factors noted above this alsowill tend to make the results conservative For exam-ple if the GSN links an ego user to a set of neighborsthat are all instances of the same individual in thisevaluation the AUC will be 05

The ranking measure is the percentile in the rank-ing at which the known-same-user instance is rankedwith lower values being better For example if thesame-user instance has the highest score (closest con-nection) among five neighbors the rank is 20 (15)Note again that this is quite conservative in thisexample the user is ranked at the top of the list butbecause there were only five neighbors the highestpossible score is 20 In case the known-same-userinstance has the same score as another neighbor theaverage rank is reported

As shown in Table 15 all AUC values exceed 05ranging up to 08 This means that in every casea known-same-user instance is likely to be rankedhigher than an instance not known to be the same

11 Technically in this setting the AUCMWW is the probability thata randomly selected neighbor that is in fact a known instance of thesame user is ranked more highly than a randomly selected neighborthat is not known to be an instance of the same user If the known-same-user instances are all ranked above the other-user instancesthe AUC = 100 If they are all below then the AUC = 0

user The frequency-based ranking metrics (4ndash9) per-form quite well in connecting instances of the sameuser strongly

Almost all metrics perform better than a randommodel with performances that even come close to thebest possible solution Consistently performing quitewell are the frequency-based metrics (metrics 3ndash9)with metric 7 (the minimum of the IDF weights ofthe shared locations) getting the most wins Not per-forming well is the very basic count metric as well asthe Boolean metric that provides a binary score onlyInteresting to note the Jaccard metrics do not performwell either It seems that penalizing users that visitmany IP addresses negatively affects the results anissue we will return to in the next study

7 Study 2 Does the GSN Select Userswith Similar Interests

The foregoing section addressed our paperrsquos firstclaimmdashthat the GSN design indeed links instances ofthe same user and links them more strongly thaninstances of other users In this section we turn to thepaperrsquos second claimmdashthat the GSN links users withsimilar interests We will measure similar interests bysimilarity in behavior accessing particular publisherson the Web and using mobile apps12

71 Connecting Users with Similar Interests

To What Extent Are the GSN Neighbors of Visi-tors to Particular Publishers Also Visitors to ThoseSame Publishers To assess whether the GSN con-nects users with similar interestsbehavior we select

12 From now on all user instances will be complete real user in-stances the simulated split-up scenarios (Random Temporal) werecreated specifically for the purpose of the same-user study

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 257

Table 16 Distribution of Publishers per RTB in Terms of thePercentage of Users in Our Data Set That Visit the GivenPublisher

Publisher of D

RTB 1Burstly 3004myYearbookcom 1062My Fitness Pal 0065TextMe Inc 0056GameResort 0025Flixster 0019247Sports 0015Daily Workout Apps LLC 0009Mobile Deluxe 0007YoYo Games Ltd 0004

RTB 2mworldnowcom 0032mtmzcom 0024mtopixcom 0014babycentercom 0015meetmecom 0006appsfacebookcom 0006itunesapplecom 0039appevitecom 0002beautyandskincarescom 0003mmajunkiecom 0001

RTB 3Conversion Exchange 5053Pinger Inc 2096Top Game Developer 4029PremiumEntertainmentApp-FamilyOriented-Android 0059Enflick 0053Clapfoot Games 0044FunPokes Inc 0018Talkatone 0015Sevenlogics Inc 0008A Star Software 0007

RTB 4Pinger PhonemdashiOSmdashConversation 1001Talkatone iPhone App 0007DiceWithBuddies 00139GAG Reader 0012Video Downloader Pro Lite 0016Rage Comics 0009Spades 3D Lite 0006Apalabrados 0003Relax Melodies HD 0002Celeb MemdashPhotoMaker 0000

a diverse collection of 40 publishers listed in Table 16A publisher can correspond to a mobile app a web-site or a set of appswebsites We selected both pub-lishers that are visited by many users and more nichepublishers with fewer visitors As can be seen fromTable 16 several types of publishers can be distin-guished the main groups are games (eg GameRe-sort Top Game Developer) and social networkingtools (eg appsfacebookcom Text Me Inc)

More specifically we create 200 samples for 40 pub-lishers 10 publishers for each of four different RTB

exchanges listed in Table 1613 The percentage ofunique users that visit each publisher in our completedata set is also shown (D) To create each samplean ldquoegordquo user is selected that has visited the pub-lisher in question (an ldquoego visitorrdquo) this ego visi-torrsquos GSN neighborhood is created Next we assesshow many of the egorsquos neighbors also visited thatsame publisher This will be compared with a base-line visit rate to produce lift and leverage values(Provost and Fawcett 2013)mdashspecifically how muchmore likely is it for the neighbors of ego visitors tovisit the publisher than would be expected by chanceWe consider three different setups that produce dif-ferent baselines

1 Considering all RTB exchanges to determine the base-line A user is considered a visitor for a given pub-lisher if she visits the publisher on any RTB ex-change (since some publishers are visible on differentexchanges) Hence the baseline visit rate is the num-ber of users (on any RTB) who visit that publisherdivided by the total number of users across all RTBexchanges This will provide optimistic lifts sincemost publishers appear only on one RTB exchangewhereas the baseline is measured over all exchanges

2 Considering one RTB exchange only to determine thebaseline A user is considered a visitor for a given pub-lisher and RTB if she visits the publisher on the sameRTB exchange This changes the baseline to the num-ber of users (on this RTB) who visit that publisherdivided by the total number of users on this RTBexchange The lift and leverage results will be worsethan in the first setting because the baseline will belarger Seeing that some publishers are seen on differ-ent RTB exchanges this setting is slightly pessimistic

3 Taking time of location visits into account In thethird setting we link users only if they visit an IPwithin the same time period This will be elaboratedon in sect73

Results are shown in Tables 17 and 18 The percent-age of publisher visitors in the neighborhood is givenas N This percentage is compared with the baselinepercentage of users that visit that publisher in thecomplete data D using two metrics the lift is the ratioof these numbers the leverage is the difference Liftis particularly useful for very small base rates (con-sider that an activity with a base rate of 30 cannotpossibly have a lift higher than about 3) Leverage ismore telling for larger base rates because it shows theabsolute improvement (here in percentage points)

When considering all NNs the lifts for all publish-ers are greater than 1 except for one Burstly on RTB 1has a lift of 097 (essentially no lift) but only when weconsider RTB 1 as the baseline This publisher already

13 For confidentiality the RTB numbering here is again differentfrom above

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers258 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 17 RTBs 1 and 2 Percentage of Network Neighbors That Also Visit the Publisher

RTB 1

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Burstly 3004 8063 2084 5059 47022 15054 44018 25061 8043 220572 myYearbookcom 1062 25077 15094 24015 44074 27067 43012 35016 21074 330543 My Fitness Pal 0065 4015 6038 305 3609 56077 36025 15055 23091 14094 TextMe Inc 0056 1805 33006 17094 39002 69075 38046 25018 45 240625 GameResort 0025 4093 19077 4068 18052 74026 18027 11086 47058 110626 Flixster 0019 1055 8015 1036 8014 42085 7095 1079 9042 1067 247Sports 0015 27057 188058 27042 77062 530093 77048 74053 509075 740388 Daily Workout Apps LLC 0009 1092 21047 1083 19023 215046 19014 11054 129027 110459 Mobile Deluxe 0007 1087 28067 108 23053 36009 23046 9063 14707 9056

10 YoYo Games Ltd 0004 3052 99058 3049 21057 609064 21053 13 367045 12096Considering this RTB exchange only to determine baseline

1 Burstly 8085 8063 0097 minus0022 47022 5033 38037 25061 2089 160762 myYearbookcom 4071 25077 5047 21006 44074 905 40003 35016 7046 300453 My Fitness Pal 1089 4015 2019 2025 3609 19048 35001 15055 8021 130654 TextMe Inc 1063 1805 11035 16087 39002 23094 37039 25018 15045 230555 GameResort 0073 4093 6078 402 18052 25049 17079 11086 16033 110146 Flixster 0055 1055 208 0099 8014 14071 7059 1079 3023 10247 247Sports 0043 27057 64073 27014 77062 182023 7702 74053 174097 74018 Daily Workout Apps LLC 0026 1092 7037 1066 19023 73095 18097 11054 44037 110289 Mobile Deluxe 0019 1087 9084 1068 23053 123088 23034 9063 5007 9044

10 YoYo Games Ltd 001 3052 34018 3042 21057 209025 21047 13 126012 1209

RTB 2

Considering all RTB exchanges to determine baseline1 mworldnowcom 0032 13027 40086 12095 54039 167042 54006 31065 97044 310332 mtmzcom 0024 6089 28086 6065 23008 9607 22084 10068 44075 100443 mtopixcom 0014 15007 104079 14093 53045 37106 5303 33075 234065 330614 babycentercom 0015 6087 47031 6072 43048 299056 43033 17074 122024 17065 meetmecom 0006 6002 10308 5097 54072 942093 54066 27085 47909 270796 appsfacebookcom 0006 301 48039 3003 9009 141097 9003 2076 43008 20697 itunesapplecom 0039 25077 65071 25038 51043 131012 51004 29049 75018 290098 appevitecom 0002 48028 21212043 48025 54076 21509068 54074 48028 21212043 480259 beautyandskincarescom 0003 18019 642087 18016 8048 299094 8046 1107 413049 11067

10 mmajunkiecom 0001 1036 201064 1035 26067 3195505 26066 13033 11977074 13033Considering this RTB exchange only to determine baseline

1 mworldnowcom 4034 13027 3006 8093 54039 12053 50005 31065 7029 270312 mtmzcom 3019 6089 2016 307 23008 7024 19089 10068 3035 70493 mtopixcom 1092 15007 7084 13015 53045 27081 51053 33075 17056 310834 babycentercom 1094 6087 3054 4093 43048 22042 41054 17074 9015 15085 meetmecom 0078 6002 7077 5025 54072 70057 53094 27085 35092 270076 appsfacebookcom 0086 301 3062 2024 9009 10063 8024 2076 3022 1097 itunesapplecom 5024 25077 4092 20053 51043 9081 46019 29049 5063 240258 appevitecom 0029 48028 165058 47098 54076 187082 54047 48028 165058 470989 beautyandskincarescom 0038 18019 48011 17081 8048 22045 8011 1107 30095 11032

10 mmajunkiecom 0009 1036 15009 1027 26067 296003 26058 13033 148001 13024

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

has a very high baseline and as we will see in thenext section when we consider only the top 1 and 10NNs the lifts go up to 5 and 3 respectively Burstlyprovides a generic platform for building apps and sois a weaker indication of user interest than many ofthe other publishers (fitness babies games)

All other lifts exceed one by substantial mar-gins and go up to even 2391 (Apalabrados onRTB 4 a Spanish-language social Scrabble-like game)

Obviously the conclusion that the NNs indeed aremore likely to visit the same publishers is highly sig-nificant by a simple sign test comparing the visita-tion rate within the geosimilarity neighborhood tothat of overall population either via lift or leveragep lt 0001 Thus NNs of publisher visitors are indeedmore likely also to visit the same websites the abso-lute values of the lifts and leverages suggest that theyare substantially more likely

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 259

Table 18 RTBs 3 and 4 Percentage of Network Neighbors That Also Visit the Publisher

RTB 3

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Conversion Exchange 5053 34054 6024 29001 64071 11069 59017 50077 9018 450242 Pinger Inc 2096 46002 15053 43006 65012 21097 62015 51048 17037 480523 Top Game Developer 4029 47024 11002 42095 60 13099 55071 48097 11042 440684 PremiumEntertainmentApp-FamilyOriented-Android 0059 23079 40012 2302 20045 34049 19086 22025 37052 210665 Enflick 0053 11071 21095 11017 21057 40044 21004 12027 23 110736 Clapfoot Games 0044 8055 19032 8011 11054 26006 1101 13099 31061 130557 FunPokes Inc 0018 11011 61028 10093 50 275077 49082 15079 87008 150618 Talkatone 0015 13013 88051 12098 31011 209076 30096 17091 120078 170779 Sevenlogics Inc 0008 2071 32059 2063 5088 70074 508 8033 100022 8025

10 A Star Software 0007 3033 44086 3025 23008 311024 23 9009 122061 9002Considering this RTB exchange only to determine baseline

1 Conversion Exchange 11045 34054 3002 23009 64071 5065 53025 50077 4043 390322 Pinger Inc 6013 46002 705 39089 65012 10062 58098 51048 8039 450353 Top Game Developer 8088 47024 5032 38036 60 6076 51012 48097 5052 400094 PremiumEntertainmentApp-FamilyOriented-Android 1023 23079 19038 22057 20045 16066 19023 22025 18013 210025 Enflick 101 11071 1006 1006 21057 19054 20046 12027 11011 110166 Clapfoot Games 0092 8055 9033 7064 11054 12059 10062 13099 15027 130087 FunPokes Inc 0038 11011 2906 10074 50 133022 49062 15079 42007 150418 Talkatone 0031 13013 42076 12082 31011 101034 3008 17091 58035 170619 Sevenlogics Inc 0017 2071 15074 2054 5088 34017 5071 8033 48041 8016

10 A Star Software 0015 3033 21067 3017 23008 150036 22092 9009 59023 8094

RTB 4

Considering all RTB exchanges to determine baseline1 Pinger PhonemdashiOSmdashConversation 1001 71081 70085 7008 80077 79068 79076 70014 6902 690132 Talkatone iPhone App 0007 11038 17203 11031 29003 439068 28097 17002 257078 160963 DiceWithBuddies 0013 7041 59004 7028 18018 144092 18006 7069 61031 70574 9GAG Reader 0012 62013 497099 62 49009 39305 48097 49005 393017 480925 Video Downloader Pro Lite 0016 20003 123015 19087 22058 138084 22042 17093 110027 170776 Rage Comics 0009 609 8001 6081 13033 154086 13025 609 8001 60817 Spades 3D Lite 0006 1205 225076 12044 27027 492057 27022 1205 225076 120448 Apalabrados 0003 77042 21391071 77039 75 21316097 74097 79032 21450043 790299 Relax Melodies HD 0002 31058 21002029 31056 50 31170029 49098 31058 21002029 31056

10 Celeb MemdashPhotoMaker 0 3023 82107 3022 7014 11819048 7014 3023 82107 3022Considering this RTB exchange only to determine baseline

1 Pinger PhonemdashiOSmdashConversation 8051 71081 8044 6303 80077 9049 72026 70014 8024 610632 Talkatone iPhone App 0055 11038 20053 10082 29003 52038 28048 17002 30071 160473 DiceWithBuddies 0009 7041 82011 7032 18018 201054 18009 7069 85027 7064 9GAG Reader 0097 62013 63097 61016 49009 50055 48012 49005 50051 480085 Video Downloader Pro Lite 0044 20003 45036 19059 22058 51014 22014 17093 40062 170496 Rage Comics 0037 609 18079 6053 13033 36032 12097 609 18079 60537 Spades 3D Lite 0046 1205 2609 12004 27027 58068 26081 1205 2609 120048 Apalabrados 0021 77042 371093 77021 75 360031 74079 79032 381006 790119 Relax Melodies HD 0013 31058 238054 31045 50 377069 49087 31058 238054 31045

10 Celeb MemdashPhotoMaker 0003 3023 97089 3019 7014 216076 7011 3023 97089 3019

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

The very high lifts could be attributed to two phe-nomena They may provide some evidence that theGSN embeds a true social network as pointed out bythe good results for the social networking publishers(eg Apalabrados with a lift of 2391 appsevitecomwith a lift of 2212 and others) Friends visiting thesame social networking sites (here unusual ones)

would increase the lift and leverage over just havingdifferent instances of the same user visiting the samewebsites An alternative explanation is that some ofthese publishers are much more likely to be visitedacross different devices of the same user It is not clearwhether this is actually true However it argues forobtaining specific and broad data on which devices

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers260 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

are not the same user to shed further light on forexample any broad data that include a (anonymized)user identifier

From these results we can conclude that GSNneighbors indeed exhibit substantially similar pub-lisher visitation behavior Next we assess to whatextent more strongly connected NNs are more similarthan less strongly connected NNs

72 Ranking Users with Similar Interests

Are the More-Similar Geosimilarity Neighbors of aPublisherrsquos Visitors Even More Likely Also to Be Vis-itors to the Same Publisher To assess whether more-similar geosimilarity neighbors are even more likelyto share interests we compute the measures exactlyas in the previous study except instead of consider-ing all NNs only the 1 (N1) and 10 (N10) neighborswith the highest scores are considered (see Tables 17and 18) First we must consider which of the 15 met-rics that are defined to measure link strength to useTable 19 reports the rank aggregations of the differentmetrics as to which provide the best lifts Specificallyfor each of the 40 publishers each scoring metric pro-vides a lift and for that publisher the scoring metricscan be ranked (the best being ranked 1 etc) The rankaggregation is the average rank for a scoring metricacross all 40 publishers Thus if one metric providedthe best lift for all publishers it would get an aggre-gated rank of 1

To visualize the best metrics the metrics with anaverage ranking less than 8 are shown in boldfacein Table 19 The Jaccard metrics do not perform wellin this study either showing that penalizing the con-nection to users that visit many locations is not sen-sible Remember that the Jaccard metric is based onthe idea that users that visit many locations will havea higher chance to share some location with anotheruser However we have so many locations in totalthat visiting a couple more locations will increase onlymarginally this probability of sharing a location bychance A user that logs into more IP addresses andis more active should hence not be penalized andranked lower than other less active users Metric 3which sums the IDF scores of the shared locationsperforms quite well Given the operational efficiencyadvantage over the frequency-based metrics this met-ric is chosen to measure the strength of the links

As seen from the results in Tables 17 and 18 thelift and leverage values for the strongly connected (top 1and top 10) NNs are even higher sometimes astro-nomical (3170 for Relax Melodies HD (RTB 4)) Theseresults are highly significant by sign tests compar-ing either the lift or the leverage between the top-ranked neighbors and the entire cohort (p lt 0001) Theabsolute values again indicate that the users with the

strongest geosimilarity are often substantially morelikely to visit the same publisher

Thus we can conclude that not only does geosim-ilarity find users with similar interests it also ranksusers well by their likelihood of having similarinterests

An interesting follow-up question is whether thecharacteristics of a publisher are of importance for theresults For example are GSN neighbors more sim-ilar in terms of using the same social network appas compared to playing the same games To answerthis question the publishers are categorized into sixclasses funny social news games communicationand miscellaneous In Figure 9 the lifts of the pub-lishers are shown per category Some general trendscan be observed where publishers related to funnycontent have the highest median lift This could beexplained by the fact that users often forward funnycontent to their friends The social websitesapps fol-low next which seems to further demonstrate thatfriends are likely linked in the GSN However pub-lishers in the communication category perform not sowell compared with the other categories and ratherhigh variances are observed in the different cate-gories Please note that high variance is observedacross the categories therefore it is difficult to makewell-supported claims from these results

73 TimingAs a final evaluation we explore whether the inclu-sion of temporal similarity in location visitation ishelpful for defining GSN connections To this end weintroduce a setting where two users are connectedonly if they visit an IP address in the same timeperiod of a day Specifically we divide a day intodifferent time windowsmdashfor example two time peri-ods of 12 hoursmdashwhere users are connected only ifthey visit the same IP address in the same time periodof a day (which might be on different days) Practi-cally two digits are added to an IP which denote thestarting moment of the corresponding time periodWe report on time periods of 2 hours and of 12 hours(other periods yield similar results) We also includethe setting where users are connected only when theyvisit the same IP address on the same day in light ofthe result of Crandall et al (2010) that visiting thesame location at around the same time is indicativeof friendship

The resulting lifts are reported in Figures 10ndash12when considering respectively all NNs the top 1 NNand the top 10 NNs The results when not consider-ing the time period are repeated as well (All) lim-ited to the conservative case of only considering oneexchange (see above) The baseline (lift of one) isindicated with a red horizontal line lifts higher than1000 are shown as 1000 to limit the range to be able

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 9: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile Consumers250 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

cussed above) These are the methods used for theempirical results presented below

4 Specifics of the GSN DesignAs described above in this design the strength of thelink between two individuals in the GSN is based onthe similarity in the distribution of the locations thatthey visit For this paperrsquos results locations will be(anonymized) IP addresses We now define differentsimilarity measures of varying complexity which willbe evaluated in the following sections

41 NotationThe location profile of a user is represented by a vec-tor Ex of length m where element i denotes the num-ber of visits to IP address IPi2 Ex isin m Vector Exbool isa binary vector of size m where element i denoteswhether the user has visited IPi For similarity com-putations both the Hadamard product () and theinner product 4 middot 5 are used and the minimum (min)operator is defined to operate componentwise on twovectors as well as on a vector and a scalar The max-imum (max) and average (avg) operators are similarTheir logic is defined in Equations (1)ndash(4) The weightof an IP address is defined by its popularity in anal-ogy to the inverse document frequency used in text

Table 2 Similarity Metrics Between Two Userrsquos Location Visit Distributions Ex1 and Ex2

Similarity metric Range d4 Ex11 Ex25

1 scount4 Ex11 Ex25= Exbool11 middot Exbool12 6015 1

2 sbool4 Ex11 Ex25= min4scount4 Ex11 Ex25115 80119 1

3 scount1W 4 Ex11 Ex25= 4 Ew Exbool115 middot Exbool12 6015 103

4 sfreq1min4 Ex11 Ex25= min4 Ex11 Ex251 6015 2

5 sfreq1max4 Ex11 Ex25= max4 Exbool12 Ex11 Exbool11 Ex251 6015 3

6 sfreq1 avg4 Ex11 Ex25= avg4 Exbool12 Ex11 Exbool11 Ex251 6015 205

7 sfreq1min1W 4 Ex11 Ex25= min4 Ex1 Ew1 Ex2 Ew51 6015 206

8 sfreq1max1W 4 Ex11 Ex25= max44 Exbool12 Ew5 Ex11 4 Exbool11 Ew5 Ex251 6015 309

9 sfreq1 avg1W 4 Ex11 Ex25= avg44 Exbool12 Ew5 Ex11 4 Exbool11 Ew5 Ex251 6015 3025

10 scosine4 Ex11 Ex25=Ex1 middot Ex2

Ex11 middot Ex21

60117 0074

11 scosine1W 4 Ex11 Ex25=4 Exbool11 Ew5 middot 4 Exbool12 Ew5

Exbool11 Ew1 middot Exbool12 Ew1

60117 0060

12 sJaccard4 Ex11 Ex25=scount4 Ex11 Ex25

max4 Exbool111 Exbool1251

60117 0033

13 sJaccard1W 4 Ex11 Ex25=scount1W 4 Ex11 Ex25

max4 Exbool11 Ew1 Exbool12 Ew51

60117 0026

14 sJaccard1 freq4 Ex11 Ex25=sfreq1min4 Ex11 Ex25

max4 Ex1 Exbool121 Ex2 Exbool1151

60117 0067

15 sJaccard1 freq1W 4 Ex11 Ex25=sfreq1min1W 4 Ex11 Ex25

max44 Ex1 Exbool125 Ew1 4 Ex2 Exbool115 Ew51

60117 0067

mining (Hotho et al 2005) the weight wi for IPi (theinverse device frequency) is defined as the logarithmof the total number of users n divided by the numberof (unique) users that visit that specific IP address ni

Ez= Ex Ey1 zi = xi times yi1 (1)

a= Ex middot Ey =sum

i=12m

4xi times yi51 (2)

Ez= min4Ex1 Ey51 zi = min4xi1yi51 (3)

Ez= min4Ex1a51 zi = min4xi1 a51 (4)

wi = log104nni50 (5)

42 Link Strength MetricsFifteen similarity metrics are defined in Table 2 andare illustrated by measuring the strength betweenthe location profiles of two example users Ex1 and Ex2shown in Equation (6) The first user visits IP1 threetimes and IP2 twice whereas the second one visits IP1twice and IP3 once Several variants of the same met-rics are considered As described above some onlytake into account the unique number of shared loca-tions whereas others take into account the frequencyof the visits to these locations and the inverse devicefrequency The latter are denoted by freq and W respectively

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 251

The first similarity metric scount counts the numberof shared locations Since the two example users shareonly IP1 this strength is 1 The IDF weighted versionscount1W sums the IDF values of the shared locationswhich is 13 for our simple example The most basicmetric is given by sbool which is 1 when a location isshared and 0 otherwise indicating whether two usersare neighbors or not

The sfreq1min sfreq1max and sfreq1avg metrics compare thefrequencies of the visits to the shared locations Theytake the minimum maximum and average respec-tively of the frequencies for each shared location andsum them The weighted versions sfreq1min1W sfreq1max1W and sfreq1avg1W do the same but weight each frequencywith the corresponding IDF value

The scosine metric measures the similarity by takingthe cosine of the angle between the two vectors TheJaccard metric is defined as the size of the intersec-tion over the size of the union In this case it countsthe number of shared locations but normalizes theseover the total number of locations both users havevisited Users that visit many locations will have ahigher chance to share some location with anotheruser so the Jaccard metric penalizes them For ourbasic example sJaccard is 1 (IP1) over 3 (IP1 IP2 IP3)Once more variants taking into account the frequencyand the IDF are defined as well

Ex1 = 63 2 071

Ex2 = 62 0 171 (6)

Ew = 6103 107 270

5 Experimental Setup51 Data SetThe data set D for the empirical results is 10 days ofanonymized advertising bid requests from mobile de-vices (smartphones tablets etc) observed across sevenRTB exchanges A total of 322770794 bid requestsare observed coming from 42437559 unique userinstances Over these 10 days we observe 41829088unique IP addresses

The number of unique user instances per RTBexchange is given in Table 3 The distributions of thenumber of bid requests per device type and operatingsystem are given in Tables 4 and 5 which show thatthe majority of bid requests come from smartphonesand the iOS operating system

A user (instance) corresponds to an exchange-assigned user ID One person on several deviceswill hence correspond to several users One deviceon several exchanges will often correspond to sev-eral users Also it might be that one ad exchangesometimes provides different user IDs to the same

Table 3 User Instances per RTB Exchange

Exchange Volume Percentage ()

RTB 1 311751979 6RTB 2 211591686 4RTB 3 1415661098 29RTB 4 2015011243 41RTB 5 212751015 5RTB 6 510551815 10RTB 7 212231418 4

Table 4 Distribution of Bid Requests per Device Type

Type Volume Percentage ()

Phone 29213581772 91Tablet 3010161334 9

Table 5 Distribution of Bid Requests per OperatingSystem

Type Volume Percentage ()

Android 12913151617 4001iOS 19219231648 5908Windows 5311529 002

device for example when the device is using differ-ent particular apps As described above an impor-tant task in online advertising is to be able to targetadvertisements to different instances of the same indi-vidual with reasonable probability regardless of thecomplexities of identification in the baroque onlineadvertising ecosystem Figure 5 visualizes a sample oflatitudes and longitudes broadcast by mobile devicesin our data set

52 SamplingFor the results in this paper we apply the GSN anal-ysis to data samples Each sample corresponds to acohortneighborhood and is built as follows A sin-gle (random) ldquoegordquo user is chosen user X aroundwhich a neighborhood is built Every neighbor of theego user must share at least one IP address so webegin by obtaining all (anonymized) IP addresses vis-ited by user X Next all other users that visited atleast one of these IP addresses (the neighbors) areadded The ego must have at least one neighbor to beincluded in this analysis Finally we obtain the com-plete location distribution of each of these neighborsWhen the next sample is chosen we ensure that noneof the users already included in the previous samplesis taken as the ego user (X) As such we ensure thatwe do not have repeated egondashneighbor dyads Theremay be shared neighbors of two different egos buttwo egos will never be immediate neighbors in thedata sample

The result of the sampling over 500 cohorts is summa-rized in Figure 6 on average an ego has 71 neighbors

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers252 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 5 Scatter Plots of Samples of Latitudes and Longitudes Broadcast by Mobile Devices

ndash130 ndash120 ndash110 ndash100 ndash90 ndash80 ndash70 ndash6020

25

30

35

40

45

50

55(a)

Long

Lat

ndash20 ndash10 0 10 20 30 40 50

Long

30

35

40

45

50

55

65

60

75

70

Lat

ndash200 ndash150 ndash100 ndash50 1000 50 150 200

Long

ndash100

ndash80

ndash60

ndash40

ndash20

0

40

20

100

60

80

Lat

4068ndash7406 ndash7404 ndash7402 ndash7400 ndash7394ndash7398 ndash7396 ndash7392 ndash7390

Long

4070

4072

4074

4076

4080

4078

4086

4082

4084

Lat

(b)

(c) (d)

Note Note that locations are plotted without reference to any map The figure gives a striking picture of population density across the world and makes uswonder what is going on with mobile devices in Antarctica

Figure 6 (Color online) The Distribution of the Number of Neighborsfor 500 Randomly Sampled Ego Users

0 500 1000 1500 2000 2500 3000 35000

50

100

150

200

250

300

350

400

450

No neighbors

Fre

quen

cy

the median being 1 The maximum number of neigh-bors is 3167 this user probably visited a publicWi-Fi that thousands of other users also logged intoWe chose 500 cohorts as we observe that the statis-

tics become quite stable as seen in Figure 7 Even200 cohorts seems to be enough to obtain stablesample statistics

The life span of user instances in this set of500 cohorts is shown by the cumulative frequency inFigure 8 Observe that only about 40 of the userinstances are seen at different hours About 30 ofthe user instances are seen only once in the sampledweek This illustrates that it is vital to understand thenotion of user instances These percentages are mis-leading if we interpret the user instances as individu-als or devices One individualdevice without stableidentification in the online advertising ecosystem willcreate many transient user instances which will driveup the percentage of user instances that are seen onlyonce or that are seen only within a short timeframeWe have not figured out how to determine or esti-mate reliably how many ldquounstablerdquo usersdevices arerepresented by the user instances in these data Wekeep all of the data for the analyses below (rather thanfiltering very short-lived user instances) to maintainrealism

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 253

Figure 7 Evolution of Several Sample Statistics as an Increasing Number of Cohorts are Created

0 100 200 300 400 5000

10

20

30

40

50

60

70

Average No network neighbors

0 100 200 300 400 5000

2

4

6

8

10

12

Median No network neighbors

0 100 200 300 400 5000

50

100

150

200

250

300

350

Average No IPs

0 100 200 300 400 5000

2

4

6

8Median No IPs

Notes The horizontal axis denotes the number of cohorts By 500 cohorts the statistics have become stable

Figure 8 (Color online) Cumulative Frequency of Life Span ofUser Instances

100 101 102 103 104 105 1060

01

02

03

04

05

06

07

08

09

10

x = lifespan (sec)

F(x

)

End of June (mobile only)

larr 1 min larr 1 hour larr 1 day

Cumulative frequency of user instances

6 Study 1 Connecting Instances ofthe Same Individual

Returning to our motivating application as men-tioned previously an important task in mobile adver-tising is to be able to target a particular individual inthe face of the individualrsquos fragmentation into differ-ent user instances (for example as seen through dif-ferent bidding systems) Advertisers understand that

it may be impossible to find the same user with 100accuracy Nonetheless it often is important for themto reach the userrsquos various instances without target-ing too many others9 The GSN is designed in partbased on the hypothesis that different instances of thesame person would visit similar locations and there-fore they will be connected in the GSN In this sectionwe present results examining whether and to whatextent that is indeed the case Furthermore the resultsquantify to what extent the same user receives a highGSN-similarity to herself as compared to her otherneighbors

Three experimental settings are used to assess judg-ments that two users are the same The first two sim-ulate different user instances by splitting up the bidrequests of known user instances The last one is notsimulated it links users via the recently introducedidentifier for advertising (IFA) observed over severalRTB exchanges for a subset of device instances

1 Random We divide the IP address visits of a userrandomly across two simulated users Table 6 showsfor example how we can divide a userrsquos IP visits across

9 The astute reader may notice that if the other users targeted arevery similar to the target user the advertising may be well placeddespite not being shown to the exact target user That is not thesubject of this study Study 2 shows evidence that these two notionsare not independent when using the GSN

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers254 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 6 Randomly Splitting User X rsquos IP VisitsCreating Two New Artificial Users X1 and X2

IP1 IP2 IP3

X 1 5 0X1 0 3 0X2 1 2 0

Table 7 The Volume of IFA Occurrence Within OurData Set

Volume

RTB 1 119401855RTB 2 113831962RTB 3 7891292

Table 8 The Frequency of Unique IFAs Visible AcrossRTB Exchanges

Frequency

RTB 1RTB 2 3061594 (1016)RTB 2RTB 3 451256 (169)RTB 3RTB 1 181417 (085)

two new users X1 and X2 for which we know thatthey are actually the same individual Results withRandom should be used only as a ceiling on perfor-mance because the results will be overly optimistic

2 Temporal A more realistic split-up of the trans-actions is a temporal split-up where first the middleday of Xrsquos transactions is determined (hence dynam-ically chosen for each user) All IP address visits upto this middle day are assigned to X1 and all trans-actions after the middle day are assigned to X2 Itwill only be applied to users who are seen on multi-ple days

3 Identifier for advertising IFA is an advertisingidentifier from Apple that users can change or asknot to be used for advertising and is seen as aprivacy-friendlier version of the controversial fixeddevice ID (the ldquoUDIDrdquo) The IFA is observed in threeexchanges with volumes shown in Table 7 for exam-ple 1940855 unique IFAs are seen in RTB 110 Whenwe observe the same IFA across different exchangeswe now know that two devices in different exchangesactually are the same The frequency of unique IFAsvisible across exchanges is shown in Table 8 For exam-ple a total of 3018223 IFAs are seen when we look atboth RTB 1 and RTB 2 with 306594 of these (1016)seen on both exchange 1 and exchange 2 What ifthere were not an IFA and the device would havehad a different identifier in each exchange To what

10 For confidentiality RTB 1 in this setting is not necessarily thesame as RTB 1 in sectsect4 or 7

extent are these two device instances then connectedin the GSN

With these different ldquogold standardrdquo methods toindicate whether or not two users are the samewe assess the ability of the geosimilarity metricsto link the same user determining how often twoinstances of the same person are connected in theGSN (sect61) Afterward we shall assess the differentGSN strength metricsrsquo ability to rank users determin-ing to what extent two instances of the same personexhibit strong geosimilarity as compared to the otherneighbors (sect62) For the random and temporal split-ups 200 samples are used For the IFA-based methodthe complete population is used for measuring thedegree of connectedness to the same user to analyzethe relative strength of the connected users 200 sam-ples are used

61 Same Individual Different ldquoScreensrdquoWe now present results on the GSN connectednessmeasured as how often two instances of the same userare connected in the GSN (in percentage) Note thatthe results summarized in Table 9 should be inter-preted in light of the limited number of days in build-ing the research data set which will limit the GSNconnectivity Nonetheless the relatively high connec-tivities observed even with this sample provide quitepromising results which lends considerable supportto the merit of the overall design

Random0 Using the random split two users thatcorrespond to the same individual are not connectedin only 9 of the neighborhoods Moreover in thesecases user X (whose IP visits are split up) usuallyhas only one or two entries and the random split-upresults in no visits for one of the two simulated users

Temporal0 The temporal split-up is more realisticthan the random one where we assume that a personuses simulated device 1 in the first half of the week(eg during the weekend) and simulated device 2 inthe second half of the week (eg during the week)

Users X1 and X2 are not connected in 32 of theneighborhoods Whereas the random split-up was toooptimistic the temporal one is too pessimistic as inreality there likely is some overlap in the time peri-ods of use of two different devices If user X visitsan IP address only in the first days of the week this

Table 9 To What Extent is the Same User Connectedto Herself

connected

Random 91Temporal 68IFAmdashRTB 1RTB 2 67IFAmdashRTB 2RTB 3 82IFAmdashRTB 3RTB 1 73

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 255

IP address will not occur in user X2rsquos transactions andreduces the chance to be connected to X1

IFA0 We look at all combinations of the three ex-changes with IFA identifiers For example there are45256 unique IFAs seen on both RTB 2 and RTB 3 In82 of these the two IFAs visit at least one same IP onboth exchanges and hence are connected in our GSN

The bottom line is that even with this limited slice ofonline behavior different instances of the same indi-vidual are connected in the GSN 70ndash80 of the timeEncouragingly the results from the simulated scenar-ios concur with the results from the real (IFA) sce-narios (including the aforementioned optimism andslight pessimism of the two simulated scenarios)Consistency over these different settings gives addi-tional confidence in the results suggesting that theGSN indeed holds promise for a high degree of suc-cess at targeting the same user across the fragmenta-tion of the online advertising ecosystem

62 Ranking User InstancesNext given that two user instances of the same indi-vidual are indeed connected we assess to what extentthe second instance of the same user is ranked highlyamong its network neighbors based on the weightof their linkage in the GSN We first will discuss themost general results using the best ranking methodand then will look across different ranking methods

We decompose the results based on the numberof network neighbors (NNs) since there are twoboundary cases that require special interpretationTables 10ndash14 show the proportions of the ego net-works that fall into three scenarios First a user mayhave only one NN In this scenario the ranking isperfect (trivially)mdashthe other instance of the same useris the only user instance given a nonzero score As

Table 10 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashRandom Split-up

Ego has only 1 NN 3104Ego has exactly 2 NNs 1905

Same user ranked first 8005Ego has more than 2 NNs 4901

Note For the scenario with two NNs the table reports thepercentage of cases where the other known instance of thesame user is ranked first (including ties)

Table 11 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashTemporal Split-up

Ego has only 1 NN 3008Ego has exactly 2 NNs 1807

Same user ranked first 8309Ego has more than 2 NNs 5005

Table 12 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 1RTB 2Split-up

Ego has only 1 NN 4401Ego has exactly 2 NNs 1201

Same user ranked first 8201Ego has more than 2 NNs 4400

Table 13 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 2RTB 3Split-up

Ego has only 1 NN 6109Ego has exactly 2 NNs 1907

Same user ranked first 7102Ego has more than 2 NNs 1804

Table 14 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 3RTB 1Split-up

Ego has only 1 NN 5305Ego has exactly 2 NNs 2501

Same user ranked first 6304Ego has more than 2 NNs 2104

shown in the tables this scenario accounts for 30ndash60 of the ego networks (depending on the split-upmethod)

The second boundary case is when the ego has ex-actly two NNs This scenario accounts for 10ndash25of the cases depending on the split-up method Herethe tables also report whether the other known userinstance is ranked first (including ties) among thetwo neighbors The other known instance of the sameuser is ranked first 60ndash80 of the time dependingon the split-up scenario The IFA results are affectedstrongly by the particular RTB settingmdashthis may indi-cate that the publishers on the different RTBs have dif-ferent policies for passing the IFA Importantly theseresults are conservativemdashpossibly very conservativeThe other NNs may also be instances of the same userunbeknownst to usmdashfor example in the IFA casebecause an app does not pass the IFA to the RTB orbecause the other user is the same user on a differentdevice and thus does not share the IFA

The final scenario is when more than two NNsare present As shown in the tables this accounts for20ndash50 of the cases depending on the split-up sce-nario In this case we measure the percentile in theranking where the instance of the known same userfalls (see below) as well as the area under the receiveroperating characteristic curve (AUC) (Fawcett 2006)which is equivalent to the MannndashWhitneyndashWilcoxon(MWW) statistic The AUCMWW measures to what

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers256 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 15 When There Are More Than Two Network Neighbors to What Extent Is the Known Same User StronglyConnected to Itself

AUC (if gt 2 NNs) () Average rank percentile (if gt 2 NNs) ()

Ranking metric R T RTB 12 RTB 23 RTB 31 R T RTB 12 RTB 23 RTB 31

1 78 64 68 59 55 29 48 34 43 542 59 58 67 58 54 45 53 34 44 553 79 69 74 61 61 29 44 29 40 484 86 75 69 61 59 24 40 32 41 505 83 70 71 59 63 26 44 31 43 476 85 74 71 60 65 25 41 31 42 457 86 76 73 62 64 24 40 29 40 468 83 70 72 59 65 26 44 30 43 459 85 74 71 60 67 25 41 30 42 44

10 79 69 69 61 63 29 44 32 41 4711 79 69 71 62 64 44 44 30 41 4612 81 68 68 60 62 28 46 34 42 4913 81 70 71 61 64 27 44 30 40 4614 74 67 66 61 48 34 47 35 41 6015 74 67 66 61 48 34 47 35 41 60

Note See the text for a discussion of ranking metrics

extent data points with label 1 are ranked higherthan those with label 0 In the results that follow(see Table 15) the AUC is measured for each samplewhere any user instance among the neighbors thatcorresponds to the same individual is labeled as 1and all others are labeled as 0 Thus in this settingthe AUC measures how well the neighbors are rankedby the likelihood of being the same individual11 Theaverage over all samples is reported If for a sampleall neighbors have the same score that AUC is setto 05 In addition to the factors noted above this alsowill tend to make the results conservative For exam-ple if the GSN links an ego user to a set of neighborsthat are all instances of the same individual in thisevaluation the AUC will be 05

The ranking measure is the percentile in the rank-ing at which the known-same-user instance is rankedwith lower values being better For example if thesame-user instance has the highest score (closest con-nection) among five neighbors the rank is 20 (15)Note again that this is quite conservative in thisexample the user is ranked at the top of the list butbecause there were only five neighbors the highestpossible score is 20 In case the known-same-userinstance has the same score as another neighbor theaverage rank is reported

As shown in Table 15 all AUC values exceed 05ranging up to 08 This means that in every casea known-same-user instance is likely to be rankedhigher than an instance not known to be the same

11 Technically in this setting the AUCMWW is the probability thata randomly selected neighbor that is in fact a known instance of thesame user is ranked more highly than a randomly selected neighborthat is not known to be an instance of the same user If the known-same-user instances are all ranked above the other-user instancesthe AUC = 100 If they are all below then the AUC = 0

user The frequency-based ranking metrics (4ndash9) per-form quite well in connecting instances of the sameuser strongly

Almost all metrics perform better than a randommodel with performances that even come close to thebest possible solution Consistently performing quitewell are the frequency-based metrics (metrics 3ndash9)with metric 7 (the minimum of the IDF weights ofthe shared locations) getting the most wins Not per-forming well is the very basic count metric as well asthe Boolean metric that provides a binary score onlyInteresting to note the Jaccard metrics do not performwell either It seems that penalizing users that visitmany IP addresses negatively affects the results anissue we will return to in the next study

7 Study 2 Does the GSN Select Userswith Similar Interests

The foregoing section addressed our paperrsquos firstclaimmdashthat the GSN design indeed links instances ofthe same user and links them more strongly thaninstances of other users In this section we turn to thepaperrsquos second claimmdashthat the GSN links users withsimilar interests We will measure similar interests bysimilarity in behavior accessing particular publisherson the Web and using mobile apps12

71 Connecting Users with Similar Interests

To What Extent Are the GSN Neighbors of Visi-tors to Particular Publishers Also Visitors to ThoseSame Publishers To assess whether the GSN con-nects users with similar interestsbehavior we select

12 From now on all user instances will be complete real user in-stances the simulated split-up scenarios (Random Temporal) werecreated specifically for the purpose of the same-user study

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 257

Table 16 Distribution of Publishers per RTB in Terms of thePercentage of Users in Our Data Set That Visit the GivenPublisher

Publisher of D

RTB 1Burstly 3004myYearbookcom 1062My Fitness Pal 0065TextMe Inc 0056GameResort 0025Flixster 0019247Sports 0015Daily Workout Apps LLC 0009Mobile Deluxe 0007YoYo Games Ltd 0004

RTB 2mworldnowcom 0032mtmzcom 0024mtopixcom 0014babycentercom 0015meetmecom 0006appsfacebookcom 0006itunesapplecom 0039appevitecom 0002beautyandskincarescom 0003mmajunkiecom 0001

RTB 3Conversion Exchange 5053Pinger Inc 2096Top Game Developer 4029PremiumEntertainmentApp-FamilyOriented-Android 0059Enflick 0053Clapfoot Games 0044FunPokes Inc 0018Talkatone 0015Sevenlogics Inc 0008A Star Software 0007

RTB 4Pinger PhonemdashiOSmdashConversation 1001Talkatone iPhone App 0007DiceWithBuddies 00139GAG Reader 0012Video Downloader Pro Lite 0016Rage Comics 0009Spades 3D Lite 0006Apalabrados 0003Relax Melodies HD 0002Celeb MemdashPhotoMaker 0000

a diverse collection of 40 publishers listed in Table 16A publisher can correspond to a mobile app a web-site or a set of appswebsites We selected both pub-lishers that are visited by many users and more nichepublishers with fewer visitors As can be seen fromTable 16 several types of publishers can be distin-guished the main groups are games (eg GameRe-sort Top Game Developer) and social networkingtools (eg appsfacebookcom Text Me Inc)

More specifically we create 200 samples for 40 pub-lishers 10 publishers for each of four different RTB

exchanges listed in Table 1613 The percentage ofunique users that visit each publisher in our completedata set is also shown (D) To create each samplean ldquoegordquo user is selected that has visited the pub-lisher in question (an ldquoego visitorrdquo) this ego visi-torrsquos GSN neighborhood is created Next we assesshow many of the egorsquos neighbors also visited thatsame publisher This will be compared with a base-line visit rate to produce lift and leverage values(Provost and Fawcett 2013)mdashspecifically how muchmore likely is it for the neighbors of ego visitors tovisit the publisher than would be expected by chanceWe consider three different setups that produce dif-ferent baselines

1 Considering all RTB exchanges to determine the base-line A user is considered a visitor for a given pub-lisher if she visits the publisher on any RTB ex-change (since some publishers are visible on differentexchanges) Hence the baseline visit rate is the num-ber of users (on any RTB) who visit that publisherdivided by the total number of users across all RTBexchanges This will provide optimistic lifts sincemost publishers appear only on one RTB exchangewhereas the baseline is measured over all exchanges

2 Considering one RTB exchange only to determine thebaseline A user is considered a visitor for a given pub-lisher and RTB if she visits the publisher on the sameRTB exchange This changes the baseline to the num-ber of users (on this RTB) who visit that publisherdivided by the total number of users on this RTBexchange The lift and leverage results will be worsethan in the first setting because the baseline will belarger Seeing that some publishers are seen on differ-ent RTB exchanges this setting is slightly pessimistic

3 Taking time of location visits into account In thethird setting we link users only if they visit an IPwithin the same time period This will be elaboratedon in sect73

Results are shown in Tables 17 and 18 The percent-age of publisher visitors in the neighborhood is givenas N This percentage is compared with the baselinepercentage of users that visit that publisher in thecomplete data D using two metrics the lift is the ratioof these numbers the leverage is the difference Liftis particularly useful for very small base rates (con-sider that an activity with a base rate of 30 cannotpossibly have a lift higher than about 3) Leverage ismore telling for larger base rates because it shows theabsolute improvement (here in percentage points)

When considering all NNs the lifts for all publish-ers are greater than 1 except for one Burstly on RTB 1has a lift of 097 (essentially no lift) but only when weconsider RTB 1 as the baseline This publisher already

13 For confidentiality the RTB numbering here is again differentfrom above

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers258 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 17 RTBs 1 and 2 Percentage of Network Neighbors That Also Visit the Publisher

RTB 1

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Burstly 3004 8063 2084 5059 47022 15054 44018 25061 8043 220572 myYearbookcom 1062 25077 15094 24015 44074 27067 43012 35016 21074 330543 My Fitness Pal 0065 4015 6038 305 3609 56077 36025 15055 23091 14094 TextMe Inc 0056 1805 33006 17094 39002 69075 38046 25018 45 240625 GameResort 0025 4093 19077 4068 18052 74026 18027 11086 47058 110626 Flixster 0019 1055 8015 1036 8014 42085 7095 1079 9042 1067 247Sports 0015 27057 188058 27042 77062 530093 77048 74053 509075 740388 Daily Workout Apps LLC 0009 1092 21047 1083 19023 215046 19014 11054 129027 110459 Mobile Deluxe 0007 1087 28067 108 23053 36009 23046 9063 14707 9056

10 YoYo Games Ltd 0004 3052 99058 3049 21057 609064 21053 13 367045 12096Considering this RTB exchange only to determine baseline

1 Burstly 8085 8063 0097 minus0022 47022 5033 38037 25061 2089 160762 myYearbookcom 4071 25077 5047 21006 44074 905 40003 35016 7046 300453 My Fitness Pal 1089 4015 2019 2025 3609 19048 35001 15055 8021 130654 TextMe Inc 1063 1805 11035 16087 39002 23094 37039 25018 15045 230555 GameResort 0073 4093 6078 402 18052 25049 17079 11086 16033 110146 Flixster 0055 1055 208 0099 8014 14071 7059 1079 3023 10247 247Sports 0043 27057 64073 27014 77062 182023 7702 74053 174097 74018 Daily Workout Apps LLC 0026 1092 7037 1066 19023 73095 18097 11054 44037 110289 Mobile Deluxe 0019 1087 9084 1068 23053 123088 23034 9063 5007 9044

10 YoYo Games Ltd 001 3052 34018 3042 21057 209025 21047 13 126012 1209

RTB 2

Considering all RTB exchanges to determine baseline1 mworldnowcom 0032 13027 40086 12095 54039 167042 54006 31065 97044 310332 mtmzcom 0024 6089 28086 6065 23008 9607 22084 10068 44075 100443 mtopixcom 0014 15007 104079 14093 53045 37106 5303 33075 234065 330614 babycentercom 0015 6087 47031 6072 43048 299056 43033 17074 122024 17065 meetmecom 0006 6002 10308 5097 54072 942093 54066 27085 47909 270796 appsfacebookcom 0006 301 48039 3003 9009 141097 9003 2076 43008 20697 itunesapplecom 0039 25077 65071 25038 51043 131012 51004 29049 75018 290098 appevitecom 0002 48028 21212043 48025 54076 21509068 54074 48028 21212043 480259 beautyandskincarescom 0003 18019 642087 18016 8048 299094 8046 1107 413049 11067

10 mmajunkiecom 0001 1036 201064 1035 26067 3195505 26066 13033 11977074 13033Considering this RTB exchange only to determine baseline

1 mworldnowcom 4034 13027 3006 8093 54039 12053 50005 31065 7029 270312 mtmzcom 3019 6089 2016 307 23008 7024 19089 10068 3035 70493 mtopixcom 1092 15007 7084 13015 53045 27081 51053 33075 17056 310834 babycentercom 1094 6087 3054 4093 43048 22042 41054 17074 9015 15085 meetmecom 0078 6002 7077 5025 54072 70057 53094 27085 35092 270076 appsfacebookcom 0086 301 3062 2024 9009 10063 8024 2076 3022 1097 itunesapplecom 5024 25077 4092 20053 51043 9081 46019 29049 5063 240258 appevitecom 0029 48028 165058 47098 54076 187082 54047 48028 165058 470989 beautyandskincarescom 0038 18019 48011 17081 8048 22045 8011 1107 30095 11032

10 mmajunkiecom 0009 1036 15009 1027 26067 296003 26058 13033 148001 13024

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

has a very high baseline and as we will see in thenext section when we consider only the top 1 and 10NNs the lifts go up to 5 and 3 respectively Burstlyprovides a generic platform for building apps and sois a weaker indication of user interest than many ofthe other publishers (fitness babies games)

All other lifts exceed one by substantial mar-gins and go up to even 2391 (Apalabrados onRTB 4 a Spanish-language social Scrabble-like game)

Obviously the conclusion that the NNs indeed aremore likely to visit the same publishers is highly sig-nificant by a simple sign test comparing the visita-tion rate within the geosimilarity neighborhood tothat of overall population either via lift or leveragep lt 0001 Thus NNs of publisher visitors are indeedmore likely also to visit the same websites the abso-lute values of the lifts and leverages suggest that theyare substantially more likely

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 259

Table 18 RTBs 3 and 4 Percentage of Network Neighbors That Also Visit the Publisher

RTB 3

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Conversion Exchange 5053 34054 6024 29001 64071 11069 59017 50077 9018 450242 Pinger Inc 2096 46002 15053 43006 65012 21097 62015 51048 17037 480523 Top Game Developer 4029 47024 11002 42095 60 13099 55071 48097 11042 440684 PremiumEntertainmentApp-FamilyOriented-Android 0059 23079 40012 2302 20045 34049 19086 22025 37052 210665 Enflick 0053 11071 21095 11017 21057 40044 21004 12027 23 110736 Clapfoot Games 0044 8055 19032 8011 11054 26006 1101 13099 31061 130557 FunPokes Inc 0018 11011 61028 10093 50 275077 49082 15079 87008 150618 Talkatone 0015 13013 88051 12098 31011 209076 30096 17091 120078 170779 Sevenlogics Inc 0008 2071 32059 2063 5088 70074 508 8033 100022 8025

10 A Star Software 0007 3033 44086 3025 23008 311024 23 9009 122061 9002Considering this RTB exchange only to determine baseline

1 Conversion Exchange 11045 34054 3002 23009 64071 5065 53025 50077 4043 390322 Pinger Inc 6013 46002 705 39089 65012 10062 58098 51048 8039 450353 Top Game Developer 8088 47024 5032 38036 60 6076 51012 48097 5052 400094 PremiumEntertainmentApp-FamilyOriented-Android 1023 23079 19038 22057 20045 16066 19023 22025 18013 210025 Enflick 101 11071 1006 1006 21057 19054 20046 12027 11011 110166 Clapfoot Games 0092 8055 9033 7064 11054 12059 10062 13099 15027 130087 FunPokes Inc 0038 11011 2906 10074 50 133022 49062 15079 42007 150418 Talkatone 0031 13013 42076 12082 31011 101034 3008 17091 58035 170619 Sevenlogics Inc 0017 2071 15074 2054 5088 34017 5071 8033 48041 8016

10 A Star Software 0015 3033 21067 3017 23008 150036 22092 9009 59023 8094

RTB 4

Considering all RTB exchanges to determine baseline1 Pinger PhonemdashiOSmdashConversation 1001 71081 70085 7008 80077 79068 79076 70014 6902 690132 Talkatone iPhone App 0007 11038 17203 11031 29003 439068 28097 17002 257078 160963 DiceWithBuddies 0013 7041 59004 7028 18018 144092 18006 7069 61031 70574 9GAG Reader 0012 62013 497099 62 49009 39305 48097 49005 393017 480925 Video Downloader Pro Lite 0016 20003 123015 19087 22058 138084 22042 17093 110027 170776 Rage Comics 0009 609 8001 6081 13033 154086 13025 609 8001 60817 Spades 3D Lite 0006 1205 225076 12044 27027 492057 27022 1205 225076 120448 Apalabrados 0003 77042 21391071 77039 75 21316097 74097 79032 21450043 790299 Relax Melodies HD 0002 31058 21002029 31056 50 31170029 49098 31058 21002029 31056

10 Celeb MemdashPhotoMaker 0 3023 82107 3022 7014 11819048 7014 3023 82107 3022Considering this RTB exchange only to determine baseline

1 Pinger PhonemdashiOSmdashConversation 8051 71081 8044 6303 80077 9049 72026 70014 8024 610632 Talkatone iPhone App 0055 11038 20053 10082 29003 52038 28048 17002 30071 160473 DiceWithBuddies 0009 7041 82011 7032 18018 201054 18009 7069 85027 7064 9GAG Reader 0097 62013 63097 61016 49009 50055 48012 49005 50051 480085 Video Downloader Pro Lite 0044 20003 45036 19059 22058 51014 22014 17093 40062 170496 Rage Comics 0037 609 18079 6053 13033 36032 12097 609 18079 60537 Spades 3D Lite 0046 1205 2609 12004 27027 58068 26081 1205 2609 120048 Apalabrados 0021 77042 371093 77021 75 360031 74079 79032 381006 790119 Relax Melodies HD 0013 31058 238054 31045 50 377069 49087 31058 238054 31045

10 Celeb MemdashPhotoMaker 0003 3023 97089 3019 7014 216076 7011 3023 97089 3019

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

The very high lifts could be attributed to two phe-nomena They may provide some evidence that theGSN embeds a true social network as pointed out bythe good results for the social networking publishers(eg Apalabrados with a lift of 2391 appsevitecomwith a lift of 2212 and others) Friends visiting thesame social networking sites (here unusual ones)

would increase the lift and leverage over just havingdifferent instances of the same user visiting the samewebsites An alternative explanation is that some ofthese publishers are much more likely to be visitedacross different devices of the same user It is not clearwhether this is actually true However it argues forobtaining specific and broad data on which devices

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers260 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

are not the same user to shed further light on forexample any broad data that include a (anonymized)user identifier

From these results we can conclude that GSNneighbors indeed exhibit substantially similar pub-lisher visitation behavior Next we assess to whatextent more strongly connected NNs are more similarthan less strongly connected NNs

72 Ranking Users with Similar Interests

Are the More-Similar Geosimilarity Neighbors of aPublisherrsquos Visitors Even More Likely Also to Be Vis-itors to the Same Publisher To assess whether more-similar geosimilarity neighbors are even more likelyto share interests we compute the measures exactlyas in the previous study except instead of consider-ing all NNs only the 1 (N1) and 10 (N10) neighborswith the highest scores are considered (see Tables 17and 18) First we must consider which of the 15 met-rics that are defined to measure link strength to useTable 19 reports the rank aggregations of the differentmetrics as to which provide the best lifts Specificallyfor each of the 40 publishers each scoring metric pro-vides a lift and for that publisher the scoring metricscan be ranked (the best being ranked 1 etc) The rankaggregation is the average rank for a scoring metricacross all 40 publishers Thus if one metric providedthe best lift for all publishers it would get an aggre-gated rank of 1

To visualize the best metrics the metrics with anaverage ranking less than 8 are shown in boldfacein Table 19 The Jaccard metrics do not perform wellin this study either showing that penalizing the con-nection to users that visit many locations is not sen-sible Remember that the Jaccard metric is based onthe idea that users that visit many locations will havea higher chance to share some location with anotheruser However we have so many locations in totalthat visiting a couple more locations will increase onlymarginally this probability of sharing a location bychance A user that logs into more IP addresses andis more active should hence not be penalized andranked lower than other less active users Metric 3which sums the IDF scores of the shared locationsperforms quite well Given the operational efficiencyadvantage over the frequency-based metrics this met-ric is chosen to measure the strength of the links

As seen from the results in Tables 17 and 18 thelift and leverage values for the strongly connected (top 1and top 10) NNs are even higher sometimes astro-nomical (3170 for Relax Melodies HD (RTB 4)) Theseresults are highly significant by sign tests compar-ing either the lift or the leverage between the top-ranked neighbors and the entire cohort (p lt 0001) Theabsolute values again indicate that the users with the

strongest geosimilarity are often substantially morelikely to visit the same publisher

Thus we can conclude that not only does geosim-ilarity find users with similar interests it also ranksusers well by their likelihood of having similarinterests

An interesting follow-up question is whether thecharacteristics of a publisher are of importance for theresults For example are GSN neighbors more sim-ilar in terms of using the same social network appas compared to playing the same games To answerthis question the publishers are categorized into sixclasses funny social news games communicationand miscellaneous In Figure 9 the lifts of the pub-lishers are shown per category Some general trendscan be observed where publishers related to funnycontent have the highest median lift This could beexplained by the fact that users often forward funnycontent to their friends The social websitesapps fol-low next which seems to further demonstrate thatfriends are likely linked in the GSN However pub-lishers in the communication category perform not sowell compared with the other categories and ratherhigh variances are observed in the different cate-gories Please note that high variance is observedacross the categories therefore it is difficult to makewell-supported claims from these results

73 TimingAs a final evaluation we explore whether the inclu-sion of temporal similarity in location visitation ishelpful for defining GSN connections To this end weintroduce a setting where two users are connectedonly if they visit an IP address in the same timeperiod of a day Specifically we divide a day intodifferent time windowsmdashfor example two time peri-ods of 12 hoursmdashwhere users are connected only ifthey visit the same IP address in the same time periodof a day (which might be on different days) Practi-cally two digits are added to an IP which denote thestarting moment of the corresponding time periodWe report on time periods of 2 hours and of 12 hours(other periods yield similar results) We also includethe setting where users are connected only when theyvisit the same IP address on the same day in light ofthe result of Crandall et al (2010) that visiting thesame location at around the same time is indicativeof friendship

The resulting lifts are reported in Figures 10ndash12when considering respectively all NNs the top 1 NNand the top 10 NNs The results when not consider-ing the time period are repeated as well (All) lim-ited to the conservative case of only considering oneexchange (see above) The baseline (lift of one) isindicated with a red horizontal line lifts higher than1000 are shown as 1000 to limit the range to be able

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 10: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 251

The first similarity metric scount counts the numberof shared locations Since the two example users shareonly IP1 this strength is 1 The IDF weighted versionscount1W sums the IDF values of the shared locationswhich is 13 for our simple example The most basicmetric is given by sbool which is 1 when a location isshared and 0 otherwise indicating whether two usersare neighbors or not

The sfreq1min sfreq1max and sfreq1avg metrics compare thefrequencies of the visits to the shared locations Theytake the minimum maximum and average respec-tively of the frequencies for each shared location andsum them The weighted versions sfreq1min1W sfreq1max1W and sfreq1avg1W do the same but weight each frequencywith the corresponding IDF value

The scosine metric measures the similarity by takingthe cosine of the angle between the two vectors TheJaccard metric is defined as the size of the intersec-tion over the size of the union In this case it countsthe number of shared locations but normalizes theseover the total number of locations both users havevisited Users that visit many locations will have ahigher chance to share some location with anotheruser so the Jaccard metric penalizes them For ourbasic example sJaccard is 1 (IP1) over 3 (IP1 IP2 IP3)Once more variants taking into account the frequencyand the IDF are defined as well

Ex1 = 63 2 071

Ex2 = 62 0 171 (6)

Ew = 6103 107 270

5 Experimental Setup51 Data SetThe data set D for the empirical results is 10 days ofanonymized advertising bid requests from mobile de-vices (smartphones tablets etc) observed across sevenRTB exchanges A total of 322770794 bid requestsare observed coming from 42437559 unique userinstances Over these 10 days we observe 41829088unique IP addresses

The number of unique user instances per RTBexchange is given in Table 3 The distributions of thenumber of bid requests per device type and operatingsystem are given in Tables 4 and 5 which show thatthe majority of bid requests come from smartphonesand the iOS operating system

A user (instance) corresponds to an exchange-assigned user ID One person on several deviceswill hence correspond to several users One deviceon several exchanges will often correspond to sev-eral users Also it might be that one ad exchangesometimes provides different user IDs to the same

Table 3 User Instances per RTB Exchange

Exchange Volume Percentage ()

RTB 1 311751979 6RTB 2 211591686 4RTB 3 1415661098 29RTB 4 2015011243 41RTB 5 212751015 5RTB 6 510551815 10RTB 7 212231418 4

Table 4 Distribution of Bid Requests per Device Type

Type Volume Percentage ()

Phone 29213581772 91Tablet 3010161334 9

Table 5 Distribution of Bid Requests per OperatingSystem

Type Volume Percentage ()

Android 12913151617 4001iOS 19219231648 5908Windows 5311529 002

device for example when the device is using differ-ent particular apps As described above an impor-tant task in online advertising is to be able to targetadvertisements to different instances of the same indi-vidual with reasonable probability regardless of thecomplexities of identification in the baroque onlineadvertising ecosystem Figure 5 visualizes a sample oflatitudes and longitudes broadcast by mobile devicesin our data set

52 SamplingFor the results in this paper we apply the GSN anal-ysis to data samples Each sample corresponds to acohortneighborhood and is built as follows A sin-gle (random) ldquoegordquo user is chosen user X aroundwhich a neighborhood is built Every neighbor of theego user must share at least one IP address so webegin by obtaining all (anonymized) IP addresses vis-ited by user X Next all other users that visited atleast one of these IP addresses (the neighbors) areadded The ego must have at least one neighbor to beincluded in this analysis Finally we obtain the com-plete location distribution of each of these neighborsWhen the next sample is chosen we ensure that noneof the users already included in the previous samplesis taken as the ego user (X) As such we ensure thatwe do not have repeated egondashneighbor dyads Theremay be shared neighbors of two different egos buttwo egos will never be immediate neighbors in thedata sample

The result of the sampling over 500 cohorts is summa-rized in Figure 6 on average an ego has 71 neighbors

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers252 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 5 Scatter Plots of Samples of Latitudes and Longitudes Broadcast by Mobile Devices

ndash130 ndash120 ndash110 ndash100 ndash90 ndash80 ndash70 ndash6020

25

30

35

40

45

50

55(a)

Long

Lat

ndash20 ndash10 0 10 20 30 40 50

Long

30

35

40

45

50

55

65

60

75

70

Lat

ndash200 ndash150 ndash100 ndash50 1000 50 150 200

Long

ndash100

ndash80

ndash60

ndash40

ndash20

0

40

20

100

60

80

Lat

4068ndash7406 ndash7404 ndash7402 ndash7400 ndash7394ndash7398 ndash7396 ndash7392 ndash7390

Long

4070

4072

4074

4076

4080

4078

4086

4082

4084

Lat

(b)

(c) (d)

Note Note that locations are plotted without reference to any map The figure gives a striking picture of population density across the world and makes uswonder what is going on with mobile devices in Antarctica

Figure 6 (Color online) The Distribution of the Number of Neighborsfor 500 Randomly Sampled Ego Users

0 500 1000 1500 2000 2500 3000 35000

50

100

150

200

250

300

350

400

450

No neighbors

Fre

quen

cy

the median being 1 The maximum number of neigh-bors is 3167 this user probably visited a publicWi-Fi that thousands of other users also logged intoWe chose 500 cohorts as we observe that the statis-

tics become quite stable as seen in Figure 7 Even200 cohorts seems to be enough to obtain stablesample statistics

The life span of user instances in this set of500 cohorts is shown by the cumulative frequency inFigure 8 Observe that only about 40 of the userinstances are seen at different hours About 30 ofthe user instances are seen only once in the sampledweek This illustrates that it is vital to understand thenotion of user instances These percentages are mis-leading if we interpret the user instances as individu-als or devices One individualdevice without stableidentification in the online advertising ecosystem willcreate many transient user instances which will driveup the percentage of user instances that are seen onlyonce or that are seen only within a short timeframeWe have not figured out how to determine or esti-mate reliably how many ldquounstablerdquo usersdevices arerepresented by the user instances in these data Wekeep all of the data for the analyses below (rather thanfiltering very short-lived user instances) to maintainrealism

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 253

Figure 7 Evolution of Several Sample Statistics as an Increasing Number of Cohorts are Created

0 100 200 300 400 5000

10

20

30

40

50

60

70

Average No network neighbors

0 100 200 300 400 5000

2

4

6

8

10

12

Median No network neighbors

0 100 200 300 400 5000

50

100

150

200

250

300

350

Average No IPs

0 100 200 300 400 5000

2

4

6

8Median No IPs

Notes The horizontal axis denotes the number of cohorts By 500 cohorts the statistics have become stable

Figure 8 (Color online) Cumulative Frequency of Life Span ofUser Instances

100 101 102 103 104 105 1060

01

02

03

04

05

06

07

08

09

10

x = lifespan (sec)

F(x

)

End of June (mobile only)

larr 1 min larr 1 hour larr 1 day

Cumulative frequency of user instances

6 Study 1 Connecting Instances ofthe Same Individual

Returning to our motivating application as men-tioned previously an important task in mobile adver-tising is to be able to target a particular individual inthe face of the individualrsquos fragmentation into differ-ent user instances (for example as seen through dif-ferent bidding systems) Advertisers understand that

it may be impossible to find the same user with 100accuracy Nonetheless it often is important for themto reach the userrsquos various instances without target-ing too many others9 The GSN is designed in partbased on the hypothesis that different instances of thesame person would visit similar locations and there-fore they will be connected in the GSN In this sectionwe present results examining whether and to whatextent that is indeed the case Furthermore the resultsquantify to what extent the same user receives a highGSN-similarity to herself as compared to her otherneighbors

Three experimental settings are used to assess judg-ments that two users are the same The first two sim-ulate different user instances by splitting up the bidrequests of known user instances The last one is notsimulated it links users via the recently introducedidentifier for advertising (IFA) observed over severalRTB exchanges for a subset of device instances

1 Random We divide the IP address visits of a userrandomly across two simulated users Table 6 showsfor example how we can divide a userrsquos IP visits across

9 The astute reader may notice that if the other users targeted arevery similar to the target user the advertising may be well placeddespite not being shown to the exact target user That is not thesubject of this study Study 2 shows evidence that these two notionsare not independent when using the GSN

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers254 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 6 Randomly Splitting User X rsquos IP VisitsCreating Two New Artificial Users X1 and X2

IP1 IP2 IP3

X 1 5 0X1 0 3 0X2 1 2 0

Table 7 The Volume of IFA Occurrence Within OurData Set

Volume

RTB 1 119401855RTB 2 113831962RTB 3 7891292

Table 8 The Frequency of Unique IFAs Visible AcrossRTB Exchanges

Frequency

RTB 1RTB 2 3061594 (1016)RTB 2RTB 3 451256 (169)RTB 3RTB 1 181417 (085)

two new users X1 and X2 for which we know thatthey are actually the same individual Results withRandom should be used only as a ceiling on perfor-mance because the results will be overly optimistic

2 Temporal A more realistic split-up of the trans-actions is a temporal split-up where first the middleday of Xrsquos transactions is determined (hence dynam-ically chosen for each user) All IP address visits upto this middle day are assigned to X1 and all trans-actions after the middle day are assigned to X2 Itwill only be applied to users who are seen on multi-ple days

3 Identifier for advertising IFA is an advertisingidentifier from Apple that users can change or asknot to be used for advertising and is seen as aprivacy-friendlier version of the controversial fixeddevice ID (the ldquoUDIDrdquo) The IFA is observed in threeexchanges with volumes shown in Table 7 for exam-ple 1940855 unique IFAs are seen in RTB 110 Whenwe observe the same IFA across different exchangeswe now know that two devices in different exchangesactually are the same The frequency of unique IFAsvisible across exchanges is shown in Table 8 For exam-ple a total of 3018223 IFAs are seen when we look atboth RTB 1 and RTB 2 with 306594 of these (1016)seen on both exchange 1 and exchange 2 What ifthere were not an IFA and the device would havehad a different identifier in each exchange To what

10 For confidentiality RTB 1 in this setting is not necessarily thesame as RTB 1 in sectsect4 or 7

extent are these two device instances then connectedin the GSN

With these different ldquogold standardrdquo methods toindicate whether or not two users are the samewe assess the ability of the geosimilarity metricsto link the same user determining how often twoinstances of the same person are connected in theGSN (sect61) Afterward we shall assess the differentGSN strength metricsrsquo ability to rank users determin-ing to what extent two instances of the same personexhibit strong geosimilarity as compared to the otherneighbors (sect62) For the random and temporal split-ups 200 samples are used For the IFA-based methodthe complete population is used for measuring thedegree of connectedness to the same user to analyzethe relative strength of the connected users 200 sam-ples are used

61 Same Individual Different ldquoScreensrdquoWe now present results on the GSN connectednessmeasured as how often two instances of the same userare connected in the GSN (in percentage) Note thatthe results summarized in Table 9 should be inter-preted in light of the limited number of days in build-ing the research data set which will limit the GSNconnectivity Nonetheless the relatively high connec-tivities observed even with this sample provide quitepromising results which lends considerable supportto the merit of the overall design

Random0 Using the random split two users thatcorrespond to the same individual are not connectedin only 9 of the neighborhoods Moreover in thesecases user X (whose IP visits are split up) usuallyhas only one or two entries and the random split-upresults in no visits for one of the two simulated users

Temporal0 The temporal split-up is more realisticthan the random one where we assume that a personuses simulated device 1 in the first half of the week(eg during the weekend) and simulated device 2 inthe second half of the week (eg during the week)

Users X1 and X2 are not connected in 32 of theneighborhoods Whereas the random split-up was toooptimistic the temporal one is too pessimistic as inreality there likely is some overlap in the time peri-ods of use of two different devices If user X visitsan IP address only in the first days of the week this

Table 9 To What Extent is the Same User Connectedto Herself

connected

Random 91Temporal 68IFAmdashRTB 1RTB 2 67IFAmdashRTB 2RTB 3 82IFAmdashRTB 3RTB 1 73

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 255

IP address will not occur in user X2rsquos transactions andreduces the chance to be connected to X1

IFA0 We look at all combinations of the three ex-changes with IFA identifiers For example there are45256 unique IFAs seen on both RTB 2 and RTB 3 In82 of these the two IFAs visit at least one same IP onboth exchanges and hence are connected in our GSN

The bottom line is that even with this limited slice ofonline behavior different instances of the same indi-vidual are connected in the GSN 70ndash80 of the timeEncouragingly the results from the simulated scenar-ios concur with the results from the real (IFA) sce-narios (including the aforementioned optimism andslight pessimism of the two simulated scenarios)Consistency over these different settings gives addi-tional confidence in the results suggesting that theGSN indeed holds promise for a high degree of suc-cess at targeting the same user across the fragmenta-tion of the online advertising ecosystem

62 Ranking User InstancesNext given that two user instances of the same indi-vidual are indeed connected we assess to what extentthe second instance of the same user is ranked highlyamong its network neighbors based on the weightof their linkage in the GSN We first will discuss themost general results using the best ranking methodand then will look across different ranking methods

We decompose the results based on the numberof network neighbors (NNs) since there are twoboundary cases that require special interpretationTables 10ndash14 show the proportions of the ego net-works that fall into three scenarios First a user mayhave only one NN In this scenario the ranking isperfect (trivially)mdashthe other instance of the same useris the only user instance given a nonzero score As

Table 10 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashRandom Split-up

Ego has only 1 NN 3104Ego has exactly 2 NNs 1905

Same user ranked first 8005Ego has more than 2 NNs 4901

Note For the scenario with two NNs the table reports thepercentage of cases where the other known instance of thesame user is ranked first (including ties)

Table 11 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashTemporal Split-up

Ego has only 1 NN 3008Ego has exactly 2 NNs 1807

Same user ranked first 8309Ego has more than 2 NNs 5005

Table 12 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 1RTB 2Split-up

Ego has only 1 NN 4401Ego has exactly 2 NNs 1201

Same user ranked first 8201Ego has more than 2 NNs 4400

Table 13 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 2RTB 3Split-up

Ego has only 1 NN 6109Ego has exactly 2 NNs 1907

Same user ranked first 7102Ego has more than 2 NNs 1804

Table 14 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 3RTB 1Split-up

Ego has only 1 NN 5305Ego has exactly 2 NNs 2501

Same user ranked first 6304Ego has more than 2 NNs 2104

shown in the tables this scenario accounts for 30ndash60 of the ego networks (depending on the split-upmethod)

The second boundary case is when the ego has ex-actly two NNs This scenario accounts for 10ndash25of the cases depending on the split-up method Herethe tables also report whether the other known userinstance is ranked first (including ties) among thetwo neighbors The other known instance of the sameuser is ranked first 60ndash80 of the time dependingon the split-up scenario The IFA results are affectedstrongly by the particular RTB settingmdashthis may indi-cate that the publishers on the different RTBs have dif-ferent policies for passing the IFA Importantly theseresults are conservativemdashpossibly very conservativeThe other NNs may also be instances of the same userunbeknownst to usmdashfor example in the IFA casebecause an app does not pass the IFA to the RTB orbecause the other user is the same user on a differentdevice and thus does not share the IFA

The final scenario is when more than two NNsare present As shown in the tables this accounts for20ndash50 of the cases depending on the split-up sce-nario In this case we measure the percentile in theranking where the instance of the known same userfalls (see below) as well as the area under the receiveroperating characteristic curve (AUC) (Fawcett 2006)which is equivalent to the MannndashWhitneyndashWilcoxon(MWW) statistic The AUCMWW measures to what

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers256 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 15 When There Are More Than Two Network Neighbors to What Extent Is the Known Same User StronglyConnected to Itself

AUC (if gt 2 NNs) () Average rank percentile (if gt 2 NNs) ()

Ranking metric R T RTB 12 RTB 23 RTB 31 R T RTB 12 RTB 23 RTB 31

1 78 64 68 59 55 29 48 34 43 542 59 58 67 58 54 45 53 34 44 553 79 69 74 61 61 29 44 29 40 484 86 75 69 61 59 24 40 32 41 505 83 70 71 59 63 26 44 31 43 476 85 74 71 60 65 25 41 31 42 457 86 76 73 62 64 24 40 29 40 468 83 70 72 59 65 26 44 30 43 459 85 74 71 60 67 25 41 30 42 44

10 79 69 69 61 63 29 44 32 41 4711 79 69 71 62 64 44 44 30 41 4612 81 68 68 60 62 28 46 34 42 4913 81 70 71 61 64 27 44 30 40 4614 74 67 66 61 48 34 47 35 41 6015 74 67 66 61 48 34 47 35 41 60

Note See the text for a discussion of ranking metrics

extent data points with label 1 are ranked higherthan those with label 0 In the results that follow(see Table 15) the AUC is measured for each samplewhere any user instance among the neighbors thatcorresponds to the same individual is labeled as 1and all others are labeled as 0 Thus in this settingthe AUC measures how well the neighbors are rankedby the likelihood of being the same individual11 Theaverage over all samples is reported If for a sampleall neighbors have the same score that AUC is setto 05 In addition to the factors noted above this alsowill tend to make the results conservative For exam-ple if the GSN links an ego user to a set of neighborsthat are all instances of the same individual in thisevaluation the AUC will be 05

The ranking measure is the percentile in the rank-ing at which the known-same-user instance is rankedwith lower values being better For example if thesame-user instance has the highest score (closest con-nection) among five neighbors the rank is 20 (15)Note again that this is quite conservative in thisexample the user is ranked at the top of the list butbecause there were only five neighbors the highestpossible score is 20 In case the known-same-userinstance has the same score as another neighbor theaverage rank is reported

As shown in Table 15 all AUC values exceed 05ranging up to 08 This means that in every casea known-same-user instance is likely to be rankedhigher than an instance not known to be the same

11 Technically in this setting the AUCMWW is the probability thata randomly selected neighbor that is in fact a known instance of thesame user is ranked more highly than a randomly selected neighborthat is not known to be an instance of the same user If the known-same-user instances are all ranked above the other-user instancesthe AUC = 100 If they are all below then the AUC = 0

user The frequency-based ranking metrics (4ndash9) per-form quite well in connecting instances of the sameuser strongly

Almost all metrics perform better than a randommodel with performances that even come close to thebest possible solution Consistently performing quitewell are the frequency-based metrics (metrics 3ndash9)with metric 7 (the minimum of the IDF weights ofthe shared locations) getting the most wins Not per-forming well is the very basic count metric as well asthe Boolean metric that provides a binary score onlyInteresting to note the Jaccard metrics do not performwell either It seems that penalizing users that visitmany IP addresses negatively affects the results anissue we will return to in the next study

7 Study 2 Does the GSN Select Userswith Similar Interests

The foregoing section addressed our paperrsquos firstclaimmdashthat the GSN design indeed links instances ofthe same user and links them more strongly thaninstances of other users In this section we turn to thepaperrsquos second claimmdashthat the GSN links users withsimilar interests We will measure similar interests bysimilarity in behavior accessing particular publisherson the Web and using mobile apps12

71 Connecting Users with Similar Interests

To What Extent Are the GSN Neighbors of Visi-tors to Particular Publishers Also Visitors to ThoseSame Publishers To assess whether the GSN con-nects users with similar interestsbehavior we select

12 From now on all user instances will be complete real user in-stances the simulated split-up scenarios (Random Temporal) werecreated specifically for the purpose of the same-user study

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 257

Table 16 Distribution of Publishers per RTB in Terms of thePercentage of Users in Our Data Set That Visit the GivenPublisher

Publisher of D

RTB 1Burstly 3004myYearbookcom 1062My Fitness Pal 0065TextMe Inc 0056GameResort 0025Flixster 0019247Sports 0015Daily Workout Apps LLC 0009Mobile Deluxe 0007YoYo Games Ltd 0004

RTB 2mworldnowcom 0032mtmzcom 0024mtopixcom 0014babycentercom 0015meetmecom 0006appsfacebookcom 0006itunesapplecom 0039appevitecom 0002beautyandskincarescom 0003mmajunkiecom 0001

RTB 3Conversion Exchange 5053Pinger Inc 2096Top Game Developer 4029PremiumEntertainmentApp-FamilyOriented-Android 0059Enflick 0053Clapfoot Games 0044FunPokes Inc 0018Talkatone 0015Sevenlogics Inc 0008A Star Software 0007

RTB 4Pinger PhonemdashiOSmdashConversation 1001Talkatone iPhone App 0007DiceWithBuddies 00139GAG Reader 0012Video Downloader Pro Lite 0016Rage Comics 0009Spades 3D Lite 0006Apalabrados 0003Relax Melodies HD 0002Celeb MemdashPhotoMaker 0000

a diverse collection of 40 publishers listed in Table 16A publisher can correspond to a mobile app a web-site or a set of appswebsites We selected both pub-lishers that are visited by many users and more nichepublishers with fewer visitors As can be seen fromTable 16 several types of publishers can be distin-guished the main groups are games (eg GameRe-sort Top Game Developer) and social networkingtools (eg appsfacebookcom Text Me Inc)

More specifically we create 200 samples for 40 pub-lishers 10 publishers for each of four different RTB

exchanges listed in Table 1613 The percentage ofunique users that visit each publisher in our completedata set is also shown (D) To create each samplean ldquoegordquo user is selected that has visited the pub-lisher in question (an ldquoego visitorrdquo) this ego visi-torrsquos GSN neighborhood is created Next we assesshow many of the egorsquos neighbors also visited thatsame publisher This will be compared with a base-line visit rate to produce lift and leverage values(Provost and Fawcett 2013)mdashspecifically how muchmore likely is it for the neighbors of ego visitors tovisit the publisher than would be expected by chanceWe consider three different setups that produce dif-ferent baselines

1 Considering all RTB exchanges to determine the base-line A user is considered a visitor for a given pub-lisher if she visits the publisher on any RTB ex-change (since some publishers are visible on differentexchanges) Hence the baseline visit rate is the num-ber of users (on any RTB) who visit that publisherdivided by the total number of users across all RTBexchanges This will provide optimistic lifts sincemost publishers appear only on one RTB exchangewhereas the baseline is measured over all exchanges

2 Considering one RTB exchange only to determine thebaseline A user is considered a visitor for a given pub-lisher and RTB if she visits the publisher on the sameRTB exchange This changes the baseline to the num-ber of users (on this RTB) who visit that publisherdivided by the total number of users on this RTBexchange The lift and leverage results will be worsethan in the first setting because the baseline will belarger Seeing that some publishers are seen on differ-ent RTB exchanges this setting is slightly pessimistic

3 Taking time of location visits into account In thethird setting we link users only if they visit an IPwithin the same time period This will be elaboratedon in sect73

Results are shown in Tables 17 and 18 The percent-age of publisher visitors in the neighborhood is givenas N This percentage is compared with the baselinepercentage of users that visit that publisher in thecomplete data D using two metrics the lift is the ratioof these numbers the leverage is the difference Liftis particularly useful for very small base rates (con-sider that an activity with a base rate of 30 cannotpossibly have a lift higher than about 3) Leverage ismore telling for larger base rates because it shows theabsolute improvement (here in percentage points)

When considering all NNs the lifts for all publish-ers are greater than 1 except for one Burstly on RTB 1has a lift of 097 (essentially no lift) but only when weconsider RTB 1 as the baseline This publisher already

13 For confidentiality the RTB numbering here is again differentfrom above

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers258 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 17 RTBs 1 and 2 Percentage of Network Neighbors That Also Visit the Publisher

RTB 1

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Burstly 3004 8063 2084 5059 47022 15054 44018 25061 8043 220572 myYearbookcom 1062 25077 15094 24015 44074 27067 43012 35016 21074 330543 My Fitness Pal 0065 4015 6038 305 3609 56077 36025 15055 23091 14094 TextMe Inc 0056 1805 33006 17094 39002 69075 38046 25018 45 240625 GameResort 0025 4093 19077 4068 18052 74026 18027 11086 47058 110626 Flixster 0019 1055 8015 1036 8014 42085 7095 1079 9042 1067 247Sports 0015 27057 188058 27042 77062 530093 77048 74053 509075 740388 Daily Workout Apps LLC 0009 1092 21047 1083 19023 215046 19014 11054 129027 110459 Mobile Deluxe 0007 1087 28067 108 23053 36009 23046 9063 14707 9056

10 YoYo Games Ltd 0004 3052 99058 3049 21057 609064 21053 13 367045 12096Considering this RTB exchange only to determine baseline

1 Burstly 8085 8063 0097 minus0022 47022 5033 38037 25061 2089 160762 myYearbookcom 4071 25077 5047 21006 44074 905 40003 35016 7046 300453 My Fitness Pal 1089 4015 2019 2025 3609 19048 35001 15055 8021 130654 TextMe Inc 1063 1805 11035 16087 39002 23094 37039 25018 15045 230555 GameResort 0073 4093 6078 402 18052 25049 17079 11086 16033 110146 Flixster 0055 1055 208 0099 8014 14071 7059 1079 3023 10247 247Sports 0043 27057 64073 27014 77062 182023 7702 74053 174097 74018 Daily Workout Apps LLC 0026 1092 7037 1066 19023 73095 18097 11054 44037 110289 Mobile Deluxe 0019 1087 9084 1068 23053 123088 23034 9063 5007 9044

10 YoYo Games Ltd 001 3052 34018 3042 21057 209025 21047 13 126012 1209

RTB 2

Considering all RTB exchanges to determine baseline1 mworldnowcom 0032 13027 40086 12095 54039 167042 54006 31065 97044 310332 mtmzcom 0024 6089 28086 6065 23008 9607 22084 10068 44075 100443 mtopixcom 0014 15007 104079 14093 53045 37106 5303 33075 234065 330614 babycentercom 0015 6087 47031 6072 43048 299056 43033 17074 122024 17065 meetmecom 0006 6002 10308 5097 54072 942093 54066 27085 47909 270796 appsfacebookcom 0006 301 48039 3003 9009 141097 9003 2076 43008 20697 itunesapplecom 0039 25077 65071 25038 51043 131012 51004 29049 75018 290098 appevitecom 0002 48028 21212043 48025 54076 21509068 54074 48028 21212043 480259 beautyandskincarescom 0003 18019 642087 18016 8048 299094 8046 1107 413049 11067

10 mmajunkiecom 0001 1036 201064 1035 26067 3195505 26066 13033 11977074 13033Considering this RTB exchange only to determine baseline

1 mworldnowcom 4034 13027 3006 8093 54039 12053 50005 31065 7029 270312 mtmzcom 3019 6089 2016 307 23008 7024 19089 10068 3035 70493 mtopixcom 1092 15007 7084 13015 53045 27081 51053 33075 17056 310834 babycentercom 1094 6087 3054 4093 43048 22042 41054 17074 9015 15085 meetmecom 0078 6002 7077 5025 54072 70057 53094 27085 35092 270076 appsfacebookcom 0086 301 3062 2024 9009 10063 8024 2076 3022 1097 itunesapplecom 5024 25077 4092 20053 51043 9081 46019 29049 5063 240258 appevitecom 0029 48028 165058 47098 54076 187082 54047 48028 165058 470989 beautyandskincarescom 0038 18019 48011 17081 8048 22045 8011 1107 30095 11032

10 mmajunkiecom 0009 1036 15009 1027 26067 296003 26058 13033 148001 13024

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

has a very high baseline and as we will see in thenext section when we consider only the top 1 and 10NNs the lifts go up to 5 and 3 respectively Burstlyprovides a generic platform for building apps and sois a weaker indication of user interest than many ofthe other publishers (fitness babies games)

All other lifts exceed one by substantial mar-gins and go up to even 2391 (Apalabrados onRTB 4 a Spanish-language social Scrabble-like game)

Obviously the conclusion that the NNs indeed aremore likely to visit the same publishers is highly sig-nificant by a simple sign test comparing the visita-tion rate within the geosimilarity neighborhood tothat of overall population either via lift or leveragep lt 0001 Thus NNs of publisher visitors are indeedmore likely also to visit the same websites the abso-lute values of the lifts and leverages suggest that theyare substantially more likely

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 259

Table 18 RTBs 3 and 4 Percentage of Network Neighbors That Also Visit the Publisher

RTB 3

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Conversion Exchange 5053 34054 6024 29001 64071 11069 59017 50077 9018 450242 Pinger Inc 2096 46002 15053 43006 65012 21097 62015 51048 17037 480523 Top Game Developer 4029 47024 11002 42095 60 13099 55071 48097 11042 440684 PremiumEntertainmentApp-FamilyOriented-Android 0059 23079 40012 2302 20045 34049 19086 22025 37052 210665 Enflick 0053 11071 21095 11017 21057 40044 21004 12027 23 110736 Clapfoot Games 0044 8055 19032 8011 11054 26006 1101 13099 31061 130557 FunPokes Inc 0018 11011 61028 10093 50 275077 49082 15079 87008 150618 Talkatone 0015 13013 88051 12098 31011 209076 30096 17091 120078 170779 Sevenlogics Inc 0008 2071 32059 2063 5088 70074 508 8033 100022 8025

10 A Star Software 0007 3033 44086 3025 23008 311024 23 9009 122061 9002Considering this RTB exchange only to determine baseline

1 Conversion Exchange 11045 34054 3002 23009 64071 5065 53025 50077 4043 390322 Pinger Inc 6013 46002 705 39089 65012 10062 58098 51048 8039 450353 Top Game Developer 8088 47024 5032 38036 60 6076 51012 48097 5052 400094 PremiumEntertainmentApp-FamilyOriented-Android 1023 23079 19038 22057 20045 16066 19023 22025 18013 210025 Enflick 101 11071 1006 1006 21057 19054 20046 12027 11011 110166 Clapfoot Games 0092 8055 9033 7064 11054 12059 10062 13099 15027 130087 FunPokes Inc 0038 11011 2906 10074 50 133022 49062 15079 42007 150418 Talkatone 0031 13013 42076 12082 31011 101034 3008 17091 58035 170619 Sevenlogics Inc 0017 2071 15074 2054 5088 34017 5071 8033 48041 8016

10 A Star Software 0015 3033 21067 3017 23008 150036 22092 9009 59023 8094

RTB 4

Considering all RTB exchanges to determine baseline1 Pinger PhonemdashiOSmdashConversation 1001 71081 70085 7008 80077 79068 79076 70014 6902 690132 Talkatone iPhone App 0007 11038 17203 11031 29003 439068 28097 17002 257078 160963 DiceWithBuddies 0013 7041 59004 7028 18018 144092 18006 7069 61031 70574 9GAG Reader 0012 62013 497099 62 49009 39305 48097 49005 393017 480925 Video Downloader Pro Lite 0016 20003 123015 19087 22058 138084 22042 17093 110027 170776 Rage Comics 0009 609 8001 6081 13033 154086 13025 609 8001 60817 Spades 3D Lite 0006 1205 225076 12044 27027 492057 27022 1205 225076 120448 Apalabrados 0003 77042 21391071 77039 75 21316097 74097 79032 21450043 790299 Relax Melodies HD 0002 31058 21002029 31056 50 31170029 49098 31058 21002029 31056

10 Celeb MemdashPhotoMaker 0 3023 82107 3022 7014 11819048 7014 3023 82107 3022Considering this RTB exchange only to determine baseline

1 Pinger PhonemdashiOSmdashConversation 8051 71081 8044 6303 80077 9049 72026 70014 8024 610632 Talkatone iPhone App 0055 11038 20053 10082 29003 52038 28048 17002 30071 160473 DiceWithBuddies 0009 7041 82011 7032 18018 201054 18009 7069 85027 7064 9GAG Reader 0097 62013 63097 61016 49009 50055 48012 49005 50051 480085 Video Downloader Pro Lite 0044 20003 45036 19059 22058 51014 22014 17093 40062 170496 Rage Comics 0037 609 18079 6053 13033 36032 12097 609 18079 60537 Spades 3D Lite 0046 1205 2609 12004 27027 58068 26081 1205 2609 120048 Apalabrados 0021 77042 371093 77021 75 360031 74079 79032 381006 790119 Relax Melodies HD 0013 31058 238054 31045 50 377069 49087 31058 238054 31045

10 Celeb MemdashPhotoMaker 0003 3023 97089 3019 7014 216076 7011 3023 97089 3019

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

The very high lifts could be attributed to two phe-nomena They may provide some evidence that theGSN embeds a true social network as pointed out bythe good results for the social networking publishers(eg Apalabrados with a lift of 2391 appsevitecomwith a lift of 2212 and others) Friends visiting thesame social networking sites (here unusual ones)

would increase the lift and leverage over just havingdifferent instances of the same user visiting the samewebsites An alternative explanation is that some ofthese publishers are much more likely to be visitedacross different devices of the same user It is not clearwhether this is actually true However it argues forobtaining specific and broad data on which devices

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers260 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

are not the same user to shed further light on forexample any broad data that include a (anonymized)user identifier

From these results we can conclude that GSNneighbors indeed exhibit substantially similar pub-lisher visitation behavior Next we assess to whatextent more strongly connected NNs are more similarthan less strongly connected NNs

72 Ranking Users with Similar Interests

Are the More-Similar Geosimilarity Neighbors of aPublisherrsquos Visitors Even More Likely Also to Be Vis-itors to the Same Publisher To assess whether more-similar geosimilarity neighbors are even more likelyto share interests we compute the measures exactlyas in the previous study except instead of consider-ing all NNs only the 1 (N1) and 10 (N10) neighborswith the highest scores are considered (see Tables 17and 18) First we must consider which of the 15 met-rics that are defined to measure link strength to useTable 19 reports the rank aggregations of the differentmetrics as to which provide the best lifts Specificallyfor each of the 40 publishers each scoring metric pro-vides a lift and for that publisher the scoring metricscan be ranked (the best being ranked 1 etc) The rankaggregation is the average rank for a scoring metricacross all 40 publishers Thus if one metric providedthe best lift for all publishers it would get an aggre-gated rank of 1

To visualize the best metrics the metrics with anaverage ranking less than 8 are shown in boldfacein Table 19 The Jaccard metrics do not perform wellin this study either showing that penalizing the con-nection to users that visit many locations is not sen-sible Remember that the Jaccard metric is based onthe idea that users that visit many locations will havea higher chance to share some location with anotheruser However we have so many locations in totalthat visiting a couple more locations will increase onlymarginally this probability of sharing a location bychance A user that logs into more IP addresses andis more active should hence not be penalized andranked lower than other less active users Metric 3which sums the IDF scores of the shared locationsperforms quite well Given the operational efficiencyadvantage over the frequency-based metrics this met-ric is chosen to measure the strength of the links

As seen from the results in Tables 17 and 18 thelift and leverage values for the strongly connected (top 1and top 10) NNs are even higher sometimes astro-nomical (3170 for Relax Melodies HD (RTB 4)) Theseresults are highly significant by sign tests compar-ing either the lift or the leverage between the top-ranked neighbors and the entire cohort (p lt 0001) Theabsolute values again indicate that the users with the

strongest geosimilarity are often substantially morelikely to visit the same publisher

Thus we can conclude that not only does geosim-ilarity find users with similar interests it also ranksusers well by their likelihood of having similarinterests

An interesting follow-up question is whether thecharacteristics of a publisher are of importance for theresults For example are GSN neighbors more sim-ilar in terms of using the same social network appas compared to playing the same games To answerthis question the publishers are categorized into sixclasses funny social news games communicationand miscellaneous In Figure 9 the lifts of the pub-lishers are shown per category Some general trendscan be observed where publishers related to funnycontent have the highest median lift This could beexplained by the fact that users often forward funnycontent to their friends The social websitesapps fol-low next which seems to further demonstrate thatfriends are likely linked in the GSN However pub-lishers in the communication category perform not sowell compared with the other categories and ratherhigh variances are observed in the different cate-gories Please note that high variance is observedacross the categories therefore it is difficult to makewell-supported claims from these results

73 TimingAs a final evaluation we explore whether the inclu-sion of temporal similarity in location visitation ishelpful for defining GSN connections To this end weintroduce a setting where two users are connectedonly if they visit an IP address in the same timeperiod of a day Specifically we divide a day intodifferent time windowsmdashfor example two time peri-ods of 12 hoursmdashwhere users are connected only ifthey visit the same IP address in the same time periodof a day (which might be on different days) Practi-cally two digits are added to an IP which denote thestarting moment of the corresponding time periodWe report on time periods of 2 hours and of 12 hours(other periods yield similar results) We also includethe setting where users are connected only when theyvisit the same IP address on the same day in light ofthe result of Crandall et al (2010) that visiting thesame location at around the same time is indicativeof friendship

The resulting lifts are reported in Figures 10ndash12when considering respectively all NNs the top 1 NNand the top 10 NNs The results when not consider-ing the time period are repeated as well (All) lim-ited to the conservative case of only considering oneexchange (see above) The baseline (lift of one) isindicated with a red horizontal line lifts higher than1000 are shown as 1000 to limit the range to be able

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 11: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile Consumers252 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 5 Scatter Plots of Samples of Latitudes and Longitudes Broadcast by Mobile Devices

ndash130 ndash120 ndash110 ndash100 ndash90 ndash80 ndash70 ndash6020

25

30

35

40

45

50

55(a)

Long

Lat

ndash20 ndash10 0 10 20 30 40 50

Long

30

35

40

45

50

55

65

60

75

70

Lat

ndash200 ndash150 ndash100 ndash50 1000 50 150 200

Long

ndash100

ndash80

ndash60

ndash40

ndash20

0

40

20

100

60

80

Lat

4068ndash7406 ndash7404 ndash7402 ndash7400 ndash7394ndash7398 ndash7396 ndash7392 ndash7390

Long

4070

4072

4074

4076

4080

4078

4086

4082

4084

Lat

(b)

(c) (d)

Note Note that locations are plotted without reference to any map The figure gives a striking picture of population density across the world and makes uswonder what is going on with mobile devices in Antarctica

Figure 6 (Color online) The Distribution of the Number of Neighborsfor 500 Randomly Sampled Ego Users

0 500 1000 1500 2000 2500 3000 35000

50

100

150

200

250

300

350

400

450

No neighbors

Fre

quen

cy

the median being 1 The maximum number of neigh-bors is 3167 this user probably visited a publicWi-Fi that thousands of other users also logged intoWe chose 500 cohorts as we observe that the statis-

tics become quite stable as seen in Figure 7 Even200 cohorts seems to be enough to obtain stablesample statistics

The life span of user instances in this set of500 cohorts is shown by the cumulative frequency inFigure 8 Observe that only about 40 of the userinstances are seen at different hours About 30 ofthe user instances are seen only once in the sampledweek This illustrates that it is vital to understand thenotion of user instances These percentages are mis-leading if we interpret the user instances as individu-als or devices One individualdevice without stableidentification in the online advertising ecosystem willcreate many transient user instances which will driveup the percentage of user instances that are seen onlyonce or that are seen only within a short timeframeWe have not figured out how to determine or esti-mate reliably how many ldquounstablerdquo usersdevices arerepresented by the user instances in these data Wekeep all of the data for the analyses below (rather thanfiltering very short-lived user instances) to maintainrealism

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 253

Figure 7 Evolution of Several Sample Statistics as an Increasing Number of Cohorts are Created

0 100 200 300 400 5000

10

20

30

40

50

60

70

Average No network neighbors

0 100 200 300 400 5000

2

4

6

8

10

12

Median No network neighbors

0 100 200 300 400 5000

50

100

150

200

250

300

350

Average No IPs

0 100 200 300 400 5000

2

4

6

8Median No IPs

Notes The horizontal axis denotes the number of cohorts By 500 cohorts the statistics have become stable

Figure 8 (Color online) Cumulative Frequency of Life Span ofUser Instances

100 101 102 103 104 105 1060

01

02

03

04

05

06

07

08

09

10

x = lifespan (sec)

F(x

)

End of June (mobile only)

larr 1 min larr 1 hour larr 1 day

Cumulative frequency of user instances

6 Study 1 Connecting Instances ofthe Same Individual

Returning to our motivating application as men-tioned previously an important task in mobile adver-tising is to be able to target a particular individual inthe face of the individualrsquos fragmentation into differ-ent user instances (for example as seen through dif-ferent bidding systems) Advertisers understand that

it may be impossible to find the same user with 100accuracy Nonetheless it often is important for themto reach the userrsquos various instances without target-ing too many others9 The GSN is designed in partbased on the hypothesis that different instances of thesame person would visit similar locations and there-fore they will be connected in the GSN In this sectionwe present results examining whether and to whatextent that is indeed the case Furthermore the resultsquantify to what extent the same user receives a highGSN-similarity to herself as compared to her otherneighbors

Three experimental settings are used to assess judg-ments that two users are the same The first two sim-ulate different user instances by splitting up the bidrequests of known user instances The last one is notsimulated it links users via the recently introducedidentifier for advertising (IFA) observed over severalRTB exchanges for a subset of device instances

1 Random We divide the IP address visits of a userrandomly across two simulated users Table 6 showsfor example how we can divide a userrsquos IP visits across

9 The astute reader may notice that if the other users targeted arevery similar to the target user the advertising may be well placeddespite not being shown to the exact target user That is not thesubject of this study Study 2 shows evidence that these two notionsare not independent when using the GSN

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers254 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 6 Randomly Splitting User X rsquos IP VisitsCreating Two New Artificial Users X1 and X2

IP1 IP2 IP3

X 1 5 0X1 0 3 0X2 1 2 0

Table 7 The Volume of IFA Occurrence Within OurData Set

Volume

RTB 1 119401855RTB 2 113831962RTB 3 7891292

Table 8 The Frequency of Unique IFAs Visible AcrossRTB Exchanges

Frequency

RTB 1RTB 2 3061594 (1016)RTB 2RTB 3 451256 (169)RTB 3RTB 1 181417 (085)

two new users X1 and X2 for which we know thatthey are actually the same individual Results withRandom should be used only as a ceiling on perfor-mance because the results will be overly optimistic

2 Temporal A more realistic split-up of the trans-actions is a temporal split-up where first the middleday of Xrsquos transactions is determined (hence dynam-ically chosen for each user) All IP address visits upto this middle day are assigned to X1 and all trans-actions after the middle day are assigned to X2 Itwill only be applied to users who are seen on multi-ple days

3 Identifier for advertising IFA is an advertisingidentifier from Apple that users can change or asknot to be used for advertising and is seen as aprivacy-friendlier version of the controversial fixeddevice ID (the ldquoUDIDrdquo) The IFA is observed in threeexchanges with volumes shown in Table 7 for exam-ple 1940855 unique IFAs are seen in RTB 110 Whenwe observe the same IFA across different exchangeswe now know that two devices in different exchangesactually are the same The frequency of unique IFAsvisible across exchanges is shown in Table 8 For exam-ple a total of 3018223 IFAs are seen when we look atboth RTB 1 and RTB 2 with 306594 of these (1016)seen on both exchange 1 and exchange 2 What ifthere were not an IFA and the device would havehad a different identifier in each exchange To what

10 For confidentiality RTB 1 in this setting is not necessarily thesame as RTB 1 in sectsect4 or 7

extent are these two device instances then connectedin the GSN

With these different ldquogold standardrdquo methods toindicate whether or not two users are the samewe assess the ability of the geosimilarity metricsto link the same user determining how often twoinstances of the same person are connected in theGSN (sect61) Afterward we shall assess the differentGSN strength metricsrsquo ability to rank users determin-ing to what extent two instances of the same personexhibit strong geosimilarity as compared to the otherneighbors (sect62) For the random and temporal split-ups 200 samples are used For the IFA-based methodthe complete population is used for measuring thedegree of connectedness to the same user to analyzethe relative strength of the connected users 200 sam-ples are used

61 Same Individual Different ldquoScreensrdquoWe now present results on the GSN connectednessmeasured as how often two instances of the same userare connected in the GSN (in percentage) Note thatthe results summarized in Table 9 should be inter-preted in light of the limited number of days in build-ing the research data set which will limit the GSNconnectivity Nonetheless the relatively high connec-tivities observed even with this sample provide quitepromising results which lends considerable supportto the merit of the overall design

Random0 Using the random split two users thatcorrespond to the same individual are not connectedin only 9 of the neighborhoods Moreover in thesecases user X (whose IP visits are split up) usuallyhas only one or two entries and the random split-upresults in no visits for one of the two simulated users

Temporal0 The temporal split-up is more realisticthan the random one where we assume that a personuses simulated device 1 in the first half of the week(eg during the weekend) and simulated device 2 inthe second half of the week (eg during the week)

Users X1 and X2 are not connected in 32 of theneighborhoods Whereas the random split-up was toooptimistic the temporal one is too pessimistic as inreality there likely is some overlap in the time peri-ods of use of two different devices If user X visitsan IP address only in the first days of the week this

Table 9 To What Extent is the Same User Connectedto Herself

connected

Random 91Temporal 68IFAmdashRTB 1RTB 2 67IFAmdashRTB 2RTB 3 82IFAmdashRTB 3RTB 1 73

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 255

IP address will not occur in user X2rsquos transactions andreduces the chance to be connected to X1

IFA0 We look at all combinations of the three ex-changes with IFA identifiers For example there are45256 unique IFAs seen on both RTB 2 and RTB 3 In82 of these the two IFAs visit at least one same IP onboth exchanges and hence are connected in our GSN

The bottom line is that even with this limited slice ofonline behavior different instances of the same indi-vidual are connected in the GSN 70ndash80 of the timeEncouragingly the results from the simulated scenar-ios concur with the results from the real (IFA) sce-narios (including the aforementioned optimism andslight pessimism of the two simulated scenarios)Consistency over these different settings gives addi-tional confidence in the results suggesting that theGSN indeed holds promise for a high degree of suc-cess at targeting the same user across the fragmenta-tion of the online advertising ecosystem

62 Ranking User InstancesNext given that two user instances of the same indi-vidual are indeed connected we assess to what extentthe second instance of the same user is ranked highlyamong its network neighbors based on the weightof their linkage in the GSN We first will discuss themost general results using the best ranking methodand then will look across different ranking methods

We decompose the results based on the numberof network neighbors (NNs) since there are twoboundary cases that require special interpretationTables 10ndash14 show the proportions of the ego net-works that fall into three scenarios First a user mayhave only one NN In this scenario the ranking isperfect (trivially)mdashthe other instance of the same useris the only user instance given a nonzero score As

Table 10 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashRandom Split-up

Ego has only 1 NN 3104Ego has exactly 2 NNs 1905

Same user ranked first 8005Ego has more than 2 NNs 4901

Note For the scenario with two NNs the table reports thepercentage of cases where the other known instance of thesame user is ranked first (including ties)

Table 11 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashTemporal Split-up

Ego has only 1 NN 3008Ego has exactly 2 NNs 1807

Same user ranked first 8309Ego has more than 2 NNs 5005

Table 12 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 1RTB 2Split-up

Ego has only 1 NN 4401Ego has exactly 2 NNs 1201

Same user ranked first 8201Ego has more than 2 NNs 4400

Table 13 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 2RTB 3Split-up

Ego has only 1 NN 6109Ego has exactly 2 NNs 1907

Same user ranked first 7102Ego has more than 2 NNs 1804

Table 14 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 3RTB 1Split-up

Ego has only 1 NN 5305Ego has exactly 2 NNs 2501

Same user ranked first 6304Ego has more than 2 NNs 2104

shown in the tables this scenario accounts for 30ndash60 of the ego networks (depending on the split-upmethod)

The second boundary case is when the ego has ex-actly two NNs This scenario accounts for 10ndash25of the cases depending on the split-up method Herethe tables also report whether the other known userinstance is ranked first (including ties) among thetwo neighbors The other known instance of the sameuser is ranked first 60ndash80 of the time dependingon the split-up scenario The IFA results are affectedstrongly by the particular RTB settingmdashthis may indi-cate that the publishers on the different RTBs have dif-ferent policies for passing the IFA Importantly theseresults are conservativemdashpossibly very conservativeThe other NNs may also be instances of the same userunbeknownst to usmdashfor example in the IFA casebecause an app does not pass the IFA to the RTB orbecause the other user is the same user on a differentdevice and thus does not share the IFA

The final scenario is when more than two NNsare present As shown in the tables this accounts for20ndash50 of the cases depending on the split-up sce-nario In this case we measure the percentile in theranking where the instance of the known same userfalls (see below) as well as the area under the receiveroperating characteristic curve (AUC) (Fawcett 2006)which is equivalent to the MannndashWhitneyndashWilcoxon(MWW) statistic The AUCMWW measures to what

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers256 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 15 When There Are More Than Two Network Neighbors to What Extent Is the Known Same User StronglyConnected to Itself

AUC (if gt 2 NNs) () Average rank percentile (if gt 2 NNs) ()

Ranking metric R T RTB 12 RTB 23 RTB 31 R T RTB 12 RTB 23 RTB 31

1 78 64 68 59 55 29 48 34 43 542 59 58 67 58 54 45 53 34 44 553 79 69 74 61 61 29 44 29 40 484 86 75 69 61 59 24 40 32 41 505 83 70 71 59 63 26 44 31 43 476 85 74 71 60 65 25 41 31 42 457 86 76 73 62 64 24 40 29 40 468 83 70 72 59 65 26 44 30 43 459 85 74 71 60 67 25 41 30 42 44

10 79 69 69 61 63 29 44 32 41 4711 79 69 71 62 64 44 44 30 41 4612 81 68 68 60 62 28 46 34 42 4913 81 70 71 61 64 27 44 30 40 4614 74 67 66 61 48 34 47 35 41 6015 74 67 66 61 48 34 47 35 41 60

Note See the text for a discussion of ranking metrics

extent data points with label 1 are ranked higherthan those with label 0 In the results that follow(see Table 15) the AUC is measured for each samplewhere any user instance among the neighbors thatcorresponds to the same individual is labeled as 1and all others are labeled as 0 Thus in this settingthe AUC measures how well the neighbors are rankedby the likelihood of being the same individual11 Theaverage over all samples is reported If for a sampleall neighbors have the same score that AUC is setto 05 In addition to the factors noted above this alsowill tend to make the results conservative For exam-ple if the GSN links an ego user to a set of neighborsthat are all instances of the same individual in thisevaluation the AUC will be 05

The ranking measure is the percentile in the rank-ing at which the known-same-user instance is rankedwith lower values being better For example if thesame-user instance has the highest score (closest con-nection) among five neighbors the rank is 20 (15)Note again that this is quite conservative in thisexample the user is ranked at the top of the list butbecause there were only five neighbors the highestpossible score is 20 In case the known-same-userinstance has the same score as another neighbor theaverage rank is reported

As shown in Table 15 all AUC values exceed 05ranging up to 08 This means that in every casea known-same-user instance is likely to be rankedhigher than an instance not known to be the same

11 Technically in this setting the AUCMWW is the probability thata randomly selected neighbor that is in fact a known instance of thesame user is ranked more highly than a randomly selected neighborthat is not known to be an instance of the same user If the known-same-user instances are all ranked above the other-user instancesthe AUC = 100 If they are all below then the AUC = 0

user The frequency-based ranking metrics (4ndash9) per-form quite well in connecting instances of the sameuser strongly

Almost all metrics perform better than a randommodel with performances that even come close to thebest possible solution Consistently performing quitewell are the frequency-based metrics (metrics 3ndash9)with metric 7 (the minimum of the IDF weights ofthe shared locations) getting the most wins Not per-forming well is the very basic count metric as well asthe Boolean metric that provides a binary score onlyInteresting to note the Jaccard metrics do not performwell either It seems that penalizing users that visitmany IP addresses negatively affects the results anissue we will return to in the next study

7 Study 2 Does the GSN Select Userswith Similar Interests

The foregoing section addressed our paperrsquos firstclaimmdashthat the GSN design indeed links instances ofthe same user and links them more strongly thaninstances of other users In this section we turn to thepaperrsquos second claimmdashthat the GSN links users withsimilar interests We will measure similar interests bysimilarity in behavior accessing particular publisherson the Web and using mobile apps12

71 Connecting Users with Similar Interests

To What Extent Are the GSN Neighbors of Visi-tors to Particular Publishers Also Visitors to ThoseSame Publishers To assess whether the GSN con-nects users with similar interestsbehavior we select

12 From now on all user instances will be complete real user in-stances the simulated split-up scenarios (Random Temporal) werecreated specifically for the purpose of the same-user study

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 257

Table 16 Distribution of Publishers per RTB in Terms of thePercentage of Users in Our Data Set That Visit the GivenPublisher

Publisher of D

RTB 1Burstly 3004myYearbookcom 1062My Fitness Pal 0065TextMe Inc 0056GameResort 0025Flixster 0019247Sports 0015Daily Workout Apps LLC 0009Mobile Deluxe 0007YoYo Games Ltd 0004

RTB 2mworldnowcom 0032mtmzcom 0024mtopixcom 0014babycentercom 0015meetmecom 0006appsfacebookcom 0006itunesapplecom 0039appevitecom 0002beautyandskincarescom 0003mmajunkiecom 0001

RTB 3Conversion Exchange 5053Pinger Inc 2096Top Game Developer 4029PremiumEntertainmentApp-FamilyOriented-Android 0059Enflick 0053Clapfoot Games 0044FunPokes Inc 0018Talkatone 0015Sevenlogics Inc 0008A Star Software 0007

RTB 4Pinger PhonemdashiOSmdashConversation 1001Talkatone iPhone App 0007DiceWithBuddies 00139GAG Reader 0012Video Downloader Pro Lite 0016Rage Comics 0009Spades 3D Lite 0006Apalabrados 0003Relax Melodies HD 0002Celeb MemdashPhotoMaker 0000

a diverse collection of 40 publishers listed in Table 16A publisher can correspond to a mobile app a web-site or a set of appswebsites We selected both pub-lishers that are visited by many users and more nichepublishers with fewer visitors As can be seen fromTable 16 several types of publishers can be distin-guished the main groups are games (eg GameRe-sort Top Game Developer) and social networkingtools (eg appsfacebookcom Text Me Inc)

More specifically we create 200 samples for 40 pub-lishers 10 publishers for each of four different RTB

exchanges listed in Table 1613 The percentage ofunique users that visit each publisher in our completedata set is also shown (D) To create each samplean ldquoegordquo user is selected that has visited the pub-lisher in question (an ldquoego visitorrdquo) this ego visi-torrsquos GSN neighborhood is created Next we assesshow many of the egorsquos neighbors also visited thatsame publisher This will be compared with a base-line visit rate to produce lift and leverage values(Provost and Fawcett 2013)mdashspecifically how muchmore likely is it for the neighbors of ego visitors tovisit the publisher than would be expected by chanceWe consider three different setups that produce dif-ferent baselines

1 Considering all RTB exchanges to determine the base-line A user is considered a visitor for a given pub-lisher if she visits the publisher on any RTB ex-change (since some publishers are visible on differentexchanges) Hence the baseline visit rate is the num-ber of users (on any RTB) who visit that publisherdivided by the total number of users across all RTBexchanges This will provide optimistic lifts sincemost publishers appear only on one RTB exchangewhereas the baseline is measured over all exchanges

2 Considering one RTB exchange only to determine thebaseline A user is considered a visitor for a given pub-lisher and RTB if she visits the publisher on the sameRTB exchange This changes the baseline to the num-ber of users (on this RTB) who visit that publisherdivided by the total number of users on this RTBexchange The lift and leverage results will be worsethan in the first setting because the baseline will belarger Seeing that some publishers are seen on differ-ent RTB exchanges this setting is slightly pessimistic

3 Taking time of location visits into account In thethird setting we link users only if they visit an IPwithin the same time period This will be elaboratedon in sect73

Results are shown in Tables 17 and 18 The percent-age of publisher visitors in the neighborhood is givenas N This percentage is compared with the baselinepercentage of users that visit that publisher in thecomplete data D using two metrics the lift is the ratioof these numbers the leverage is the difference Liftis particularly useful for very small base rates (con-sider that an activity with a base rate of 30 cannotpossibly have a lift higher than about 3) Leverage ismore telling for larger base rates because it shows theabsolute improvement (here in percentage points)

When considering all NNs the lifts for all publish-ers are greater than 1 except for one Burstly on RTB 1has a lift of 097 (essentially no lift) but only when weconsider RTB 1 as the baseline This publisher already

13 For confidentiality the RTB numbering here is again differentfrom above

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers258 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 17 RTBs 1 and 2 Percentage of Network Neighbors That Also Visit the Publisher

RTB 1

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Burstly 3004 8063 2084 5059 47022 15054 44018 25061 8043 220572 myYearbookcom 1062 25077 15094 24015 44074 27067 43012 35016 21074 330543 My Fitness Pal 0065 4015 6038 305 3609 56077 36025 15055 23091 14094 TextMe Inc 0056 1805 33006 17094 39002 69075 38046 25018 45 240625 GameResort 0025 4093 19077 4068 18052 74026 18027 11086 47058 110626 Flixster 0019 1055 8015 1036 8014 42085 7095 1079 9042 1067 247Sports 0015 27057 188058 27042 77062 530093 77048 74053 509075 740388 Daily Workout Apps LLC 0009 1092 21047 1083 19023 215046 19014 11054 129027 110459 Mobile Deluxe 0007 1087 28067 108 23053 36009 23046 9063 14707 9056

10 YoYo Games Ltd 0004 3052 99058 3049 21057 609064 21053 13 367045 12096Considering this RTB exchange only to determine baseline

1 Burstly 8085 8063 0097 minus0022 47022 5033 38037 25061 2089 160762 myYearbookcom 4071 25077 5047 21006 44074 905 40003 35016 7046 300453 My Fitness Pal 1089 4015 2019 2025 3609 19048 35001 15055 8021 130654 TextMe Inc 1063 1805 11035 16087 39002 23094 37039 25018 15045 230555 GameResort 0073 4093 6078 402 18052 25049 17079 11086 16033 110146 Flixster 0055 1055 208 0099 8014 14071 7059 1079 3023 10247 247Sports 0043 27057 64073 27014 77062 182023 7702 74053 174097 74018 Daily Workout Apps LLC 0026 1092 7037 1066 19023 73095 18097 11054 44037 110289 Mobile Deluxe 0019 1087 9084 1068 23053 123088 23034 9063 5007 9044

10 YoYo Games Ltd 001 3052 34018 3042 21057 209025 21047 13 126012 1209

RTB 2

Considering all RTB exchanges to determine baseline1 mworldnowcom 0032 13027 40086 12095 54039 167042 54006 31065 97044 310332 mtmzcom 0024 6089 28086 6065 23008 9607 22084 10068 44075 100443 mtopixcom 0014 15007 104079 14093 53045 37106 5303 33075 234065 330614 babycentercom 0015 6087 47031 6072 43048 299056 43033 17074 122024 17065 meetmecom 0006 6002 10308 5097 54072 942093 54066 27085 47909 270796 appsfacebookcom 0006 301 48039 3003 9009 141097 9003 2076 43008 20697 itunesapplecom 0039 25077 65071 25038 51043 131012 51004 29049 75018 290098 appevitecom 0002 48028 21212043 48025 54076 21509068 54074 48028 21212043 480259 beautyandskincarescom 0003 18019 642087 18016 8048 299094 8046 1107 413049 11067

10 mmajunkiecom 0001 1036 201064 1035 26067 3195505 26066 13033 11977074 13033Considering this RTB exchange only to determine baseline

1 mworldnowcom 4034 13027 3006 8093 54039 12053 50005 31065 7029 270312 mtmzcom 3019 6089 2016 307 23008 7024 19089 10068 3035 70493 mtopixcom 1092 15007 7084 13015 53045 27081 51053 33075 17056 310834 babycentercom 1094 6087 3054 4093 43048 22042 41054 17074 9015 15085 meetmecom 0078 6002 7077 5025 54072 70057 53094 27085 35092 270076 appsfacebookcom 0086 301 3062 2024 9009 10063 8024 2076 3022 1097 itunesapplecom 5024 25077 4092 20053 51043 9081 46019 29049 5063 240258 appevitecom 0029 48028 165058 47098 54076 187082 54047 48028 165058 470989 beautyandskincarescom 0038 18019 48011 17081 8048 22045 8011 1107 30095 11032

10 mmajunkiecom 0009 1036 15009 1027 26067 296003 26058 13033 148001 13024

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

has a very high baseline and as we will see in thenext section when we consider only the top 1 and 10NNs the lifts go up to 5 and 3 respectively Burstlyprovides a generic platform for building apps and sois a weaker indication of user interest than many ofthe other publishers (fitness babies games)

All other lifts exceed one by substantial mar-gins and go up to even 2391 (Apalabrados onRTB 4 a Spanish-language social Scrabble-like game)

Obviously the conclusion that the NNs indeed aremore likely to visit the same publishers is highly sig-nificant by a simple sign test comparing the visita-tion rate within the geosimilarity neighborhood tothat of overall population either via lift or leveragep lt 0001 Thus NNs of publisher visitors are indeedmore likely also to visit the same websites the abso-lute values of the lifts and leverages suggest that theyare substantially more likely

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 259

Table 18 RTBs 3 and 4 Percentage of Network Neighbors That Also Visit the Publisher

RTB 3

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Conversion Exchange 5053 34054 6024 29001 64071 11069 59017 50077 9018 450242 Pinger Inc 2096 46002 15053 43006 65012 21097 62015 51048 17037 480523 Top Game Developer 4029 47024 11002 42095 60 13099 55071 48097 11042 440684 PremiumEntertainmentApp-FamilyOriented-Android 0059 23079 40012 2302 20045 34049 19086 22025 37052 210665 Enflick 0053 11071 21095 11017 21057 40044 21004 12027 23 110736 Clapfoot Games 0044 8055 19032 8011 11054 26006 1101 13099 31061 130557 FunPokes Inc 0018 11011 61028 10093 50 275077 49082 15079 87008 150618 Talkatone 0015 13013 88051 12098 31011 209076 30096 17091 120078 170779 Sevenlogics Inc 0008 2071 32059 2063 5088 70074 508 8033 100022 8025

10 A Star Software 0007 3033 44086 3025 23008 311024 23 9009 122061 9002Considering this RTB exchange only to determine baseline

1 Conversion Exchange 11045 34054 3002 23009 64071 5065 53025 50077 4043 390322 Pinger Inc 6013 46002 705 39089 65012 10062 58098 51048 8039 450353 Top Game Developer 8088 47024 5032 38036 60 6076 51012 48097 5052 400094 PremiumEntertainmentApp-FamilyOriented-Android 1023 23079 19038 22057 20045 16066 19023 22025 18013 210025 Enflick 101 11071 1006 1006 21057 19054 20046 12027 11011 110166 Clapfoot Games 0092 8055 9033 7064 11054 12059 10062 13099 15027 130087 FunPokes Inc 0038 11011 2906 10074 50 133022 49062 15079 42007 150418 Talkatone 0031 13013 42076 12082 31011 101034 3008 17091 58035 170619 Sevenlogics Inc 0017 2071 15074 2054 5088 34017 5071 8033 48041 8016

10 A Star Software 0015 3033 21067 3017 23008 150036 22092 9009 59023 8094

RTB 4

Considering all RTB exchanges to determine baseline1 Pinger PhonemdashiOSmdashConversation 1001 71081 70085 7008 80077 79068 79076 70014 6902 690132 Talkatone iPhone App 0007 11038 17203 11031 29003 439068 28097 17002 257078 160963 DiceWithBuddies 0013 7041 59004 7028 18018 144092 18006 7069 61031 70574 9GAG Reader 0012 62013 497099 62 49009 39305 48097 49005 393017 480925 Video Downloader Pro Lite 0016 20003 123015 19087 22058 138084 22042 17093 110027 170776 Rage Comics 0009 609 8001 6081 13033 154086 13025 609 8001 60817 Spades 3D Lite 0006 1205 225076 12044 27027 492057 27022 1205 225076 120448 Apalabrados 0003 77042 21391071 77039 75 21316097 74097 79032 21450043 790299 Relax Melodies HD 0002 31058 21002029 31056 50 31170029 49098 31058 21002029 31056

10 Celeb MemdashPhotoMaker 0 3023 82107 3022 7014 11819048 7014 3023 82107 3022Considering this RTB exchange only to determine baseline

1 Pinger PhonemdashiOSmdashConversation 8051 71081 8044 6303 80077 9049 72026 70014 8024 610632 Talkatone iPhone App 0055 11038 20053 10082 29003 52038 28048 17002 30071 160473 DiceWithBuddies 0009 7041 82011 7032 18018 201054 18009 7069 85027 7064 9GAG Reader 0097 62013 63097 61016 49009 50055 48012 49005 50051 480085 Video Downloader Pro Lite 0044 20003 45036 19059 22058 51014 22014 17093 40062 170496 Rage Comics 0037 609 18079 6053 13033 36032 12097 609 18079 60537 Spades 3D Lite 0046 1205 2609 12004 27027 58068 26081 1205 2609 120048 Apalabrados 0021 77042 371093 77021 75 360031 74079 79032 381006 790119 Relax Melodies HD 0013 31058 238054 31045 50 377069 49087 31058 238054 31045

10 Celeb MemdashPhotoMaker 0003 3023 97089 3019 7014 216076 7011 3023 97089 3019

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

The very high lifts could be attributed to two phe-nomena They may provide some evidence that theGSN embeds a true social network as pointed out bythe good results for the social networking publishers(eg Apalabrados with a lift of 2391 appsevitecomwith a lift of 2212 and others) Friends visiting thesame social networking sites (here unusual ones)

would increase the lift and leverage over just havingdifferent instances of the same user visiting the samewebsites An alternative explanation is that some ofthese publishers are much more likely to be visitedacross different devices of the same user It is not clearwhether this is actually true However it argues forobtaining specific and broad data on which devices

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers260 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

are not the same user to shed further light on forexample any broad data that include a (anonymized)user identifier

From these results we can conclude that GSNneighbors indeed exhibit substantially similar pub-lisher visitation behavior Next we assess to whatextent more strongly connected NNs are more similarthan less strongly connected NNs

72 Ranking Users with Similar Interests

Are the More-Similar Geosimilarity Neighbors of aPublisherrsquos Visitors Even More Likely Also to Be Vis-itors to the Same Publisher To assess whether more-similar geosimilarity neighbors are even more likelyto share interests we compute the measures exactlyas in the previous study except instead of consider-ing all NNs only the 1 (N1) and 10 (N10) neighborswith the highest scores are considered (see Tables 17and 18) First we must consider which of the 15 met-rics that are defined to measure link strength to useTable 19 reports the rank aggregations of the differentmetrics as to which provide the best lifts Specificallyfor each of the 40 publishers each scoring metric pro-vides a lift and for that publisher the scoring metricscan be ranked (the best being ranked 1 etc) The rankaggregation is the average rank for a scoring metricacross all 40 publishers Thus if one metric providedthe best lift for all publishers it would get an aggre-gated rank of 1

To visualize the best metrics the metrics with anaverage ranking less than 8 are shown in boldfacein Table 19 The Jaccard metrics do not perform wellin this study either showing that penalizing the con-nection to users that visit many locations is not sen-sible Remember that the Jaccard metric is based onthe idea that users that visit many locations will havea higher chance to share some location with anotheruser However we have so many locations in totalthat visiting a couple more locations will increase onlymarginally this probability of sharing a location bychance A user that logs into more IP addresses andis more active should hence not be penalized andranked lower than other less active users Metric 3which sums the IDF scores of the shared locationsperforms quite well Given the operational efficiencyadvantage over the frequency-based metrics this met-ric is chosen to measure the strength of the links

As seen from the results in Tables 17 and 18 thelift and leverage values for the strongly connected (top 1and top 10) NNs are even higher sometimes astro-nomical (3170 for Relax Melodies HD (RTB 4)) Theseresults are highly significant by sign tests compar-ing either the lift or the leverage between the top-ranked neighbors and the entire cohort (p lt 0001) Theabsolute values again indicate that the users with the

strongest geosimilarity are often substantially morelikely to visit the same publisher

Thus we can conclude that not only does geosim-ilarity find users with similar interests it also ranksusers well by their likelihood of having similarinterests

An interesting follow-up question is whether thecharacteristics of a publisher are of importance for theresults For example are GSN neighbors more sim-ilar in terms of using the same social network appas compared to playing the same games To answerthis question the publishers are categorized into sixclasses funny social news games communicationand miscellaneous In Figure 9 the lifts of the pub-lishers are shown per category Some general trendscan be observed where publishers related to funnycontent have the highest median lift This could beexplained by the fact that users often forward funnycontent to their friends The social websitesapps fol-low next which seems to further demonstrate thatfriends are likely linked in the GSN However pub-lishers in the communication category perform not sowell compared with the other categories and ratherhigh variances are observed in the different cate-gories Please note that high variance is observedacross the categories therefore it is difficult to makewell-supported claims from these results

73 TimingAs a final evaluation we explore whether the inclu-sion of temporal similarity in location visitation ishelpful for defining GSN connections To this end weintroduce a setting where two users are connectedonly if they visit an IP address in the same timeperiod of a day Specifically we divide a day intodifferent time windowsmdashfor example two time peri-ods of 12 hoursmdashwhere users are connected only ifthey visit the same IP address in the same time periodof a day (which might be on different days) Practi-cally two digits are added to an IP which denote thestarting moment of the corresponding time periodWe report on time periods of 2 hours and of 12 hours(other periods yield similar results) We also includethe setting where users are connected only when theyvisit the same IP address on the same day in light ofthe result of Crandall et al (2010) that visiting thesame location at around the same time is indicativeof friendship

The resulting lifts are reported in Figures 10ndash12when considering respectively all NNs the top 1 NNand the top 10 NNs The results when not consider-ing the time period are repeated as well (All) lim-ited to the conservative case of only considering oneexchange (see above) The baseline (lift of one) isindicated with a red horizontal line lifts higher than1000 are shown as 1000 to limit the range to be able

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 12: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 253

Figure 7 Evolution of Several Sample Statistics as an Increasing Number of Cohorts are Created

0 100 200 300 400 5000

10

20

30

40

50

60

70

Average No network neighbors

0 100 200 300 400 5000

2

4

6

8

10

12

Median No network neighbors

0 100 200 300 400 5000

50

100

150

200

250

300

350

Average No IPs

0 100 200 300 400 5000

2

4

6

8Median No IPs

Notes The horizontal axis denotes the number of cohorts By 500 cohorts the statistics have become stable

Figure 8 (Color online) Cumulative Frequency of Life Span ofUser Instances

100 101 102 103 104 105 1060

01

02

03

04

05

06

07

08

09

10

x = lifespan (sec)

F(x

)

End of June (mobile only)

larr 1 min larr 1 hour larr 1 day

Cumulative frequency of user instances

6 Study 1 Connecting Instances ofthe Same Individual

Returning to our motivating application as men-tioned previously an important task in mobile adver-tising is to be able to target a particular individual inthe face of the individualrsquos fragmentation into differ-ent user instances (for example as seen through dif-ferent bidding systems) Advertisers understand that

it may be impossible to find the same user with 100accuracy Nonetheless it often is important for themto reach the userrsquos various instances without target-ing too many others9 The GSN is designed in partbased on the hypothesis that different instances of thesame person would visit similar locations and there-fore they will be connected in the GSN In this sectionwe present results examining whether and to whatextent that is indeed the case Furthermore the resultsquantify to what extent the same user receives a highGSN-similarity to herself as compared to her otherneighbors

Three experimental settings are used to assess judg-ments that two users are the same The first two sim-ulate different user instances by splitting up the bidrequests of known user instances The last one is notsimulated it links users via the recently introducedidentifier for advertising (IFA) observed over severalRTB exchanges for a subset of device instances

1 Random We divide the IP address visits of a userrandomly across two simulated users Table 6 showsfor example how we can divide a userrsquos IP visits across

9 The astute reader may notice that if the other users targeted arevery similar to the target user the advertising may be well placeddespite not being shown to the exact target user That is not thesubject of this study Study 2 shows evidence that these two notionsare not independent when using the GSN

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers254 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 6 Randomly Splitting User X rsquos IP VisitsCreating Two New Artificial Users X1 and X2

IP1 IP2 IP3

X 1 5 0X1 0 3 0X2 1 2 0

Table 7 The Volume of IFA Occurrence Within OurData Set

Volume

RTB 1 119401855RTB 2 113831962RTB 3 7891292

Table 8 The Frequency of Unique IFAs Visible AcrossRTB Exchanges

Frequency

RTB 1RTB 2 3061594 (1016)RTB 2RTB 3 451256 (169)RTB 3RTB 1 181417 (085)

two new users X1 and X2 for which we know thatthey are actually the same individual Results withRandom should be used only as a ceiling on perfor-mance because the results will be overly optimistic

2 Temporal A more realistic split-up of the trans-actions is a temporal split-up where first the middleday of Xrsquos transactions is determined (hence dynam-ically chosen for each user) All IP address visits upto this middle day are assigned to X1 and all trans-actions after the middle day are assigned to X2 Itwill only be applied to users who are seen on multi-ple days

3 Identifier for advertising IFA is an advertisingidentifier from Apple that users can change or asknot to be used for advertising and is seen as aprivacy-friendlier version of the controversial fixeddevice ID (the ldquoUDIDrdquo) The IFA is observed in threeexchanges with volumes shown in Table 7 for exam-ple 1940855 unique IFAs are seen in RTB 110 Whenwe observe the same IFA across different exchangeswe now know that two devices in different exchangesactually are the same The frequency of unique IFAsvisible across exchanges is shown in Table 8 For exam-ple a total of 3018223 IFAs are seen when we look atboth RTB 1 and RTB 2 with 306594 of these (1016)seen on both exchange 1 and exchange 2 What ifthere were not an IFA and the device would havehad a different identifier in each exchange To what

10 For confidentiality RTB 1 in this setting is not necessarily thesame as RTB 1 in sectsect4 or 7

extent are these two device instances then connectedin the GSN

With these different ldquogold standardrdquo methods toindicate whether or not two users are the samewe assess the ability of the geosimilarity metricsto link the same user determining how often twoinstances of the same person are connected in theGSN (sect61) Afterward we shall assess the differentGSN strength metricsrsquo ability to rank users determin-ing to what extent two instances of the same personexhibit strong geosimilarity as compared to the otherneighbors (sect62) For the random and temporal split-ups 200 samples are used For the IFA-based methodthe complete population is used for measuring thedegree of connectedness to the same user to analyzethe relative strength of the connected users 200 sam-ples are used

61 Same Individual Different ldquoScreensrdquoWe now present results on the GSN connectednessmeasured as how often two instances of the same userare connected in the GSN (in percentage) Note thatthe results summarized in Table 9 should be inter-preted in light of the limited number of days in build-ing the research data set which will limit the GSNconnectivity Nonetheless the relatively high connec-tivities observed even with this sample provide quitepromising results which lends considerable supportto the merit of the overall design

Random0 Using the random split two users thatcorrespond to the same individual are not connectedin only 9 of the neighborhoods Moreover in thesecases user X (whose IP visits are split up) usuallyhas only one or two entries and the random split-upresults in no visits for one of the two simulated users

Temporal0 The temporal split-up is more realisticthan the random one where we assume that a personuses simulated device 1 in the first half of the week(eg during the weekend) and simulated device 2 inthe second half of the week (eg during the week)

Users X1 and X2 are not connected in 32 of theneighborhoods Whereas the random split-up was toooptimistic the temporal one is too pessimistic as inreality there likely is some overlap in the time peri-ods of use of two different devices If user X visitsan IP address only in the first days of the week this

Table 9 To What Extent is the Same User Connectedto Herself

connected

Random 91Temporal 68IFAmdashRTB 1RTB 2 67IFAmdashRTB 2RTB 3 82IFAmdashRTB 3RTB 1 73

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 255

IP address will not occur in user X2rsquos transactions andreduces the chance to be connected to X1

IFA0 We look at all combinations of the three ex-changes with IFA identifiers For example there are45256 unique IFAs seen on both RTB 2 and RTB 3 In82 of these the two IFAs visit at least one same IP onboth exchanges and hence are connected in our GSN

The bottom line is that even with this limited slice ofonline behavior different instances of the same indi-vidual are connected in the GSN 70ndash80 of the timeEncouragingly the results from the simulated scenar-ios concur with the results from the real (IFA) sce-narios (including the aforementioned optimism andslight pessimism of the two simulated scenarios)Consistency over these different settings gives addi-tional confidence in the results suggesting that theGSN indeed holds promise for a high degree of suc-cess at targeting the same user across the fragmenta-tion of the online advertising ecosystem

62 Ranking User InstancesNext given that two user instances of the same indi-vidual are indeed connected we assess to what extentthe second instance of the same user is ranked highlyamong its network neighbors based on the weightof their linkage in the GSN We first will discuss themost general results using the best ranking methodand then will look across different ranking methods

We decompose the results based on the numberof network neighbors (NNs) since there are twoboundary cases that require special interpretationTables 10ndash14 show the proportions of the ego net-works that fall into three scenarios First a user mayhave only one NN In this scenario the ranking isperfect (trivially)mdashthe other instance of the same useris the only user instance given a nonzero score As

Table 10 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashRandom Split-up

Ego has only 1 NN 3104Ego has exactly 2 NNs 1905

Same user ranked first 8005Ego has more than 2 NNs 4901

Note For the scenario with two NNs the table reports thepercentage of cases where the other known instance of thesame user is ranked first (including ties)

Table 11 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashTemporal Split-up

Ego has only 1 NN 3008Ego has exactly 2 NNs 1807

Same user ranked first 8309Ego has more than 2 NNs 5005

Table 12 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 1RTB 2Split-up

Ego has only 1 NN 4401Ego has exactly 2 NNs 1201

Same user ranked first 8201Ego has more than 2 NNs 4400

Table 13 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 2RTB 3Split-up

Ego has only 1 NN 6109Ego has exactly 2 NNs 1907

Same user ranked first 7102Ego has more than 2 NNs 1804

Table 14 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 3RTB 1Split-up

Ego has only 1 NN 5305Ego has exactly 2 NNs 2501

Same user ranked first 6304Ego has more than 2 NNs 2104

shown in the tables this scenario accounts for 30ndash60 of the ego networks (depending on the split-upmethod)

The second boundary case is when the ego has ex-actly two NNs This scenario accounts for 10ndash25of the cases depending on the split-up method Herethe tables also report whether the other known userinstance is ranked first (including ties) among thetwo neighbors The other known instance of the sameuser is ranked first 60ndash80 of the time dependingon the split-up scenario The IFA results are affectedstrongly by the particular RTB settingmdashthis may indi-cate that the publishers on the different RTBs have dif-ferent policies for passing the IFA Importantly theseresults are conservativemdashpossibly very conservativeThe other NNs may also be instances of the same userunbeknownst to usmdashfor example in the IFA casebecause an app does not pass the IFA to the RTB orbecause the other user is the same user on a differentdevice and thus does not share the IFA

The final scenario is when more than two NNsare present As shown in the tables this accounts for20ndash50 of the cases depending on the split-up sce-nario In this case we measure the percentile in theranking where the instance of the known same userfalls (see below) as well as the area under the receiveroperating characteristic curve (AUC) (Fawcett 2006)which is equivalent to the MannndashWhitneyndashWilcoxon(MWW) statistic The AUCMWW measures to what

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers256 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 15 When There Are More Than Two Network Neighbors to What Extent Is the Known Same User StronglyConnected to Itself

AUC (if gt 2 NNs) () Average rank percentile (if gt 2 NNs) ()

Ranking metric R T RTB 12 RTB 23 RTB 31 R T RTB 12 RTB 23 RTB 31

1 78 64 68 59 55 29 48 34 43 542 59 58 67 58 54 45 53 34 44 553 79 69 74 61 61 29 44 29 40 484 86 75 69 61 59 24 40 32 41 505 83 70 71 59 63 26 44 31 43 476 85 74 71 60 65 25 41 31 42 457 86 76 73 62 64 24 40 29 40 468 83 70 72 59 65 26 44 30 43 459 85 74 71 60 67 25 41 30 42 44

10 79 69 69 61 63 29 44 32 41 4711 79 69 71 62 64 44 44 30 41 4612 81 68 68 60 62 28 46 34 42 4913 81 70 71 61 64 27 44 30 40 4614 74 67 66 61 48 34 47 35 41 6015 74 67 66 61 48 34 47 35 41 60

Note See the text for a discussion of ranking metrics

extent data points with label 1 are ranked higherthan those with label 0 In the results that follow(see Table 15) the AUC is measured for each samplewhere any user instance among the neighbors thatcorresponds to the same individual is labeled as 1and all others are labeled as 0 Thus in this settingthe AUC measures how well the neighbors are rankedby the likelihood of being the same individual11 Theaverage over all samples is reported If for a sampleall neighbors have the same score that AUC is setto 05 In addition to the factors noted above this alsowill tend to make the results conservative For exam-ple if the GSN links an ego user to a set of neighborsthat are all instances of the same individual in thisevaluation the AUC will be 05

The ranking measure is the percentile in the rank-ing at which the known-same-user instance is rankedwith lower values being better For example if thesame-user instance has the highest score (closest con-nection) among five neighbors the rank is 20 (15)Note again that this is quite conservative in thisexample the user is ranked at the top of the list butbecause there were only five neighbors the highestpossible score is 20 In case the known-same-userinstance has the same score as another neighbor theaverage rank is reported

As shown in Table 15 all AUC values exceed 05ranging up to 08 This means that in every casea known-same-user instance is likely to be rankedhigher than an instance not known to be the same

11 Technically in this setting the AUCMWW is the probability thata randomly selected neighbor that is in fact a known instance of thesame user is ranked more highly than a randomly selected neighborthat is not known to be an instance of the same user If the known-same-user instances are all ranked above the other-user instancesthe AUC = 100 If they are all below then the AUC = 0

user The frequency-based ranking metrics (4ndash9) per-form quite well in connecting instances of the sameuser strongly

Almost all metrics perform better than a randommodel with performances that even come close to thebest possible solution Consistently performing quitewell are the frequency-based metrics (metrics 3ndash9)with metric 7 (the minimum of the IDF weights ofthe shared locations) getting the most wins Not per-forming well is the very basic count metric as well asthe Boolean metric that provides a binary score onlyInteresting to note the Jaccard metrics do not performwell either It seems that penalizing users that visitmany IP addresses negatively affects the results anissue we will return to in the next study

7 Study 2 Does the GSN Select Userswith Similar Interests

The foregoing section addressed our paperrsquos firstclaimmdashthat the GSN design indeed links instances ofthe same user and links them more strongly thaninstances of other users In this section we turn to thepaperrsquos second claimmdashthat the GSN links users withsimilar interests We will measure similar interests bysimilarity in behavior accessing particular publisherson the Web and using mobile apps12

71 Connecting Users with Similar Interests

To What Extent Are the GSN Neighbors of Visi-tors to Particular Publishers Also Visitors to ThoseSame Publishers To assess whether the GSN con-nects users with similar interestsbehavior we select

12 From now on all user instances will be complete real user in-stances the simulated split-up scenarios (Random Temporal) werecreated specifically for the purpose of the same-user study

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 257

Table 16 Distribution of Publishers per RTB in Terms of thePercentage of Users in Our Data Set That Visit the GivenPublisher

Publisher of D

RTB 1Burstly 3004myYearbookcom 1062My Fitness Pal 0065TextMe Inc 0056GameResort 0025Flixster 0019247Sports 0015Daily Workout Apps LLC 0009Mobile Deluxe 0007YoYo Games Ltd 0004

RTB 2mworldnowcom 0032mtmzcom 0024mtopixcom 0014babycentercom 0015meetmecom 0006appsfacebookcom 0006itunesapplecom 0039appevitecom 0002beautyandskincarescom 0003mmajunkiecom 0001

RTB 3Conversion Exchange 5053Pinger Inc 2096Top Game Developer 4029PremiumEntertainmentApp-FamilyOriented-Android 0059Enflick 0053Clapfoot Games 0044FunPokes Inc 0018Talkatone 0015Sevenlogics Inc 0008A Star Software 0007

RTB 4Pinger PhonemdashiOSmdashConversation 1001Talkatone iPhone App 0007DiceWithBuddies 00139GAG Reader 0012Video Downloader Pro Lite 0016Rage Comics 0009Spades 3D Lite 0006Apalabrados 0003Relax Melodies HD 0002Celeb MemdashPhotoMaker 0000

a diverse collection of 40 publishers listed in Table 16A publisher can correspond to a mobile app a web-site or a set of appswebsites We selected both pub-lishers that are visited by many users and more nichepublishers with fewer visitors As can be seen fromTable 16 several types of publishers can be distin-guished the main groups are games (eg GameRe-sort Top Game Developer) and social networkingtools (eg appsfacebookcom Text Me Inc)

More specifically we create 200 samples for 40 pub-lishers 10 publishers for each of four different RTB

exchanges listed in Table 1613 The percentage ofunique users that visit each publisher in our completedata set is also shown (D) To create each samplean ldquoegordquo user is selected that has visited the pub-lisher in question (an ldquoego visitorrdquo) this ego visi-torrsquos GSN neighborhood is created Next we assesshow many of the egorsquos neighbors also visited thatsame publisher This will be compared with a base-line visit rate to produce lift and leverage values(Provost and Fawcett 2013)mdashspecifically how muchmore likely is it for the neighbors of ego visitors tovisit the publisher than would be expected by chanceWe consider three different setups that produce dif-ferent baselines

1 Considering all RTB exchanges to determine the base-line A user is considered a visitor for a given pub-lisher if she visits the publisher on any RTB ex-change (since some publishers are visible on differentexchanges) Hence the baseline visit rate is the num-ber of users (on any RTB) who visit that publisherdivided by the total number of users across all RTBexchanges This will provide optimistic lifts sincemost publishers appear only on one RTB exchangewhereas the baseline is measured over all exchanges

2 Considering one RTB exchange only to determine thebaseline A user is considered a visitor for a given pub-lisher and RTB if she visits the publisher on the sameRTB exchange This changes the baseline to the num-ber of users (on this RTB) who visit that publisherdivided by the total number of users on this RTBexchange The lift and leverage results will be worsethan in the first setting because the baseline will belarger Seeing that some publishers are seen on differ-ent RTB exchanges this setting is slightly pessimistic

3 Taking time of location visits into account In thethird setting we link users only if they visit an IPwithin the same time period This will be elaboratedon in sect73

Results are shown in Tables 17 and 18 The percent-age of publisher visitors in the neighborhood is givenas N This percentage is compared with the baselinepercentage of users that visit that publisher in thecomplete data D using two metrics the lift is the ratioof these numbers the leverage is the difference Liftis particularly useful for very small base rates (con-sider that an activity with a base rate of 30 cannotpossibly have a lift higher than about 3) Leverage ismore telling for larger base rates because it shows theabsolute improvement (here in percentage points)

When considering all NNs the lifts for all publish-ers are greater than 1 except for one Burstly on RTB 1has a lift of 097 (essentially no lift) but only when weconsider RTB 1 as the baseline This publisher already

13 For confidentiality the RTB numbering here is again differentfrom above

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers258 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 17 RTBs 1 and 2 Percentage of Network Neighbors That Also Visit the Publisher

RTB 1

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Burstly 3004 8063 2084 5059 47022 15054 44018 25061 8043 220572 myYearbookcom 1062 25077 15094 24015 44074 27067 43012 35016 21074 330543 My Fitness Pal 0065 4015 6038 305 3609 56077 36025 15055 23091 14094 TextMe Inc 0056 1805 33006 17094 39002 69075 38046 25018 45 240625 GameResort 0025 4093 19077 4068 18052 74026 18027 11086 47058 110626 Flixster 0019 1055 8015 1036 8014 42085 7095 1079 9042 1067 247Sports 0015 27057 188058 27042 77062 530093 77048 74053 509075 740388 Daily Workout Apps LLC 0009 1092 21047 1083 19023 215046 19014 11054 129027 110459 Mobile Deluxe 0007 1087 28067 108 23053 36009 23046 9063 14707 9056

10 YoYo Games Ltd 0004 3052 99058 3049 21057 609064 21053 13 367045 12096Considering this RTB exchange only to determine baseline

1 Burstly 8085 8063 0097 minus0022 47022 5033 38037 25061 2089 160762 myYearbookcom 4071 25077 5047 21006 44074 905 40003 35016 7046 300453 My Fitness Pal 1089 4015 2019 2025 3609 19048 35001 15055 8021 130654 TextMe Inc 1063 1805 11035 16087 39002 23094 37039 25018 15045 230555 GameResort 0073 4093 6078 402 18052 25049 17079 11086 16033 110146 Flixster 0055 1055 208 0099 8014 14071 7059 1079 3023 10247 247Sports 0043 27057 64073 27014 77062 182023 7702 74053 174097 74018 Daily Workout Apps LLC 0026 1092 7037 1066 19023 73095 18097 11054 44037 110289 Mobile Deluxe 0019 1087 9084 1068 23053 123088 23034 9063 5007 9044

10 YoYo Games Ltd 001 3052 34018 3042 21057 209025 21047 13 126012 1209

RTB 2

Considering all RTB exchanges to determine baseline1 mworldnowcom 0032 13027 40086 12095 54039 167042 54006 31065 97044 310332 mtmzcom 0024 6089 28086 6065 23008 9607 22084 10068 44075 100443 mtopixcom 0014 15007 104079 14093 53045 37106 5303 33075 234065 330614 babycentercom 0015 6087 47031 6072 43048 299056 43033 17074 122024 17065 meetmecom 0006 6002 10308 5097 54072 942093 54066 27085 47909 270796 appsfacebookcom 0006 301 48039 3003 9009 141097 9003 2076 43008 20697 itunesapplecom 0039 25077 65071 25038 51043 131012 51004 29049 75018 290098 appevitecom 0002 48028 21212043 48025 54076 21509068 54074 48028 21212043 480259 beautyandskincarescom 0003 18019 642087 18016 8048 299094 8046 1107 413049 11067

10 mmajunkiecom 0001 1036 201064 1035 26067 3195505 26066 13033 11977074 13033Considering this RTB exchange only to determine baseline

1 mworldnowcom 4034 13027 3006 8093 54039 12053 50005 31065 7029 270312 mtmzcom 3019 6089 2016 307 23008 7024 19089 10068 3035 70493 mtopixcom 1092 15007 7084 13015 53045 27081 51053 33075 17056 310834 babycentercom 1094 6087 3054 4093 43048 22042 41054 17074 9015 15085 meetmecom 0078 6002 7077 5025 54072 70057 53094 27085 35092 270076 appsfacebookcom 0086 301 3062 2024 9009 10063 8024 2076 3022 1097 itunesapplecom 5024 25077 4092 20053 51043 9081 46019 29049 5063 240258 appevitecom 0029 48028 165058 47098 54076 187082 54047 48028 165058 470989 beautyandskincarescom 0038 18019 48011 17081 8048 22045 8011 1107 30095 11032

10 mmajunkiecom 0009 1036 15009 1027 26067 296003 26058 13033 148001 13024

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

has a very high baseline and as we will see in thenext section when we consider only the top 1 and 10NNs the lifts go up to 5 and 3 respectively Burstlyprovides a generic platform for building apps and sois a weaker indication of user interest than many ofthe other publishers (fitness babies games)

All other lifts exceed one by substantial mar-gins and go up to even 2391 (Apalabrados onRTB 4 a Spanish-language social Scrabble-like game)

Obviously the conclusion that the NNs indeed aremore likely to visit the same publishers is highly sig-nificant by a simple sign test comparing the visita-tion rate within the geosimilarity neighborhood tothat of overall population either via lift or leveragep lt 0001 Thus NNs of publisher visitors are indeedmore likely also to visit the same websites the abso-lute values of the lifts and leverages suggest that theyare substantially more likely

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 259

Table 18 RTBs 3 and 4 Percentage of Network Neighbors That Also Visit the Publisher

RTB 3

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Conversion Exchange 5053 34054 6024 29001 64071 11069 59017 50077 9018 450242 Pinger Inc 2096 46002 15053 43006 65012 21097 62015 51048 17037 480523 Top Game Developer 4029 47024 11002 42095 60 13099 55071 48097 11042 440684 PremiumEntertainmentApp-FamilyOriented-Android 0059 23079 40012 2302 20045 34049 19086 22025 37052 210665 Enflick 0053 11071 21095 11017 21057 40044 21004 12027 23 110736 Clapfoot Games 0044 8055 19032 8011 11054 26006 1101 13099 31061 130557 FunPokes Inc 0018 11011 61028 10093 50 275077 49082 15079 87008 150618 Talkatone 0015 13013 88051 12098 31011 209076 30096 17091 120078 170779 Sevenlogics Inc 0008 2071 32059 2063 5088 70074 508 8033 100022 8025

10 A Star Software 0007 3033 44086 3025 23008 311024 23 9009 122061 9002Considering this RTB exchange only to determine baseline

1 Conversion Exchange 11045 34054 3002 23009 64071 5065 53025 50077 4043 390322 Pinger Inc 6013 46002 705 39089 65012 10062 58098 51048 8039 450353 Top Game Developer 8088 47024 5032 38036 60 6076 51012 48097 5052 400094 PremiumEntertainmentApp-FamilyOriented-Android 1023 23079 19038 22057 20045 16066 19023 22025 18013 210025 Enflick 101 11071 1006 1006 21057 19054 20046 12027 11011 110166 Clapfoot Games 0092 8055 9033 7064 11054 12059 10062 13099 15027 130087 FunPokes Inc 0038 11011 2906 10074 50 133022 49062 15079 42007 150418 Talkatone 0031 13013 42076 12082 31011 101034 3008 17091 58035 170619 Sevenlogics Inc 0017 2071 15074 2054 5088 34017 5071 8033 48041 8016

10 A Star Software 0015 3033 21067 3017 23008 150036 22092 9009 59023 8094

RTB 4

Considering all RTB exchanges to determine baseline1 Pinger PhonemdashiOSmdashConversation 1001 71081 70085 7008 80077 79068 79076 70014 6902 690132 Talkatone iPhone App 0007 11038 17203 11031 29003 439068 28097 17002 257078 160963 DiceWithBuddies 0013 7041 59004 7028 18018 144092 18006 7069 61031 70574 9GAG Reader 0012 62013 497099 62 49009 39305 48097 49005 393017 480925 Video Downloader Pro Lite 0016 20003 123015 19087 22058 138084 22042 17093 110027 170776 Rage Comics 0009 609 8001 6081 13033 154086 13025 609 8001 60817 Spades 3D Lite 0006 1205 225076 12044 27027 492057 27022 1205 225076 120448 Apalabrados 0003 77042 21391071 77039 75 21316097 74097 79032 21450043 790299 Relax Melodies HD 0002 31058 21002029 31056 50 31170029 49098 31058 21002029 31056

10 Celeb MemdashPhotoMaker 0 3023 82107 3022 7014 11819048 7014 3023 82107 3022Considering this RTB exchange only to determine baseline

1 Pinger PhonemdashiOSmdashConversation 8051 71081 8044 6303 80077 9049 72026 70014 8024 610632 Talkatone iPhone App 0055 11038 20053 10082 29003 52038 28048 17002 30071 160473 DiceWithBuddies 0009 7041 82011 7032 18018 201054 18009 7069 85027 7064 9GAG Reader 0097 62013 63097 61016 49009 50055 48012 49005 50051 480085 Video Downloader Pro Lite 0044 20003 45036 19059 22058 51014 22014 17093 40062 170496 Rage Comics 0037 609 18079 6053 13033 36032 12097 609 18079 60537 Spades 3D Lite 0046 1205 2609 12004 27027 58068 26081 1205 2609 120048 Apalabrados 0021 77042 371093 77021 75 360031 74079 79032 381006 790119 Relax Melodies HD 0013 31058 238054 31045 50 377069 49087 31058 238054 31045

10 Celeb MemdashPhotoMaker 0003 3023 97089 3019 7014 216076 7011 3023 97089 3019

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

The very high lifts could be attributed to two phe-nomena They may provide some evidence that theGSN embeds a true social network as pointed out bythe good results for the social networking publishers(eg Apalabrados with a lift of 2391 appsevitecomwith a lift of 2212 and others) Friends visiting thesame social networking sites (here unusual ones)

would increase the lift and leverage over just havingdifferent instances of the same user visiting the samewebsites An alternative explanation is that some ofthese publishers are much more likely to be visitedacross different devices of the same user It is not clearwhether this is actually true However it argues forobtaining specific and broad data on which devices

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers260 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

are not the same user to shed further light on forexample any broad data that include a (anonymized)user identifier

From these results we can conclude that GSNneighbors indeed exhibit substantially similar pub-lisher visitation behavior Next we assess to whatextent more strongly connected NNs are more similarthan less strongly connected NNs

72 Ranking Users with Similar Interests

Are the More-Similar Geosimilarity Neighbors of aPublisherrsquos Visitors Even More Likely Also to Be Vis-itors to the Same Publisher To assess whether more-similar geosimilarity neighbors are even more likelyto share interests we compute the measures exactlyas in the previous study except instead of consider-ing all NNs only the 1 (N1) and 10 (N10) neighborswith the highest scores are considered (see Tables 17and 18) First we must consider which of the 15 met-rics that are defined to measure link strength to useTable 19 reports the rank aggregations of the differentmetrics as to which provide the best lifts Specificallyfor each of the 40 publishers each scoring metric pro-vides a lift and for that publisher the scoring metricscan be ranked (the best being ranked 1 etc) The rankaggregation is the average rank for a scoring metricacross all 40 publishers Thus if one metric providedthe best lift for all publishers it would get an aggre-gated rank of 1

To visualize the best metrics the metrics with anaverage ranking less than 8 are shown in boldfacein Table 19 The Jaccard metrics do not perform wellin this study either showing that penalizing the con-nection to users that visit many locations is not sen-sible Remember that the Jaccard metric is based onthe idea that users that visit many locations will havea higher chance to share some location with anotheruser However we have so many locations in totalthat visiting a couple more locations will increase onlymarginally this probability of sharing a location bychance A user that logs into more IP addresses andis more active should hence not be penalized andranked lower than other less active users Metric 3which sums the IDF scores of the shared locationsperforms quite well Given the operational efficiencyadvantage over the frequency-based metrics this met-ric is chosen to measure the strength of the links

As seen from the results in Tables 17 and 18 thelift and leverage values for the strongly connected (top 1and top 10) NNs are even higher sometimes astro-nomical (3170 for Relax Melodies HD (RTB 4)) Theseresults are highly significant by sign tests compar-ing either the lift or the leverage between the top-ranked neighbors and the entire cohort (p lt 0001) Theabsolute values again indicate that the users with the

strongest geosimilarity are often substantially morelikely to visit the same publisher

Thus we can conclude that not only does geosim-ilarity find users with similar interests it also ranksusers well by their likelihood of having similarinterests

An interesting follow-up question is whether thecharacteristics of a publisher are of importance for theresults For example are GSN neighbors more sim-ilar in terms of using the same social network appas compared to playing the same games To answerthis question the publishers are categorized into sixclasses funny social news games communicationand miscellaneous In Figure 9 the lifts of the pub-lishers are shown per category Some general trendscan be observed where publishers related to funnycontent have the highest median lift This could beexplained by the fact that users often forward funnycontent to their friends The social websitesapps fol-low next which seems to further demonstrate thatfriends are likely linked in the GSN However pub-lishers in the communication category perform not sowell compared with the other categories and ratherhigh variances are observed in the different cate-gories Please note that high variance is observedacross the categories therefore it is difficult to makewell-supported claims from these results

73 TimingAs a final evaluation we explore whether the inclu-sion of temporal similarity in location visitation ishelpful for defining GSN connections To this end weintroduce a setting where two users are connectedonly if they visit an IP address in the same timeperiod of a day Specifically we divide a day intodifferent time windowsmdashfor example two time peri-ods of 12 hoursmdashwhere users are connected only ifthey visit the same IP address in the same time periodof a day (which might be on different days) Practi-cally two digits are added to an IP which denote thestarting moment of the corresponding time periodWe report on time periods of 2 hours and of 12 hours(other periods yield similar results) We also includethe setting where users are connected only when theyvisit the same IP address on the same day in light ofthe result of Crandall et al (2010) that visiting thesame location at around the same time is indicativeof friendship

The resulting lifts are reported in Figures 10ndash12when considering respectively all NNs the top 1 NNand the top 10 NNs The results when not consider-ing the time period are repeated as well (All) lim-ited to the conservative case of only considering oneexchange (see above) The baseline (lift of one) isindicated with a red horizontal line lifts higher than1000 are shown as 1000 to limit the range to be able

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 13: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile Consumers254 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 6 Randomly Splitting User X rsquos IP VisitsCreating Two New Artificial Users X1 and X2

IP1 IP2 IP3

X 1 5 0X1 0 3 0X2 1 2 0

Table 7 The Volume of IFA Occurrence Within OurData Set

Volume

RTB 1 119401855RTB 2 113831962RTB 3 7891292

Table 8 The Frequency of Unique IFAs Visible AcrossRTB Exchanges

Frequency

RTB 1RTB 2 3061594 (1016)RTB 2RTB 3 451256 (169)RTB 3RTB 1 181417 (085)

two new users X1 and X2 for which we know thatthey are actually the same individual Results withRandom should be used only as a ceiling on perfor-mance because the results will be overly optimistic

2 Temporal A more realistic split-up of the trans-actions is a temporal split-up where first the middleday of Xrsquos transactions is determined (hence dynam-ically chosen for each user) All IP address visits upto this middle day are assigned to X1 and all trans-actions after the middle day are assigned to X2 Itwill only be applied to users who are seen on multi-ple days

3 Identifier for advertising IFA is an advertisingidentifier from Apple that users can change or asknot to be used for advertising and is seen as aprivacy-friendlier version of the controversial fixeddevice ID (the ldquoUDIDrdquo) The IFA is observed in threeexchanges with volumes shown in Table 7 for exam-ple 1940855 unique IFAs are seen in RTB 110 Whenwe observe the same IFA across different exchangeswe now know that two devices in different exchangesactually are the same The frequency of unique IFAsvisible across exchanges is shown in Table 8 For exam-ple a total of 3018223 IFAs are seen when we look atboth RTB 1 and RTB 2 with 306594 of these (1016)seen on both exchange 1 and exchange 2 What ifthere were not an IFA and the device would havehad a different identifier in each exchange To what

10 For confidentiality RTB 1 in this setting is not necessarily thesame as RTB 1 in sectsect4 or 7

extent are these two device instances then connectedin the GSN

With these different ldquogold standardrdquo methods toindicate whether or not two users are the samewe assess the ability of the geosimilarity metricsto link the same user determining how often twoinstances of the same person are connected in theGSN (sect61) Afterward we shall assess the differentGSN strength metricsrsquo ability to rank users determin-ing to what extent two instances of the same personexhibit strong geosimilarity as compared to the otherneighbors (sect62) For the random and temporal split-ups 200 samples are used For the IFA-based methodthe complete population is used for measuring thedegree of connectedness to the same user to analyzethe relative strength of the connected users 200 sam-ples are used

61 Same Individual Different ldquoScreensrdquoWe now present results on the GSN connectednessmeasured as how often two instances of the same userare connected in the GSN (in percentage) Note thatthe results summarized in Table 9 should be inter-preted in light of the limited number of days in build-ing the research data set which will limit the GSNconnectivity Nonetheless the relatively high connec-tivities observed even with this sample provide quitepromising results which lends considerable supportto the merit of the overall design

Random0 Using the random split two users thatcorrespond to the same individual are not connectedin only 9 of the neighborhoods Moreover in thesecases user X (whose IP visits are split up) usuallyhas only one or two entries and the random split-upresults in no visits for one of the two simulated users

Temporal0 The temporal split-up is more realisticthan the random one where we assume that a personuses simulated device 1 in the first half of the week(eg during the weekend) and simulated device 2 inthe second half of the week (eg during the week)

Users X1 and X2 are not connected in 32 of theneighborhoods Whereas the random split-up was toooptimistic the temporal one is too pessimistic as inreality there likely is some overlap in the time peri-ods of use of two different devices If user X visitsan IP address only in the first days of the week this

Table 9 To What Extent is the Same User Connectedto Herself

connected

Random 91Temporal 68IFAmdashRTB 1RTB 2 67IFAmdashRTB 2RTB 3 82IFAmdashRTB 3RTB 1 73

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 255

IP address will not occur in user X2rsquos transactions andreduces the chance to be connected to X1

IFA0 We look at all combinations of the three ex-changes with IFA identifiers For example there are45256 unique IFAs seen on both RTB 2 and RTB 3 In82 of these the two IFAs visit at least one same IP onboth exchanges and hence are connected in our GSN

The bottom line is that even with this limited slice ofonline behavior different instances of the same indi-vidual are connected in the GSN 70ndash80 of the timeEncouragingly the results from the simulated scenar-ios concur with the results from the real (IFA) sce-narios (including the aforementioned optimism andslight pessimism of the two simulated scenarios)Consistency over these different settings gives addi-tional confidence in the results suggesting that theGSN indeed holds promise for a high degree of suc-cess at targeting the same user across the fragmenta-tion of the online advertising ecosystem

62 Ranking User InstancesNext given that two user instances of the same indi-vidual are indeed connected we assess to what extentthe second instance of the same user is ranked highlyamong its network neighbors based on the weightof their linkage in the GSN We first will discuss themost general results using the best ranking methodand then will look across different ranking methods

We decompose the results based on the numberof network neighbors (NNs) since there are twoboundary cases that require special interpretationTables 10ndash14 show the proportions of the ego net-works that fall into three scenarios First a user mayhave only one NN In this scenario the ranking isperfect (trivially)mdashthe other instance of the same useris the only user instance given a nonzero score As

Table 10 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashRandom Split-up

Ego has only 1 NN 3104Ego has exactly 2 NNs 1905

Same user ranked first 8005Ego has more than 2 NNs 4901

Note For the scenario with two NNs the table reports thepercentage of cases where the other known instance of thesame user is ranked first (including ties)

Table 11 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashTemporal Split-up

Ego has only 1 NN 3008Ego has exactly 2 NNs 1807

Same user ranked first 8309Ego has more than 2 NNs 5005

Table 12 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 1RTB 2Split-up

Ego has only 1 NN 4401Ego has exactly 2 NNs 1201

Same user ranked first 8201Ego has more than 2 NNs 4400

Table 13 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 2RTB 3Split-up

Ego has only 1 NN 6109Ego has exactly 2 NNs 1907

Same user ranked first 7102Ego has more than 2 NNs 1804

Table 14 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 3RTB 1Split-up

Ego has only 1 NN 5305Ego has exactly 2 NNs 2501

Same user ranked first 6304Ego has more than 2 NNs 2104

shown in the tables this scenario accounts for 30ndash60 of the ego networks (depending on the split-upmethod)

The second boundary case is when the ego has ex-actly two NNs This scenario accounts for 10ndash25of the cases depending on the split-up method Herethe tables also report whether the other known userinstance is ranked first (including ties) among thetwo neighbors The other known instance of the sameuser is ranked first 60ndash80 of the time dependingon the split-up scenario The IFA results are affectedstrongly by the particular RTB settingmdashthis may indi-cate that the publishers on the different RTBs have dif-ferent policies for passing the IFA Importantly theseresults are conservativemdashpossibly very conservativeThe other NNs may also be instances of the same userunbeknownst to usmdashfor example in the IFA casebecause an app does not pass the IFA to the RTB orbecause the other user is the same user on a differentdevice and thus does not share the IFA

The final scenario is when more than two NNsare present As shown in the tables this accounts for20ndash50 of the cases depending on the split-up sce-nario In this case we measure the percentile in theranking where the instance of the known same userfalls (see below) as well as the area under the receiveroperating characteristic curve (AUC) (Fawcett 2006)which is equivalent to the MannndashWhitneyndashWilcoxon(MWW) statistic The AUCMWW measures to what

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers256 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 15 When There Are More Than Two Network Neighbors to What Extent Is the Known Same User StronglyConnected to Itself

AUC (if gt 2 NNs) () Average rank percentile (if gt 2 NNs) ()

Ranking metric R T RTB 12 RTB 23 RTB 31 R T RTB 12 RTB 23 RTB 31

1 78 64 68 59 55 29 48 34 43 542 59 58 67 58 54 45 53 34 44 553 79 69 74 61 61 29 44 29 40 484 86 75 69 61 59 24 40 32 41 505 83 70 71 59 63 26 44 31 43 476 85 74 71 60 65 25 41 31 42 457 86 76 73 62 64 24 40 29 40 468 83 70 72 59 65 26 44 30 43 459 85 74 71 60 67 25 41 30 42 44

10 79 69 69 61 63 29 44 32 41 4711 79 69 71 62 64 44 44 30 41 4612 81 68 68 60 62 28 46 34 42 4913 81 70 71 61 64 27 44 30 40 4614 74 67 66 61 48 34 47 35 41 6015 74 67 66 61 48 34 47 35 41 60

Note See the text for a discussion of ranking metrics

extent data points with label 1 are ranked higherthan those with label 0 In the results that follow(see Table 15) the AUC is measured for each samplewhere any user instance among the neighbors thatcorresponds to the same individual is labeled as 1and all others are labeled as 0 Thus in this settingthe AUC measures how well the neighbors are rankedby the likelihood of being the same individual11 Theaverage over all samples is reported If for a sampleall neighbors have the same score that AUC is setto 05 In addition to the factors noted above this alsowill tend to make the results conservative For exam-ple if the GSN links an ego user to a set of neighborsthat are all instances of the same individual in thisevaluation the AUC will be 05

The ranking measure is the percentile in the rank-ing at which the known-same-user instance is rankedwith lower values being better For example if thesame-user instance has the highest score (closest con-nection) among five neighbors the rank is 20 (15)Note again that this is quite conservative in thisexample the user is ranked at the top of the list butbecause there were only five neighbors the highestpossible score is 20 In case the known-same-userinstance has the same score as another neighbor theaverage rank is reported

As shown in Table 15 all AUC values exceed 05ranging up to 08 This means that in every casea known-same-user instance is likely to be rankedhigher than an instance not known to be the same

11 Technically in this setting the AUCMWW is the probability thata randomly selected neighbor that is in fact a known instance of thesame user is ranked more highly than a randomly selected neighborthat is not known to be an instance of the same user If the known-same-user instances are all ranked above the other-user instancesthe AUC = 100 If they are all below then the AUC = 0

user The frequency-based ranking metrics (4ndash9) per-form quite well in connecting instances of the sameuser strongly

Almost all metrics perform better than a randommodel with performances that even come close to thebest possible solution Consistently performing quitewell are the frequency-based metrics (metrics 3ndash9)with metric 7 (the minimum of the IDF weights ofthe shared locations) getting the most wins Not per-forming well is the very basic count metric as well asthe Boolean metric that provides a binary score onlyInteresting to note the Jaccard metrics do not performwell either It seems that penalizing users that visitmany IP addresses negatively affects the results anissue we will return to in the next study

7 Study 2 Does the GSN Select Userswith Similar Interests

The foregoing section addressed our paperrsquos firstclaimmdashthat the GSN design indeed links instances ofthe same user and links them more strongly thaninstances of other users In this section we turn to thepaperrsquos second claimmdashthat the GSN links users withsimilar interests We will measure similar interests bysimilarity in behavior accessing particular publisherson the Web and using mobile apps12

71 Connecting Users with Similar Interests

To What Extent Are the GSN Neighbors of Visi-tors to Particular Publishers Also Visitors to ThoseSame Publishers To assess whether the GSN con-nects users with similar interestsbehavior we select

12 From now on all user instances will be complete real user in-stances the simulated split-up scenarios (Random Temporal) werecreated specifically for the purpose of the same-user study

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 257

Table 16 Distribution of Publishers per RTB in Terms of thePercentage of Users in Our Data Set That Visit the GivenPublisher

Publisher of D

RTB 1Burstly 3004myYearbookcom 1062My Fitness Pal 0065TextMe Inc 0056GameResort 0025Flixster 0019247Sports 0015Daily Workout Apps LLC 0009Mobile Deluxe 0007YoYo Games Ltd 0004

RTB 2mworldnowcom 0032mtmzcom 0024mtopixcom 0014babycentercom 0015meetmecom 0006appsfacebookcom 0006itunesapplecom 0039appevitecom 0002beautyandskincarescom 0003mmajunkiecom 0001

RTB 3Conversion Exchange 5053Pinger Inc 2096Top Game Developer 4029PremiumEntertainmentApp-FamilyOriented-Android 0059Enflick 0053Clapfoot Games 0044FunPokes Inc 0018Talkatone 0015Sevenlogics Inc 0008A Star Software 0007

RTB 4Pinger PhonemdashiOSmdashConversation 1001Talkatone iPhone App 0007DiceWithBuddies 00139GAG Reader 0012Video Downloader Pro Lite 0016Rage Comics 0009Spades 3D Lite 0006Apalabrados 0003Relax Melodies HD 0002Celeb MemdashPhotoMaker 0000

a diverse collection of 40 publishers listed in Table 16A publisher can correspond to a mobile app a web-site or a set of appswebsites We selected both pub-lishers that are visited by many users and more nichepublishers with fewer visitors As can be seen fromTable 16 several types of publishers can be distin-guished the main groups are games (eg GameRe-sort Top Game Developer) and social networkingtools (eg appsfacebookcom Text Me Inc)

More specifically we create 200 samples for 40 pub-lishers 10 publishers for each of four different RTB

exchanges listed in Table 1613 The percentage ofunique users that visit each publisher in our completedata set is also shown (D) To create each samplean ldquoegordquo user is selected that has visited the pub-lisher in question (an ldquoego visitorrdquo) this ego visi-torrsquos GSN neighborhood is created Next we assesshow many of the egorsquos neighbors also visited thatsame publisher This will be compared with a base-line visit rate to produce lift and leverage values(Provost and Fawcett 2013)mdashspecifically how muchmore likely is it for the neighbors of ego visitors tovisit the publisher than would be expected by chanceWe consider three different setups that produce dif-ferent baselines

1 Considering all RTB exchanges to determine the base-line A user is considered a visitor for a given pub-lisher if she visits the publisher on any RTB ex-change (since some publishers are visible on differentexchanges) Hence the baseline visit rate is the num-ber of users (on any RTB) who visit that publisherdivided by the total number of users across all RTBexchanges This will provide optimistic lifts sincemost publishers appear only on one RTB exchangewhereas the baseline is measured over all exchanges

2 Considering one RTB exchange only to determine thebaseline A user is considered a visitor for a given pub-lisher and RTB if she visits the publisher on the sameRTB exchange This changes the baseline to the num-ber of users (on this RTB) who visit that publisherdivided by the total number of users on this RTBexchange The lift and leverage results will be worsethan in the first setting because the baseline will belarger Seeing that some publishers are seen on differ-ent RTB exchanges this setting is slightly pessimistic

3 Taking time of location visits into account In thethird setting we link users only if they visit an IPwithin the same time period This will be elaboratedon in sect73

Results are shown in Tables 17 and 18 The percent-age of publisher visitors in the neighborhood is givenas N This percentage is compared with the baselinepercentage of users that visit that publisher in thecomplete data D using two metrics the lift is the ratioof these numbers the leverage is the difference Liftis particularly useful for very small base rates (con-sider that an activity with a base rate of 30 cannotpossibly have a lift higher than about 3) Leverage ismore telling for larger base rates because it shows theabsolute improvement (here in percentage points)

When considering all NNs the lifts for all publish-ers are greater than 1 except for one Burstly on RTB 1has a lift of 097 (essentially no lift) but only when weconsider RTB 1 as the baseline This publisher already

13 For confidentiality the RTB numbering here is again differentfrom above

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers258 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 17 RTBs 1 and 2 Percentage of Network Neighbors That Also Visit the Publisher

RTB 1

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Burstly 3004 8063 2084 5059 47022 15054 44018 25061 8043 220572 myYearbookcom 1062 25077 15094 24015 44074 27067 43012 35016 21074 330543 My Fitness Pal 0065 4015 6038 305 3609 56077 36025 15055 23091 14094 TextMe Inc 0056 1805 33006 17094 39002 69075 38046 25018 45 240625 GameResort 0025 4093 19077 4068 18052 74026 18027 11086 47058 110626 Flixster 0019 1055 8015 1036 8014 42085 7095 1079 9042 1067 247Sports 0015 27057 188058 27042 77062 530093 77048 74053 509075 740388 Daily Workout Apps LLC 0009 1092 21047 1083 19023 215046 19014 11054 129027 110459 Mobile Deluxe 0007 1087 28067 108 23053 36009 23046 9063 14707 9056

10 YoYo Games Ltd 0004 3052 99058 3049 21057 609064 21053 13 367045 12096Considering this RTB exchange only to determine baseline

1 Burstly 8085 8063 0097 minus0022 47022 5033 38037 25061 2089 160762 myYearbookcom 4071 25077 5047 21006 44074 905 40003 35016 7046 300453 My Fitness Pal 1089 4015 2019 2025 3609 19048 35001 15055 8021 130654 TextMe Inc 1063 1805 11035 16087 39002 23094 37039 25018 15045 230555 GameResort 0073 4093 6078 402 18052 25049 17079 11086 16033 110146 Flixster 0055 1055 208 0099 8014 14071 7059 1079 3023 10247 247Sports 0043 27057 64073 27014 77062 182023 7702 74053 174097 74018 Daily Workout Apps LLC 0026 1092 7037 1066 19023 73095 18097 11054 44037 110289 Mobile Deluxe 0019 1087 9084 1068 23053 123088 23034 9063 5007 9044

10 YoYo Games Ltd 001 3052 34018 3042 21057 209025 21047 13 126012 1209

RTB 2

Considering all RTB exchanges to determine baseline1 mworldnowcom 0032 13027 40086 12095 54039 167042 54006 31065 97044 310332 mtmzcom 0024 6089 28086 6065 23008 9607 22084 10068 44075 100443 mtopixcom 0014 15007 104079 14093 53045 37106 5303 33075 234065 330614 babycentercom 0015 6087 47031 6072 43048 299056 43033 17074 122024 17065 meetmecom 0006 6002 10308 5097 54072 942093 54066 27085 47909 270796 appsfacebookcom 0006 301 48039 3003 9009 141097 9003 2076 43008 20697 itunesapplecom 0039 25077 65071 25038 51043 131012 51004 29049 75018 290098 appevitecom 0002 48028 21212043 48025 54076 21509068 54074 48028 21212043 480259 beautyandskincarescom 0003 18019 642087 18016 8048 299094 8046 1107 413049 11067

10 mmajunkiecom 0001 1036 201064 1035 26067 3195505 26066 13033 11977074 13033Considering this RTB exchange only to determine baseline

1 mworldnowcom 4034 13027 3006 8093 54039 12053 50005 31065 7029 270312 mtmzcom 3019 6089 2016 307 23008 7024 19089 10068 3035 70493 mtopixcom 1092 15007 7084 13015 53045 27081 51053 33075 17056 310834 babycentercom 1094 6087 3054 4093 43048 22042 41054 17074 9015 15085 meetmecom 0078 6002 7077 5025 54072 70057 53094 27085 35092 270076 appsfacebookcom 0086 301 3062 2024 9009 10063 8024 2076 3022 1097 itunesapplecom 5024 25077 4092 20053 51043 9081 46019 29049 5063 240258 appevitecom 0029 48028 165058 47098 54076 187082 54047 48028 165058 470989 beautyandskincarescom 0038 18019 48011 17081 8048 22045 8011 1107 30095 11032

10 mmajunkiecom 0009 1036 15009 1027 26067 296003 26058 13033 148001 13024

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

has a very high baseline and as we will see in thenext section when we consider only the top 1 and 10NNs the lifts go up to 5 and 3 respectively Burstlyprovides a generic platform for building apps and sois a weaker indication of user interest than many ofthe other publishers (fitness babies games)

All other lifts exceed one by substantial mar-gins and go up to even 2391 (Apalabrados onRTB 4 a Spanish-language social Scrabble-like game)

Obviously the conclusion that the NNs indeed aremore likely to visit the same publishers is highly sig-nificant by a simple sign test comparing the visita-tion rate within the geosimilarity neighborhood tothat of overall population either via lift or leveragep lt 0001 Thus NNs of publisher visitors are indeedmore likely also to visit the same websites the abso-lute values of the lifts and leverages suggest that theyare substantially more likely

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 259

Table 18 RTBs 3 and 4 Percentage of Network Neighbors That Also Visit the Publisher

RTB 3

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Conversion Exchange 5053 34054 6024 29001 64071 11069 59017 50077 9018 450242 Pinger Inc 2096 46002 15053 43006 65012 21097 62015 51048 17037 480523 Top Game Developer 4029 47024 11002 42095 60 13099 55071 48097 11042 440684 PremiumEntertainmentApp-FamilyOriented-Android 0059 23079 40012 2302 20045 34049 19086 22025 37052 210665 Enflick 0053 11071 21095 11017 21057 40044 21004 12027 23 110736 Clapfoot Games 0044 8055 19032 8011 11054 26006 1101 13099 31061 130557 FunPokes Inc 0018 11011 61028 10093 50 275077 49082 15079 87008 150618 Talkatone 0015 13013 88051 12098 31011 209076 30096 17091 120078 170779 Sevenlogics Inc 0008 2071 32059 2063 5088 70074 508 8033 100022 8025

10 A Star Software 0007 3033 44086 3025 23008 311024 23 9009 122061 9002Considering this RTB exchange only to determine baseline

1 Conversion Exchange 11045 34054 3002 23009 64071 5065 53025 50077 4043 390322 Pinger Inc 6013 46002 705 39089 65012 10062 58098 51048 8039 450353 Top Game Developer 8088 47024 5032 38036 60 6076 51012 48097 5052 400094 PremiumEntertainmentApp-FamilyOriented-Android 1023 23079 19038 22057 20045 16066 19023 22025 18013 210025 Enflick 101 11071 1006 1006 21057 19054 20046 12027 11011 110166 Clapfoot Games 0092 8055 9033 7064 11054 12059 10062 13099 15027 130087 FunPokes Inc 0038 11011 2906 10074 50 133022 49062 15079 42007 150418 Talkatone 0031 13013 42076 12082 31011 101034 3008 17091 58035 170619 Sevenlogics Inc 0017 2071 15074 2054 5088 34017 5071 8033 48041 8016

10 A Star Software 0015 3033 21067 3017 23008 150036 22092 9009 59023 8094

RTB 4

Considering all RTB exchanges to determine baseline1 Pinger PhonemdashiOSmdashConversation 1001 71081 70085 7008 80077 79068 79076 70014 6902 690132 Talkatone iPhone App 0007 11038 17203 11031 29003 439068 28097 17002 257078 160963 DiceWithBuddies 0013 7041 59004 7028 18018 144092 18006 7069 61031 70574 9GAG Reader 0012 62013 497099 62 49009 39305 48097 49005 393017 480925 Video Downloader Pro Lite 0016 20003 123015 19087 22058 138084 22042 17093 110027 170776 Rage Comics 0009 609 8001 6081 13033 154086 13025 609 8001 60817 Spades 3D Lite 0006 1205 225076 12044 27027 492057 27022 1205 225076 120448 Apalabrados 0003 77042 21391071 77039 75 21316097 74097 79032 21450043 790299 Relax Melodies HD 0002 31058 21002029 31056 50 31170029 49098 31058 21002029 31056

10 Celeb MemdashPhotoMaker 0 3023 82107 3022 7014 11819048 7014 3023 82107 3022Considering this RTB exchange only to determine baseline

1 Pinger PhonemdashiOSmdashConversation 8051 71081 8044 6303 80077 9049 72026 70014 8024 610632 Talkatone iPhone App 0055 11038 20053 10082 29003 52038 28048 17002 30071 160473 DiceWithBuddies 0009 7041 82011 7032 18018 201054 18009 7069 85027 7064 9GAG Reader 0097 62013 63097 61016 49009 50055 48012 49005 50051 480085 Video Downloader Pro Lite 0044 20003 45036 19059 22058 51014 22014 17093 40062 170496 Rage Comics 0037 609 18079 6053 13033 36032 12097 609 18079 60537 Spades 3D Lite 0046 1205 2609 12004 27027 58068 26081 1205 2609 120048 Apalabrados 0021 77042 371093 77021 75 360031 74079 79032 381006 790119 Relax Melodies HD 0013 31058 238054 31045 50 377069 49087 31058 238054 31045

10 Celeb MemdashPhotoMaker 0003 3023 97089 3019 7014 216076 7011 3023 97089 3019

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

The very high lifts could be attributed to two phe-nomena They may provide some evidence that theGSN embeds a true social network as pointed out bythe good results for the social networking publishers(eg Apalabrados with a lift of 2391 appsevitecomwith a lift of 2212 and others) Friends visiting thesame social networking sites (here unusual ones)

would increase the lift and leverage over just havingdifferent instances of the same user visiting the samewebsites An alternative explanation is that some ofthese publishers are much more likely to be visitedacross different devices of the same user It is not clearwhether this is actually true However it argues forobtaining specific and broad data on which devices

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers260 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

are not the same user to shed further light on forexample any broad data that include a (anonymized)user identifier

From these results we can conclude that GSNneighbors indeed exhibit substantially similar pub-lisher visitation behavior Next we assess to whatextent more strongly connected NNs are more similarthan less strongly connected NNs

72 Ranking Users with Similar Interests

Are the More-Similar Geosimilarity Neighbors of aPublisherrsquos Visitors Even More Likely Also to Be Vis-itors to the Same Publisher To assess whether more-similar geosimilarity neighbors are even more likelyto share interests we compute the measures exactlyas in the previous study except instead of consider-ing all NNs only the 1 (N1) and 10 (N10) neighborswith the highest scores are considered (see Tables 17and 18) First we must consider which of the 15 met-rics that are defined to measure link strength to useTable 19 reports the rank aggregations of the differentmetrics as to which provide the best lifts Specificallyfor each of the 40 publishers each scoring metric pro-vides a lift and for that publisher the scoring metricscan be ranked (the best being ranked 1 etc) The rankaggregation is the average rank for a scoring metricacross all 40 publishers Thus if one metric providedthe best lift for all publishers it would get an aggre-gated rank of 1

To visualize the best metrics the metrics with anaverage ranking less than 8 are shown in boldfacein Table 19 The Jaccard metrics do not perform wellin this study either showing that penalizing the con-nection to users that visit many locations is not sen-sible Remember that the Jaccard metric is based onthe idea that users that visit many locations will havea higher chance to share some location with anotheruser However we have so many locations in totalthat visiting a couple more locations will increase onlymarginally this probability of sharing a location bychance A user that logs into more IP addresses andis more active should hence not be penalized andranked lower than other less active users Metric 3which sums the IDF scores of the shared locationsperforms quite well Given the operational efficiencyadvantage over the frequency-based metrics this met-ric is chosen to measure the strength of the links

As seen from the results in Tables 17 and 18 thelift and leverage values for the strongly connected (top 1and top 10) NNs are even higher sometimes astro-nomical (3170 for Relax Melodies HD (RTB 4)) Theseresults are highly significant by sign tests compar-ing either the lift or the leverage between the top-ranked neighbors and the entire cohort (p lt 0001) Theabsolute values again indicate that the users with the

strongest geosimilarity are often substantially morelikely to visit the same publisher

Thus we can conclude that not only does geosim-ilarity find users with similar interests it also ranksusers well by their likelihood of having similarinterests

An interesting follow-up question is whether thecharacteristics of a publisher are of importance for theresults For example are GSN neighbors more sim-ilar in terms of using the same social network appas compared to playing the same games To answerthis question the publishers are categorized into sixclasses funny social news games communicationand miscellaneous In Figure 9 the lifts of the pub-lishers are shown per category Some general trendscan be observed where publishers related to funnycontent have the highest median lift This could beexplained by the fact that users often forward funnycontent to their friends The social websitesapps fol-low next which seems to further demonstrate thatfriends are likely linked in the GSN However pub-lishers in the communication category perform not sowell compared with the other categories and ratherhigh variances are observed in the different cate-gories Please note that high variance is observedacross the categories therefore it is difficult to makewell-supported claims from these results

73 TimingAs a final evaluation we explore whether the inclu-sion of temporal similarity in location visitation ishelpful for defining GSN connections To this end weintroduce a setting where two users are connectedonly if they visit an IP address in the same timeperiod of a day Specifically we divide a day intodifferent time windowsmdashfor example two time peri-ods of 12 hoursmdashwhere users are connected only ifthey visit the same IP address in the same time periodof a day (which might be on different days) Practi-cally two digits are added to an IP which denote thestarting moment of the corresponding time periodWe report on time periods of 2 hours and of 12 hours(other periods yield similar results) We also includethe setting where users are connected only when theyvisit the same IP address on the same day in light ofthe result of Crandall et al (2010) that visiting thesame location at around the same time is indicativeof friendship

The resulting lifts are reported in Figures 10ndash12when considering respectively all NNs the top 1 NNand the top 10 NNs The results when not consider-ing the time period are repeated as well (All) lim-ited to the conservative case of only considering oneexchange (see above) The baseline (lift of one) isindicated with a red horizontal line lifts higher than1000 are shown as 1000 to limit the range to be able

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 14: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 255

IP address will not occur in user X2rsquos transactions andreduces the chance to be connected to X1

IFA0 We look at all combinations of the three ex-changes with IFA identifiers For example there are45256 unique IFAs seen on both RTB 2 and RTB 3 In82 of these the two IFAs visit at least one same IP onboth exchanges and hence are connected in our GSN

The bottom line is that even with this limited slice ofonline behavior different instances of the same indi-vidual are connected in the GSN 70ndash80 of the timeEncouragingly the results from the simulated scenar-ios concur with the results from the real (IFA) sce-narios (including the aforementioned optimism andslight pessimism of the two simulated scenarios)Consistency over these different settings gives addi-tional confidence in the results suggesting that theGSN indeed holds promise for a high degree of suc-cess at targeting the same user across the fragmenta-tion of the online advertising ecosystem

62 Ranking User InstancesNext given that two user instances of the same indi-vidual are indeed connected we assess to what extentthe second instance of the same user is ranked highlyamong its network neighbors based on the weightof their linkage in the GSN We first will discuss themost general results using the best ranking methodand then will look across different ranking methods

We decompose the results based on the numberof network neighbors (NNs) since there are twoboundary cases that require special interpretationTables 10ndash14 show the proportions of the ego net-works that fall into three scenarios First a user mayhave only one NN In this scenario the ranking isperfect (trivially)mdashthe other instance of the same useris the only user instance given a nonzero score As

Table 10 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashRandom Split-up

Ego has only 1 NN 3104Ego has exactly 2 NNs 1905

Same user ranked first 8005Ego has more than 2 NNs 4901

Note For the scenario with two NNs the table reports thepercentage of cases where the other known instance of thesame user is ranked first (including ties)

Table 11 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashTemporal Split-up

Ego has only 1 NN 3008Ego has exactly 2 NNs 1807

Same user ranked first 8309Ego has more than 2 NNs 5005

Table 12 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 1RTB 2Split-up

Ego has only 1 NN 4401Ego has exactly 2 NNs 1201

Same user ranked first 8201Ego has more than 2 NNs 4400

Table 13 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 2RTB 3Split-up

Ego has only 1 NN 6109Ego has exactly 2 NNs 1907

Same user ranked first 7102Ego has more than 2 NNs 1804

Table 14 In What Portion of Cases Is the Same UserConnected Depending on the Number ofNetwork NeighborsmdashIFA RTB 3RTB 1Split-up

Ego has only 1 NN 5305Ego has exactly 2 NNs 2501

Same user ranked first 6304Ego has more than 2 NNs 2104

shown in the tables this scenario accounts for 30ndash60 of the ego networks (depending on the split-upmethod)

The second boundary case is when the ego has ex-actly two NNs This scenario accounts for 10ndash25of the cases depending on the split-up method Herethe tables also report whether the other known userinstance is ranked first (including ties) among thetwo neighbors The other known instance of the sameuser is ranked first 60ndash80 of the time dependingon the split-up scenario The IFA results are affectedstrongly by the particular RTB settingmdashthis may indi-cate that the publishers on the different RTBs have dif-ferent policies for passing the IFA Importantly theseresults are conservativemdashpossibly very conservativeThe other NNs may also be instances of the same userunbeknownst to usmdashfor example in the IFA casebecause an app does not pass the IFA to the RTB orbecause the other user is the same user on a differentdevice and thus does not share the IFA

The final scenario is when more than two NNsare present As shown in the tables this accounts for20ndash50 of the cases depending on the split-up sce-nario In this case we measure the percentile in theranking where the instance of the known same userfalls (see below) as well as the area under the receiveroperating characteristic curve (AUC) (Fawcett 2006)which is equivalent to the MannndashWhitneyndashWilcoxon(MWW) statistic The AUCMWW measures to what

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers256 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 15 When There Are More Than Two Network Neighbors to What Extent Is the Known Same User StronglyConnected to Itself

AUC (if gt 2 NNs) () Average rank percentile (if gt 2 NNs) ()

Ranking metric R T RTB 12 RTB 23 RTB 31 R T RTB 12 RTB 23 RTB 31

1 78 64 68 59 55 29 48 34 43 542 59 58 67 58 54 45 53 34 44 553 79 69 74 61 61 29 44 29 40 484 86 75 69 61 59 24 40 32 41 505 83 70 71 59 63 26 44 31 43 476 85 74 71 60 65 25 41 31 42 457 86 76 73 62 64 24 40 29 40 468 83 70 72 59 65 26 44 30 43 459 85 74 71 60 67 25 41 30 42 44

10 79 69 69 61 63 29 44 32 41 4711 79 69 71 62 64 44 44 30 41 4612 81 68 68 60 62 28 46 34 42 4913 81 70 71 61 64 27 44 30 40 4614 74 67 66 61 48 34 47 35 41 6015 74 67 66 61 48 34 47 35 41 60

Note See the text for a discussion of ranking metrics

extent data points with label 1 are ranked higherthan those with label 0 In the results that follow(see Table 15) the AUC is measured for each samplewhere any user instance among the neighbors thatcorresponds to the same individual is labeled as 1and all others are labeled as 0 Thus in this settingthe AUC measures how well the neighbors are rankedby the likelihood of being the same individual11 Theaverage over all samples is reported If for a sampleall neighbors have the same score that AUC is setto 05 In addition to the factors noted above this alsowill tend to make the results conservative For exam-ple if the GSN links an ego user to a set of neighborsthat are all instances of the same individual in thisevaluation the AUC will be 05

The ranking measure is the percentile in the rank-ing at which the known-same-user instance is rankedwith lower values being better For example if thesame-user instance has the highest score (closest con-nection) among five neighbors the rank is 20 (15)Note again that this is quite conservative in thisexample the user is ranked at the top of the list butbecause there were only five neighbors the highestpossible score is 20 In case the known-same-userinstance has the same score as another neighbor theaverage rank is reported

As shown in Table 15 all AUC values exceed 05ranging up to 08 This means that in every casea known-same-user instance is likely to be rankedhigher than an instance not known to be the same

11 Technically in this setting the AUCMWW is the probability thata randomly selected neighbor that is in fact a known instance of thesame user is ranked more highly than a randomly selected neighborthat is not known to be an instance of the same user If the known-same-user instances are all ranked above the other-user instancesthe AUC = 100 If they are all below then the AUC = 0

user The frequency-based ranking metrics (4ndash9) per-form quite well in connecting instances of the sameuser strongly

Almost all metrics perform better than a randommodel with performances that even come close to thebest possible solution Consistently performing quitewell are the frequency-based metrics (metrics 3ndash9)with metric 7 (the minimum of the IDF weights ofthe shared locations) getting the most wins Not per-forming well is the very basic count metric as well asthe Boolean metric that provides a binary score onlyInteresting to note the Jaccard metrics do not performwell either It seems that penalizing users that visitmany IP addresses negatively affects the results anissue we will return to in the next study

7 Study 2 Does the GSN Select Userswith Similar Interests

The foregoing section addressed our paperrsquos firstclaimmdashthat the GSN design indeed links instances ofthe same user and links them more strongly thaninstances of other users In this section we turn to thepaperrsquos second claimmdashthat the GSN links users withsimilar interests We will measure similar interests bysimilarity in behavior accessing particular publisherson the Web and using mobile apps12

71 Connecting Users with Similar Interests

To What Extent Are the GSN Neighbors of Visi-tors to Particular Publishers Also Visitors to ThoseSame Publishers To assess whether the GSN con-nects users with similar interestsbehavior we select

12 From now on all user instances will be complete real user in-stances the simulated split-up scenarios (Random Temporal) werecreated specifically for the purpose of the same-user study

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 257

Table 16 Distribution of Publishers per RTB in Terms of thePercentage of Users in Our Data Set That Visit the GivenPublisher

Publisher of D

RTB 1Burstly 3004myYearbookcom 1062My Fitness Pal 0065TextMe Inc 0056GameResort 0025Flixster 0019247Sports 0015Daily Workout Apps LLC 0009Mobile Deluxe 0007YoYo Games Ltd 0004

RTB 2mworldnowcom 0032mtmzcom 0024mtopixcom 0014babycentercom 0015meetmecom 0006appsfacebookcom 0006itunesapplecom 0039appevitecom 0002beautyandskincarescom 0003mmajunkiecom 0001

RTB 3Conversion Exchange 5053Pinger Inc 2096Top Game Developer 4029PremiumEntertainmentApp-FamilyOriented-Android 0059Enflick 0053Clapfoot Games 0044FunPokes Inc 0018Talkatone 0015Sevenlogics Inc 0008A Star Software 0007

RTB 4Pinger PhonemdashiOSmdashConversation 1001Talkatone iPhone App 0007DiceWithBuddies 00139GAG Reader 0012Video Downloader Pro Lite 0016Rage Comics 0009Spades 3D Lite 0006Apalabrados 0003Relax Melodies HD 0002Celeb MemdashPhotoMaker 0000

a diverse collection of 40 publishers listed in Table 16A publisher can correspond to a mobile app a web-site or a set of appswebsites We selected both pub-lishers that are visited by many users and more nichepublishers with fewer visitors As can be seen fromTable 16 several types of publishers can be distin-guished the main groups are games (eg GameRe-sort Top Game Developer) and social networkingtools (eg appsfacebookcom Text Me Inc)

More specifically we create 200 samples for 40 pub-lishers 10 publishers for each of four different RTB

exchanges listed in Table 1613 The percentage ofunique users that visit each publisher in our completedata set is also shown (D) To create each samplean ldquoegordquo user is selected that has visited the pub-lisher in question (an ldquoego visitorrdquo) this ego visi-torrsquos GSN neighborhood is created Next we assesshow many of the egorsquos neighbors also visited thatsame publisher This will be compared with a base-line visit rate to produce lift and leverage values(Provost and Fawcett 2013)mdashspecifically how muchmore likely is it for the neighbors of ego visitors tovisit the publisher than would be expected by chanceWe consider three different setups that produce dif-ferent baselines

1 Considering all RTB exchanges to determine the base-line A user is considered a visitor for a given pub-lisher if she visits the publisher on any RTB ex-change (since some publishers are visible on differentexchanges) Hence the baseline visit rate is the num-ber of users (on any RTB) who visit that publisherdivided by the total number of users across all RTBexchanges This will provide optimistic lifts sincemost publishers appear only on one RTB exchangewhereas the baseline is measured over all exchanges

2 Considering one RTB exchange only to determine thebaseline A user is considered a visitor for a given pub-lisher and RTB if she visits the publisher on the sameRTB exchange This changes the baseline to the num-ber of users (on this RTB) who visit that publisherdivided by the total number of users on this RTBexchange The lift and leverage results will be worsethan in the first setting because the baseline will belarger Seeing that some publishers are seen on differ-ent RTB exchanges this setting is slightly pessimistic

3 Taking time of location visits into account In thethird setting we link users only if they visit an IPwithin the same time period This will be elaboratedon in sect73

Results are shown in Tables 17 and 18 The percent-age of publisher visitors in the neighborhood is givenas N This percentage is compared with the baselinepercentage of users that visit that publisher in thecomplete data D using two metrics the lift is the ratioof these numbers the leverage is the difference Liftis particularly useful for very small base rates (con-sider that an activity with a base rate of 30 cannotpossibly have a lift higher than about 3) Leverage ismore telling for larger base rates because it shows theabsolute improvement (here in percentage points)

When considering all NNs the lifts for all publish-ers are greater than 1 except for one Burstly on RTB 1has a lift of 097 (essentially no lift) but only when weconsider RTB 1 as the baseline This publisher already

13 For confidentiality the RTB numbering here is again differentfrom above

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers258 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 17 RTBs 1 and 2 Percentage of Network Neighbors That Also Visit the Publisher

RTB 1

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Burstly 3004 8063 2084 5059 47022 15054 44018 25061 8043 220572 myYearbookcom 1062 25077 15094 24015 44074 27067 43012 35016 21074 330543 My Fitness Pal 0065 4015 6038 305 3609 56077 36025 15055 23091 14094 TextMe Inc 0056 1805 33006 17094 39002 69075 38046 25018 45 240625 GameResort 0025 4093 19077 4068 18052 74026 18027 11086 47058 110626 Flixster 0019 1055 8015 1036 8014 42085 7095 1079 9042 1067 247Sports 0015 27057 188058 27042 77062 530093 77048 74053 509075 740388 Daily Workout Apps LLC 0009 1092 21047 1083 19023 215046 19014 11054 129027 110459 Mobile Deluxe 0007 1087 28067 108 23053 36009 23046 9063 14707 9056

10 YoYo Games Ltd 0004 3052 99058 3049 21057 609064 21053 13 367045 12096Considering this RTB exchange only to determine baseline

1 Burstly 8085 8063 0097 minus0022 47022 5033 38037 25061 2089 160762 myYearbookcom 4071 25077 5047 21006 44074 905 40003 35016 7046 300453 My Fitness Pal 1089 4015 2019 2025 3609 19048 35001 15055 8021 130654 TextMe Inc 1063 1805 11035 16087 39002 23094 37039 25018 15045 230555 GameResort 0073 4093 6078 402 18052 25049 17079 11086 16033 110146 Flixster 0055 1055 208 0099 8014 14071 7059 1079 3023 10247 247Sports 0043 27057 64073 27014 77062 182023 7702 74053 174097 74018 Daily Workout Apps LLC 0026 1092 7037 1066 19023 73095 18097 11054 44037 110289 Mobile Deluxe 0019 1087 9084 1068 23053 123088 23034 9063 5007 9044

10 YoYo Games Ltd 001 3052 34018 3042 21057 209025 21047 13 126012 1209

RTB 2

Considering all RTB exchanges to determine baseline1 mworldnowcom 0032 13027 40086 12095 54039 167042 54006 31065 97044 310332 mtmzcom 0024 6089 28086 6065 23008 9607 22084 10068 44075 100443 mtopixcom 0014 15007 104079 14093 53045 37106 5303 33075 234065 330614 babycentercom 0015 6087 47031 6072 43048 299056 43033 17074 122024 17065 meetmecom 0006 6002 10308 5097 54072 942093 54066 27085 47909 270796 appsfacebookcom 0006 301 48039 3003 9009 141097 9003 2076 43008 20697 itunesapplecom 0039 25077 65071 25038 51043 131012 51004 29049 75018 290098 appevitecom 0002 48028 21212043 48025 54076 21509068 54074 48028 21212043 480259 beautyandskincarescom 0003 18019 642087 18016 8048 299094 8046 1107 413049 11067

10 mmajunkiecom 0001 1036 201064 1035 26067 3195505 26066 13033 11977074 13033Considering this RTB exchange only to determine baseline

1 mworldnowcom 4034 13027 3006 8093 54039 12053 50005 31065 7029 270312 mtmzcom 3019 6089 2016 307 23008 7024 19089 10068 3035 70493 mtopixcom 1092 15007 7084 13015 53045 27081 51053 33075 17056 310834 babycentercom 1094 6087 3054 4093 43048 22042 41054 17074 9015 15085 meetmecom 0078 6002 7077 5025 54072 70057 53094 27085 35092 270076 appsfacebookcom 0086 301 3062 2024 9009 10063 8024 2076 3022 1097 itunesapplecom 5024 25077 4092 20053 51043 9081 46019 29049 5063 240258 appevitecom 0029 48028 165058 47098 54076 187082 54047 48028 165058 470989 beautyandskincarescom 0038 18019 48011 17081 8048 22045 8011 1107 30095 11032

10 mmajunkiecom 0009 1036 15009 1027 26067 296003 26058 13033 148001 13024

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

has a very high baseline and as we will see in thenext section when we consider only the top 1 and 10NNs the lifts go up to 5 and 3 respectively Burstlyprovides a generic platform for building apps and sois a weaker indication of user interest than many ofthe other publishers (fitness babies games)

All other lifts exceed one by substantial mar-gins and go up to even 2391 (Apalabrados onRTB 4 a Spanish-language social Scrabble-like game)

Obviously the conclusion that the NNs indeed aremore likely to visit the same publishers is highly sig-nificant by a simple sign test comparing the visita-tion rate within the geosimilarity neighborhood tothat of overall population either via lift or leveragep lt 0001 Thus NNs of publisher visitors are indeedmore likely also to visit the same websites the abso-lute values of the lifts and leverages suggest that theyare substantially more likely

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 259

Table 18 RTBs 3 and 4 Percentage of Network Neighbors That Also Visit the Publisher

RTB 3

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Conversion Exchange 5053 34054 6024 29001 64071 11069 59017 50077 9018 450242 Pinger Inc 2096 46002 15053 43006 65012 21097 62015 51048 17037 480523 Top Game Developer 4029 47024 11002 42095 60 13099 55071 48097 11042 440684 PremiumEntertainmentApp-FamilyOriented-Android 0059 23079 40012 2302 20045 34049 19086 22025 37052 210665 Enflick 0053 11071 21095 11017 21057 40044 21004 12027 23 110736 Clapfoot Games 0044 8055 19032 8011 11054 26006 1101 13099 31061 130557 FunPokes Inc 0018 11011 61028 10093 50 275077 49082 15079 87008 150618 Talkatone 0015 13013 88051 12098 31011 209076 30096 17091 120078 170779 Sevenlogics Inc 0008 2071 32059 2063 5088 70074 508 8033 100022 8025

10 A Star Software 0007 3033 44086 3025 23008 311024 23 9009 122061 9002Considering this RTB exchange only to determine baseline

1 Conversion Exchange 11045 34054 3002 23009 64071 5065 53025 50077 4043 390322 Pinger Inc 6013 46002 705 39089 65012 10062 58098 51048 8039 450353 Top Game Developer 8088 47024 5032 38036 60 6076 51012 48097 5052 400094 PremiumEntertainmentApp-FamilyOriented-Android 1023 23079 19038 22057 20045 16066 19023 22025 18013 210025 Enflick 101 11071 1006 1006 21057 19054 20046 12027 11011 110166 Clapfoot Games 0092 8055 9033 7064 11054 12059 10062 13099 15027 130087 FunPokes Inc 0038 11011 2906 10074 50 133022 49062 15079 42007 150418 Talkatone 0031 13013 42076 12082 31011 101034 3008 17091 58035 170619 Sevenlogics Inc 0017 2071 15074 2054 5088 34017 5071 8033 48041 8016

10 A Star Software 0015 3033 21067 3017 23008 150036 22092 9009 59023 8094

RTB 4

Considering all RTB exchanges to determine baseline1 Pinger PhonemdashiOSmdashConversation 1001 71081 70085 7008 80077 79068 79076 70014 6902 690132 Talkatone iPhone App 0007 11038 17203 11031 29003 439068 28097 17002 257078 160963 DiceWithBuddies 0013 7041 59004 7028 18018 144092 18006 7069 61031 70574 9GAG Reader 0012 62013 497099 62 49009 39305 48097 49005 393017 480925 Video Downloader Pro Lite 0016 20003 123015 19087 22058 138084 22042 17093 110027 170776 Rage Comics 0009 609 8001 6081 13033 154086 13025 609 8001 60817 Spades 3D Lite 0006 1205 225076 12044 27027 492057 27022 1205 225076 120448 Apalabrados 0003 77042 21391071 77039 75 21316097 74097 79032 21450043 790299 Relax Melodies HD 0002 31058 21002029 31056 50 31170029 49098 31058 21002029 31056

10 Celeb MemdashPhotoMaker 0 3023 82107 3022 7014 11819048 7014 3023 82107 3022Considering this RTB exchange only to determine baseline

1 Pinger PhonemdashiOSmdashConversation 8051 71081 8044 6303 80077 9049 72026 70014 8024 610632 Talkatone iPhone App 0055 11038 20053 10082 29003 52038 28048 17002 30071 160473 DiceWithBuddies 0009 7041 82011 7032 18018 201054 18009 7069 85027 7064 9GAG Reader 0097 62013 63097 61016 49009 50055 48012 49005 50051 480085 Video Downloader Pro Lite 0044 20003 45036 19059 22058 51014 22014 17093 40062 170496 Rage Comics 0037 609 18079 6053 13033 36032 12097 609 18079 60537 Spades 3D Lite 0046 1205 2609 12004 27027 58068 26081 1205 2609 120048 Apalabrados 0021 77042 371093 77021 75 360031 74079 79032 381006 790119 Relax Melodies HD 0013 31058 238054 31045 50 377069 49087 31058 238054 31045

10 Celeb MemdashPhotoMaker 0003 3023 97089 3019 7014 216076 7011 3023 97089 3019

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

The very high lifts could be attributed to two phe-nomena They may provide some evidence that theGSN embeds a true social network as pointed out bythe good results for the social networking publishers(eg Apalabrados with a lift of 2391 appsevitecomwith a lift of 2212 and others) Friends visiting thesame social networking sites (here unusual ones)

would increase the lift and leverage over just havingdifferent instances of the same user visiting the samewebsites An alternative explanation is that some ofthese publishers are much more likely to be visitedacross different devices of the same user It is not clearwhether this is actually true However it argues forobtaining specific and broad data on which devices

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers260 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

are not the same user to shed further light on forexample any broad data that include a (anonymized)user identifier

From these results we can conclude that GSNneighbors indeed exhibit substantially similar pub-lisher visitation behavior Next we assess to whatextent more strongly connected NNs are more similarthan less strongly connected NNs

72 Ranking Users with Similar Interests

Are the More-Similar Geosimilarity Neighbors of aPublisherrsquos Visitors Even More Likely Also to Be Vis-itors to the Same Publisher To assess whether more-similar geosimilarity neighbors are even more likelyto share interests we compute the measures exactlyas in the previous study except instead of consider-ing all NNs only the 1 (N1) and 10 (N10) neighborswith the highest scores are considered (see Tables 17and 18) First we must consider which of the 15 met-rics that are defined to measure link strength to useTable 19 reports the rank aggregations of the differentmetrics as to which provide the best lifts Specificallyfor each of the 40 publishers each scoring metric pro-vides a lift and for that publisher the scoring metricscan be ranked (the best being ranked 1 etc) The rankaggregation is the average rank for a scoring metricacross all 40 publishers Thus if one metric providedthe best lift for all publishers it would get an aggre-gated rank of 1

To visualize the best metrics the metrics with anaverage ranking less than 8 are shown in boldfacein Table 19 The Jaccard metrics do not perform wellin this study either showing that penalizing the con-nection to users that visit many locations is not sen-sible Remember that the Jaccard metric is based onthe idea that users that visit many locations will havea higher chance to share some location with anotheruser However we have so many locations in totalthat visiting a couple more locations will increase onlymarginally this probability of sharing a location bychance A user that logs into more IP addresses andis more active should hence not be penalized andranked lower than other less active users Metric 3which sums the IDF scores of the shared locationsperforms quite well Given the operational efficiencyadvantage over the frequency-based metrics this met-ric is chosen to measure the strength of the links

As seen from the results in Tables 17 and 18 thelift and leverage values for the strongly connected (top 1and top 10) NNs are even higher sometimes astro-nomical (3170 for Relax Melodies HD (RTB 4)) Theseresults are highly significant by sign tests compar-ing either the lift or the leverage between the top-ranked neighbors and the entire cohort (p lt 0001) Theabsolute values again indicate that the users with the

strongest geosimilarity are often substantially morelikely to visit the same publisher

Thus we can conclude that not only does geosim-ilarity find users with similar interests it also ranksusers well by their likelihood of having similarinterests

An interesting follow-up question is whether thecharacteristics of a publisher are of importance for theresults For example are GSN neighbors more sim-ilar in terms of using the same social network appas compared to playing the same games To answerthis question the publishers are categorized into sixclasses funny social news games communicationand miscellaneous In Figure 9 the lifts of the pub-lishers are shown per category Some general trendscan be observed where publishers related to funnycontent have the highest median lift This could beexplained by the fact that users often forward funnycontent to their friends The social websitesapps fol-low next which seems to further demonstrate thatfriends are likely linked in the GSN However pub-lishers in the communication category perform not sowell compared with the other categories and ratherhigh variances are observed in the different cate-gories Please note that high variance is observedacross the categories therefore it is difficult to makewell-supported claims from these results

73 TimingAs a final evaluation we explore whether the inclu-sion of temporal similarity in location visitation ishelpful for defining GSN connections To this end weintroduce a setting where two users are connectedonly if they visit an IP address in the same timeperiod of a day Specifically we divide a day intodifferent time windowsmdashfor example two time peri-ods of 12 hoursmdashwhere users are connected only ifthey visit the same IP address in the same time periodof a day (which might be on different days) Practi-cally two digits are added to an IP which denote thestarting moment of the corresponding time periodWe report on time periods of 2 hours and of 12 hours(other periods yield similar results) We also includethe setting where users are connected only when theyvisit the same IP address on the same day in light ofthe result of Crandall et al (2010) that visiting thesame location at around the same time is indicativeof friendship

The resulting lifts are reported in Figures 10ndash12when considering respectively all NNs the top 1 NNand the top 10 NNs The results when not consider-ing the time period are repeated as well (All) lim-ited to the conservative case of only considering oneexchange (see above) The baseline (lift of one) isindicated with a red horizontal line lifts higher than1000 are shown as 1000 to limit the range to be able

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 15: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile Consumers256 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 15 When There Are More Than Two Network Neighbors to What Extent Is the Known Same User StronglyConnected to Itself

AUC (if gt 2 NNs) () Average rank percentile (if gt 2 NNs) ()

Ranking metric R T RTB 12 RTB 23 RTB 31 R T RTB 12 RTB 23 RTB 31

1 78 64 68 59 55 29 48 34 43 542 59 58 67 58 54 45 53 34 44 553 79 69 74 61 61 29 44 29 40 484 86 75 69 61 59 24 40 32 41 505 83 70 71 59 63 26 44 31 43 476 85 74 71 60 65 25 41 31 42 457 86 76 73 62 64 24 40 29 40 468 83 70 72 59 65 26 44 30 43 459 85 74 71 60 67 25 41 30 42 44

10 79 69 69 61 63 29 44 32 41 4711 79 69 71 62 64 44 44 30 41 4612 81 68 68 60 62 28 46 34 42 4913 81 70 71 61 64 27 44 30 40 4614 74 67 66 61 48 34 47 35 41 6015 74 67 66 61 48 34 47 35 41 60

Note See the text for a discussion of ranking metrics

extent data points with label 1 are ranked higherthan those with label 0 In the results that follow(see Table 15) the AUC is measured for each samplewhere any user instance among the neighbors thatcorresponds to the same individual is labeled as 1and all others are labeled as 0 Thus in this settingthe AUC measures how well the neighbors are rankedby the likelihood of being the same individual11 Theaverage over all samples is reported If for a sampleall neighbors have the same score that AUC is setto 05 In addition to the factors noted above this alsowill tend to make the results conservative For exam-ple if the GSN links an ego user to a set of neighborsthat are all instances of the same individual in thisevaluation the AUC will be 05

The ranking measure is the percentile in the rank-ing at which the known-same-user instance is rankedwith lower values being better For example if thesame-user instance has the highest score (closest con-nection) among five neighbors the rank is 20 (15)Note again that this is quite conservative in thisexample the user is ranked at the top of the list butbecause there were only five neighbors the highestpossible score is 20 In case the known-same-userinstance has the same score as another neighbor theaverage rank is reported

As shown in Table 15 all AUC values exceed 05ranging up to 08 This means that in every casea known-same-user instance is likely to be rankedhigher than an instance not known to be the same

11 Technically in this setting the AUCMWW is the probability thata randomly selected neighbor that is in fact a known instance of thesame user is ranked more highly than a randomly selected neighborthat is not known to be an instance of the same user If the known-same-user instances are all ranked above the other-user instancesthe AUC = 100 If they are all below then the AUC = 0

user The frequency-based ranking metrics (4ndash9) per-form quite well in connecting instances of the sameuser strongly

Almost all metrics perform better than a randommodel with performances that even come close to thebest possible solution Consistently performing quitewell are the frequency-based metrics (metrics 3ndash9)with metric 7 (the minimum of the IDF weights ofthe shared locations) getting the most wins Not per-forming well is the very basic count metric as well asthe Boolean metric that provides a binary score onlyInteresting to note the Jaccard metrics do not performwell either It seems that penalizing users that visitmany IP addresses negatively affects the results anissue we will return to in the next study

7 Study 2 Does the GSN Select Userswith Similar Interests

The foregoing section addressed our paperrsquos firstclaimmdashthat the GSN design indeed links instances ofthe same user and links them more strongly thaninstances of other users In this section we turn to thepaperrsquos second claimmdashthat the GSN links users withsimilar interests We will measure similar interests bysimilarity in behavior accessing particular publisherson the Web and using mobile apps12

71 Connecting Users with Similar Interests

To What Extent Are the GSN Neighbors of Visi-tors to Particular Publishers Also Visitors to ThoseSame Publishers To assess whether the GSN con-nects users with similar interestsbehavior we select

12 From now on all user instances will be complete real user in-stances the simulated split-up scenarios (Random Temporal) werecreated specifically for the purpose of the same-user study

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 257

Table 16 Distribution of Publishers per RTB in Terms of thePercentage of Users in Our Data Set That Visit the GivenPublisher

Publisher of D

RTB 1Burstly 3004myYearbookcom 1062My Fitness Pal 0065TextMe Inc 0056GameResort 0025Flixster 0019247Sports 0015Daily Workout Apps LLC 0009Mobile Deluxe 0007YoYo Games Ltd 0004

RTB 2mworldnowcom 0032mtmzcom 0024mtopixcom 0014babycentercom 0015meetmecom 0006appsfacebookcom 0006itunesapplecom 0039appevitecom 0002beautyandskincarescom 0003mmajunkiecom 0001

RTB 3Conversion Exchange 5053Pinger Inc 2096Top Game Developer 4029PremiumEntertainmentApp-FamilyOriented-Android 0059Enflick 0053Clapfoot Games 0044FunPokes Inc 0018Talkatone 0015Sevenlogics Inc 0008A Star Software 0007

RTB 4Pinger PhonemdashiOSmdashConversation 1001Talkatone iPhone App 0007DiceWithBuddies 00139GAG Reader 0012Video Downloader Pro Lite 0016Rage Comics 0009Spades 3D Lite 0006Apalabrados 0003Relax Melodies HD 0002Celeb MemdashPhotoMaker 0000

a diverse collection of 40 publishers listed in Table 16A publisher can correspond to a mobile app a web-site or a set of appswebsites We selected both pub-lishers that are visited by many users and more nichepublishers with fewer visitors As can be seen fromTable 16 several types of publishers can be distin-guished the main groups are games (eg GameRe-sort Top Game Developer) and social networkingtools (eg appsfacebookcom Text Me Inc)

More specifically we create 200 samples for 40 pub-lishers 10 publishers for each of four different RTB

exchanges listed in Table 1613 The percentage ofunique users that visit each publisher in our completedata set is also shown (D) To create each samplean ldquoegordquo user is selected that has visited the pub-lisher in question (an ldquoego visitorrdquo) this ego visi-torrsquos GSN neighborhood is created Next we assesshow many of the egorsquos neighbors also visited thatsame publisher This will be compared with a base-line visit rate to produce lift and leverage values(Provost and Fawcett 2013)mdashspecifically how muchmore likely is it for the neighbors of ego visitors tovisit the publisher than would be expected by chanceWe consider three different setups that produce dif-ferent baselines

1 Considering all RTB exchanges to determine the base-line A user is considered a visitor for a given pub-lisher if she visits the publisher on any RTB ex-change (since some publishers are visible on differentexchanges) Hence the baseline visit rate is the num-ber of users (on any RTB) who visit that publisherdivided by the total number of users across all RTBexchanges This will provide optimistic lifts sincemost publishers appear only on one RTB exchangewhereas the baseline is measured over all exchanges

2 Considering one RTB exchange only to determine thebaseline A user is considered a visitor for a given pub-lisher and RTB if she visits the publisher on the sameRTB exchange This changes the baseline to the num-ber of users (on this RTB) who visit that publisherdivided by the total number of users on this RTBexchange The lift and leverage results will be worsethan in the first setting because the baseline will belarger Seeing that some publishers are seen on differ-ent RTB exchanges this setting is slightly pessimistic

3 Taking time of location visits into account In thethird setting we link users only if they visit an IPwithin the same time period This will be elaboratedon in sect73

Results are shown in Tables 17 and 18 The percent-age of publisher visitors in the neighborhood is givenas N This percentage is compared with the baselinepercentage of users that visit that publisher in thecomplete data D using two metrics the lift is the ratioof these numbers the leverage is the difference Liftis particularly useful for very small base rates (con-sider that an activity with a base rate of 30 cannotpossibly have a lift higher than about 3) Leverage ismore telling for larger base rates because it shows theabsolute improvement (here in percentage points)

When considering all NNs the lifts for all publish-ers are greater than 1 except for one Burstly on RTB 1has a lift of 097 (essentially no lift) but only when weconsider RTB 1 as the baseline This publisher already

13 For confidentiality the RTB numbering here is again differentfrom above

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers258 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 17 RTBs 1 and 2 Percentage of Network Neighbors That Also Visit the Publisher

RTB 1

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Burstly 3004 8063 2084 5059 47022 15054 44018 25061 8043 220572 myYearbookcom 1062 25077 15094 24015 44074 27067 43012 35016 21074 330543 My Fitness Pal 0065 4015 6038 305 3609 56077 36025 15055 23091 14094 TextMe Inc 0056 1805 33006 17094 39002 69075 38046 25018 45 240625 GameResort 0025 4093 19077 4068 18052 74026 18027 11086 47058 110626 Flixster 0019 1055 8015 1036 8014 42085 7095 1079 9042 1067 247Sports 0015 27057 188058 27042 77062 530093 77048 74053 509075 740388 Daily Workout Apps LLC 0009 1092 21047 1083 19023 215046 19014 11054 129027 110459 Mobile Deluxe 0007 1087 28067 108 23053 36009 23046 9063 14707 9056

10 YoYo Games Ltd 0004 3052 99058 3049 21057 609064 21053 13 367045 12096Considering this RTB exchange only to determine baseline

1 Burstly 8085 8063 0097 minus0022 47022 5033 38037 25061 2089 160762 myYearbookcom 4071 25077 5047 21006 44074 905 40003 35016 7046 300453 My Fitness Pal 1089 4015 2019 2025 3609 19048 35001 15055 8021 130654 TextMe Inc 1063 1805 11035 16087 39002 23094 37039 25018 15045 230555 GameResort 0073 4093 6078 402 18052 25049 17079 11086 16033 110146 Flixster 0055 1055 208 0099 8014 14071 7059 1079 3023 10247 247Sports 0043 27057 64073 27014 77062 182023 7702 74053 174097 74018 Daily Workout Apps LLC 0026 1092 7037 1066 19023 73095 18097 11054 44037 110289 Mobile Deluxe 0019 1087 9084 1068 23053 123088 23034 9063 5007 9044

10 YoYo Games Ltd 001 3052 34018 3042 21057 209025 21047 13 126012 1209

RTB 2

Considering all RTB exchanges to determine baseline1 mworldnowcom 0032 13027 40086 12095 54039 167042 54006 31065 97044 310332 mtmzcom 0024 6089 28086 6065 23008 9607 22084 10068 44075 100443 mtopixcom 0014 15007 104079 14093 53045 37106 5303 33075 234065 330614 babycentercom 0015 6087 47031 6072 43048 299056 43033 17074 122024 17065 meetmecom 0006 6002 10308 5097 54072 942093 54066 27085 47909 270796 appsfacebookcom 0006 301 48039 3003 9009 141097 9003 2076 43008 20697 itunesapplecom 0039 25077 65071 25038 51043 131012 51004 29049 75018 290098 appevitecom 0002 48028 21212043 48025 54076 21509068 54074 48028 21212043 480259 beautyandskincarescom 0003 18019 642087 18016 8048 299094 8046 1107 413049 11067

10 mmajunkiecom 0001 1036 201064 1035 26067 3195505 26066 13033 11977074 13033Considering this RTB exchange only to determine baseline

1 mworldnowcom 4034 13027 3006 8093 54039 12053 50005 31065 7029 270312 mtmzcom 3019 6089 2016 307 23008 7024 19089 10068 3035 70493 mtopixcom 1092 15007 7084 13015 53045 27081 51053 33075 17056 310834 babycentercom 1094 6087 3054 4093 43048 22042 41054 17074 9015 15085 meetmecom 0078 6002 7077 5025 54072 70057 53094 27085 35092 270076 appsfacebookcom 0086 301 3062 2024 9009 10063 8024 2076 3022 1097 itunesapplecom 5024 25077 4092 20053 51043 9081 46019 29049 5063 240258 appevitecom 0029 48028 165058 47098 54076 187082 54047 48028 165058 470989 beautyandskincarescom 0038 18019 48011 17081 8048 22045 8011 1107 30095 11032

10 mmajunkiecom 0009 1036 15009 1027 26067 296003 26058 13033 148001 13024

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

has a very high baseline and as we will see in thenext section when we consider only the top 1 and 10NNs the lifts go up to 5 and 3 respectively Burstlyprovides a generic platform for building apps and sois a weaker indication of user interest than many ofthe other publishers (fitness babies games)

All other lifts exceed one by substantial mar-gins and go up to even 2391 (Apalabrados onRTB 4 a Spanish-language social Scrabble-like game)

Obviously the conclusion that the NNs indeed aremore likely to visit the same publishers is highly sig-nificant by a simple sign test comparing the visita-tion rate within the geosimilarity neighborhood tothat of overall population either via lift or leveragep lt 0001 Thus NNs of publisher visitors are indeedmore likely also to visit the same websites the abso-lute values of the lifts and leverages suggest that theyare substantially more likely

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 259

Table 18 RTBs 3 and 4 Percentage of Network Neighbors That Also Visit the Publisher

RTB 3

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Conversion Exchange 5053 34054 6024 29001 64071 11069 59017 50077 9018 450242 Pinger Inc 2096 46002 15053 43006 65012 21097 62015 51048 17037 480523 Top Game Developer 4029 47024 11002 42095 60 13099 55071 48097 11042 440684 PremiumEntertainmentApp-FamilyOriented-Android 0059 23079 40012 2302 20045 34049 19086 22025 37052 210665 Enflick 0053 11071 21095 11017 21057 40044 21004 12027 23 110736 Clapfoot Games 0044 8055 19032 8011 11054 26006 1101 13099 31061 130557 FunPokes Inc 0018 11011 61028 10093 50 275077 49082 15079 87008 150618 Talkatone 0015 13013 88051 12098 31011 209076 30096 17091 120078 170779 Sevenlogics Inc 0008 2071 32059 2063 5088 70074 508 8033 100022 8025

10 A Star Software 0007 3033 44086 3025 23008 311024 23 9009 122061 9002Considering this RTB exchange only to determine baseline

1 Conversion Exchange 11045 34054 3002 23009 64071 5065 53025 50077 4043 390322 Pinger Inc 6013 46002 705 39089 65012 10062 58098 51048 8039 450353 Top Game Developer 8088 47024 5032 38036 60 6076 51012 48097 5052 400094 PremiumEntertainmentApp-FamilyOriented-Android 1023 23079 19038 22057 20045 16066 19023 22025 18013 210025 Enflick 101 11071 1006 1006 21057 19054 20046 12027 11011 110166 Clapfoot Games 0092 8055 9033 7064 11054 12059 10062 13099 15027 130087 FunPokes Inc 0038 11011 2906 10074 50 133022 49062 15079 42007 150418 Talkatone 0031 13013 42076 12082 31011 101034 3008 17091 58035 170619 Sevenlogics Inc 0017 2071 15074 2054 5088 34017 5071 8033 48041 8016

10 A Star Software 0015 3033 21067 3017 23008 150036 22092 9009 59023 8094

RTB 4

Considering all RTB exchanges to determine baseline1 Pinger PhonemdashiOSmdashConversation 1001 71081 70085 7008 80077 79068 79076 70014 6902 690132 Talkatone iPhone App 0007 11038 17203 11031 29003 439068 28097 17002 257078 160963 DiceWithBuddies 0013 7041 59004 7028 18018 144092 18006 7069 61031 70574 9GAG Reader 0012 62013 497099 62 49009 39305 48097 49005 393017 480925 Video Downloader Pro Lite 0016 20003 123015 19087 22058 138084 22042 17093 110027 170776 Rage Comics 0009 609 8001 6081 13033 154086 13025 609 8001 60817 Spades 3D Lite 0006 1205 225076 12044 27027 492057 27022 1205 225076 120448 Apalabrados 0003 77042 21391071 77039 75 21316097 74097 79032 21450043 790299 Relax Melodies HD 0002 31058 21002029 31056 50 31170029 49098 31058 21002029 31056

10 Celeb MemdashPhotoMaker 0 3023 82107 3022 7014 11819048 7014 3023 82107 3022Considering this RTB exchange only to determine baseline

1 Pinger PhonemdashiOSmdashConversation 8051 71081 8044 6303 80077 9049 72026 70014 8024 610632 Talkatone iPhone App 0055 11038 20053 10082 29003 52038 28048 17002 30071 160473 DiceWithBuddies 0009 7041 82011 7032 18018 201054 18009 7069 85027 7064 9GAG Reader 0097 62013 63097 61016 49009 50055 48012 49005 50051 480085 Video Downloader Pro Lite 0044 20003 45036 19059 22058 51014 22014 17093 40062 170496 Rage Comics 0037 609 18079 6053 13033 36032 12097 609 18079 60537 Spades 3D Lite 0046 1205 2609 12004 27027 58068 26081 1205 2609 120048 Apalabrados 0021 77042 371093 77021 75 360031 74079 79032 381006 790119 Relax Melodies HD 0013 31058 238054 31045 50 377069 49087 31058 238054 31045

10 Celeb MemdashPhotoMaker 0003 3023 97089 3019 7014 216076 7011 3023 97089 3019

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

The very high lifts could be attributed to two phe-nomena They may provide some evidence that theGSN embeds a true social network as pointed out bythe good results for the social networking publishers(eg Apalabrados with a lift of 2391 appsevitecomwith a lift of 2212 and others) Friends visiting thesame social networking sites (here unusual ones)

would increase the lift and leverage over just havingdifferent instances of the same user visiting the samewebsites An alternative explanation is that some ofthese publishers are much more likely to be visitedacross different devices of the same user It is not clearwhether this is actually true However it argues forobtaining specific and broad data on which devices

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers260 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

are not the same user to shed further light on forexample any broad data that include a (anonymized)user identifier

From these results we can conclude that GSNneighbors indeed exhibit substantially similar pub-lisher visitation behavior Next we assess to whatextent more strongly connected NNs are more similarthan less strongly connected NNs

72 Ranking Users with Similar Interests

Are the More-Similar Geosimilarity Neighbors of aPublisherrsquos Visitors Even More Likely Also to Be Vis-itors to the Same Publisher To assess whether more-similar geosimilarity neighbors are even more likelyto share interests we compute the measures exactlyas in the previous study except instead of consider-ing all NNs only the 1 (N1) and 10 (N10) neighborswith the highest scores are considered (see Tables 17and 18) First we must consider which of the 15 met-rics that are defined to measure link strength to useTable 19 reports the rank aggregations of the differentmetrics as to which provide the best lifts Specificallyfor each of the 40 publishers each scoring metric pro-vides a lift and for that publisher the scoring metricscan be ranked (the best being ranked 1 etc) The rankaggregation is the average rank for a scoring metricacross all 40 publishers Thus if one metric providedthe best lift for all publishers it would get an aggre-gated rank of 1

To visualize the best metrics the metrics with anaverage ranking less than 8 are shown in boldfacein Table 19 The Jaccard metrics do not perform wellin this study either showing that penalizing the con-nection to users that visit many locations is not sen-sible Remember that the Jaccard metric is based onthe idea that users that visit many locations will havea higher chance to share some location with anotheruser However we have so many locations in totalthat visiting a couple more locations will increase onlymarginally this probability of sharing a location bychance A user that logs into more IP addresses andis more active should hence not be penalized andranked lower than other less active users Metric 3which sums the IDF scores of the shared locationsperforms quite well Given the operational efficiencyadvantage over the frequency-based metrics this met-ric is chosen to measure the strength of the links

As seen from the results in Tables 17 and 18 thelift and leverage values for the strongly connected (top 1and top 10) NNs are even higher sometimes astro-nomical (3170 for Relax Melodies HD (RTB 4)) Theseresults are highly significant by sign tests compar-ing either the lift or the leverage between the top-ranked neighbors and the entire cohort (p lt 0001) Theabsolute values again indicate that the users with the

strongest geosimilarity are often substantially morelikely to visit the same publisher

Thus we can conclude that not only does geosim-ilarity find users with similar interests it also ranksusers well by their likelihood of having similarinterests

An interesting follow-up question is whether thecharacteristics of a publisher are of importance for theresults For example are GSN neighbors more sim-ilar in terms of using the same social network appas compared to playing the same games To answerthis question the publishers are categorized into sixclasses funny social news games communicationand miscellaneous In Figure 9 the lifts of the pub-lishers are shown per category Some general trendscan be observed where publishers related to funnycontent have the highest median lift This could beexplained by the fact that users often forward funnycontent to their friends The social websitesapps fol-low next which seems to further demonstrate thatfriends are likely linked in the GSN However pub-lishers in the communication category perform not sowell compared with the other categories and ratherhigh variances are observed in the different cate-gories Please note that high variance is observedacross the categories therefore it is difficult to makewell-supported claims from these results

73 TimingAs a final evaluation we explore whether the inclu-sion of temporal similarity in location visitation ishelpful for defining GSN connections To this end weintroduce a setting where two users are connectedonly if they visit an IP address in the same timeperiod of a day Specifically we divide a day intodifferent time windowsmdashfor example two time peri-ods of 12 hoursmdashwhere users are connected only ifthey visit the same IP address in the same time periodof a day (which might be on different days) Practi-cally two digits are added to an IP which denote thestarting moment of the corresponding time periodWe report on time periods of 2 hours and of 12 hours(other periods yield similar results) We also includethe setting where users are connected only when theyvisit the same IP address on the same day in light ofthe result of Crandall et al (2010) that visiting thesame location at around the same time is indicativeof friendship

The resulting lifts are reported in Figures 10ndash12when considering respectively all NNs the top 1 NNand the top 10 NNs The results when not consider-ing the time period are repeated as well (All) lim-ited to the conservative case of only considering oneexchange (see above) The baseline (lift of one) isindicated with a red horizontal line lifts higher than1000 are shown as 1000 to limit the range to be able

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 16: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 257

Table 16 Distribution of Publishers per RTB in Terms of thePercentage of Users in Our Data Set That Visit the GivenPublisher

Publisher of D

RTB 1Burstly 3004myYearbookcom 1062My Fitness Pal 0065TextMe Inc 0056GameResort 0025Flixster 0019247Sports 0015Daily Workout Apps LLC 0009Mobile Deluxe 0007YoYo Games Ltd 0004

RTB 2mworldnowcom 0032mtmzcom 0024mtopixcom 0014babycentercom 0015meetmecom 0006appsfacebookcom 0006itunesapplecom 0039appevitecom 0002beautyandskincarescom 0003mmajunkiecom 0001

RTB 3Conversion Exchange 5053Pinger Inc 2096Top Game Developer 4029PremiumEntertainmentApp-FamilyOriented-Android 0059Enflick 0053Clapfoot Games 0044FunPokes Inc 0018Talkatone 0015Sevenlogics Inc 0008A Star Software 0007

RTB 4Pinger PhonemdashiOSmdashConversation 1001Talkatone iPhone App 0007DiceWithBuddies 00139GAG Reader 0012Video Downloader Pro Lite 0016Rage Comics 0009Spades 3D Lite 0006Apalabrados 0003Relax Melodies HD 0002Celeb MemdashPhotoMaker 0000

a diverse collection of 40 publishers listed in Table 16A publisher can correspond to a mobile app a web-site or a set of appswebsites We selected both pub-lishers that are visited by many users and more nichepublishers with fewer visitors As can be seen fromTable 16 several types of publishers can be distin-guished the main groups are games (eg GameRe-sort Top Game Developer) and social networkingtools (eg appsfacebookcom Text Me Inc)

More specifically we create 200 samples for 40 pub-lishers 10 publishers for each of four different RTB

exchanges listed in Table 1613 The percentage ofunique users that visit each publisher in our completedata set is also shown (D) To create each samplean ldquoegordquo user is selected that has visited the pub-lisher in question (an ldquoego visitorrdquo) this ego visi-torrsquos GSN neighborhood is created Next we assesshow many of the egorsquos neighbors also visited thatsame publisher This will be compared with a base-line visit rate to produce lift and leverage values(Provost and Fawcett 2013)mdashspecifically how muchmore likely is it for the neighbors of ego visitors tovisit the publisher than would be expected by chanceWe consider three different setups that produce dif-ferent baselines

1 Considering all RTB exchanges to determine the base-line A user is considered a visitor for a given pub-lisher if she visits the publisher on any RTB ex-change (since some publishers are visible on differentexchanges) Hence the baseline visit rate is the num-ber of users (on any RTB) who visit that publisherdivided by the total number of users across all RTBexchanges This will provide optimistic lifts sincemost publishers appear only on one RTB exchangewhereas the baseline is measured over all exchanges

2 Considering one RTB exchange only to determine thebaseline A user is considered a visitor for a given pub-lisher and RTB if she visits the publisher on the sameRTB exchange This changes the baseline to the num-ber of users (on this RTB) who visit that publisherdivided by the total number of users on this RTBexchange The lift and leverage results will be worsethan in the first setting because the baseline will belarger Seeing that some publishers are seen on differ-ent RTB exchanges this setting is slightly pessimistic

3 Taking time of location visits into account In thethird setting we link users only if they visit an IPwithin the same time period This will be elaboratedon in sect73

Results are shown in Tables 17 and 18 The percent-age of publisher visitors in the neighborhood is givenas N This percentage is compared with the baselinepercentage of users that visit that publisher in thecomplete data D using two metrics the lift is the ratioof these numbers the leverage is the difference Liftis particularly useful for very small base rates (con-sider that an activity with a base rate of 30 cannotpossibly have a lift higher than about 3) Leverage ismore telling for larger base rates because it shows theabsolute improvement (here in percentage points)

When considering all NNs the lifts for all publish-ers are greater than 1 except for one Burstly on RTB 1has a lift of 097 (essentially no lift) but only when weconsider RTB 1 as the baseline This publisher already

13 For confidentiality the RTB numbering here is again differentfrom above

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers258 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 17 RTBs 1 and 2 Percentage of Network Neighbors That Also Visit the Publisher

RTB 1

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Burstly 3004 8063 2084 5059 47022 15054 44018 25061 8043 220572 myYearbookcom 1062 25077 15094 24015 44074 27067 43012 35016 21074 330543 My Fitness Pal 0065 4015 6038 305 3609 56077 36025 15055 23091 14094 TextMe Inc 0056 1805 33006 17094 39002 69075 38046 25018 45 240625 GameResort 0025 4093 19077 4068 18052 74026 18027 11086 47058 110626 Flixster 0019 1055 8015 1036 8014 42085 7095 1079 9042 1067 247Sports 0015 27057 188058 27042 77062 530093 77048 74053 509075 740388 Daily Workout Apps LLC 0009 1092 21047 1083 19023 215046 19014 11054 129027 110459 Mobile Deluxe 0007 1087 28067 108 23053 36009 23046 9063 14707 9056

10 YoYo Games Ltd 0004 3052 99058 3049 21057 609064 21053 13 367045 12096Considering this RTB exchange only to determine baseline

1 Burstly 8085 8063 0097 minus0022 47022 5033 38037 25061 2089 160762 myYearbookcom 4071 25077 5047 21006 44074 905 40003 35016 7046 300453 My Fitness Pal 1089 4015 2019 2025 3609 19048 35001 15055 8021 130654 TextMe Inc 1063 1805 11035 16087 39002 23094 37039 25018 15045 230555 GameResort 0073 4093 6078 402 18052 25049 17079 11086 16033 110146 Flixster 0055 1055 208 0099 8014 14071 7059 1079 3023 10247 247Sports 0043 27057 64073 27014 77062 182023 7702 74053 174097 74018 Daily Workout Apps LLC 0026 1092 7037 1066 19023 73095 18097 11054 44037 110289 Mobile Deluxe 0019 1087 9084 1068 23053 123088 23034 9063 5007 9044

10 YoYo Games Ltd 001 3052 34018 3042 21057 209025 21047 13 126012 1209

RTB 2

Considering all RTB exchanges to determine baseline1 mworldnowcom 0032 13027 40086 12095 54039 167042 54006 31065 97044 310332 mtmzcom 0024 6089 28086 6065 23008 9607 22084 10068 44075 100443 mtopixcom 0014 15007 104079 14093 53045 37106 5303 33075 234065 330614 babycentercom 0015 6087 47031 6072 43048 299056 43033 17074 122024 17065 meetmecom 0006 6002 10308 5097 54072 942093 54066 27085 47909 270796 appsfacebookcom 0006 301 48039 3003 9009 141097 9003 2076 43008 20697 itunesapplecom 0039 25077 65071 25038 51043 131012 51004 29049 75018 290098 appevitecom 0002 48028 21212043 48025 54076 21509068 54074 48028 21212043 480259 beautyandskincarescom 0003 18019 642087 18016 8048 299094 8046 1107 413049 11067

10 mmajunkiecom 0001 1036 201064 1035 26067 3195505 26066 13033 11977074 13033Considering this RTB exchange only to determine baseline

1 mworldnowcom 4034 13027 3006 8093 54039 12053 50005 31065 7029 270312 mtmzcom 3019 6089 2016 307 23008 7024 19089 10068 3035 70493 mtopixcom 1092 15007 7084 13015 53045 27081 51053 33075 17056 310834 babycentercom 1094 6087 3054 4093 43048 22042 41054 17074 9015 15085 meetmecom 0078 6002 7077 5025 54072 70057 53094 27085 35092 270076 appsfacebookcom 0086 301 3062 2024 9009 10063 8024 2076 3022 1097 itunesapplecom 5024 25077 4092 20053 51043 9081 46019 29049 5063 240258 appevitecom 0029 48028 165058 47098 54076 187082 54047 48028 165058 470989 beautyandskincarescom 0038 18019 48011 17081 8048 22045 8011 1107 30095 11032

10 mmajunkiecom 0009 1036 15009 1027 26067 296003 26058 13033 148001 13024

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

has a very high baseline and as we will see in thenext section when we consider only the top 1 and 10NNs the lifts go up to 5 and 3 respectively Burstlyprovides a generic platform for building apps and sois a weaker indication of user interest than many ofthe other publishers (fitness babies games)

All other lifts exceed one by substantial mar-gins and go up to even 2391 (Apalabrados onRTB 4 a Spanish-language social Scrabble-like game)

Obviously the conclusion that the NNs indeed aremore likely to visit the same publishers is highly sig-nificant by a simple sign test comparing the visita-tion rate within the geosimilarity neighborhood tothat of overall population either via lift or leveragep lt 0001 Thus NNs of publisher visitors are indeedmore likely also to visit the same websites the abso-lute values of the lifts and leverages suggest that theyare substantially more likely

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 259

Table 18 RTBs 3 and 4 Percentage of Network Neighbors That Also Visit the Publisher

RTB 3

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Conversion Exchange 5053 34054 6024 29001 64071 11069 59017 50077 9018 450242 Pinger Inc 2096 46002 15053 43006 65012 21097 62015 51048 17037 480523 Top Game Developer 4029 47024 11002 42095 60 13099 55071 48097 11042 440684 PremiumEntertainmentApp-FamilyOriented-Android 0059 23079 40012 2302 20045 34049 19086 22025 37052 210665 Enflick 0053 11071 21095 11017 21057 40044 21004 12027 23 110736 Clapfoot Games 0044 8055 19032 8011 11054 26006 1101 13099 31061 130557 FunPokes Inc 0018 11011 61028 10093 50 275077 49082 15079 87008 150618 Talkatone 0015 13013 88051 12098 31011 209076 30096 17091 120078 170779 Sevenlogics Inc 0008 2071 32059 2063 5088 70074 508 8033 100022 8025

10 A Star Software 0007 3033 44086 3025 23008 311024 23 9009 122061 9002Considering this RTB exchange only to determine baseline

1 Conversion Exchange 11045 34054 3002 23009 64071 5065 53025 50077 4043 390322 Pinger Inc 6013 46002 705 39089 65012 10062 58098 51048 8039 450353 Top Game Developer 8088 47024 5032 38036 60 6076 51012 48097 5052 400094 PremiumEntertainmentApp-FamilyOriented-Android 1023 23079 19038 22057 20045 16066 19023 22025 18013 210025 Enflick 101 11071 1006 1006 21057 19054 20046 12027 11011 110166 Clapfoot Games 0092 8055 9033 7064 11054 12059 10062 13099 15027 130087 FunPokes Inc 0038 11011 2906 10074 50 133022 49062 15079 42007 150418 Talkatone 0031 13013 42076 12082 31011 101034 3008 17091 58035 170619 Sevenlogics Inc 0017 2071 15074 2054 5088 34017 5071 8033 48041 8016

10 A Star Software 0015 3033 21067 3017 23008 150036 22092 9009 59023 8094

RTB 4

Considering all RTB exchanges to determine baseline1 Pinger PhonemdashiOSmdashConversation 1001 71081 70085 7008 80077 79068 79076 70014 6902 690132 Talkatone iPhone App 0007 11038 17203 11031 29003 439068 28097 17002 257078 160963 DiceWithBuddies 0013 7041 59004 7028 18018 144092 18006 7069 61031 70574 9GAG Reader 0012 62013 497099 62 49009 39305 48097 49005 393017 480925 Video Downloader Pro Lite 0016 20003 123015 19087 22058 138084 22042 17093 110027 170776 Rage Comics 0009 609 8001 6081 13033 154086 13025 609 8001 60817 Spades 3D Lite 0006 1205 225076 12044 27027 492057 27022 1205 225076 120448 Apalabrados 0003 77042 21391071 77039 75 21316097 74097 79032 21450043 790299 Relax Melodies HD 0002 31058 21002029 31056 50 31170029 49098 31058 21002029 31056

10 Celeb MemdashPhotoMaker 0 3023 82107 3022 7014 11819048 7014 3023 82107 3022Considering this RTB exchange only to determine baseline

1 Pinger PhonemdashiOSmdashConversation 8051 71081 8044 6303 80077 9049 72026 70014 8024 610632 Talkatone iPhone App 0055 11038 20053 10082 29003 52038 28048 17002 30071 160473 DiceWithBuddies 0009 7041 82011 7032 18018 201054 18009 7069 85027 7064 9GAG Reader 0097 62013 63097 61016 49009 50055 48012 49005 50051 480085 Video Downloader Pro Lite 0044 20003 45036 19059 22058 51014 22014 17093 40062 170496 Rage Comics 0037 609 18079 6053 13033 36032 12097 609 18079 60537 Spades 3D Lite 0046 1205 2609 12004 27027 58068 26081 1205 2609 120048 Apalabrados 0021 77042 371093 77021 75 360031 74079 79032 381006 790119 Relax Melodies HD 0013 31058 238054 31045 50 377069 49087 31058 238054 31045

10 Celeb MemdashPhotoMaker 0003 3023 97089 3019 7014 216076 7011 3023 97089 3019

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

The very high lifts could be attributed to two phe-nomena They may provide some evidence that theGSN embeds a true social network as pointed out bythe good results for the social networking publishers(eg Apalabrados with a lift of 2391 appsevitecomwith a lift of 2212 and others) Friends visiting thesame social networking sites (here unusual ones)

would increase the lift and leverage over just havingdifferent instances of the same user visiting the samewebsites An alternative explanation is that some ofthese publishers are much more likely to be visitedacross different devices of the same user It is not clearwhether this is actually true However it argues forobtaining specific and broad data on which devices

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers260 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

are not the same user to shed further light on forexample any broad data that include a (anonymized)user identifier

From these results we can conclude that GSNneighbors indeed exhibit substantially similar pub-lisher visitation behavior Next we assess to whatextent more strongly connected NNs are more similarthan less strongly connected NNs

72 Ranking Users with Similar Interests

Are the More-Similar Geosimilarity Neighbors of aPublisherrsquos Visitors Even More Likely Also to Be Vis-itors to the Same Publisher To assess whether more-similar geosimilarity neighbors are even more likelyto share interests we compute the measures exactlyas in the previous study except instead of consider-ing all NNs only the 1 (N1) and 10 (N10) neighborswith the highest scores are considered (see Tables 17and 18) First we must consider which of the 15 met-rics that are defined to measure link strength to useTable 19 reports the rank aggregations of the differentmetrics as to which provide the best lifts Specificallyfor each of the 40 publishers each scoring metric pro-vides a lift and for that publisher the scoring metricscan be ranked (the best being ranked 1 etc) The rankaggregation is the average rank for a scoring metricacross all 40 publishers Thus if one metric providedthe best lift for all publishers it would get an aggre-gated rank of 1

To visualize the best metrics the metrics with anaverage ranking less than 8 are shown in boldfacein Table 19 The Jaccard metrics do not perform wellin this study either showing that penalizing the con-nection to users that visit many locations is not sen-sible Remember that the Jaccard metric is based onthe idea that users that visit many locations will havea higher chance to share some location with anotheruser However we have so many locations in totalthat visiting a couple more locations will increase onlymarginally this probability of sharing a location bychance A user that logs into more IP addresses andis more active should hence not be penalized andranked lower than other less active users Metric 3which sums the IDF scores of the shared locationsperforms quite well Given the operational efficiencyadvantage over the frequency-based metrics this met-ric is chosen to measure the strength of the links

As seen from the results in Tables 17 and 18 thelift and leverage values for the strongly connected (top 1and top 10) NNs are even higher sometimes astro-nomical (3170 for Relax Melodies HD (RTB 4)) Theseresults are highly significant by sign tests compar-ing either the lift or the leverage between the top-ranked neighbors and the entire cohort (p lt 0001) Theabsolute values again indicate that the users with the

strongest geosimilarity are often substantially morelikely to visit the same publisher

Thus we can conclude that not only does geosim-ilarity find users with similar interests it also ranksusers well by their likelihood of having similarinterests

An interesting follow-up question is whether thecharacteristics of a publisher are of importance for theresults For example are GSN neighbors more sim-ilar in terms of using the same social network appas compared to playing the same games To answerthis question the publishers are categorized into sixclasses funny social news games communicationand miscellaneous In Figure 9 the lifts of the pub-lishers are shown per category Some general trendscan be observed where publishers related to funnycontent have the highest median lift This could beexplained by the fact that users often forward funnycontent to their friends The social websitesapps fol-low next which seems to further demonstrate thatfriends are likely linked in the GSN However pub-lishers in the communication category perform not sowell compared with the other categories and ratherhigh variances are observed in the different cate-gories Please note that high variance is observedacross the categories therefore it is difficult to makewell-supported claims from these results

73 TimingAs a final evaluation we explore whether the inclu-sion of temporal similarity in location visitation ishelpful for defining GSN connections To this end weintroduce a setting where two users are connectedonly if they visit an IP address in the same timeperiod of a day Specifically we divide a day intodifferent time windowsmdashfor example two time peri-ods of 12 hoursmdashwhere users are connected only ifthey visit the same IP address in the same time periodof a day (which might be on different days) Practi-cally two digits are added to an IP which denote thestarting moment of the corresponding time periodWe report on time periods of 2 hours and of 12 hours(other periods yield similar results) We also includethe setting where users are connected only when theyvisit the same IP address on the same day in light ofthe result of Crandall et al (2010) that visiting thesame location at around the same time is indicativeof friendship

The resulting lifts are reported in Figures 10ndash12when considering respectively all NNs the top 1 NNand the top 10 NNs The results when not consider-ing the time period are repeated as well (All) lim-ited to the conservative case of only considering oneexchange (see above) The baseline (lift of one) isindicated with a red horizontal line lifts higher than1000 are shown as 1000 to limit the range to be able

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 17: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile Consumers258 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Table 17 RTBs 1 and 2 Percentage of Network Neighbors That Also Visit the Publisher

RTB 1

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Burstly 3004 8063 2084 5059 47022 15054 44018 25061 8043 220572 myYearbookcom 1062 25077 15094 24015 44074 27067 43012 35016 21074 330543 My Fitness Pal 0065 4015 6038 305 3609 56077 36025 15055 23091 14094 TextMe Inc 0056 1805 33006 17094 39002 69075 38046 25018 45 240625 GameResort 0025 4093 19077 4068 18052 74026 18027 11086 47058 110626 Flixster 0019 1055 8015 1036 8014 42085 7095 1079 9042 1067 247Sports 0015 27057 188058 27042 77062 530093 77048 74053 509075 740388 Daily Workout Apps LLC 0009 1092 21047 1083 19023 215046 19014 11054 129027 110459 Mobile Deluxe 0007 1087 28067 108 23053 36009 23046 9063 14707 9056

10 YoYo Games Ltd 0004 3052 99058 3049 21057 609064 21053 13 367045 12096Considering this RTB exchange only to determine baseline

1 Burstly 8085 8063 0097 minus0022 47022 5033 38037 25061 2089 160762 myYearbookcom 4071 25077 5047 21006 44074 905 40003 35016 7046 300453 My Fitness Pal 1089 4015 2019 2025 3609 19048 35001 15055 8021 130654 TextMe Inc 1063 1805 11035 16087 39002 23094 37039 25018 15045 230555 GameResort 0073 4093 6078 402 18052 25049 17079 11086 16033 110146 Flixster 0055 1055 208 0099 8014 14071 7059 1079 3023 10247 247Sports 0043 27057 64073 27014 77062 182023 7702 74053 174097 74018 Daily Workout Apps LLC 0026 1092 7037 1066 19023 73095 18097 11054 44037 110289 Mobile Deluxe 0019 1087 9084 1068 23053 123088 23034 9063 5007 9044

10 YoYo Games Ltd 001 3052 34018 3042 21057 209025 21047 13 126012 1209

RTB 2

Considering all RTB exchanges to determine baseline1 mworldnowcom 0032 13027 40086 12095 54039 167042 54006 31065 97044 310332 mtmzcom 0024 6089 28086 6065 23008 9607 22084 10068 44075 100443 mtopixcom 0014 15007 104079 14093 53045 37106 5303 33075 234065 330614 babycentercom 0015 6087 47031 6072 43048 299056 43033 17074 122024 17065 meetmecom 0006 6002 10308 5097 54072 942093 54066 27085 47909 270796 appsfacebookcom 0006 301 48039 3003 9009 141097 9003 2076 43008 20697 itunesapplecom 0039 25077 65071 25038 51043 131012 51004 29049 75018 290098 appevitecom 0002 48028 21212043 48025 54076 21509068 54074 48028 21212043 480259 beautyandskincarescom 0003 18019 642087 18016 8048 299094 8046 1107 413049 11067

10 mmajunkiecom 0001 1036 201064 1035 26067 3195505 26066 13033 11977074 13033Considering this RTB exchange only to determine baseline

1 mworldnowcom 4034 13027 3006 8093 54039 12053 50005 31065 7029 270312 mtmzcom 3019 6089 2016 307 23008 7024 19089 10068 3035 70493 mtopixcom 1092 15007 7084 13015 53045 27081 51053 33075 17056 310834 babycentercom 1094 6087 3054 4093 43048 22042 41054 17074 9015 15085 meetmecom 0078 6002 7077 5025 54072 70057 53094 27085 35092 270076 appsfacebookcom 0086 301 3062 2024 9009 10063 8024 2076 3022 1097 itunesapplecom 5024 25077 4092 20053 51043 9081 46019 29049 5063 240258 appevitecom 0029 48028 165058 47098 54076 187082 54047 48028 165058 470989 beautyandskincarescom 0038 18019 48011 17081 8048 22045 8011 1107 30095 11032

10 mmajunkiecom 0009 1036 15009 1027 26067 296003 26058 13033 148001 13024

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

has a very high baseline and as we will see in thenext section when we consider only the top 1 and 10NNs the lifts go up to 5 and 3 respectively Burstlyprovides a generic platform for building apps and sois a weaker indication of user interest than many ofthe other publishers (fitness babies games)

All other lifts exceed one by substantial mar-gins and go up to even 2391 (Apalabrados onRTB 4 a Spanish-language social Scrabble-like game)

Obviously the conclusion that the NNs indeed aremore likely to visit the same publishers is highly sig-nificant by a simple sign test comparing the visita-tion rate within the geosimilarity neighborhood tothat of overall population either via lift or leveragep lt 0001 Thus NNs of publisher visitors are indeedmore likely also to visit the same websites the abso-lute values of the lifts and leverages suggest that theyare substantially more likely

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 259

Table 18 RTBs 3 and 4 Percentage of Network Neighbors That Also Visit the Publisher

RTB 3

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Conversion Exchange 5053 34054 6024 29001 64071 11069 59017 50077 9018 450242 Pinger Inc 2096 46002 15053 43006 65012 21097 62015 51048 17037 480523 Top Game Developer 4029 47024 11002 42095 60 13099 55071 48097 11042 440684 PremiumEntertainmentApp-FamilyOriented-Android 0059 23079 40012 2302 20045 34049 19086 22025 37052 210665 Enflick 0053 11071 21095 11017 21057 40044 21004 12027 23 110736 Clapfoot Games 0044 8055 19032 8011 11054 26006 1101 13099 31061 130557 FunPokes Inc 0018 11011 61028 10093 50 275077 49082 15079 87008 150618 Talkatone 0015 13013 88051 12098 31011 209076 30096 17091 120078 170779 Sevenlogics Inc 0008 2071 32059 2063 5088 70074 508 8033 100022 8025

10 A Star Software 0007 3033 44086 3025 23008 311024 23 9009 122061 9002Considering this RTB exchange only to determine baseline

1 Conversion Exchange 11045 34054 3002 23009 64071 5065 53025 50077 4043 390322 Pinger Inc 6013 46002 705 39089 65012 10062 58098 51048 8039 450353 Top Game Developer 8088 47024 5032 38036 60 6076 51012 48097 5052 400094 PremiumEntertainmentApp-FamilyOriented-Android 1023 23079 19038 22057 20045 16066 19023 22025 18013 210025 Enflick 101 11071 1006 1006 21057 19054 20046 12027 11011 110166 Clapfoot Games 0092 8055 9033 7064 11054 12059 10062 13099 15027 130087 FunPokes Inc 0038 11011 2906 10074 50 133022 49062 15079 42007 150418 Talkatone 0031 13013 42076 12082 31011 101034 3008 17091 58035 170619 Sevenlogics Inc 0017 2071 15074 2054 5088 34017 5071 8033 48041 8016

10 A Star Software 0015 3033 21067 3017 23008 150036 22092 9009 59023 8094

RTB 4

Considering all RTB exchanges to determine baseline1 Pinger PhonemdashiOSmdashConversation 1001 71081 70085 7008 80077 79068 79076 70014 6902 690132 Talkatone iPhone App 0007 11038 17203 11031 29003 439068 28097 17002 257078 160963 DiceWithBuddies 0013 7041 59004 7028 18018 144092 18006 7069 61031 70574 9GAG Reader 0012 62013 497099 62 49009 39305 48097 49005 393017 480925 Video Downloader Pro Lite 0016 20003 123015 19087 22058 138084 22042 17093 110027 170776 Rage Comics 0009 609 8001 6081 13033 154086 13025 609 8001 60817 Spades 3D Lite 0006 1205 225076 12044 27027 492057 27022 1205 225076 120448 Apalabrados 0003 77042 21391071 77039 75 21316097 74097 79032 21450043 790299 Relax Melodies HD 0002 31058 21002029 31056 50 31170029 49098 31058 21002029 31056

10 Celeb MemdashPhotoMaker 0 3023 82107 3022 7014 11819048 7014 3023 82107 3022Considering this RTB exchange only to determine baseline

1 Pinger PhonemdashiOSmdashConversation 8051 71081 8044 6303 80077 9049 72026 70014 8024 610632 Talkatone iPhone App 0055 11038 20053 10082 29003 52038 28048 17002 30071 160473 DiceWithBuddies 0009 7041 82011 7032 18018 201054 18009 7069 85027 7064 9GAG Reader 0097 62013 63097 61016 49009 50055 48012 49005 50051 480085 Video Downloader Pro Lite 0044 20003 45036 19059 22058 51014 22014 17093 40062 170496 Rage Comics 0037 609 18079 6053 13033 36032 12097 609 18079 60537 Spades 3D Lite 0046 1205 2609 12004 27027 58068 26081 1205 2609 120048 Apalabrados 0021 77042 371093 77021 75 360031 74079 79032 381006 790119 Relax Melodies HD 0013 31058 238054 31045 50 377069 49087 31058 238054 31045

10 Celeb MemdashPhotoMaker 0003 3023 97089 3019 7014 216076 7011 3023 97089 3019

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

The very high lifts could be attributed to two phe-nomena They may provide some evidence that theGSN embeds a true social network as pointed out bythe good results for the social networking publishers(eg Apalabrados with a lift of 2391 appsevitecomwith a lift of 2212 and others) Friends visiting thesame social networking sites (here unusual ones)

would increase the lift and leverage over just havingdifferent instances of the same user visiting the samewebsites An alternative explanation is that some ofthese publishers are much more likely to be visitedacross different devices of the same user It is not clearwhether this is actually true However it argues forobtaining specific and broad data on which devices

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers260 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

are not the same user to shed further light on forexample any broad data that include a (anonymized)user identifier

From these results we can conclude that GSNneighbors indeed exhibit substantially similar pub-lisher visitation behavior Next we assess to whatextent more strongly connected NNs are more similarthan less strongly connected NNs

72 Ranking Users with Similar Interests

Are the More-Similar Geosimilarity Neighbors of aPublisherrsquos Visitors Even More Likely Also to Be Vis-itors to the Same Publisher To assess whether more-similar geosimilarity neighbors are even more likelyto share interests we compute the measures exactlyas in the previous study except instead of consider-ing all NNs only the 1 (N1) and 10 (N10) neighborswith the highest scores are considered (see Tables 17and 18) First we must consider which of the 15 met-rics that are defined to measure link strength to useTable 19 reports the rank aggregations of the differentmetrics as to which provide the best lifts Specificallyfor each of the 40 publishers each scoring metric pro-vides a lift and for that publisher the scoring metricscan be ranked (the best being ranked 1 etc) The rankaggregation is the average rank for a scoring metricacross all 40 publishers Thus if one metric providedthe best lift for all publishers it would get an aggre-gated rank of 1

To visualize the best metrics the metrics with anaverage ranking less than 8 are shown in boldfacein Table 19 The Jaccard metrics do not perform wellin this study either showing that penalizing the con-nection to users that visit many locations is not sen-sible Remember that the Jaccard metric is based onthe idea that users that visit many locations will havea higher chance to share some location with anotheruser However we have so many locations in totalthat visiting a couple more locations will increase onlymarginally this probability of sharing a location bychance A user that logs into more IP addresses andis more active should hence not be penalized andranked lower than other less active users Metric 3which sums the IDF scores of the shared locationsperforms quite well Given the operational efficiencyadvantage over the frequency-based metrics this met-ric is chosen to measure the strength of the links

As seen from the results in Tables 17 and 18 thelift and leverage values for the strongly connected (top 1and top 10) NNs are even higher sometimes astro-nomical (3170 for Relax Melodies HD (RTB 4)) Theseresults are highly significant by sign tests compar-ing either the lift or the leverage between the top-ranked neighbors and the entire cohort (p lt 0001) Theabsolute values again indicate that the users with the

strongest geosimilarity are often substantially morelikely to visit the same publisher

Thus we can conclude that not only does geosim-ilarity find users with similar interests it also ranksusers well by their likelihood of having similarinterests

An interesting follow-up question is whether thecharacteristics of a publisher are of importance for theresults For example are GSN neighbors more sim-ilar in terms of using the same social network appas compared to playing the same games To answerthis question the publishers are categorized into sixclasses funny social news games communicationand miscellaneous In Figure 9 the lifts of the pub-lishers are shown per category Some general trendscan be observed where publishers related to funnycontent have the highest median lift This could beexplained by the fact that users often forward funnycontent to their friends The social websitesapps fol-low next which seems to further demonstrate thatfriends are likely linked in the GSN However pub-lishers in the communication category perform not sowell compared with the other categories and ratherhigh variances are observed in the different cate-gories Please note that high variance is observedacross the categories therefore it is difficult to makewell-supported claims from these results

73 TimingAs a final evaluation we explore whether the inclu-sion of temporal similarity in location visitation ishelpful for defining GSN connections To this end weintroduce a setting where two users are connectedonly if they visit an IP address in the same timeperiod of a day Specifically we divide a day intodifferent time windowsmdashfor example two time peri-ods of 12 hoursmdashwhere users are connected only ifthey visit the same IP address in the same time periodof a day (which might be on different days) Practi-cally two digits are added to an IP which denote thestarting moment of the corresponding time periodWe report on time periods of 2 hours and of 12 hours(other periods yield similar results) We also includethe setting where users are connected only when theyvisit the same IP address on the same day in light ofthe result of Crandall et al (2010) that visiting thesame location at around the same time is indicativeof friendship

The resulting lifts are reported in Figures 10ndash12when considering respectively all NNs the top 1 NNand the top 10 NNs The results when not consider-ing the time period are repeated as well (All) lim-ited to the conservative case of only considering oneexchange (see above) The baseline (lift of one) isindicated with a red horizontal line lifts higher than1000 are shown as 1000 to limit the range to be able

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 18: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 259

Table 18 RTBs 3 and 4 Percentage of Network Neighbors That Also Visit the Publisher

RTB 3

All NNs N1 N10

D N Lift Lev N1 Lift Lev N10 Lift Lev

Considering all RTB exchanges to determine baseline1 Conversion Exchange 5053 34054 6024 29001 64071 11069 59017 50077 9018 450242 Pinger Inc 2096 46002 15053 43006 65012 21097 62015 51048 17037 480523 Top Game Developer 4029 47024 11002 42095 60 13099 55071 48097 11042 440684 PremiumEntertainmentApp-FamilyOriented-Android 0059 23079 40012 2302 20045 34049 19086 22025 37052 210665 Enflick 0053 11071 21095 11017 21057 40044 21004 12027 23 110736 Clapfoot Games 0044 8055 19032 8011 11054 26006 1101 13099 31061 130557 FunPokes Inc 0018 11011 61028 10093 50 275077 49082 15079 87008 150618 Talkatone 0015 13013 88051 12098 31011 209076 30096 17091 120078 170779 Sevenlogics Inc 0008 2071 32059 2063 5088 70074 508 8033 100022 8025

10 A Star Software 0007 3033 44086 3025 23008 311024 23 9009 122061 9002Considering this RTB exchange only to determine baseline

1 Conversion Exchange 11045 34054 3002 23009 64071 5065 53025 50077 4043 390322 Pinger Inc 6013 46002 705 39089 65012 10062 58098 51048 8039 450353 Top Game Developer 8088 47024 5032 38036 60 6076 51012 48097 5052 400094 PremiumEntertainmentApp-FamilyOriented-Android 1023 23079 19038 22057 20045 16066 19023 22025 18013 210025 Enflick 101 11071 1006 1006 21057 19054 20046 12027 11011 110166 Clapfoot Games 0092 8055 9033 7064 11054 12059 10062 13099 15027 130087 FunPokes Inc 0038 11011 2906 10074 50 133022 49062 15079 42007 150418 Talkatone 0031 13013 42076 12082 31011 101034 3008 17091 58035 170619 Sevenlogics Inc 0017 2071 15074 2054 5088 34017 5071 8033 48041 8016

10 A Star Software 0015 3033 21067 3017 23008 150036 22092 9009 59023 8094

RTB 4

Considering all RTB exchanges to determine baseline1 Pinger PhonemdashiOSmdashConversation 1001 71081 70085 7008 80077 79068 79076 70014 6902 690132 Talkatone iPhone App 0007 11038 17203 11031 29003 439068 28097 17002 257078 160963 DiceWithBuddies 0013 7041 59004 7028 18018 144092 18006 7069 61031 70574 9GAG Reader 0012 62013 497099 62 49009 39305 48097 49005 393017 480925 Video Downloader Pro Lite 0016 20003 123015 19087 22058 138084 22042 17093 110027 170776 Rage Comics 0009 609 8001 6081 13033 154086 13025 609 8001 60817 Spades 3D Lite 0006 1205 225076 12044 27027 492057 27022 1205 225076 120448 Apalabrados 0003 77042 21391071 77039 75 21316097 74097 79032 21450043 790299 Relax Melodies HD 0002 31058 21002029 31056 50 31170029 49098 31058 21002029 31056

10 Celeb MemdashPhotoMaker 0 3023 82107 3022 7014 11819048 7014 3023 82107 3022Considering this RTB exchange only to determine baseline

1 Pinger PhonemdashiOSmdashConversation 8051 71081 8044 6303 80077 9049 72026 70014 8024 610632 Talkatone iPhone App 0055 11038 20053 10082 29003 52038 28048 17002 30071 160473 DiceWithBuddies 0009 7041 82011 7032 18018 201054 18009 7069 85027 7064 9GAG Reader 0097 62013 63097 61016 49009 50055 48012 49005 50051 480085 Video Downloader Pro Lite 0044 20003 45036 19059 22058 51014 22014 17093 40062 170496 Rage Comics 0037 609 18079 6053 13033 36032 12097 609 18079 60537 Spades 3D Lite 0046 1205 2609 12004 27027 58068 26081 1205 2609 120048 Apalabrados 0021 77042 371093 77021 75 360031 74079 79032 381006 790119 Relax Melodies HD 0013 31058 238054 31045 50 377069 49087 31058 238054 31045

10 Celeb MemdashPhotoMaker 0003 3023 97089 3019 7014 216076 7011 3023 97089 3019

Notes Results are given for the complete data set (D) the complete neighborhood of a user who visited the given publisher (All NNs) and considering theneighborhood of closest 1 and 10 users (N1 and N10)

The very high lifts could be attributed to two phe-nomena They may provide some evidence that theGSN embeds a true social network as pointed out bythe good results for the social networking publishers(eg Apalabrados with a lift of 2391 appsevitecomwith a lift of 2212 and others) Friends visiting thesame social networking sites (here unusual ones)

would increase the lift and leverage over just havingdifferent instances of the same user visiting the samewebsites An alternative explanation is that some ofthese publishers are much more likely to be visitedacross different devices of the same user It is not clearwhether this is actually true However it argues forobtaining specific and broad data on which devices

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers260 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

are not the same user to shed further light on forexample any broad data that include a (anonymized)user identifier

From these results we can conclude that GSNneighbors indeed exhibit substantially similar pub-lisher visitation behavior Next we assess to whatextent more strongly connected NNs are more similarthan less strongly connected NNs

72 Ranking Users with Similar Interests

Are the More-Similar Geosimilarity Neighbors of aPublisherrsquos Visitors Even More Likely Also to Be Vis-itors to the Same Publisher To assess whether more-similar geosimilarity neighbors are even more likelyto share interests we compute the measures exactlyas in the previous study except instead of consider-ing all NNs only the 1 (N1) and 10 (N10) neighborswith the highest scores are considered (see Tables 17and 18) First we must consider which of the 15 met-rics that are defined to measure link strength to useTable 19 reports the rank aggregations of the differentmetrics as to which provide the best lifts Specificallyfor each of the 40 publishers each scoring metric pro-vides a lift and for that publisher the scoring metricscan be ranked (the best being ranked 1 etc) The rankaggregation is the average rank for a scoring metricacross all 40 publishers Thus if one metric providedthe best lift for all publishers it would get an aggre-gated rank of 1

To visualize the best metrics the metrics with anaverage ranking less than 8 are shown in boldfacein Table 19 The Jaccard metrics do not perform wellin this study either showing that penalizing the con-nection to users that visit many locations is not sen-sible Remember that the Jaccard metric is based onthe idea that users that visit many locations will havea higher chance to share some location with anotheruser However we have so many locations in totalthat visiting a couple more locations will increase onlymarginally this probability of sharing a location bychance A user that logs into more IP addresses andis more active should hence not be penalized andranked lower than other less active users Metric 3which sums the IDF scores of the shared locationsperforms quite well Given the operational efficiencyadvantage over the frequency-based metrics this met-ric is chosen to measure the strength of the links

As seen from the results in Tables 17 and 18 thelift and leverage values for the strongly connected (top 1and top 10) NNs are even higher sometimes astro-nomical (3170 for Relax Melodies HD (RTB 4)) Theseresults are highly significant by sign tests compar-ing either the lift or the leverage between the top-ranked neighbors and the entire cohort (p lt 0001) Theabsolute values again indicate that the users with the

strongest geosimilarity are often substantially morelikely to visit the same publisher

Thus we can conclude that not only does geosim-ilarity find users with similar interests it also ranksusers well by their likelihood of having similarinterests

An interesting follow-up question is whether thecharacteristics of a publisher are of importance for theresults For example are GSN neighbors more sim-ilar in terms of using the same social network appas compared to playing the same games To answerthis question the publishers are categorized into sixclasses funny social news games communicationand miscellaneous In Figure 9 the lifts of the pub-lishers are shown per category Some general trendscan be observed where publishers related to funnycontent have the highest median lift This could beexplained by the fact that users often forward funnycontent to their friends The social websitesapps fol-low next which seems to further demonstrate thatfriends are likely linked in the GSN However pub-lishers in the communication category perform not sowell compared with the other categories and ratherhigh variances are observed in the different cate-gories Please note that high variance is observedacross the categories therefore it is difficult to makewell-supported claims from these results

73 TimingAs a final evaluation we explore whether the inclu-sion of temporal similarity in location visitation ishelpful for defining GSN connections To this end weintroduce a setting where two users are connectedonly if they visit an IP address in the same timeperiod of a day Specifically we divide a day intodifferent time windowsmdashfor example two time peri-ods of 12 hoursmdashwhere users are connected only ifthey visit the same IP address in the same time periodof a day (which might be on different days) Practi-cally two digits are added to an IP which denote thestarting moment of the corresponding time periodWe report on time periods of 2 hours and of 12 hours(other periods yield similar results) We also includethe setting where users are connected only when theyvisit the same IP address on the same day in light ofthe result of Crandall et al (2010) that visiting thesame location at around the same time is indicativeof friendship

The resulting lifts are reported in Figures 10ndash12when considering respectively all NNs the top 1 NNand the top 10 NNs The results when not consider-ing the time period are repeated as well (All) lim-ited to the conservative case of only considering oneexchange (see above) The baseline (lift of one) isindicated with a red horizontal line lifts higher than1000 are shown as 1000 to limit the range to be able

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 19: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile Consumers260 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

are not the same user to shed further light on forexample any broad data that include a (anonymized)user identifier

From these results we can conclude that GSNneighbors indeed exhibit substantially similar pub-lisher visitation behavior Next we assess to whatextent more strongly connected NNs are more similarthan less strongly connected NNs

72 Ranking Users with Similar Interests

Are the More-Similar Geosimilarity Neighbors of aPublisherrsquos Visitors Even More Likely Also to Be Vis-itors to the Same Publisher To assess whether more-similar geosimilarity neighbors are even more likelyto share interests we compute the measures exactlyas in the previous study except instead of consider-ing all NNs only the 1 (N1) and 10 (N10) neighborswith the highest scores are considered (see Tables 17and 18) First we must consider which of the 15 met-rics that are defined to measure link strength to useTable 19 reports the rank aggregations of the differentmetrics as to which provide the best lifts Specificallyfor each of the 40 publishers each scoring metric pro-vides a lift and for that publisher the scoring metricscan be ranked (the best being ranked 1 etc) The rankaggregation is the average rank for a scoring metricacross all 40 publishers Thus if one metric providedthe best lift for all publishers it would get an aggre-gated rank of 1

To visualize the best metrics the metrics with anaverage ranking less than 8 are shown in boldfacein Table 19 The Jaccard metrics do not perform wellin this study either showing that penalizing the con-nection to users that visit many locations is not sen-sible Remember that the Jaccard metric is based onthe idea that users that visit many locations will havea higher chance to share some location with anotheruser However we have so many locations in totalthat visiting a couple more locations will increase onlymarginally this probability of sharing a location bychance A user that logs into more IP addresses andis more active should hence not be penalized andranked lower than other less active users Metric 3which sums the IDF scores of the shared locationsperforms quite well Given the operational efficiencyadvantage over the frequency-based metrics this met-ric is chosen to measure the strength of the links

As seen from the results in Tables 17 and 18 thelift and leverage values for the strongly connected (top 1and top 10) NNs are even higher sometimes astro-nomical (3170 for Relax Melodies HD (RTB 4)) Theseresults are highly significant by sign tests compar-ing either the lift or the leverage between the top-ranked neighbors and the entire cohort (p lt 0001) Theabsolute values again indicate that the users with the

strongest geosimilarity are often substantially morelikely to visit the same publisher

Thus we can conclude that not only does geosim-ilarity find users with similar interests it also ranksusers well by their likelihood of having similarinterests

An interesting follow-up question is whether thecharacteristics of a publisher are of importance for theresults For example are GSN neighbors more sim-ilar in terms of using the same social network appas compared to playing the same games To answerthis question the publishers are categorized into sixclasses funny social news games communicationand miscellaneous In Figure 9 the lifts of the pub-lishers are shown per category Some general trendscan be observed where publishers related to funnycontent have the highest median lift This could beexplained by the fact that users often forward funnycontent to their friends The social websitesapps fol-low next which seems to further demonstrate thatfriends are likely linked in the GSN However pub-lishers in the communication category perform not sowell compared with the other categories and ratherhigh variances are observed in the different cate-gories Please note that high variance is observedacross the categories therefore it is difficult to makewell-supported claims from these results

73 TimingAs a final evaluation we explore whether the inclu-sion of temporal similarity in location visitation ishelpful for defining GSN connections To this end weintroduce a setting where two users are connectedonly if they visit an IP address in the same timeperiod of a day Specifically we divide a day intodifferent time windowsmdashfor example two time peri-ods of 12 hoursmdashwhere users are connected only ifthey visit the same IP address in the same time periodof a day (which might be on different days) Practi-cally two digits are added to an IP which denote thestarting moment of the corresponding time periodWe report on time periods of 2 hours and of 12 hours(other periods yield similar results) We also includethe setting where users are connected only when theyvisit the same IP address on the same day in light ofthe result of Crandall et al (2010) that visiting thesame location at around the same time is indicativeof friendship

The resulting lifts are reported in Figures 10ndash12when considering respectively all NNs the top 1 NNand the top 10 NNs The results when not consider-ing the time period are repeated as well (All) lim-ited to the conservative case of only considering oneexchange (see above) The baseline (lift of one) isindicated with a red horizontal line lifts higher than1000 are shown as 1000 to limit the range to be able

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 20: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 261

Table 19 Rank Aggregation per Metric Over All Publishers (Across All RTBs)

Top 1 NN Top 10 NNs

1 Shared unique IPs 801 1 Shared Unique IPs 802252 BooleanmdashSharing an IP 80625 2 BooleanmdashSharing an IP 809133 Shared unique IPsmdashIDF weighted 70025 3 Shared unique IPsmdashIDF weighted 709754 Shared IP visitsmdashMin 70063 4 Shared IP visitsmdashMin 704135 Shared IP visitsmdashMax 7095 5 Shared IP visitsmdashMax 70256 Shared IP visitsmdashAvg 8005 6 Shared IP visitsmdashAvg 704887 Shared IP visitsmdashMinmdashIDF weighted 60625 7 Shared IP visitsmdashMinmdashIDF weighted 703758 Shared IP visitsmdashMaxmdashIDF weighted 70975 8 Shared IP visitsmdashMaxmdashIDF weighted 702259 Shared IP visitsmdashAvgmdashIDF weighted 70938 9 Shared IP visitsmdashAvgmdashIDF weighted 70525

10 Cosine similarity 80338 10 Cosine similarity 804511 Cosine similaritymdashIDF weighted 80213 11 Cosine similaritymdashIDF weighted 8086312 GSN Jaccard 80925 12 GSN Jaccard 913 GSN JaccardmdashIDF weighted 80325 13 GSN JaccardmdashIDF weighted 9013814 GSN Jaccard Freq 80425 14 GSN Jaccard Freq 80515 GSN Jaccard FreqmdashIDF weighted 80425 15 GSN Jaccard FreqmdashIDF weighted 805

to visualize the results Tables 20 and 21 show theaverage rank aggregation (see the discussion of rankaggregation above) in terms of lift per RTB and thep-values of a signed rank test respectively

When comparing the different temporal settingswe see that the ldquoDayrdquo variant performs best and lim-ited differences exist between the time periods of 2and 12 hours More important to observe from theseresults is that (overall) not including time in the GSNdesign outperforms all three temporal variants sub-stantially Only for RTB 3 and the Day version whereusers are connected only if they visit the same IPaddress on the same day are the lifts not signifi-cantly worse than having no time constraint at a 1level (although even here they are significantly worseat a 10 level) This shows that for finding userswith similar interests which locations they visit ismore important than when they visit them By includ-ing the time constraints connections between previ-ously connected users are lost simply because they

Figure 9 (Color online) Lifts of the Publishers (for All Four RTBs)Ranked According to the Median Lift Within the Category(Medians Shown by the Full Line)

visit locations in different time periods Although weuse time to define links metrics that include time todefine the strength of a link might improve the resultsfurther and we consider that an interesting issue forfuture research

8 ConclusionThis paper presents a new design for using geosim-ilarity to connect user instances in a geosimilarity-weighted network In the GSN users are connected ifthey share at least one observed location Various met-rics can be used to yield degrees of similarity We pre-sented intuitive arguments theory and prior researchthat suggest that geosimilarity and the GSN designshould link similar users We also provided strongempirical support Specifically Study 1 shows thatthe GSN indeed links different users instances cor-responding to the same individual and the geosimi-larity metrics rank user instances by their likelihoodof corresponding to the same individual Study 2shows that the GSN links users who have similarinterests as measured by their propensity to visitthe same publishersuse the same apps It also pro-vides strong evidence that the geosimilarity metricsrank user instances by their likelihood of havingsimilar interests Taken together the results providestrong support that the GSN links users with similarinterestsmdashin many cases because they actually repre-sent the same individuals

The main limitation of the research presented inthis paper is that the data are not sufficient to dis-tinguish whether the similarity in interests is solelybased on connecting instances of the same user Thisis an important avenue for future research This lim-itation notwithstanding the very strong results fromStudy 2 have important implications in either caseeither the network is indeed finding different indi-viduals with substantially similar interests or we

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 21: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile Consumers262 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

Figure 10 (Color online) Lifts Considering All Network Neighbors per RTB System (Each Panel) With and Without the Time Constraint Limited toOne Network (see sect71) for the 40 Publishers

RTB 3mdashAll NNs

0 5 10

Publisher

0

20

40

60

80

100

Lift

RTB 4mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 2mdashAll NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000RTB 1mdashAll NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

Figure 11 (Color online) Lifts Considering the Closest 1 Network Neighbor per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

Lift

0

100

200

400

300

RTB 3mdashTop 1 NNs

0 5 10

Publisher

All

TP2

TP12

Day

0 5 100

200

400

600

800

1000RTB 1mdashTop 1 NNs

Publisher

Lift

RTB 2mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

RTB 4mdashTop 1 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

Note Lifts higher than 1000 are shown at 1000

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 22: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 263

Figure 12 (Color online) Lifts Considering the Closest 10 Network Neighbors per RTB System (Each Panel) With and Without the Time ConstraintLimited to One Network (see sect71) for the 40 Publishers

RTB 2mdashTop 10 NNs

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 10

Publisher

Lift

0

50

100

150

0 5 10

Publisher

0

200

400

600

800

1000

Lift

0 5 100

200

400

600

800

1000

RTB 1mdashTop 10 NNs

RTB 3mdashTop 10 NNs RTB 4mdashTop 10 NNs

Publisher

Lift

All

TP2

TP12

Day

Note Lifts higher than 1000 are shown at 1000

have very strong additional support for the rankingresults of Study 1 as the neighbors with the strongestgeosimilarity have significantly higher affinity for thesame publishersapps

These results have broad and immediate manage-ment implications within our motivating applicationof mobile advertising As discussed above digitalmarketers would like to target the many fragmentedinstances of individuals in the digital advertising

Table 20 Rank Aggregations in Terms of Lift AveragedAcross All Publishers (per RTB)

All TP2 TP12 Day

RTB 1 101 209 309 201RTB 2 1 204 304 302RTB 3 101 3 4 109RTB 4 101 208 308 203

Note Linking based on time does not improve over linkingignoring time

Table 21 Evaluation of Including Time p-Values ofSigned Rank Test

All TP2 TP12 Day

RTB 1 mdash 0 0002 0 0002 0 0006RTB 2 mdash 0 0002 0 0002 0 0002RTB 3 mdash 0 0002 0 0002 00065RTB 4 mdash 0 0002 0 0002 0 0004

ecosystem as well as users who have similar intereststo particular ldquoseedrdquo users Location data such asIP addresses are readily available across the adver-tising ecosystem The most direct use of the GSN formobile ad targeting is to seed a targeting campaignwith users chosen to exhibit some characteristics ofinterest These could be chosen through any of themyriad of methods currently used in advertising TheGSN will directly allow the targeting of the othermembers of these usersrsquo geosimilarity cohorts whichwill include other instances of the same user othersimilar users and likely both So for example if a setof users has been identified to have brand affinity viavisits to a brandrsquos website the GSN cohorts of theseusers would provide an attractive avenue to expandthe reach of a campaign to target them (extendingtraditional retargeting) and also target users similarto the seeds (customer prospecting) In the researchdata set we see that different instances of the sameusers are very often connected in the GSN and thatthose connected in the GSN exhibit substantially sim-ilar interests

The GSN also could be used for (privacy-friendly)hyperlocal targetingmdashmeaning targeting people whofrequent a particular location without needing to storedata on the actual locations of the users as describedearlier IP addresses currently are used by somemarketers for coarse-grained demographic targetinginferring geography from IP address registration data

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 23: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile Consumers264 Information Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS

and then inferring demographics from the geographyThe GSN provides a complementary alternative it willconnect people who visit the same fine-grained loca-tions As described above in the couponing exampleif a campaign is seeded with customers of a localestablishment (eg via an online loyalty program)geosimilarity can target others who frequent the samelocations

An exciting potential future use of the GSN is theevaluation of marketing campaigns across differentchannels This has become an important challenge formobile advertizing (Lee 2014)14 For example I mightreceive an ad for a hot new product on my mobiledevice I may be interested in the product but Irsquomunlikely to buy this product on my mobile phoneRather Irsquoll buy the product using my laptop or PC Bylinking the screens of the same user we are now ableto provide a broader (and possibly much more robust)estimate of campaign success (eg conversion rate)across different channels by aggregating the successmetrics across GSN neighbors For example similar tohow a traditional ad campaign may runnot run anad in a controlled study across different cities agen-cies could targetnot target different (matched) usersand then look at the success rates in their respec-tive geosimilarity neighborhoods This should captureeffects of the same individual on different devices

As discussed at the outset advertisers need to becognizant of privacy concerns regarding the collec-tion storage and use of data This paper followswhat the FTC calls the ldquoprivacy by designrdquo approachBy design the technique does not need to store anydirect personal information about mobile users thereis no need for nonanonymized identifiers demo-graphics geographics psychographics etc In addi-tion the storage of indirect information about userscan be severely limited as well The method doesnot need the actual locationsmdashonly anonymized loca-tion ldquokeysrdquo So for example IP addresses can bereplaced with random numbers15 without affectingthe GSN performance (European Commission 2014Federal Trade Commission 2012) More technically atthe ldquoouter wallrdquo of the system or firm each deviceid can be irreversibly hashed to a random key Theonly requirement is that the same device be hashedto the same key if encountered again Similarly atthe ldquoouter wallrdquo of the system or firm every loca-tion also can be irreversibly hashed to a random keyThe GSN can be formed just the same with the ran-dom keys as with the actual locations If more privacy

14 httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study15 See both the FTCrsquos and the European Commissionrsquos 2010ndash2011privacy-by-design reports and the comments therein along withProvost et al (2009) for a thorough treatment of this topic

is desired hashing can be done irreversibly many toone and in that case it becomes impossible to asso-ciate definitively any particular location with any par-ticular deviceuser

Finally an aspect of this research that is importantfor managers but not typically covered in the predic-tive modeling literature is the notion of increasing thereach of a campaign For example as discussed abovepossibly the most straightforward use of the GSN isto select the same actor on different mobile devicesThis would allow us to expand the reach of a ldquoretar-getingrdquo campaign16

Increasing reach has important subtleties in the cur-rent advertising ecosystemmdashit is not just the otherside of the coin of increasing lift The reason is thatthere are many targeters in the onlinemobile adver-tising ecosystem all of whom are using the samedata retargeting data and demographicgeographicpsychographic data that are purchased from third-party data providers However these data are avail-able only on a subset of devices Thus all partieswho are using these data are competing for the samesometimes small set of devices Since large advertis-ers typically contract with multiple targeting firmswho take different strategies the effect is that theadvertiser is paying the firms to compete against eachother effectively raising the price in the auction andthereby raising their own cost of advertising17 Whatis more these will be the same devices that also willbe targeted for other campaigns because they are thedevices for which the targeters have data

However by connecting devices in a GSN we canexpand campaigns to devices for which there are noretargeting or third-party data available at all If thereis less competition for these devices we should beable to target them for a lower price Thus expand-ing reach has an implication for the cost-effectivenessof achieving a certain level of predictive performance(eg lift)

AcknowledgmentsThe authors thank the anonymous reviewers and edi-tors for very constructive suggestions which substantiallyimproved the paper This work was done while the firstauthor was at Coriolis Ventures and Everyscreen MediaThe authors extend our gratitude to Everyscreen MediaDstillery Lauren Moores Tom OrsquoReilly Vinayak Javaly andTina Eliassi-Rad for many discussions about mobile adver-tising and mobile ad targeting as well as technical assistance

16 Retargeting is to target browsers who have previously purchasedfrom the brand or who have taken some other indicative brandaction such as browsing the brandrsquos site Retargeting is consid-ered by many to be one of the most effective targeting strategies(although to our knowledge these conclusions are drawn based onassessing conversion rate rather than the influence of the advertis-ing please see Stitelman et al 2011)17 The latter is our conjecture

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved

Page 24: Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

Provost Martens and Murray Finding Similar Mobile ConsumersInformation Systems Research 26(2) pp 243ndash265 copy 2015 INFORMS 265

to obtain all of the relevant data This particular data set wasnot necessarily used in the development of any productionmodel used for mobile advertising The first author thanksNEC and Andre Meyer for faculty fellowships The authorsthank the Moore and Sloan Foundations for their generoussupport of the Moore-Sloan Data Science Environment atNew York University

ReferencesAgarwal R Gupta AK Kraut RE (2008) Editorial overviewmdashThe

interplay between digital and social networks Inform SystemsRes 19(3)243ndash252

Aral S Muchnik L Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamicnetworks Proc National Acad Sci USA 106(51)21544ndash21549

Bampo M Ewing MT Mather DR Stewart D Wallace M (2008)The effects of the social structure of digital networks on viralmarketing performance Inform Systems Res 19(3)273ndash290

Cho E Myers SA Leskovec J (2011) Friendship and mobility Usermovement in location-based social networks Proc 17th ACMSIGKDD Internat Conf Knowledge Discovery and Data MiningKDD rsquo11 (ACM New York) 1082ndash1090

Cortes C Pregibon D Volinsky CT (2001) Communities of interestHoffmann F Hand DJ Adams N Fisher D Guimaraes G edsAdvances in Intelligent Data Analysis Lecture Notes ComputSci Vol 2189 (Springer-Verlag Berlin Heidelberg) 105ndash114

Crandall DJ Backstrom L Cosley D Suri S Huttenlocher DKleinberg J (2010) Inferring social ties from geographic coinci-dences Proc National Acad Sci USA 107(52)22436ndash22441

de Montjoye YA Hidalgo CA Verleysen M Blondel VD (2013)Unique in the crowd The privacy bounds of human mobilitySci Rep 31ndash5

European Commission (2014) Data protection day 2014 Full speedon EU data protection reform Press release January 27httpeuropaeurapidpress-release_MEMO-14-60_enhtm

Fawcett T (2006) An introduction to ROC analysis Pattern Recogni-tion Lett 27(8)861ndash874

Federal Trade Commission (2012) Protecting consumer pri-vacy in an era of rapid change FTC Report March 2012httpswwwftcgovsitesdefaultfilesdocumentsreportsfederal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations120326privacyreportpdf

Heidemann J Klier M Probst F (2010) Identifying key users inonline social networks A pagerank based approach InformSystems J 4801(December)157ndash160

Hill S Provost F (2003) The myth of the double-blind reviewAuthor identification using only citations SIGKDD Explorations5(2)179ndash184

Hill S Provost F Volinsky C (2006) Network-based marketingIdentifying likely adopters via consumer networks Statist Sci22(2)256ndash276

Hotho A Nuumlrnberger A Paass G (2005) A brief survey of text min-ing LDV Forum 20(1)19ndash62

Kerho S (2012) Mobile marketingmdashA new analytics frameworkwhat we have and what we need Presentation Marketing onthe Move Conference Wharton School University of PennsylvaniaPhiladelphia February 28

Kossinets G Watts DJ (2009) Origins of homophily in an evolvingsocial network Amer J Sociol 115(2)405ndash450

Lee J (2014) 80 of marketers will run cross-channel marketingcampaigns in 2014 ClickZ (March 31) httpwwwclickzcomclickznews233699680-of-marketers-will-run-cross-channel-marketing-campaigns-in-2014-study

Martens D Provost F (2011) Pseudo-social network targeting fromconsumer transaction data Working paper CEDER-11-05 SternSchool of Business New York University New York

McPherson M Smith-Lovin L Cook JM (2001) Birds of afeather Homophily in social networks Annual Rev Sociol 27415ndash444

Oinas-Kukkonen H Lyytinen K Yoo Y (2010) Social networks andinformation systems Ongoing and future research streamsJ Assoc Inform Systems 11(2)61ndash68

Pan W Aharony N Pentland A (2011) Composite social networkfor predicting mobile apps installation Burgard W Roth Deds Proc AAAI 2011 (AAAI Press San Francisco)

Perlich C Dalessandro B Raeder T Stitelman O Provost F (2014)Machine learning for targeted display advertising Transferlearning in action Machine Learn 95(1)103ndash127

Pogue D (2011) Wrapping up the apple location brouhaha NewYork Times (April 28) httppogueblogsnytimescom20110428wrapping-up-the-apple-location-brouhaha

Provost F Fawcett T (2013) Data Science for Business What You Needto Know About Data Mining and Data-analytic Thinking (OrsquoReillyMedia Inc Sebastopol CA)

Provost F Dalessandro B Hook R Zhang X Murray A (2009) Audi-ence selection for on-line brand advertising Privacy-friendlysocial network targeting KDD rsquo09 Proc 15th ACM SIGKDDInternat Conf Knowledge Discovery and Data Mining (ACMNew York) 707ndash716

Pubmatic (2010) Understanding real-time bidding (RTB) from thepublisherrsquos perspective Technical report Pubmatic SiliconValley CA

Quercia D Lathia N Calabrese F Di Lorenzo G Crowcroft J (2010)Recommending social events from mobile phone location dataProc 2010 IEEE Internat Conf Data Mining ICDM rsquo10 (IEEEComputer Society Washington DC) 971ndash976

Raeder T Stitelman O Dalessandro B Perlich C Provost F (2012)Design principles of massive robust prediction systems Proc18th ACM SIGKDD Internat Conf Knowledge Discovery and DataMining (ACM New York) 1357ndash1365

Stitelman O Dalessandro B Perlich C Provost F (2011) Estimatingthe effect of online display advertising on browser conversionData Mining and Audience Intelligence for Advertising ADKDD2011 (ACM New York) 8

Stitelman O Perlich C Dalessandro B Hook R Raeder T Provost F(2013) Using co-visitation networks for detecting large scaleonline display advertising exchange fraud Proc 19th ACMSIGKDD Internat Conf Knowledge Discovery and Data Mining(ACM New York) 1240ndash1248

Dow

nloa

ded

from

info

rms

org

by [

128

122

185

232]

on

10 N

ovem

ber

2015

at 1

410

Fo

r pe

rson

al u

se o

nly

all

righ

ts r

eser

ved