final privacycon webcast deck (without notes) · • data brokers can tell when you're sick,...

67

Upload: others

Post on 18-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used
Page 2: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

RemarksCommissioner Julie Brill

Page 3: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Session 3:Big Data and Algorithms:

Transparency Tools Revealing Data Discrimination

Page 4: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Michael Carl TschantzUniversity of California, Berkeley

Anupam DattaCarnegie Mellon University

Automated Experiments on Ad Privacy Settings

Co-author: Amit Datta (Carnegie Mellon University)

Page 5: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

AdFisherInformation Flow Experiments on Ad Privacy Settings

Michael Carl TschantzInternational Computer Science Institute

Anupam DattaCarnegie Mellon University

Joint work with Amit Datta, CMU

Page 6: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

2

Page 7: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

3

Page 8: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

241

Page 9: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

5

Web browsing Advertisements

Ad settings

Inferences Edits

Ad ecosystem

Page 10: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

AdFisher

6

Experimental group Control group

Measurements

Experimental treatment Control treatment

Ad Ecosystem

Significance testing:Is there a difference?

P-value

Contribution: The rigor of experimental science• Causal effects• Statistical significance• Automation

Page 11: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Discrimination

7

Web browsing Advertisements

Ad settings

Ad ecosystem

Set the gender bit to female or male

Browse websites related finding a new job

Significant difference ads on news website(p < 0.000006)

Page 12: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Discrimination Explanation

0%10%20%30%40%50%60%70%80%90%

100%

$200k+ Jobs ‐ Execs Only Find Next $200k+ Job

FemaleMale

8

1816

311

36

7

Fails the 80% rule for disparate impact

Page 13: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Open questions

• How widespread?• Who is responsible?

9

Page 14: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

10

Input

Output

Female users

Other advertisers

WebsitesGoogle

Maleusers

The Barrett Group

Page 15: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

11

Input

Output

Female users

Other advertisers

WebsitesGoogle

Maleusers

The Barrett Group

Page 16: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

12

Input

Output

Female users

Other advertisers

WebsitesGoogle

Male users

The Barrett Group

Show to males

Page 17: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Websites

13

Input

Output

Female users

Other advertisers

Google

Male users

The Barrett Group

Show to high earners

High earners are male

Page 18: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

14

Input

Output

Female users

Other advertisers

WebsitesGoogle

Male users

The Barrett Group

Show to femalesShow to both

Page 19: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

15

Input

Output

Female users

Other advertisers

WebsitesGoogle

Maleusers

The Barrett Group

Clicking Not clicking

Show to both

Page 20: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Summary

• AdFisher: Rigorous experimental design– Causal effects– Statistical significance– Automation

• Found gender‐based discrimination• Open questions:

– How widespread?– How to assign responsibility?

16

Page 21: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

More Information

• http://www.cs.cmu.edu/~mtschant/ife/

• M.C. Tschantz, A. Datta, A. Datta, and J.M. Wing. A methodology for information flow experiments. CSF 2015.

• A. Datta, M.C. Tschantz, and A. Datta.Automated Experiments on Ad Privacy Settings:A Tale of Opacity, Choice, and Discrimination.PETS 2015

17

Page 22: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Roxana GeambasuColumbia University

Sunlight: Fine-grained Targeting Detection at Scale with Statistical Confidence

Co-authors: Mathias Lecuyer, Riley Spahn, Yannis Spiliopoulos, Augustin Chaintreau, Daniel Hsu (Columbia University)

Page 23: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Sunlight:Web Transparency at Scale.

Mathias Lecuyer, Riley Spahn, Yannis Spiliopoulos, Augustin Chaintreau, Roxana Geambasu, and Daniel Hsu

Columbia University

http://columbia.github.io/sunlight/

Page 24: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Example: Gmail Ads

Page 25: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

email subject & text

E1VacationI’m going on vacation to travel.

E2HomosexualGay, lesbian, homosexual.

E3PregnantI’m pregnant. I’m having a baby.

E4 UnemployedI’m unemployed.

E5FordI want to buy a car, maybe a Ford.

Ralph Lauren Online Shopwww.ralphlauren.comThe official Site for Ralph Lauren Apparel, Acccessories & More

Ad1

Cedars Hotel Loughboroughwww.thecedarshotel.com36 Bedrooms, Restaurant, BarFree WiFi, Parking, Best Rates

Ad2

ad title, url & text

Example: Gmail Ads

Page 26: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

email subject & text

E1VacationI’m going on vacation to travel.

E2HomosexualGay, lesbian, homosexual.

E3PregnantI’m pregnant. I’m having a baby.

E4 UnemployedI’m unemployed.

E5FordI want to buy a car, maybe a Ford.

Ralph Lauren Online Shopwww.ralphlauren.comThe official Site for Ralph Lauren Apparel, Acccessories & More

Ad1

Cedars Hotel Loughboroughwww.thecedarshotel.com36 Bedrooms, Restaurant, BarFree WiFi, Parking, Best Rates

Ad2

ad title, url & text

Example: Gmail Ads

?

Page 27: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

email subject & text

E1VacationI’m going on vacation to travel.

E2HomosexualGay, lesbian, homosexual.

E3PregnantI’m pregnant. I’m having a baby.

E4 UnemployedI’m unemployed.

E5FordI want to buy a car, maybe a Ford.

Ralph Lauren Online Shopwww.ralphlauren.comThe official Site for Ralph Lauren Apparel, Acccessories & More

Ad1

Cedars Hotel Loughboroughwww.thecedarshotel.com36 Bedrooms, Restaurant, BarFree WiFi, Parking, Best Rates

Ad2

ad title, url & text

Example: Gmail Ads

Page 28: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

email subject & text

E1VacationI’m going on vacation to travel.

E2HomosexualGay, lesbian, homosexual.

E3PregnantI’m pregnant. I’m having a baby.

E4 UnemployedI’m unemployed.

E5FordI want to buy a car, maybe a Ford.

Ralph Lauren Online Shopwww.ralphlauren.comThe official Site for Ralph Lauren Apparel, Acccessories & More

Ad1

Cedars Hotel Loughboroughwww.thecedarshotel.com36 Bedrooms, Restaurant, BarFree WiFi, Parking, Best Rates

Ad2

ad title, url & text

Example: Gmail Ads

?

Page 29: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

email subject & text

E1VacationI’m going on vacation to travel.

E2HomosexualGay, lesbian, homosexual.

E3PregnantI’m pregnant. I’m having a baby.

E4 UnemployedI’m unemployed.

E5FordI want to buy a car, maybe a Ford.

Ralph Lauren Online Shopwww.ralphlauren.comThe official Site for Ralph Lauren Apparel, Acccessories & More

Ad1

Cedars Hotel Loughboroughwww.thecedarshotel.com36 Bedrooms, Restaurant, BarFree WiFi, Parking, Best Rates

Ad2

ad title, url & text

Example: Gmail Ads

Page 30: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Did you know?

• Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14]

• Google Apps for Ed used institutional emails to target ads in personal accounts. [SafeGov’14]

• Credit companies are looking into using Facebook data to decide loans. [CNN’13]

It’s not just Gmail...

Page 31: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

The data-driven web

• The web is a complex and opaque ecosystem driven by massive collection and monetization of personal data.

● Who has what data?● What’s it used for?● Are the uses good or

bad for us?

● End-users, privacy watchdogs (eg, FTC) are equally blind.

data

Page 32: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

• Build transparency and oversight tools that increase users’ awareness and society’s oversight over web services’ use of personal data.

• Timeline:• 2014: XRay, the first targeting detection tool; it reveals targeting

through correlation [USENIX Security’14].• 2015: Sunlight, second-generation, more robust tool; it reveals the

causes of targeting at scale and with statistical justification [CCS’15].• Ongoing: DataObservatory, the first tool to reveal personalization on

arbitrary web pages.• Ongoing: Hubble, transparency tool based on end-user information.

Our research

Page 33: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Mathias Lecuyer Riley Spahn Yannis Spiliopoulos

Ph.D. students:

Faculty:

Augustin Chaintreau Daniel Hsu Arvind NarayananRoxana Geambasu

Page 34: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Sunlight

Generic and broadly applicable system that detects personal data use for targeting and personalization.Reveals which data (e.g., emails) triggers which outputs (e.g., ads).

● Key idea: correlate inputs with outputs based on observations from profiles with differentiated inputs.

Sunlight is precise, scalable, and works with many services.We tested it for Gmail ads, ads on arbitrary websites, recommendations

on Amazon & YouTube, prices in travel websites.

Page 35: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

email subject & text

E1VacationI’m going on vacation to travel.

E2HomosexualGay, lesbian, homosexual.

E3PregnantI’m pregnant. I’m having a baby.

E4 UnemployedI’m unemployed.

E5FordI want to buy a car, maybe a Ford.

Ralph Lauren Online Shopwww.ralphlauren.comThe official Site for Ralph Lauren Apparel, Acccessories & More

Ad1

Cedars Hotel Loughboroughwww.thecedarshotel.com36 Bedrooms, Restaurant, BarFree WiFi, Parking, Best Rates

Ad2

ad title, url & text

Example

Page 36: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

email subject & text

E1VacationI’m going on vacation to travel.

E2HomosexualGay, lesbian, homosexual.

E3PregnantI’m pregnant. I’m having a baby.

Ralph Lauren Online Shopwww.ralphlauren.comThe official Site for Ralph Lauren Apparel, Acccessories & More

Ad1

ad title, url & text

Example

Page 37: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

E1

E2

E3

Ad1

main account

Example

Page 38: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

E1

E2

E3

Ad1

main account

Exampleshadow account 1

shadow account 2

shadow account 3

E1

E2

E1

E3

E2

E3

Page 39: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

E1

E2

E3

Ad1

main account

Exampleshadow account 1

shadow account 2

shadow account 3

E1

E2

E1

E3

E2

E3

Ad1

Ad1

Page 40: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

E1

E2

E3

Ad1

main account

Exampleshadow account 1

shadow account 2

shadow account 3

E1

E2

E1

E3

E2

E3

Ad1

Ad1

E3 Ad1

targeting prediction:

Page 41: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

E1

E2

E3

Ad1

main account

shadow account 1

shadow account 2

shadow account 3

E1

E2

E1

E3

E2

E3

Ad1

Ad1

E3 Ad1

targeting prediction:

data collection: service-specific, with browser automation

targeting analysis:service-agnostic, with Sunlight

Page 42: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Transparency solutions

Sunlight(generic, scalable, and justifiable targeting detection)

... transparency tools (built by us, others)

transparency infrastructures

input/output observations

targeting predictions{inputs->output}

GmailAd-Observatory

AdsOnWeb-Observatory

AMZN,Youtube recomm.

end-users, privacy watchdogs (e.g., FTC, journalists)

Data-Observatory

Page 43: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Sunlight goals

GenericityWe assume that a small set of inputs is used to produce each

output. Our goal is to discover the correct input combination.

ScalabilityDetect targeting of many outputs on many inputs w/ limited

resources.

Precision

Targeting predictions must be statistically justified. Our goal is to detect as many true predictions as possible.

Page 44: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

E1

E2

E3

A1

main account

The scalability challengeshadow account 1

shadow account 2

shadow account 3

E1

E2

E1

E3

E2

E3

A1

A1

• To detect targeting on combinations of the inputs, will we need shadow profiles for all combinations???

Page 45: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Scalable targeting detection

• Theorem: Under sparsity assumptions, for any ε > 0 there exists an algorithm that requires C x log(N) accounts to correctly identify the inputs of a targeted output with probability (1 − ε). N is the number of inputs.

• Key insight: rely on sparsity properties (like compressed sensing).

• Sunlight supports several sparse detection algorithms, including sparse regressions with Lasso.

Page 46: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Justifiable targeting predictions

• Sparse algorithms only guarantee asymptotic correctness of the targeting predictions.

• We need correctness assessment for each targeting prediction.

• Solution: hypothesis testing.• Provides quantification of statistical significance of each

targeting association (a p-value).• p-value gives knob for precision/recall tradeoff.

Page 47: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

testingset

training set

input/output observations

1. Split observations

Architecture

2. Scalable Targeting Prediction

putativetargeting

predictions

targetingpredictions& p-values

4. Multiple Test

Correction

Tran

spar

ency

tool

(e.g

., G

mai

lAdO

bser

vato

ry)

Sunlight

3. Prediction Hypothesis

Testing

justifiable targeting

predictions& p-values

Page 48: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

What we get in the end

If during data collection we randomly assign our inputs independently of any other variable, Sunlight’s associations will have a causal interpretation (not just correlation).

However, Sunlight cannot explain how this targeting happens.

E.g.: What player in the ecosystem is responsible? Is it a human intervention or an algorithmic decision? Is it intentional or not?

Page 49: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Transparency toolsend-users, privacy watchdogs

(e.g., FTC, journalists)

Sunlight(generic, scalable, and justifiable targeting detection)

... transparency tools (built by us, others)

transparency infrastructures

input/output observations

targeting predictions{inputs->output}

GmailAd-Observatory

AdsOnWeb-Observatory

AMZN,Youtube recomm.

Data-Observatory

Page 50: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

● Service to study targeting of Gmail ads on users’ emails.○ Meant for researchers and journalists.

● How it works:○ Researcher supplies a set of emails.○ GmailAdObservatory uses a set of Gmail accounts to send emails to

a separate set of Gmail accounts (the shadows).○ It then collects ads periodically.○ Uses Sunlight to detect targeting for each collected ad.

● We ran a 33-day pilot study and we found violations of Google privacy statements.

Tool 1: GmailAdObservatory

Page 51: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Google privacy FAQ

Privacy, Transparency and Choice[...]

Only ads classified as Family-safe are displayed in Gmail. We are careful about the types of content we serve ads against. For example, Google may block certain ads from running next to an email about catastrophic news. We will also not target ads based on sensitive information, such as race, religion, sexual orientation, health, or sensitive financial categories.

http://support.google.com/mail/answer/6603

Page 52: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

“We will also not target ads based on sensitive information, such as race, religion, sexual orientation, health, or sensitive financial categories.”

Page 53: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

“We will also not target ads based on sensitive information, such as race, religion, sexual orientation, health, or sensitive financial categories.”

Notice the extremely low in-context impressions --the most obscure form of targeting.

Page 54: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

● Discovers personalization on arbitrary websites withoutany a-priori specification of targeted outputs.

● How it works (in progress!):○ Visits a website from the vantage point of multiple user

profiles with differentiated inputs.○ Compares various versions of each page by comparing

DOM trees.○ Uses Sunlight to detect how differences are targeted on the

inputs.

Tool 2: DataObservatory(work in progress)

Page 55: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

New York, NY Berlin, Germany

Ex: Personalization on Booking.com

Page 56: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

• We are building the first generic and broadly applicable transparency tools that enable oversight at scale.• Sunlight reveals the causes of targeting from controlled

experiments with many inputs.• DataObservatory reveals personalization on arbitrary pages.

• Tools can be used to study complex targeting phenomena.• E.g.: ad targeting, price tuning, personalization based on

tracking, cross-device targeting, remote fingerprint-based tracking, how children are targeted, etc.

• Open challenge: avoid the pitfalls of controlled experiments.

Summary

Page 57: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

http://www.cs.columbia.edu/~yannis/stable/booking_com_us_ger_LA_feb01-feb02_exp/Visualization.html

NOTE: This is very much in-progress work, but the demo illustrates the kinds of functionality the DataObservatory will provide.

Demo page

Page 58: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Daniel HsuColumbia University

Discovering Unwarranted Associations in Data-Driven Applications with the FairTestTesting Toolkit

Co-authors: Vaggelis Atlidakis, Roxana Geambasu (Columbia University); Florian Tramèr, Jean-Pierre Hubaux, Huang Lin (École PolytechniqueFédérale de Lausanne); Ari Juels (Cornell Tech)

Page 59: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

FairTest:discovering unwarranted associations in data‐driven 

applicationsFlorian Tramèr#,   Vaggelis Atlidakis*,   Roxana Geambasu*,   Daniel Hsu*,

Jean‐Pierre Hubaux#,   Mathias Humbert#,   Ari Juels@,   Huang Lin#

#École Polytechnique Fédérale de Lausanne,   *Columbia University,   @Cornell Tech

Page 60: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

“Unfair” associations + consequences

These are software bugs: need to actively test for them and fix them (i.e., debug) in data‐driven applications…just as with functionality, performance, and reliability bugs.

Page 61: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Limits of preventative measures

What doesn’t work:• Hide protected attributes from data‐driven application.• Aim for statistical parity w.r.t. protected classes and service output.

Foremost challenge is to even detect these unwarranted associations.

Page 62: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

FairTest: a testing suite for data‐driven apps• Finds context‐specific associations between protected variables and application outputs

• Bug report ranks findings by assoc. strength and affected pop. size

Data‐driven applicationUser inputs Application outputs

Protected vars.

Context vars. FairTest

Association bug report for developer

Explanatory vars.

race, gender, …

zip code, job, …

qualifications, …

location, click, …

prices, tags, …

Page 63: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Training data

Test dataData

A data‐driven approach

Core of FairTest is based on statistical machine learning

Ideally sampled from relevant user population

FairTest

Find context‐specific associations

Statistically validate associations

Statistical machine learning internals:• top‐down spatial partitioning 

algorithm• confidence intervals for assoc. 

metrics• …

Page 64: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Example: health care application

Predictor of whether patient will visit hospital again in next year(from winner of 2012 Heritage Health Prize Competition)

FairTest’s finding: significant contexts exhibiting strongassociation between age and prediction error rate.

Association may translate to quantifiable harms(e.g., if app is used to adjust insurance premiums)!

Hospitalre‐admission predictor

age, gender,# emergencies, …

Will patient be re‐admitted?

Page 65: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Example: Berkeley graduate admissions

Admission into UC Berkeley graduate programs(Bickel, Hammel, and O’Connell, 1975)

Bickel et al’s (and also FairTest’s) findings: gender bias in admissions at university level, but mostly gone after conditioning 

on department

FairTest helps developers understand & evaluate potential association bugs.

Graduate admissions committees

age, gender, GPA, … Admit applicant?

Page 66: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Closing remarks• Other applications studied using FairTest(http://arxiv.org/abs/1510.02377):

• Image tagger based on deep learning (on ImageNet data)• Simple movie recommender system (on MovieLens data)• Simulation of Staple’s pricing system

• Other features in FairTest:• Exploratory studies (e.g., find image tags with offensive associations)

• Adaptive data analysis (preliminary) – i.e., statistical validity with data re‐use

• Integration with SciPy library

Developers need better statistical training and toolsto make better statistical decisions and applications.

Page 67: Final PrivacyCon Webcast Deck (without notes) · • Data brokers can tell when you're sick, tired and depressed -- and sell this information. [CNN ’14] • Google Apps for Ed used

Discussion of Session 3Discussants:• Dan Salsburg, Federal

Trade Commission

• James C. Cooper, George Mason University School of Law

• Deirdre K. Mulligan,University of California, Berkeley

Presenters:• Michael Carl Tschantz,

University of California, Berkeley & Anupam Datta, Carnegie Mellon University

• Roxana Geambasu, Columbia University

• Daniel Hsu, Columbia University