do you trust your recommender? an exploration of privacy and trust in recommender systems
Post on 13-Jan-2016
34 Views
Preview:
DESCRIPTION
TRANSCRIPT
Do You Trust Your Recommender?An Exploration of Privacy and Trust in
Recommender Systems
Dan Frankowski, Dan Cosley, Shilad Sen, Tony Lam, Loren Terveen, John Riedl
University of Minnesota
CDT Spring Research Forum 20072
Story: Finding “Subversives”
“.. few things tell you as much about a person as the books he chooses to read.”
– Tom Owad, applefritter.com
CDT Spring Research Forum 20073
Session Outline
Exposure: undesired access to a person’s information Privacy Risks Preserving Privacy
Bias and Sabotage: manipulating a trusted system to manipulate users of that system
CDT Spring Research Forum 20074
Why Do I Care?
As a businessperson The nearest competitor is one click away Lose your customer’s trust, they will leave Lose your credibility, they will ignore you
As a person Let’s not build Big Brother
CDT Spring Research Forum 20075
Risk of Exposure in One Slide
+ +
= Your private data linked!
algorithms
Seems bad. How can privacy be preserved?
Private Dataset
YOU
Public Dataset
YOU
movielens.org
-Started ~1995
-Users rate movies ½ to 5 stars
-Users get recommendations
-Private: no one outside GroupLens can see user’s ratings
Anonymized Dataset
-Released 2003
-Ratings, some demographic data, but no identifiers
-Intended for research
-Public: anyone can download
movielens.org Forums
-Started June 2005
-Users talk about movies
-Public: on the web, no login to read
-Can forum users be identified in our anonymized dataset?
CDT Spring Research Forum 20079
Research Questions RQ1: RISKS OF DATASET RELEASE:
What are risks to user privacy when releasing a dataset?
RQ2: ALTERING THE DATASET: How can dataset owners alter the dataset they release to preserve user privacy?
RQ3: SELF DEFENSE: How can users protect their own privacy?
CDT Spring Research Forum 200710
Motivation: Privacy Loss MovieLens forum users did not agree to
reveal ratings
Anonymized ratings + public forum data = privacy violation?
More generally: dataset 1 + dataset 2 = privacy risk? What kind of datasets? What kinds of risks?
CDT Spring Research Forum 200711
Vulnerable Datasets We talk about datasets from a sparse relation space
Relates people to items
Is sparse (few relations per person from possible relations)
Has a large space of items
i1 i2 i3 …
p1 X
p2 X
p3 X
…
CDT Spring Research Forum 200712
Example Sparse Relation Spaces Examples
Customer purchase data from Target Songs played from iTunes Articles edited in Wikipedia Books/Albums/Beers… mentioned by bloggers or
on forums Research papers cited in a paper (or review) Groceries bought at Safeway …
We look at movie ratings and forum mentions, but there are many sparse relation spaces
CDT Spring Research Forum 200713
Risks of re-identification Re-identification is matching a user in
two datasets by using some linking information (e.g., name and address, or movie mentions)
Re-identifying to an identified dataset (e.g., with name and address, or social security number) can result in severe privacy loss
CDT Spring Research Forum 200714
Former Governor of Massachusetts
Story: Finding Medical records (Sweeney 2002)
87% of people in 1990 U.S. census identifiable
by these!
CDT Spring Research Forum 200715
The Rebus Form
+ =
Governor’s medical records!
CDT Spring Research Forum 200716
Related Work Anonymizing datasets: k-anonymity
Sweeney 2002 Privacy-preserving data mining
Verykios et al 2004, Agrawal et al 2000, … Privacy-preserving recommender systems
Polat et al 2003, Berkovsky et al 2005, Ramakrishnan et al 2001
Text mining of user comments and opinions Drenner et al 2006, Dave et al 2003, Pang et al
2002
CDT Spring Research Forum 200717
RQ1: Risks of Dataset Release RQ1: What are risks to user privacy
when releasing a dataset?
RESULT: 1-identification rate of 31% Ignores rating values entirely! Can do even better if text analysis
produces rating value Rarely-rated items were more identifying
CDT Spring Research Forum 200718
Glorious Linking Assumption People mostly talk about things they
know => People tend to have rated what they mentioned
Measured P(u rated m | u mentioned m) averaged over all forum users: 0.82
CDT Spring Research Forum 200719
Algorithm Idea
All Users
Users whorated apopular item
Users whorated a rarely rated item
Users whorated both
Probability of 1-identification vs. algorithm
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1(25)
2..3(21)
4..7(23)
8..15(22)
16..31(18)
32..63(13)
>64(11)
# mentions bin (and # users in bin)
Probability of 1-identification
ExactRating
FuzzyRating
Scoring
TF-IDF
Set Intersection
•>=16 mentions and we often 1-identify
•More mentions => better re-identification
CDT Spring Research Forum 200721
RQ2: ALTERING THE DATASET How can dataset owners alter the dataset
they release to preserve user privacy?
Perturbation: change rating values Oops, Scoring doesn’t need values
Generalization: group items (e.g., genre) Dataset becomes less useful
Suppression: hide data IDEA: Release a ratings dataset suppressing all
“rarely-rated” items
Database-level suppression curves
0
0.1
0.2
0.3
0.4
0.5
0.6
0 0.2 0.4 0.6 0.8 1
Fraction of items suppressed
Fraction of users 1-identified
•Drop 88% of items to protect current users against 1-identification
•88% of items => 28% ratings
CDT Spring Research Forum 200723
RQ3: SELF DEFENSE RQ3: How can users protect their own
privacy? Similar to RQ2, but now per-user User can change ratings or mentions. We
focus on mentions
User can perturb, generalize, or suppress. As before, we study suppression
User-level suppression curves
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5
Fraction of user mentions (per user) suppressed
Fraction of users 1-identified
•Suppressing 20% of mentions dropped 1-ident some, but not
all
•Suppressing >20% is not reasonable for
a user
CDT Spring Research Forum 200725
Another Strategy: Misdirection What if users mention items they did NOT
rate? This might misdirect a re-identification algorithm
Create a misdirection list of items. Each user takes an unrated item from the list and mentions it. Repeat until not identified.
What are good misdirection lists? Remember: rarely-rated items are identifying
User 1-identification vs. number of misdirecting mentions
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20
# misdirecting mentions
Fraction of users 1-identified
Rare, rated>=1
Rare, rated>=16
Rare, rated>=1024
Rare, rated>=8192
Popular
•Rarely-rated items don’t misdirect!
•Popular items do better, though 1-ident isn’t zero
•Better to misdirect to a large crowd
•Rarely-rated items are identifying, popular items are misdirecting
CDT Spring Research Forum 200727
Exposure: What Have We Learned? REAL RISK
Re-identification can lead to loss of privacy We found substantial risk of re-identification in our
sparse relation space There are a lot of sparse relation spaces We’re probably in more and more of them available
electronically
HARD TO PRESERVE PRIVACY Dataset owner had to suppress a lot of their dataset to
protect privacy Users had to suppress a lot to protect privacy Users could misdirect somewhat with popular items
CDT Spring Research Forum 200728
Advice: Keep Customer’s Trust
Share data rarely Remember the governor: (zip + birthdate +
gender) is not anonymous
Reduce exposure Example: Google will anonymize search
data older than 24 months
CDT Spring Research Forum 200729
AOL: 650K users, 20M queriesData wants to be free
Government subpoena, research, commerce
People do not know the risks
AOL was text, this is items
NY Times: 4417749 searched for “dog that urinates on everything.”
CDT Spring Research Forum 200730
Discussion #1: Exposure
Examples of sparse relation spaces?
Examples of re-identification risks?
How to preserve privacy?
top related