privacy challenges and solutions for data sharing
Post on 14-Apr-2022
5 Views
Preview:
TRANSCRIPT
Privacy Challenges and Solutions for Data Sharing
Aris Gkoulalas-Divanis
Smarter Cities Technology Centre IBM Research, Ireland
June 8, 2015
§ Introduction / Motivation for data privacy
§ Focus Application – Privacy Preserving Medical Data Publishing � Electronic Medical Records (EMR) and their use in research � Privacy threats and their effectiveness � Privacy models for relational data (demographics) � Privacy models for transaction/set-valued data (diagnoses codes) � Policy-based anonymization model � Rule-based anonymization model � Anonymization of RT (relational-transaction)-datasets � SECRETA anonymization toolkit
§ Summary
2
Content
IBM Research-Ireland
§ Individuals’ data are increasing shared § Netflix published movie ratings of 500K subscribers § AOL published 20M search query terms of 658K web users § TomTom sold customers’ location (GPS) data to the Dutch police § eMERGE consortium published patient data related to genome-wide
association studies to biorepositories (dbGaP) § Orange provided call information about its mobile subscribers, as part of the
D4D challenge on mobile phone data (http://www.d4d.orange.com/en/home)
§ Benefits of data sharing § Personalization (e.g., Netflix’s data mining contest aimed to improve movie
recommendation based on personal preferences) § Marketing (e.g., Tesco made £53M from selling shopping patterns to retailers
and manufacturers, such as Nestle and Unilever, last year) § Social benefits (e.g., promote medical research studies, improve traffic
management, etc.)
Data sharing
3 IBM Research-Ireland
Data sharing must guarantee privacy and accommodate utility
§ A popular data sharing scenario (data publishing)
data owners data publisher (trusted) data recipient (untrusted)
§ Threats to data privacy § Identity disclosure § Sensitive information disclosure § Membership disclosure § Inferential disclosure
§ Data utility requirements § Minimal data distortion (general purpose use) § Support of specific applications / workloads
(e.g., building accurate predictive models, GWAS, etc.)
Original data
Released data
4 IBM Research-Ireland
Privacy Preserving Medical Data Publishing
How can we share medical data in a way that protects patients’ privacy while supporting research studies?
The image cannot be
displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
Electronic Medical Records (EMR)
§ Relational data � Registration and demographic data
§ Transaction (set-valued) data � Billing information
§ ICD codes* are represented as numbers (up to 5 digits) and denote signs, findings, and causes of injury or disease**
§ Sequential data � DNA
§ Text data
� Clinical notes
6 * International Statistical Classification of Diseases and Related Health Problems ** Centers for Medicare & Medicaid Services - https://www.cms.gov/icd9providerdiagnosticcodes/
Electronic Medical Records (EMR) Name YOB ICD DNA Clinical
notes Jim 1955 493.00, 185 C…T (doc1) Mary 1943 185, 157.3 A…G (doc2) Mary 1943 493.01 C…G (doc3) Carol 1965 493.02 C…G (doc4) Anne 1973 157.9, 493.03 G…C (doc5) Anne 1973 157.3 A…T (doc6)
EMR data use in analytics
§ Statistical analysis � Correlation between YOB and ICD code
185 (Malignant neoplasm of prostate) § Querying § Clustering
� Control epidemics* § Classification
� Predict domestic violence** § Association rule mining
� Formulate a government policy on hypertension management*** IF age in [43,48] AND smoke = yes AND exercise=no AND drink=yes; THEN hypertension=yes (sup=2.9%; conf=26%)0
Electronic Medical Records Name YOB ICD DNA
Jim 1955 493.00, 493.01 C…T
Mary 1943 185 A…G
Mary 1943 493.01, 493.02 C…G
Carol 1965 493.02, 157.9 C…G
Anne 1973 157.9, 157.3 G…C
Anne 1973 157.3 A…T
* Tildesley et al. Impact of spatial clustering on disease transmission and optimal control, PNAS, 2010. ** Reis et al. Longitudinal Histories as Predictors of Future Diagnoses of Domestic Abuse: Modelling Study, BMJ: British Medical Journal, 2011 *** Chae et al. Data mining approach to policy analysis in a health insurance domain. Int. J. of Med. Inf., 2001
IBM Research-Ireland 7
§ Why we need privacy in medical data sharing?
§ If privacy is breached, there are consequences to patients
Consequences to patients
§ Emotional and economical embarrassment
§ 62% of individuals worry their EMRs will not remain confidential*
§ 35% expressed privacy concerns regarding the publishing of their data to dbGaP**
§ Opt-out or provide fake data à difficulty to conduct statistically powered studies
Need for privacy
8
* Health Confidence Survey 2008, Employee Benefit Research Institute ** Ludman et al. Glad You Asked: Participants’ Opinions of Re-Consent for dbGap Data Submission. Journal of Empirical Research on Human Research Ethics, 2010.
IBM Research-Ireland
§ If privacy is breached, there are consequences to organizations § Legal à HIPAA, EU legislation (95/46/EC, 2002/58/EC, 2009/136/EC etc.)
§ Financial à The average cost of a single data breach is $5.85M in the US and $4.74M in Germany; these countries have the highest per capita cost. Healthcare and Education are the most heavily regulated industries.
Need for privacy
* Ponemon Institute Research Report – 2014 Cost of Data Breach Study: Global Analysis. 9
× 106
Protecting data privacy: data masking / removal of identifiers
§ Removing / masking direct identifiers data owners data publisher (trusted) data recipient (untrusted)
1. Locate the direct identifiers (attributes that uniquely identify an
individual), such as SSN, Patient ID, Phone number etc. 2. Remove or mask them from the data prior to data publishing
Original data
De-identified data
Name Search Query Terms John Doe Harry potter, King’s speech Thelma Arnold Hand tremors, bipolar, dry mouth, effect of nicotine on the body
10 IBM Research-Ireland
Protecting data privacy: data masking / removal of identifiers
§ Masking or removal of direct identifiers is not sufficient! data owners data publisher (trusted) data recipient (untrusted)
§ Main types of threats to data privacy
§ Identity disclosure § Sensitive information disclosure § Inferential disclosure
Original data
Released data
11
External data
Background Knowledge
IBM Research-Ireland
Privacy Threats: Identity disclosure
§ Identity disclosure in relational data (e.g., patients’ demographics) � Individuals are linked to their published records based on
quasi-identifiers (attributes that in combination can identify an individual)
12
Age Postcode Sex 20 NW10 M 45 NW15 M 22 NW30 M 50 NW25 F
Name Age Postcode Sex Greg 20 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F
External data De-identified data
87% of US citizens can be identified by Age, DOB, 5-digit ZIP code*
IBM Research-Ireland
* Sweeney, k-‐anonymity: a model for protec7ng privacy. IJUFKS, 2002.
Released EMR Data ICD DNA 333.4 CT…A 401.0 401.1 AC…T 401.0 401.2 401.3 GC…C
§ Disclosure based on diagnosis codes* à general problem for other medical terminologies (e.g., ICD-10 used in EU) à sharing data susceptible to the attack is against legislation
* Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants’ Privacy. JAMIA, 2010.
Identified EMR data ID ICD Jim 333.4 Mary 401.0 401.1 Anne 401.0 401.2 401.3
Identity disclosure in
sharing patients’ diagnosis codes
Mary is diagnosed with benign essential hypertension (ICD code 401.1)
… the second record belongs to her à all her diagnosis codes
13
§ Identity disclosure in transaction data (e.g., diagnosis codes)
IBM Research-Ireland
§ Group Insurance Commission data Voter list of Cambridge, MA
William Weld, Former Governor of MA
Real-world identity disclosure cases involving medical data
§ Chicago Homicide database Social security death index
35% of murder victims
§ Adverse Drug Reaction Database Public obituaries 26-year old girl who died from drug
14 IBM Research-Ireland
voter list & discharge summary à privacy breach
voter(name, ..., zip, dob, sex)
summary(zip, dob, sex, diagnoses)
release(diagnoses, DNA)
§ Two-step attack using publicly available voter registration lists and hospital discharge summaries
Issuing attacks on medical datasets
15 * Sweeney, k-‐anonymity: a model for protec7ng privacy. IJUFKS, 2002.
87% of US citizens can be identified by
{dob, gender, ZIP-code}
EMR (name, ..., diagnoses)
release(…, diagnoses, DNA)
* Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants’ Privacy. JAMIA, 2010.
§ One-step attack using EMRs*
Issuing attacks on medical datasets
16 IBM Research-Ireland
Insider’s attack
de-identified EMR (ID, ..., diagnoses)
VNEC(…, diagnoses, DNA)
§ De-identified / Masked EMR population from VUMC � Population: 1.2M records (patients) from Vanderbilt � A unique random number for ID
§ VNEC de-identified / masked EMR sample � 2762 records (patients) derived from the population � Patients from VNEC were involved in a study (GWAS) for the Native
Electrical Conduction of the heart � Patients’ EMR were to be deposited into dbGaP � Data would be made available to support other studies (GWAS)
Case study: Evaluating the
effectiveness of the insider’s attack
17 IBM Research-Ireland
?
§ Vanderbilt’s EMR - VNEC dataset linkage on ICD codes
§ We assume that all
ICD codes are used to issue an attack
§ 96.5% of patients susceptible to identity disclosure
96.5%
60.0%
70.0%
80.0%
90.0%
100.0%
1 10 100 1000
% o
f re-
iden
tifie
d sa
mpl
e
Distinguishability (log scale) Number of times a set of ICD codes appears in the population
(Support count in the data mining literature) 18 IBM Research-Ireland
Case study: Evaluating the
effectiveness of the insider’s attack
§ Vanderbilt’s EMR - VNEC dataset linkage on ICD codes
§ A random subset of
ICD codes that can be used in attack
§ Knowing a random
combination of 2 ICD codes can lead to unique re-identification
Number of times a set of ICD codes appears in the population (equiv. to support count in data mining literature)
0%
20%
40%
60%
80%
100%
1 10 100 1000 10000 100000
% o
f re-
iden
tifia
ble
sam
ple
Distinguishability (log scale)
1 ICD code 2 ICD code combination
3 ICD code combination 10 ICD code combination
19 IBM Research-Ireland
Case study: Evaluating the
effectiveness of the insider’s attack
§ VNEC dataset linkage on ICD codes – Hospital discharge records
§ All ICD codes associated with a patient for a single visit
§ Difficult to know ICD codes that span visits when public discharge summaries are used
§ 46% uniquely re-identifiable patients in VNEC
Number of times a set of ICD codes appears in the VNEC
(Support count in data mining literature)
20 IBM Research-Ireland
Case study: Evaluating the
effectiveness of the insider’s attack
Privacy Threats:
Sensitive information disclosure
21
Name Age Postcode Sex Greg 20 NW10 M
Background knowledge
Age Postcode YOB Disease 20 NW10 1981 HIV 20 NW10 1987 HIV 20 NW10 1998 HIV 20 NW10 1970 HIV
De-identified data
Sensitive Attribute (SA)
Sensitive information disclosure can occur without identity disclosure
Released EMR Data ID ICD DNA
Jim 401.1 401.1 295 C…A Mary 401.0 401.1 303 295 A…T
Identified EMR data ID ICD Jim 401.0 401.1 295 Mary 401.0 401.1 303 295
Schizophrenia Mary is diagnosed with 401.0 and 401.1à she has Schizophrenia
Individuals are associated with sensitive information
IBM Research-Ireland
Sensitive information disclosure in Netflix movie rating sharing
* Narayanan et al. Robust De-‐anonymiza7on of Large Sparse Datasets. IEEE Symposium on Security and Privacy ‘08.
§ 100M dated ratings from 480K users to 18K movies
§ data mining contest ($1M prize) to improve movie recommendation based on personal preferences § movies reveal political, religious, and sexual beliefs and need protection according to Video Protection Act § “Anonymized”
• De-identification • Sampling, date modification, rate suppression • Movie title and year published in full
§ Researchers inferred movie rates of subscribers* • Data are linked with IMDB w.r.t. ratings and/or dates
A lawsuit was filed, Netflix settled the lawsuit “We will find new ways to collaborate with researchers”
22 IBM Research-Ireland
Privacy Threats: Inferential disclosure
75% of patients visit the same physician more than 4 times
Stream data collected by health monitoring systems
Electronic medical records
Drug orders & costs
60% of the white males > 50 suffer from diabetes
§ Business rivals can harm data publishers and insurance, pharmaceutical & marketing companies can harm data owners*
Unsolicited advertisement
Customer discrimination
* G. Das and N. Zhang. Privacy risks in health databases from aggregate disclosure. PETRA, 2009.
§ Sensitive knowledge patterns are exposed by data mining
23
§ k-anonymity principle* � Each record in a relational table T should have the same value over
quasi-identifiers with at least k-1 other records in T � These records collectively form a k-anonymous group
� k-anonymity protects from identity disclosure § Protects data from linkage to external sources (triangulation attacks) § The probability that an individual is correctly associated with their
record is at most 1/k
Anonymization of demographics
Age Postcode Sex 4* NW1* M 4* NW1* M * NW* * * NW* *
Name Age Postcode Sex Greg 40 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F
External data 2-anonymous data * Sweeney. Achieving k-‐anonymity privacy protec7on using generaliza7on and suppression. IJUFKS. 2002.
24 IBM Research-Ireland
§ k-anonymity
Anonymization of demographics
Age Postcode Sex 4* NW1* M 4* NW1* M * NW* * * NW* *
Name Age Postcode Sex Greg 40 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F
External data 2-anonymous data
Pros § A baseline model § Intuitive § Has been implemented in
many real-world systems § Follows & impacts privacy
legislation
Cons § Does not protect against sensitive
information disclosure § Requires the data owner to specify the
quasi-identifiers (QIDs) and the k-value
25 IBM Research-Ireland
§ Homogeneity attack* � All sensitive values in a k-anonymous group are the same
à sensitive information disclosure
Age Postcode Disease 4* NW1* HIV 4* NW1* HIV 5* NW* Ovarian Cancer 5* NW* Flu
2-anonymous data
Attack on k-anonymous data
Name Age Postcode Greg 40 NW10
External data
* Machanavajjhala et al, l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006.
26 IBM Research-Ireland
Attacker is confident that Greg suffers from HIV
§ l -diversity*
Age Postcode Disease 4* NW1* HIV 4* NW1* HIV 4* NW1* HIV 4* NW1* HIV 4* NW1* Flu 4* NW1* Cancer
l-diversity principle for demographics
§ A relational table is l-diverse if all groups of records with the same values over quasi-identifiers (QID groups) contain no less than l “well-represented” values for the sensitive attribute (SA)
� Distinct l-diversity l “well-represented” à l distinct
Three distinct values, but the probability of “HIV” being disclosed is ~0.67
27 IBM Research-Ireland
6-an
onym
ous
grou
p
§ Sensitive values may not need the same level of protection � (a,k)-anonymity[1]
§ l-diversity is difficult to achieve when the SA values are skewed � t-closeness[2]
§ Does not consider semantic similarity of SA values � (e,m)-anonymity[3] , range diversity[4]
§ Can patients decide the level of protection for their SA values? � Personalized privacy[5]
Further improvements over l-diversity
[1] Wong et al., (alpha, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing, KDD 2006. [2] Li et al., t-Closeness: Privacy Beyond k-Anonymity and l-Diversity, ICDE 2007. [3] Li et al. Preservation of proximity privacy in publishing numerical sensitive data. SIGMOD 2008. [4] Loukides et al. Preventing range disclosure in k-anonymised data. Expert Syst. Appl. 2011. [5] Xiao et al. Personalized privacy preservation. SIGMOD, 2006. 28 IBM Research-Ireland
§ Main idea of partition-based algorithms � A record projected over QIDs is treated as a multidimensional point � A subspace (hyper-rectangle) that contains at least k points can
form a k-anonymous group à multidimensional global recoding
� How to partition the space? § One attribute at a time – which to use? § How to split the selected attribute?
Partition-based algorithms for k-anonymity
Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity
M
F
20 22 24 26 28 30
29 IBM Research-Ireland
Mondrian (D,k)* § Find the QID attribute Q with the largest domain
§ Find the median µ of Q § Create subspace S with all records of D whose
value in Q is less than µ § Create subspace S’ with all records of D whose value in Q is at least µ
§ If |S|≥k or |S’| ≥ k Return Mondrian(S,k) U Mondrian(S’,k)
§ Else Return D
.
Mondrian algorithm
* LeFevre et al. Mondrian multidimensional k-anonymity, ICDE, 2006.
Attribute split
Attribute selection
Recursive execution
30 IBM Research-Ireland
Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity
Age Sex Disease [20-26] {M,F} HIV [20-26] {M,F} HIV [20-26] {M,F} Obesity [27-29] F HIV [27-29] F Cancer [27-29] F Obesity
M
F
20 22 24 26 28 30
Example of applying Mondrian (k=2)
M
F
20 22 24 26 28 30
31 IBM Research-Ireland
� R-tree based algorithm
� Optimized partitioning for intended tasks [2] § Classification § Regression § Query answering
� Algorithms for disk-resident data [3]
� Extensions to prevent sensitive information disclosure [4]
Other works on partition-based algorithms
[1]
[1] Iwuchukwu et al. K-anonymization as spatial indexing: toward scalable and incremental anonymization, VLDB, 2007. [2] LeFevre et al. Workload-aware anonymization. KDD, 2006. [3] LeFevre et al. Workload-aware anonymization techniques for large-scale datasets. TODS, 2008. [4] Loukides et al. Preventing range disclosure in k-anonymised data. Expert Syst. Appl. 2011. 32 IBM Research-Ireland
?
?
?
§ Main idea of clustering-based anonymization
Seed selection
Similarity measurement
Stopping criterion
Clustering-based anonymization algorithms
1.Create clusters containing at least k records with “similar”
values over QIDs
2. Anonymize records in each cluster separately
Local recoding and/or
Suppression
33 IBM Research-Ireland
Similarity measurement
Stopping criterion
Seed Selection
Size-based
Quality-based
Random
Furthest-first
Clustering-based anonymization algorithms
Clusters need to be separated
Clusters should not be “too” large
Clusters need to contain “similar” values
34
§ All these heuristics attempt to improve data utility
IBM Research-Ireland
Bottom-up clustering algorithm* § Each record is selected as a seed to start a cluster § While there exists group s.t.
§ For each group s.t. § Find group s.t. is min. and merge and
§ For each group s.t.
§ Split into groups s.t. each group has at
least records § Generalize the QID values in each group § Return all groups
)'( GGNCP ∪
G kG <||G kG <||
G 'G'GG kG ×> 2||
G ⎥⎦
⎥⎢⎣
⎢kG ||
k
Bottom-up clustering algorithm
* Xu et al. Utility-Based Anonymization Using Local Recoding, KDD, 2006. 35 IBM Research-Ireland
Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity
Age Sex Disease [20-25] M HIV [20-25] M Obesity [23-27] F HIV [23-27] F HIV [28-29] F Cancer [28-29] F Obesity
M
F
20 22 24 26 28 30
M
F
20 22 24 26 28 30
M
F
20 22 24 26 28 30
Example of Bottom-up clustering algorithm (k=2)
36 IBM Research-Ireland
Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity
Age Sex Disease [20-25] {M,F} HIV [20-25] {M,F} HIV [20-25] {M,F} Obesity [27-29] F HIV [27-29] F Cancer [27-29] F Obesity
M
F
20 22 24 26 28 30
M
F
20 22 24 26 28 30
M
F
20 22 24 26 28 30
Example of top-down clustering algorithm (k=2)
IBM Research-Ireland 37
§ Constant factor approximation algorithms* � Publish only the cluster centers along with radius information
§ Combine partitioning with clustering for efficiency**
Other works on clustering-based anonymization
* Aggarwal et al. Achieving anonymity via clustering. ACM Trans. on Algorithms, 2010. ** Loukides et al. Preventing range disclosure in k-anonymised data. Expert Syst. Appl. 2011.
IBM Research-Ireland 38
Preventing identity disclosure from diagnosis codes: Suppression
§ Suppression – Removes items or records from data prior to releasing the data
§ Suppress ICD codes*
§ appearing in less than a certain percent of patient records
§ Intuition: such ICD codes can act as quasi-identifiers
* Vinterbo et al. Hiding information by cell suppression. AMIA Annual Symposium ‘01
Released EMR Data ICD DNA 401.0 401.1 AC…T 401.0 401.3 GC…C
Identified EMR data ID ICD Mary 401.0 401.1 Anne 401.0 401.3
Released EMR Data ICD DNA 401.0 401.1 AC…T 401.0 401.3 GC…C
39 IBM Research-Ireland
§ We had to suppress diagnosis codes appearing in less than 25% of the records in VNEC to prevent re-identification
… doing so we were left with only 5 out of ~6000 ICD codes! *
5-Digit ICD-9 Codes 3-Digit ICD-9 Codes ICD-9 Sections 401.1- Benign → essential hypertension
401-Essential → hypertension
Hypertensive disease
780.79 -Other → malaise and fatigue
780- Other → soft tissue
Rheumatism excluding the back
729.5 -Pain in limb → 729 - Other → disorders of soft tissues
Rheumatism excluding the back
789.0 -Abdominal → pain
789 – Other → abdomen/pelvis symptoms
Symptoms
786.5 -Chest pain → 786 -Respiratory → system Symptoms
Code suppression – a case study using Vanderbilt’s EMR data
40
*Loukides, Gkoulalas-‐Divanis, Malin. Anonymiza7on of Electronic Medical Records for Valida7ng Genome-‐ Wide Associa7on Studies. PNAS 2010 IBM Research-Ireland
Released EMR Data ICD DNA 401.0 401.1 AC…T 401.0 401.3 GC…C
§ Generalization - replaces items with more general ones (usually with the help of a domain hierarchy)
§ Generalize ICD-codes to their 3-digit representation
401.1
401
Any
Identified EMR data ID ICD Mary 401.0 401.1 Anne 401.0 401.3
Released EMR Data ICD DNA 401 AC…T 401 GC…C
401.1 - benign essential hypertension à 401- essential hypertension
Preventing identity disclosure in EMR data: Generalization
5-digit ICD codes
3-digit ICD codes
Chapters
Sections
Any
41 IBM Research-Ireland
§ Generalizing ICD codes from VNEC*
96.5%
75.0%
25.6%
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
100.0%
1 10 100 1000 10000 100000 1000000
% o
f re-
iden
tifie
d sa
mpl
e
distinguishability (log scale)
no suppression suppression 5% suppression 15% suppression 25%
95%
§ 95% of the patients remain re-identifiable
42 * Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants’ Privacy. JAMIA, 2010.
Code generalization – a case study using Vanderbilt’s EMR data
5-digit ICD codes 3-digit ICD codes
IBM Research-Ireland
ICD DNA 401 AC…T 401 GC…C 401 CC…A 401 CA…T
§ Complete k-anonymity: Knowing that an individual is associated with any itemset, an attacker should not associate this individual to < k transactions
Original data 2-complete anonymous data
Complete k-anonymity & km-anonymity
ICD DNA 401.0 401.1 AC…T 401.2 401.3 GC…C 401.0 401.1 CC…A 401.4 401.3 CA…T
ICD DNA 401.0 401.1 AC…T 401 401.3 GC…C 401.0 401.1 CC…A 401 401.3 CA…T
§ km-anonymity: Knowing that an individual is associated with any m-itemset, an attacker should not associate this individual to less than k transactions
Original data 42- anonymous data
ICD DNA 401.0 401.1 AC…T 401.2 401.3 GC…C 401.0 401.1 CC…A 401.4 401.3 CA…T
43 IBM Research-Ireland
§ Limited in the specification of privacy requirements � Assume too powerful attackers
§ all m-itemsets (combinations of m diagnosis codes) need protection
� but… medical data publishers have detailed privacy requirements
§ Explore a small number of possible generalizations § Do not take into account utility requirements
Attackers know who is diagnosed with abc or defgh
They protect all 5-itemsets instead of the 2 itemsets
privacy constraints
Applicability of complete k-anonymity and km-anonymity to medical data
44 IBM Research-Ireland
§ Policy-based anonymization for ICD codes* § Global anonymization model § Models both generalization and suppression § Each original ICD code is replaced by a unique set of ICD
codes – no need for generalization hierarchies 493.00
493.01
296.01
296.02
174.01
ICD codes (493.00, 493.01) (296.01, 296.02) ( )
Anonymized codes Generalized ICD code
interpreted as 493.00 or 493.01 or both
Suppressed ICD code Not released
45
Policy-based anonymization model
Φ
*Loukides, Gkoulalas-‐Divanis, Malin. Anonymiza7on of Electronic Medical Records for Valida7ng Genome-‐ Wide Associa7on Studies. PNAS ‘10 IBM Research-Ireland
§ Data publishers specify diagnosis codes that need protection
§ Privacy Model: Knowing that an individual is associated with one or more specific itemsets (privacy constraints), an attacker should not be able to associate this individual to less than k transactions
Original data Anonymized data
Policy-based anonymization: Privacy model
46
ICD DNA 401.0 401.1 AC…T 401.2 401.3 GC…C 401.0 401.1 CC…A 401.4 401.3 CA…T
ICD DNA 401.0 401.1 AC…T (401.2, 401.4) 401.3 GC…C 401.0 401.1 CC…A (401.2, 401.4) 401.3 CA…T
§ Privacy Policy: The set of all specified privacy constraints
§ Privacy is achieved when all privacy constraints are supported by at least k transactions in the published data or do not appear at all
IBM Research-Ireland
§ Utility Constraints: Published data must remain as useful as the original data for conducting a GWAS on a disease or trait à number of cases and controls in a GWAS must be preserved
§ Supporting utility constraints: ICD codes from utility policy are generalized together; a larger part of the solution space is searched than when using domain generalization hierarchies
Policy-based anonymization: Data utility considerations
(296.00, 296.01)
47 IBM Research-Ireland
§ Utility Loss: A measure to quantify the level of information loss incurred by anonymization
Policy-based anonymization: Measuring information loss
§ captures the introduced uncertainty of interpreting an anonymized item
§ customizable
§ Favors (493.01) over (493.01, 493.02)
# of items mapped to generalized item
weight (semantic closeness) fraction of affected transactions
48 IBM Research-Ireland
§ Goal: Anonymize medical records so that � Privacy is guaranteed � Utility is high à many GWAS are supported simultaneously � Incurred information loss is minimal
§ Challenging optimization problem � NP-hard � Feasibility depends on constraints
§ Heuristic algorithms � Utility-Guided Anonymization of Clinical Profiles (UGACLIP) � Clustering-based Anonymization (CBA)
Policy-based anonymization algorithms
Algorithm Efficiency Scalability Utility UGACLIP CBA
49 IBM Research-Ireland
Anonymization algorithms: UGACLIP
§ Sketch of UGACLIP (PNAS) Input: EMRs, Privacy Policy, Utility Policy, k Output: Anonymized EMRs
While the Privacy Policy is not satisfied
Select the privacy constraint p that corresponds to most patients While p is not protected
Select the ICD code i in p that corresponds to fewest patients Anonymize i
If i can be anonymized according to the Utility Policy
generalize i to (i,i’) Else suppress each unprotected ICD code in p
Considers privacy constraints in a certain order
Protects a privacy constraint by set-based anonymization - Generalization when
Utility Policy is satisfied - otherwise suppression
50 IBM Research-Ireland
Anonymization algorithms: UGACLIP
Data remains useful for GWAS on Bipolar disorder; associations between (296.00, 296.01) and DNA region CT…A are preserved
Privacy Policy 296.00 296.01 296.02
Utility Policy 296.00 296.01
EMR data ICD DNA 296.00 296.01 296.02 CT…A 295.00 295.01 295.02 AC…T 296.00 296.02 GC…C
Anonymized EMR data ICD DNA (296.00, 296.01) 296.02 CT…A 295.00 295.01 295.02 AC…T (296.00, 296.01) 296.02 GC…C
k=2 UGACLIP Algorithm
Data is protected; {296.00, 296.01, 296.02}
appears 2 times
51 IBM Research-Ireland
Anonymization algorithms: CBA
§ Sketch of CBA § Retrieve the ICD codes that need less protection from the Privacy Policy
§ Gradually build a cluster of codes that can be anonymized according to the utility policy and with minimal UL § If the ICD codes are not protected
§ Suppress no more ICD codes than required to protect privacy
52
p1 = {i1} p2 = {i5, i6} p3 = {i3, i4} k=3
singleton clusters
clusters merging (driven by UL)
privacy req. are met
IBM Research-Ireland
§ Datasets � VNEC – 2762 de-identified EMRs from Vanderbilt – involved in a GWAS � VNECkc – subset of VNEC, we know which diseases are controls for others � BIOVU – all de-identified EMRs (79087) from Vanderbilt’s biobank (the largest dataset in medical data privacy literature)*
§ Methods
� UGACLIP and CBA � ACLIP (state-of-the-art method – it does not take utility policy into account)
Case Study: EMRs from Vanderbilt University Medical Center
*Loukides, Gkoulalas-‐Divanis. U7lity-‐aware anonymiza7on of diagnosis codes. IEEE TITB 2011.
53 IBM Research-Ireland
* Manolio et al. A HapMap harvest of insights into the genetics of common disease. J Clinic. Inv. 2008.
Diseases related to all GWAS reported in Manolio*
54
§ Result of ACLIP is
useless for validating GWAS
UGACLIP preserves
11 out of 18 GWAS CBA preserves
14 out of 18 GWAS simultaneously
UGACLIP & CBA: First algorithms to offer data utility in GWAS
Bes
t com
petit
or
Setting: k = 5, protecting single-visits of patients, 18 GWAS-related diseases*
no utility constraints
§ Supporting clinical case counts in addition to GWAS � learn number of patients with sets of codes in ≥10% of the records � useful for epidemiology and data mining applications
.|..|
actestimact −
VNECkc VNECkc
Queries can be estimated accurately (ARE <1.25), comparable to ACLIP Anonymized data can support both GWAS and studies on clinical case counts
Utility beyond GWAS
55 IBM Research-Ireland
§ Supporting clinical case counts in BIOVU
Very low error in query answering (Average Relative Error <1)
All EMRs in the VUMC biobank can be anonymized and remain useful
56
Anonymizing the BIOVU (79K EMR)
IBM Research-Ireland
§ Our approach*: We use PS-rules to express protection requirements against both identity and sensitive information disclosure
Rule-based anonymization
*G. Loukides, A. Gkoulalas-‐Divanis, J. Shao. Anonymizing transac7on-‐data to eliminate sensi7ve inferences. DEXA ’10 (extended to KAIS) 57
at least k records to support I
at most c x 100% of the records that support I also support J
(preventing identity disclosure) (preventing sensitive information disclosure)
Public items Sensitive items I à J
Contributions § Rule-based privacy model
§ More flexible and general than existing models § Intuitive and able to capture real-world privacy requirements
§ Three anonymization algorithms § Effective (better data utility and protection than state-of-the-art) § Efficient (efficient rule checking strategies, sampling, etc.)
IBM Research-Ireland
§ An example of offering PS-rule based anonymization
Rule-based anonymization
*Grigorios Loukides, Aris Gkoulalas-‐Divanis, Jianhua Shao: Efficient and flexible anonymiza7on of transac7on data. Knowledge and Informa7on Systems (KAIS), 36(1), pp. 153-‐210, 2013. 58
Name Diagnoses codes
Mary a b c d g h i j
Bob e f h i
Tom d g j
Anne e f g h
Brad a b d e i
Jim c f j
Name Diagnoses codes
Mary (a,b,c) (d,e,f) g h i j
Bob (d,e,f) h i
Tom (d,e,f) g j
Anne (d,e,f) g h
Brad (a,b,c) (d,e,f) i
Jim (a,b,c) (d,e,f) j
Name Diagnoses codes
Mary a (b,c) d g h i j
Bob e f h i
Tom d g j
Anne e f g h
Brad a (b,c) d e i
Jim (b,c) f j
PS-rules
a à j
c d à g
d à h i
original dataset 22-anonymous dataset Hierarchy and PS-rules
Anonymous dataset based on the PS-rules model
ü Support of fine-grained, flexible privacy requirements
ü Privacy protection from both identity and sensitive information disclosure
ü General privacy model that incorporates existing privacy models
ü PS-rules can be automatically discovered and specified
§ Datasets – BMS1, BMS2 contain click-stream data and POS contains sales transaction data
§ Evaluation
� Is anonymized data useful in aggregate query answering?
� How efficient are the algorithms?
§ Methods � Tree-based, Sample-based vs Baseline (no pruning) and
Apriori Anonymization*
Experimental results: Setup
*M. Terrovi7s, N. Mamoulis, P. Kalnis. Privacy-‐preserving anonymiza7on of set-‐valued data, PVLDB ’08. 59 IBM Research-Ireland
Data Utility: uniform privacy requirements
§ Same protection as Apriori for identity disclosure, and additionally thwarts sensitive information disclosure § Our algorithms offer many times more accurate query answering
BMS2 dataset, k=5,p=2 (all 2-itemsets need protection from identity disclosure)
60 IBM Research-Ireland
Data Utility: detailed privacy requirements
§ Apriori cannot take the detailed privacy requirements into account and overdistorts data § Our algorithms protect data no more than necessary to satisfy these requirements, achieving much higher data utility
BMS2 dataset, k=5,p=2, type 2-1 rules of varying number and rules of other types
61 IBM Research-Ireland
Efficiency
Sample-based is the fastest and most scalable; Apriori is the slowest
BMS2 dataset, k=5,p=2, 5K type 2-1 rules
Synthetic data of varying |D| and |I|, k=5,p=2, 5K rules of type 2-1
62 IBM Research-Ireland
Anonymization of RT-datasets (e.g., demographics + diagnoses codes)
Privacy Threat: Attackers know some relational attribute values (e.g., demographics) plus some sensitive items (e.g., diagnoses) for an individual.
*G. Poulis, G. Loukides, A. Gkoulalas-‐Divanis, S. Skiadopoulos. Anonymizing data with rela7onal and transac7onal aaributes. PKDD ’13.
63
SECRETA anonymization tool
SECRETA*,**: System for Evaluating and Comparing RElational and Transaction Anonymization algorithms.
*G. Poulis, A. Gkoulalas-‐Divanis, G. Loukides, C. Tryfonopoulos, S. Skiadopoulos. SECRETA: A system for evalua7ng and comparing rela7onal and transac7on anonymiza7on algorithms. EDBT ’14. **G. Poulis, G. Loukides, A. Gkoulalas-‐Divanis, S. Skiadopoulos. Anonymizing data with rela7onal and transac7onal aaributes. PKDD ’13.
ü Evaluates algorithms for relational, transaction and RT-dataset anonymization ü Integrates 9 popular anonymization algorithms and 3 bounding methods for combining them ü R: Supports Incognito, Cluster, Top-down and Full subtree bottom-up ü T: Supports COAT, PCTA, Apriori, LRA and VPA ü Supports two modes of operation: Evaluation and Comparison
IBM Research-Ireland 64
§ Explained the need for privacy in medical data sharing
§ Presented the state-of-the-art in privacy-preserving medical data publishing to support intended analyses
§ Elaborated on the policy-based anonymization model, which allows data publishers to specify detailed privacy and utility constraints for the data anonymization process
§ Discussed methods for anonymizing data of co-existing data types, and introduced the SECRETA anonymization tool
Summary
Thank you! Questions?
§ 10-15 paid internships per year @ DRL (out of ~100 applications)
§ ~5 unpaid internships
§ Internship duration: 3-4 months, start date: flexible
§ Positions are advertised in December and filled as soon as possible
§ Each candidate identifies an RSM @ DRL to work with and a project that is of mutual interest to the applicant and to IBM
§ The candidate submits a 1-2 page document on the project that he / she will be involved in if accepted
§ At the end of the internship the candidate gives a talk to the lab on his/her accomplishments during the internship
§ Need more information? Email me at: arisdiva@ie.ibm.com
66
Internship Opportunities @ IBM
IBM Research-Ireland
top related