Download - Microdata Sharing Via Pseudonymization
David Galindo Eric R. VerheulComputer Science Department PWC Netherlands &University of Malaga University of Nijmegen
Microdata Sharing Via Microdata Sharing Via PseudonymizationPseudonymization
Microdata Sharing Via Microdata Sharing Via PseudonymizationPseudonymization
UNECE Work session on statistical data confidentiality
Manchester, 2007 December 18th
20-06-20062
MotivationMotivationMotivationMotivation
Individuals microdata is essential for empirical research
Its direct release thwarts the privacy of the individuals
Goal: to build privacy-preserving microdata sharing systems through pseudonymization
20-06-20063
Problem statementProblem statementProblem statementProblem statement
Suppliers own confidential microdata on individuals ((id1,D(id1)),…, (idn,D(idn))
Researchers want to correlate microdata from different Suppliers
Example: A Researcher wants to find out the correlation between drug prescription (Chemists) and traffic accidents (Insurers)
Question: How to enable Researchers to correlate microdata without having access to sensitive information?
(id1;D(id1)); ;(idn;D(idn))
20-06-20064
FrameworkFrameworkFrameworkFramework
id1 DataChm(id1)
. .
. .
. .idn DataChm(idn)
$
idm DataIns(idm)
. .
. .
. .idm DataIns(idt)
Maybe de-
identifieddata?
Maybe de-
identifieddata?
id1 DataChm(id1)
. .
. .
. .idn DataChm(idn)
$
I want to correlate
I want to correlate
20-06-20065
Supplying de-identified dataSupplying de-identified dataSupplying de-identified dataSupplying de-identified data
DataChm(id1)
.
.
.DataChm(idn)
$
DataIns(idm)
.
.
.DataIns(idt
)
If Suppliers de-identify the data by:
- removing the identifier field
-applying Statistical Disclosure Control (SDC) mechanisms
no sensitive information is leaked, but…
Matching is not possible!
20-06-20066
Pseudonymizing data via TTPsPseudonymizing data via TTPsPseudonymizing data via TTPsPseudonymizing data via TTPs
Solution 1: a Trusted Third Party replaces real identifiers by random identifiers (pseudonyms)
id1 P(id1)
. .
. .
. .idl P(idl)
Where P(id) is random
This table is only know to the TTP
P(idm)
DataIns(idm)
. .
. .
. .P(idt
)DataIns(idt)
P(id1
)DataChm(id1
)
. .
. .
. .P(idn
)DataChm(idn
)
Matching!
Matching!
20-06-20067
Pseudonymizing data via TTPs (II)Pseudonymizing data via TTPs (II)Pseudonymizing data via TTPs (II)Pseudonymizing data via TTPs (II)
Advantages: Unconditional security (w.r.t. pnymization) Matching is possible
Drawback: TTP must store a huge table secretly
Solution 2: Use a block cipher (Enc(K,·),Dec(K,·)), and then P(id)= Enc(K,id)
Advantage: Only the key K must be stored secretly
Drawbacks: Security is not unconditional Different Researchers might not have the
same access rights
20-06-20068
Pseudonymizing data via TTPs (III)Pseudonymizing data via TTPs (III)Pseudonymizing data via TTPs (III)Pseudonymizing data via TTPs (III)
$
P(idm
)DataIns(idm
)
. .P(id*
)DataIns(id*)
. .P(idt) DataIns(idt)
P(id1
)DataChm(id1)
P(id*)
DataChm(id*)
. .
. .P(idn
)DataChm(idn)
Not allowed to match Chemists and Insurers data
Not allowed to match Chemists and Insurers data
We share and win!
20-06-20069
Pseudonymizing data via TTPs (IV)Pseudonymizing data via TTPs (IV)Pseudonymizing data via TTPs (IV)Pseudonymizing data via TTPs (IV)
Solution 3: Allocate a different key Ki for every Researcher Ri
Pseudonyms are destination-dependant:P(id,Ri)=Enc(Ki,id)
P(idm,R
2)DataIns(idm)
. .P(id*,R
2)DataIns(id*)
. .P(idt,R2
)DataIns(idt
)
P(id1,R1
)DataChm(id1)
P(id*,R1)
DataChm(id*)
. .
. .P(idn,R1
)DataChm(idn)
P(id*,R1) and P(id*,R2)
look unrelated
20-06-200610
Pseudonymizing data via TTPs (V)Pseudonymizing data via TTPs (V)Pseudonymizing data via TTPs (V)Pseudonymizing data via TTPs (V)
Advantage: Disallowed matching among malicious
Researchers is prevented Drawbacks:
TTP must be on-line to perform sensitive operations (pseudonymization and matching)
Let’s see why…
20-06-200611
Pseudonymization with symmetric Pseudonymization with symmetric encryptionencryptionPseudonymization with symmetric Pseudonymization with symmetric encryptionencryption
Supplying pseudonymized data: Supplier Sj sends datablocks D(id1),…,D(idl)
to Researcher Ri
Sj sends the identities id1,…,idl in the same order to the TTP
TTP sends the list P(id,Ri)=Enc(Ki,id) to Ri
Ri forms the pnymized database (P(id1,Ri),D(id1)),…,(P(idl,Ri),D(idl))
20-06-200612
Pseudonymization with symmetric Pseudonymization with symmetric encryptionencryptionPseudonymization with symmetric Pseudonymization with symmetric encryptionencryption
Matching Ri and Rd pnymized databases: Ri sends to Rd the data D(id1,i),…,D(idl,i)
Ri sends to TTP P(id1,Ri),…, P(idl,Ri)
TTP decrypts Dec(Ki,P(id,Ri))=id and encrypts P(id,Rd)=Enc(Kd,id). The result is sent to Rd
Rd matches the pnymized databases (P(id1,Rd),D(id1,i)),…,(P(idl,Rd),D(idl,i)) (P(idl,Rd),D(id1,d)),…,(P(idm,Rd),D(idm,d))
As a result the TTP is a bottleneck to the system
P(idm,R
d)D(idm,Rd
)
. .P(id*,R
d)D(id*,Rd
)
. .P(idt,Rd
)D(idt,Rd)
P(id1,Ri
)D(id1,Ri)
P(id*,Ri
)D(id*,Ri)
. .
. .P(idn,Ri
)D(idn,Ri)
20-06-200613
Pseudonymization using public key Pseudonymization using public key cryptocryptoPseudonymization using public key Pseudonymization using public key cryptocrypto
Let G=<g> a prime order group. Let H:{0,1}*! G a hash function
TTP assigns a secret key xi 2 Zp to Researcher Ri
P(id,Ri)=H(id)x{i}
Supplying pseudonymized data from Sj to Ri
Supplier Sj and Researcher Ri jointly compute the pnymized database {P(id,Ri),D(id)}
TTP allocates pnymizing keys (¹,º) 2 Zp£Zp, such that ¹¢º=xi; ¹ is sent to Si, º is sent to Rj
Sj computes and sends H(id1)¹,…,H(idl)¹ to Rj
Rj computes (H(id)¹)º=H(id)x{i} =P(id,Ri)
Ri forms the pnymized database (P(id1,Ri),D(id1)),…,(P(idl,Ri),D(idl))
20-06-200614
Pseudonymization with public key Pseudonymization with public key crypto (II)crypto (II)Pseudonymization with public key Pseudonymization with public key crypto (II)crypto (II)
Matching Ri and Rd pnymized databases: This can be done by Ri and Rd with a 1-
round interactive protocol provided certain keys are obtained off-line from the TTP
Ri nor Rd learn their pnymizing keys xi, xd even if colluding
Rd only learns D(id,Ri) for id’s in the intersection
Security is based on Decision Diffie-Hellman assumption
H(idm)x{j
}
D(idm,Rd
)
. .H(id*)x{j} D(id*,Rd
)
. .H(idt)x{j} D(idt,Rd)
H(id1)x{i} D(id1,Ri
)
H(id*)x{i} D(id*,Ri
)
. .
. .H(idn)x{i} D(idn,Ri
)
20-06-200615
Pseudonymization with public key Pseudonymization with public key crypto (III)crypto (III)Pseudonymization with public key Pseudonymization with public key crypto (III)crypto (III)
Advantages: Matching is possible Disallowed matching among malicious
Researchers is prevented TTP is not a bottleneck (only delivers off-
line crypto keys) Drawbacks:
Suppliers must collaborate for every pnymization
Interactive protocols (on-line communication)
20-06-200616
Advanced settingAdvanced settingAdvanced settingAdvanced setting
20-06-200617
PropertiesPropertiesPropertiesProperties
Suppliers and Accumulators are assumed Honest-But-Curious
Researchers are assumed Malicious Accumulators’ intersection and union
operations are non-interactive Two levels of pseudonymization
corresponding to the different levels of trust It uses ‘composite bilinear groups’
20-06-200618
GovernanceGovernanceGovernanceGovernance
The allowance of these protocols is governed by a Regulatory Privacy Body (RPB) from a functional perspective. A strict licensing infrastructure will be enforced by the RPB, describing:
Which parties are allowed to perform what protocols with each
What kind of data can be exchanged Which subsets of identities or pnyms are
allowed as input to the protocols
20-06-200619
Thanks!