re-identification of de-identified phi date elements

12
Inadvertent disclosure of protected health information (PHI) in randomly shifted date elements for de-identification Tomasz Adamusiak MD PhD 7omasz

Upload: tomasz-adamusiak

Post on 05-Dec-2014

723 views

Category:

Technology


1 download

DESCRIPTION

Presented in the Late Breaking Research Abstracts - Machine Learning in Relation to EMRs session at the American Medical Informatics Associatio (AMIA) 2013 Annual Symposium on 11/20/2013

TRANSCRIPT

Page 1: Re-identification of de-identified PHI date elements

Inadvertent disclosure of protected health information (PHI) in randomly shifted date elements for de-identification

Tomasz Adamusiak MD PhD

7omasz

Page 2: Re-identification of de-identified PHI date elements

There is a high probability that some patients in your de-identified

data sets can have their dates re-identified on subsequent releases

if you randomly shift dates each time

Page 3: Re-identification of de-identified PHI date elements

Two methods for de-identification according to HIPAA Privacy Rule

• Expert determination § 164.514(b)(1)

• Safe harbor § 164.514(b)(1)

• Removal of dates -> data useless for research

• Date shifting ≠ removal (not safe harbor)

Page 4: Re-identification of de-identified PHI date elements

Patient John Doe De-identified data set 1 Date of birth randomly shifted by +/- 31 days

Time

Page 5: Re-identification of de-identified PHI date elements

The same patient In multiple de-identified data sets Date of birth randomly shifted by +/- 31 days

Time

Non-random interval 2*31+1 = 63 days

Page 6: Re-identification of de-identified PHI date elements

Time

Can you guess when the real DOB is?

Page 7: Re-identification of de-identified PHI date elements

Time

Can you guess when the real DOB is?

Page 8: Re-identification of de-identified PHI date elements

Time

In fact we only need two extremes

Page 9: Re-identification of de-identified PHI date elements

This probability can be estimated with binomial distribution

Pr = 2𝑛

2𝑝2 1 − 𝑝 𝑛−2

p – probability of shift to one of the extremes, e.g., 1/62 n – number of releases of data

Page 10: Re-identification of de-identified PHI date elements

Results

• For a single patient and two data releases the risk is relatively low (0.0005)

• For a hundred patients and ten releases on average two patients can be de-identified

• Larger sets and more releases higher risk

Page 11: Re-identification of de-identified PHI date elements

Conclusions

• Stop using random shifts immediately

• Evaluate the risk of disclosure for already released data

• Use a non-random value for the shift (e.g., [SSN digit +1] x 31).

Page 12: Re-identification of de-identified PHI date elements

Thank you

• Mary Shimoyama PhD