setting the stage: how de-identification came into u.s. law, and why the debate matters today
DESCRIPTION
Setting the Stage: How De-Identification Came into U.S. Law, and Why the Debate Matters Today. Professor Peter Swire Ohio State University/Future of Privacy Forum FPF Conference on DeIdentification National Press Club December 5 , 2011. Overview. - PowerPoint PPT PresentationTRANSCRIPT
Setting the Stage: How De-Identification Came into U.S. Law,
and Why the Debate Matters Today
Professor Peter SwireOhio State University/Future of Privacy ForumFPF Conference on DeIdentificationNational Press ClubDecember 5, 2011
Overview
• U.S. history: Census, federal agency statistics, & HIPAA
• Why Deidentification (DeID) matters today– The debate – it works or it doesn’t– Three threat models– Analogy to law enforcement
• Big picture – useful for many tasks, even with the limits shown by scientists
Census, Statistics & DeID
• Many years of Census experience– Highly useful data– Deidentified• Periodic opposition to mandatory reporting• Needed strong confidentiality promises
– Suppress small cell size• Only home in a census tract
– Fuzz data– Strict rules against release even for national security
purposes
Federal Agency Statistics
• Codification in Confidential Information Protection & Statistical Efficiency Act of 2002 (CIPSEA)– Good history by Sylvester & Lohr
• Basic rule: if collect data for statistical purposes, use only for statistical purposes, don’t ReID
• Funny thing: same culture & practice for years in private sector polling (Gallup-style) and market research
• Many years of practice here• Perhaps a basic guideline going forward?
HIPAA• 1999-2000 regs informed by Sweeney research• Safe harbor – delete a lot of specified data fields• Expert (I pushed for this) – where statistical basis, can
achieve DeID based on risk, not safe harbor• Data use agreements – release for research, with
enforceable promise not to ReID• In short:– If scrubbed enough, can release publicly– If scrubbed less, then enforceable promise not to
ReID
Why It Matters Today
• Now data mining far beyond specialized researchers– The Internet (commercial since only 1993) gives
me access to data– Storage & processing on my laptop > mainframe of
25 years ago– Search is way better– The erosion of practical obscurity – “they” really
may figure out who “we” are
The Debate is Joined
• Ohm (and others) draw on Sweeney-type research– DeID likely to lead to ReID
• Yakowitz (and others) respond– Benefits of public data enormous– Practical risk/harm from ReID low
• Anonymization creates huge risks or low risks?• Worth doing anonymization/DeID at all?• Today’s conference to shed light on this …
Threat Models – Which Attackers?
• Three types of attackers on “anonymized” data:– Insiders “peeping”– Outside hackers intruding– The public who doesn’t get into the database
• DeID often effective for first two• Ohm/Yakowitz debate primarily on the third
Insiders Peeping• Swire 2009 Peeping article, at peterswire.net• Threat: employee or employee of sub-contractor sees
the data and “peeps”– Sees celebrity information - Clooney– Sees information about friend/family/ex– Sees information to create harm (ID theft, blackmail)
• Anonymization useful part of anti-peeping strategy– Employee doesn’t search or stumble upon Clooney– Employee may lack tools to do Sweeney-type analysis– Audit logs catch employees who try– Give employees access to statistical data, not PII
Outside Hackers
• Hacker may intrude for a short while– Anonymization may prevent “ah hah” – Clooney
• Hacker may download database– If so, then hacker becomes similar to the public– May or may not be good at Sweeney-type tricks– May be focused on specific types of information,
and not try to ReID• Less-than-perfect DeID may substantially reduce
incidence of ReID
Re-ID by “The Public”
• So, masking may help against some threats• The debate, though, is whether “the public” (i.e., the
experts) can ReID• Sweeney & other research provides startling &
important results of ReID– Can everything be ReIdentified?
ReID & 2 Famous Studies
• Date of birth, zip, & gender -> 80%+ unique– Yes– BUT, DOB is off-the-charts different• Gender – splits population in half• DOB = 366 (days) x 80 (years) = over 25,000 cells• Moral – DOB ridiculously strong to ReID
• Netflix and can Re-ID over 60% of movie reviews– BUT, takes known ImDB reviewers and matches to
Netflix– Can ReID a lot, but not a big effect
Law Enforcement Analogy
• So, is ReID generally easy or hard, useful or useless?• Consider cop with a bunch of clues (male, tall, red
hair, etc.)– Enough to ReID? No– Helpful to ReID? Yes– A matter of how much legwork, analysis, extra data is
available and accurate– Very big range for difficulty of finding the suspect– Same is true for ability of “the public” to ReID, to name
the suspect
Conclusion
• Issue matters today -- more data potentially available to “the public”
• History of useful anonymization in statistics– If collect data for statistical purposes, use only for
statistical purposes, store that way, don’t ReID• DeID helps against insider & hacker threats• DeID by “the public” varies widely in the effort needed
to find the “suspect”• Our conference today to help policymakers learn
where DeID likely to be most useful